📊 ArXiv 研究报告 (2026-03-21)

生成时间: 2026-03-21 09:18:07 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 305 篇
及格论文: 13 篇 (4.3%)
深度分析: 8 篇

⭐ 及格论文详细分析

1. MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models

作者: Chenyang Gu, Jiahao Cheng, Meicong Zhang, Pujun Zheng, Jinquan Zheng, Guoxiu He 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19044v1

评分: 70.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文提出MoRI框架，旨在提升LLM在科学构思任务中的推理能力。核心相关关键词包括：1) “Large Language Models” (论文基于LLM构建框架)；2) “Post-training” (使用监督微调初始化模型)；3) “RLHF” (采用复合强化学习奖励进行训练)；4) “Chain of Thought” (框架显式学习从研究动机到方法的推理过程)；5) “System 2 Thinking” (强调深度推理，避免表面概念重组)；6) “LLM Agents” (论文属于LLM智能体研究，改进现有智能体方法)；7) “AI for Science” (应用于科学构思领域，属于AI for Science范畴)。其他关键词如MoE、量化、RAG等未在论文中涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文针对现有基于LLM的智能体在科学构思任务中推理能力不足的问题，提出了MoRI框架，通过监督微调和强化学习奖励机制显式学习从研究动机到方法的推理过程，实验表明其在新颖性、技术严谨性和可行性方面显著优于现有方法。

摘要翻译

科学构思旨在给定科学背景下提出新颖解决方案。现有基于大语言模型（LLM）的智能体方法虽模拟人类研究流程，却未能充分建模科学推理过程，导致其产出多为缺乏技术深度与科学依据的表层概念重组。为解决这一问题，我们提出 MoRI（基于动机的科学构思推理框架），该框架使大语言模型能够显式学习从研究动机到方法论的推理过程。基础大语言模型首先通过监督微调进行初始化，以从给定情境中生成研究动机，随后在复合强化学习奖励机制下进行训练，以逼近科学严谨性：（1）熵感知信息增益鼓励模型基于真实方法论揭示并阐述高复杂度的技术细节；（2）对比语义增益约束推理轨迹，确保其与科学有效解决方案保持概念一致。实验结果表明，MoRI 在创新性、技术严谨性和可行性等多个维度上显著优于主流商用大语言模型及复杂智能体基线方法。代码将在 \href{https://github.com/ECNU-Text-Computing/IdeaGeneration}{GitHub} 平台开源。

摘要 (Abstract)

Scientific ideation aims to propose novel solutions within a given scientific context. Existing LLM-based agentic approaches emulate human research workflows, yet inadequately model scientific reasoning, resulting in surface-level conceptual recombinations that lack technical depth and scientific grounding. To address this issue, we propose \textbf{MoRI} (\textbf{Mo}tivation-grounded \textbf{R}easoning for Scientific \textbf{I}deation), a framework that enables LLMs to explicitly learn the reasoning process from research motivations to methodologies. The base LLM is initialized via supervised fine-tuning to generate a research motivation from a given context, and is subsequently trained under a composite reinforcement learning reward that approximates scientific rigor: (1) entropy-aware information gain encourages the model to uncover and elaborate high-complexity technical details grounded in ground-truth methodologies, and (2) contrastive semantic gain constrains the reasoning trajectory to maintain conceptually aligned with scientifically valid solutions. Empirical results show that MoRI significantly outperforms strong commercial LLMs and complex agentic baselines across multiple dimensions, including novelty, technical rigor, and feasibility. The code will be made available on \href{https://github.com/ECNU-Text-Computing/IdeaGeneration}{GitHub}.

关键词: Large Language Models, Scientific Ideation, Reasoning, Supervised Fine-tuning, Reinforcement Learning, LLM Agents, AI for Science, Motivation-grounded Reasoning

2. TARo: Token-level Adaptive Routing for LLM Test-time Alignment

作者: Arushi Rai, Qiang Zhang, Hanqing Zeng, Yunkai Zhang, Dipesh Tamboli, Xiangjun Fan, Zhuokai Zhao, Lizhu Zhang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18411v1

评分: 56.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	5.0/10	5.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文核心研究LLM的推理能力提升，提出TARo方法在推理时对齐，属于大模型技术原理创新。高度相关的关键词包括：LLMs（核心研究对象）、Alignment（测试时对齐方法）、CoT Reasoning（提升推理性能）、System 2 Thinking（涉及深度推理）。有一定相关的关键词：Post-training（对比传统后训练方法）、RLHF（属于对齐技术范畴）、AI for Science（应用于临床推理）。其余关键词与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型推理能力提升问题，提出了Token-level Adaptive Routing方法，在推理时进行对齐，显著提高了数学推理和临床推理性能。

摘要翻译

大语言模型（LLMs）展现出强大的推理能力，但通常需要昂贵的后训练才能达到高性能。近期的测试时对齐方法提供了一种轻量级替代方案，但主要被探索用于偏好对齐而非推理任务。为填补这一空白，我们提出了令牌级自适应路由（Token-level Adaptive Routing, TARo），该方法在完全保持基础模型冻结的状态下，于推理阶段引导模型进行结构化推理。具体而言，我们首先在分步数学推导轨迹上训练奖励模型，以捕捉细粒度的逻辑一致性信号；随后引入一个可学习的令牌级路由器，自动控制奖励模型对基础模型的引导强度。大量实验表明，TARo在推理性能上相比基础模型显著提升高达+22.4%，较现有令牌级测试时对齐方法提升+8.4%，同时还能增强分布外临床推理（MedXpertQA）和指令遵循（AlpacaEval）能力。此外，TARo无需重新训练即可从小型骨干模型泛化至大型骨干模型，从而将测试时对齐的应用范围从偏好优化扩展至鲁棒的跨领域推理。

摘要 (Abstract)

Large language models (LLMs) exhibit strong reasoning capabilities but typically require expensive post-training to reach high performance. Recent test-time alignment methods offer a lightweight alternative, but have been explored mainly for preference alignment rather than reasoning. To bridge this gap, we propose, Token-level Adaptive Routing (TARo), which steers frozen LLMs toward structured reasoning entirely at inference time. Specifically, we first train reward models on step-wise mathematical traces to capture fine-grained logical consistency signals, then introduce a learnable token-level router that automatically controls the guidance of the reward model to the base model. Extensive experiments show that TARo significantly improves reasoning performance by up to +22.4% over base model and +8.4% over existing token-level test-time alignment methods, while also boosting out-of-distribution clinical reasoning (MedXpertQA) and instruction following (AlpacaEval). Furthermore, TARo also generalizes from small to large backbones without retraining, extending test-time alignment from preference optimization to robust, cross-domain reasoning.

关键词: Large Language Models, Test-time Alignment, Reasoning, Token-level Routing, Mathematical Reasoning, Clinical Reasoning, Reward Model, Inference-time Steering

3. ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augme

作者: Wanjia Zhao, Ludwig Schmidt, James Zou, Vidhisha Balachandran, Lingjiao Chen 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18614v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究工具增强大语言模型（LLMs）中推理与外部行动的耦合问题，因此与"Large Language Models"、“Tool Use”、“LLM Agents"高度相关（10分）。论文强调需要"in-depth reasoning"和"multi-step reasoning”，与"System 2 Thinking"和"Chain of Thought"高度相关（10分）。其他关键词如MoE、量化、对齐、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了ZebraArena诊断仿真环境，用于研究工具增强大语言模型中推理与行动的耦合问题，发现即使是前沿模型如GPT-5在困难任务上准确率仅60%，且工具调用次数比理论最优多70-270%。

摘要翻译

工具增强型大语言模型（LLM）必须将多步推理与外部行动紧密耦合，然而现有基准测试常因复杂的环境动态、记忆知识或数据集污染而混淆这种相互作用。本文提出ZebraArena——一个通过程序化生成、用于研究工具增强型LLM中推理-行动耦合的诊断性环境，其具备可控难度和知识最小化设计，能有效限制模型从记忆或数据集污染中获益。ZebraArena中的每个任务都需要一组关键信息，这些信息仅能通过针对性工具调用获取，从而在外部信息获取与演绎推理之间构建了可解释的接口。该设计通过唯一解实现确定性评估，并提供了理论最优查询次数以衡量工具使用效率。我们证明ZebraArena要求深度推理与精准外部工具调用的结合，这对前沿推理模型（如GPT-5和Gemini 2.5 Pro）仍具挑战性——它们在困难实例上仅达到60%的准确率。我们还观察到理论最优性与实际工具使用之间存在持续差距：例如GPT-5的工具调用次数比理论最优值高出70-270%。本文重点阐述了评估中的关键发现，期望ZebraArena能推动关于内部推理与外部行动交互机制的进一步研究。

摘要 (Abstract)

Tool-augmented large language models (LLMs) must tightly couple multi-step reasoning with external actions, yet existing benchmarks often confound this interplay with complex environment dynamics, memorized knowledge or dataset contamination. In this paper, we introduce ZebraArena, a procedurally generated diagnostic environment for studying reasoning-action coupling in tool-augmented LLMs, with controllable difficulty and a knowledge-minimal design, which limits gains from memorization or dataset contamination. Each task in ZebraArena requires a set of critical information which is available only through targeted tool use, yielding an interpretable interface between external information acquisition and deductive reasoning. This design provides deterministic evaluation via unique solutions, and a theoretical optimal query count for measuring efficient tool use. We show that ZebraArena requires a combination of in-depth reasoning and accurate external tool calling, which remains a challenge as frontier reasoning models such as GPT-5 and Gemini 2.5 Pro only achieves 60% accuracy on the hard instances. We also observe a persistent gaps between theoretical optimality and practical tool usage. For example, GPT-5 uses 70-270% more tool calls than the theoretical optimum. We highlight the key findings in our evaluation, and hope ZebraArena stimulates further research on the interplay between internal reasoning and external action.

关键词: Tool-augmented LLMs, Reasoning-action coupling, Diagnostic simulation environment, Multi-step reasoning, External tool calling, In-depth reasoning, ZebraArena, Efficient tool use

深度分析:

ZEBRAARENA：用于研究工具增强型大模型中推理-行动耦合的诊断模拟环境

摘要:

针对现有工具增强型大模型（LLM）基准测试难以分离推理与行动耦合能力的问题，本文提出了ZEBRAARENA，一个基于经典斑马（爱因斯坦）逻辑谜题的诊断模拟环境。该环境采用程序化生成和知识最小化设计，通过隐藏关键线索强制模型必须调用外部工具获取信息，从而在可控难度下评估模型的推理与工具使用能力。研究结果表明，即使是GPT-5等前沿模型在困难实例上的准确率仅约60%，且工具调用次数显著多于理论最优值，揭示了当前模型在高效信息获取与逻辑推理整合方面仍存在显著不足。

创新点:

提出了ZEBRAARENA，一个基于逻辑网格谜题的、程序化生成的诊断环境，实现了对推理-行动耦合的隔离研究。
设计了“缺失线索”机制，强制模型通过工具调用获取外部信息，从而精确测量工具使用效率。
引入了理论最优查询计数作为评估指标，能够量化模型工具使用的冗余度，超越了仅关注准确率的传统评估。
实现了知识最小化和难度可控的设计，有效避免了数据污染和记忆依赖，确保评估结果反映模型的逻辑推理能力。

方法

!!! info

论文构建了一个基于约束满足问题（CSP）的模拟环境。首先生成完整的斑马谜题实例，然后随机隐藏部分线索，形成部分可观察的初始状态。环境提供基于规则的预言机接口，支持事实查询（验证特定属性赋值）和关系查询（检查位置或逻辑关系）。模型通过ReAct等框架进行交互，利用工具获取信息，逐步缩小解空间直至唯一。评估指标包括任务准确率、工具调用次数（与理论最优值对比）以及Token消耗量。

关键结果:

GPT-5在中等难度谜题上准确率接近99%，但在困难实例上降至约60%。
所有测试模型（包括GPT-5）的工具调用效率均显著低于理论最优值，GPT-5使用的工具调用次数比最优情况多70-270%。
Gemini-2.5-Flash解决每个谜题消耗的Token数量是GPT-5的十倍以上（约20k vs 1.2k）。
较弱模型（如Llama-3.3-70B）在中等难度上准确率仅为12-24%，主要失败原因是信息收集不足。

技术栈: 约束满足问题（CSP）, 程序化生成, ReAct框架（推理-行动循环）, 逻辑网格谜题（斑马谜题）, 工具调用API设计（Schema validation）

优点

隔离性强：成功将推理能力与工具使用能力解耦，避免了环境噪声和复杂动态的干扰。
评估维度丰富：不仅关注最终准确率，还引入了工具调用效率（查询次数）和成本（Token数）的评估。
可解释性高：基于逻辑谜题的任务设计使得模型的每一步推理和行动都可以被精确追踪和验证。
抗污染：程序化生成无限新实例，有效解决了现有基准测试的数据泄露问题。

局限

环境简化：相比真实世界的复杂环境（如网页导航、物理交互），ZEBRAARENA过于理想化，可能无法完全反映模型在现实场景中的表现。
领域局限：任务仅限于逻辑推理，可能无法评估模型在其他类型知识（如常识、物理定律）与工具结合时的表现。
工具类型单一：提供的工具主要是布尔查询和关系检查，缺乏真实世界中API的多样性和复杂性（如文本生成、图像处理）。

与研究方向的相关性:

论文高度相关。它直接研究了大模型（LLM）的技术原理，特别是工具增强型LLM的推理与行动耦合机制。虽然不是直接应用于生物医药等具体科学领域，但它提供了一个通用的评估框架，对于提升大模型在科学发现等需要复杂推理和工具调用的场景中的性能具有重要的基础意义。其创新性在于提出了新的评估范式，符合“大模型和深度学习技术原理的创新”这一关键词。

4. PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

作者: Ruishuo Chen, Yu Chen, Zhuoran Li, Longbo Huang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18363v1

评分: 46.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	8.0/10	8.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的无监督强化学习微调方法（RLIF），提出PowerFlow框架，通过分布匹配优化LLMs的双重能力（逻辑推理与创造性）。高度相关关键词：LLMs（核心研究对象）、Post-training/SFT（涉及微调）、RLHF/RLAIF/DPO（属于强化学习微调范畴）。中等相关关键词：Instruction Tuning/Alignment（涉及模型能力对齐）、Chain of Thought/System 2 Thinking（涉及逻辑推理）、Self-Correction/Self-Improvement（涉及模型自我优化）。其他关键词未在论文中直接涉及。

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型无监督强化学习微调中启发式内在奖励缺乏理论优化目标的问题，提出了PowerFlow分布匹配框架，通过α-power分布定向激发LLMs的逻辑推理或创造性能力，实验表明其性能优于现有RLIF方法并匹配或超越有监督GRPO。

摘要翻译

无监督内部反馈强化学习已成为一种无需外部监督即可激发大型语言模型潜在能力的前景广阔的研究范式。然而，现有方法依赖于启发式内在奖励，这些奖励通常缺乏明确的理论优化目标，且容易产生退化性偏差。本研究提出PowerFlow，一个将无监督微调重新定义为分布匹配问题的原理性框架。通过将GFlowNet构建为非归一化密度的摊销变分采样器，我们提出了一种长度感知的轨迹平衡目标，该目标能显式地抵消自回归生成中固有的结构长度偏差。通过以$α$-幂分布为目标，PowerFlow能够定向激发大型语言模型的双重特性：通过锐化分布（$α> 1$）来强化逻辑推理能力，或通过平坦化分布（$α< 1$）来释放表达创造力。大量实验表明，PowerFlow在各项任务中持续优于现有RLIF方法，其表现达到甚至超越了有监督的GRPO。此外，通过缓解对齐模型中的过度锐化现象，我们的方法在多样性与质量上实现了同步提升，从而在创造性任务中推动了帕累托边界的演进。

摘要 (Abstract)

Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting $α$-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution ($α> 1$) to intensify logical reasoning, or flattening it ($α< 1$) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.

关键词: Large Language Models, Unsupervised Reinforcement Learning, Distribution Matching, GFlowNet, Trajectory-Balance, Logical Reasoning, Expressive Creativity, Fine-tuning

深度分析:

PowerFlow：通过原则性分布匹配解锁大语言模型的双重属性

摘要:

论文针对现有无监督强化学习（RLIF）依赖启发式奖励导致优化目标不明确和长度偏差等问题，提出了PowerFlow框架。该框架将无监督微调重新表述为分布匹配问题，旨在匹配基础模型的$\alpha$-power分布。通过引入长度感知轨迹平衡（LA-TB）目标，PowerFlow有效中和了自回归生成中的结构长度偏差。实验表明，当$\alpha>1$时，该方法能锐化分布以增强逻辑推理能力，性能超越现有RLIF方法；当$\alpha<1$时，能扁平化分布以释放被对齐过程抑制的创造力，在创意任务中实现质量和多样性的双重提升。

创新点:

提出了PowerFlow框架，将无监督微调重新表述为匹配$\alpha$-power分布的原则性分布匹配问题，避免了启发式奖励的偏差。
引入了长度感知轨迹平衡（LA-TB）目标，通过重新参数化配分函数，在长度归一化的能量表面上进行优化，有效解决了自回归生成中的指数长度偏差。
揭示并利用了LLM的双重属性：通过调节参数$\alpha$，既能锐化分布以增强逻辑推理（$\alpha>1$），又能扁平化分布以释放创造力（$\alpha<1$）。
理论上证明了多数投票（Majority Voting）等现有RLIF方法实际上是极端分布锐化的隐式机制，为理解RLIF提供了新视角。

方法

!!! info

论文采用生成流网络作为核心算法，将其视为非归一化密度的摊销变分采样器。为了解决自回归结构带来的轨迹概率指数衰减问题，作者推导了长度感知轨迹平衡（LA-TB）目标函数，将配分函数重新参数化为摊销的token级能量项，从而在长度归一化的能量表面上进行优化。通过调节$\alpha$参数，控制目标分布的锐化或扁平化程度，分别针对推理增强和创造力释放进行训练。

关键结果:

PowerFlow ($\alpha>1$) 在推理任务上持续优于现有RLIF方法，性能匹配甚至超越有监督的GRPO。
PowerFlow ($\alpha<1$) 在指令微调模型上成功恢复了被抑制的创造力，在创意写作任务中同时提高了输出的多样性和质量，实现了帕累托改进。
长度感知目标有效防止了长度崩溃（$\alpha>1$时）和重复爆炸（$\alpha<1$时）等病理行为，实现了稳定的优化。

技术栈: 生成流网络, $\alpha$-power分布（Escort分布）, 长度感知轨迹平衡（Length-Aware Trajectory-Balance, LA-TB）目标, 无监督强化学习（RLIF）, 变分推断

优点

理论基础扎实：摆脱了启发式奖励设计，基于明确的统计力学原理（$\alpha$-power分布）进行优化。
解决关键偏差：针对自回归生成固有的长度偏差提出了有效的解决方案（LA-TB），提高了训练稳定性。
功能灵活统一：通过单一参数$\alpha$即可控制模型向推理或创造力两个相反方向进化，揭示了模型的双重属性。
实验效果显著：在推理和创意任务上均表现出色，特别是在恢复对齐模型的创造力方面有突破。

局限

计算开销：虽然摊销了推理成本，但基于GFlowNet的训练可能比标准的监督微调或简单的RLIF更复杂且计算量更大。
参数敏感性：$\alpha$参数的选择至关重要，可能需要针对不同任务或模型进行细致调整，缺乏自适应选择机制。
适用范围：主要针对无监督场景，对于有明确外部奖励信号的任务，其优势可能不如专门设计的RLHF或RLVR方法明显。

与研究方向的相关性:

该论文高度相关。它属于“大模型和深度学习技术原理的创新”这一子领域。论文深入探讨了LLM的训练机制（RLIF、分布匹配），提出了新的算法框架（PowerFlow）和目标函数（LA-TB），是对大模型底层训练原理的重要创新。同时，它也涉及如何激发模型潜能，这与大模型在科学推理（逻辑增强）和创意生成（创造力释放）等领域的应用潜力紧密相关。

5. Learning to Self-Evolve

作者: Xiaoyin Chen, Canwen Xu, Yite Wang, Boyi Liu, Zhewei Yao, Yuxiong He 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18620v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心是提出一个强化学习框架（LSE）来训练大语言模型（LLMs）在测试时自我进化，通过迭代优化上下文来提升在新问题上的表现。这与"Large Language Models"高度相关（核心研究对象）。方法基于强化学习，将多步进化问题简化为单步RL目标，与"RLHF"等关键词高度相关（核心方法）。论文涉及模型自我改进和反思，与"Self-Correction"等关键词高度相关。论文提到模型迭代优化上下文以提升性能，涉及多步推理和深入思考，与"Chain of Thought"和"System 2 Thinking"有一定关联（5分）。论文在测试时通过上下文优化来提升性能，与"In-context Learning"有一定关联（5分）。其他关键词如MoE、量化、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为Learning to Self-Evolve（LSE）的强化学习框架，训练大语言模型在测试时通过迭代优化上下文来自我进化，从而在Text-to-SQL生成和通用问答任务上超越了现有自我进化策略和提示优化方法。

摘要翻译

我们提出“学习式自我进化”框架，这是一种强化学习框架，用于训练大语言模型在测试阶段优化其自身上下文。我们将该框架置于测试时自我进化的情境中，使模型能够基于已见问题的反馈迭代优化上下文，从而在新问题上表现更佳。现有方法完全依赖模型固有的推理能力，从未针对此任务进行显式训练。本框架将多步进化问题简化为单步强化学习目标，其中每次上下文编辑的奖励由下游性能提升程度决定。我们将此目标与树状引导进化循环相结合。在文本到SQL生成和通用问答任务上，采用本框架训练的40亿参数模型超越了基于GPT-5与Claude Sonnet 4.5的自进化策略，以及包括GEPA和TextGrad在内的提示优化方法，并且无需额外训练即可迁移指导其他模型。我们的研究结果证明了将自我进化作为可学习技能的有效性。

摘要 (Abstract)

We introduce Learning to Self-Evolve (LSE), a reinforcement learning framework that trains large language models (LLMs) to improve their own contexts at test time. We situate LSE in the setting of test-time self-evolution, where a model iteratively refines its context from feedback on seen problems to perform better on new ones. Existing approaches rely entirely on the inherent reasoning ability of the model and never explicitly train it for this task. LSE reduces the multi-step evolution problem to a single-step RL objective, where each context edit is rewarded by the improvement in downstream performance. We pair this objective with a tree-guided evolution loop. On Text-to-SQL generation (BIRD) and general question answering (MMLU-Redux), a 4B-parameter model trained with LSE outperforms self-evolving policies powered by GPT-5 and Claude Sonnet 4.5, as well as prompt optimization methods including GEPA and TextGrad, and transfers to guide other models without additional training. Our results highlight the effectiveness of treating self-evolution as a learnable skill.

关键词: Self-Evolution, Reinforcement Learning, Large Language Models, Test-time Adaptation, Context Optimization, Multi-step Reasoning, Performance Improvement, Transfer Learning

深度分析:

学习自我进化

摘要:

本文提出了Learning to Self-Evolve (LSE)框架，旨在解决大语言模型在部署后无法动态适应环境反馈的问题。现有测试时自我进化方法依赖模型固有推理能力，缺乏针对性训练。LSE将多步进化简化为单步强化学习目标，通过基于改进的奖励机制训练模型优化其上下文。结合树引导的进化循环，模型能探索不同上下文路径。实验表明，在Text-to-SQL和通用问答任务上，仅4B参数的LSE模型超越了由GPT-5和Claude Sonnet 4.5驱动的进化策略及现有提示优化方法，且能迁移指导其他模型。

创新点:

提出了LSE框架，首次显式训练大语言模型作为自我进化策略，而非依赖其固有推理能力。
将复杂的多步进化过程简化为单步强化学习目标，利用基于性能提升的奖励信号进行优化。
引入树引导的进化循环，利用UCB算法选择上下文，避免陷入次优路径，增强探索能力。
证明了小模型（4B参数）经过训练后，其自我进化能力可超越未训练的顶尖大模型（如GPT-5）。

方法

!!! info

论文首先形式化定义了测试时跨集间自我进化问题，聚焦于基于提示的更新。技术路线上，采用树引导搜索策略，维护一个进化树，使用Upper Confidence Bound (UCB)算法选择节点进行扩展。在训练阶段，利用强化学习训练一个策略模型，输入当前上下文和性能摘要，输出改进后的上下文。奖励函数定义为编辑后与编辑前在保留集上的性能差值，通过策略梯度方法更新模型参数。

关键结果:

在Text-to-SQL生成（BIRD数据集）和通用问答（MMLU-Redux）任务上，LSE训练的4B参数模型表现优异。
LSE策略优于由GPT-5和Claude Sonnet 4.5驱动的自我进化策略。
LSE优于现有的提示优化方法（如GEPA和TextGrad）。
LSE训练出的策略具有迁移性，无需额外训练即可指导其他模型。

技术栈: Reinforcement Learning (RL), Policy Gradient, Upper Confidence Bound (UCB) Algorithm, Tree Search, Large Language Models (LLMs), Text-to-SQL (BIRD), MMLU-Redux

优点

显式优化：打破了依赖模型固有能力的局限，通过RL显式训练自我进化能力。
高效性：小模型经过训练即可超越大模型，降低了部署成本。
鲁棒性：树引导机制避免了线性进化的贪婪陷阱，提供了更稳健的上下文探索。
通用性：方法不依赖特定任务，展示了在不同任务上的泛化和迁移能力。

局限

计算开销：树引导搜索和保留集评估增加了测试时的计算成本。
评估依赖：需要可验证的奖励信号（如SQL执行结果），对于开放式任务可能难以应用。
上下文长度限制：随着进化树的扩展，存储历史上下文可能触及模型的上下文窗口限制。
单步简化：虽然简化了训练，但可能忽略了长程进化轨迹中的复杂依赖关系。

与研究方向的相关性:

该论文属于大模型技术原理的创新范畴，与研究关键词高度相关。它提出了新的训练范式（LSE），利用强化学习改进大模型在测试时的适应能力，涉及深度学习（RL）和LLM的核心技术。虽然未直接涉及生物医药等具体科学应用，但其提出的“自我进化”机制具有很强的通用性，未来可应用于科学发现等需要持续迭代的场景。符合“大模型和深度学习技术原理的创新”这一关键词，且创新性较强。

6. TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

作者: Yan Shu, Bin Ren, Zhitong Xiong, Xiao Xiang Zhu, Begüm Demir, Nicu Sebe, Paolo Rota 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19039v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文提出TerraScope，一种用于地球观测的视觉语言模型（VLM），专注于像素级地理空间推理。与关键词的相关性分析如下：1）与"Chain of Thought"高度相关（10分），因为论文核心是像素级推理链，并创建了Terra-CoT数据集。2）与"AI for Science"高度相关（10分），属于地球科学领域的AI应用。3）与"Large Language Models"、“Pre-training”、“Supervised Fine-tuning”、“System 2 Thinking”、“Explainable AI"有一定关联（各5分），因为VLM基于大模型技术，涉及训练和推理过程，并提供可解释性。其他关键词如MoE、SLMs、RAG、RLHF等与论文内容无直接关系，得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有视觉语言模型在地球观测中难以进行精确像素级空间推理的问题，提出了TerraScope模型，通过模态灵活推理和多时序推理能力，在像素级地理空间推理任务上显著优于现有模型，并提供了可解释的视觉证据。

摘要翻译

视觉语言模型（VLMs）在地球观测（EO）领域展现出潜力，但在需要将复杂空间推理与精确像素级视觉表征相锚定的任务中仍面临挑战。为解决这一问题，我们提出了TerraScope——一个能够实现像素锚定地理空间推理的统一视觉语言模型，其具备两项核心能力：（1）模态灵活推理：可处理单模态输入（光学或合成孔径雷达SAR），并在多模态可用时自适应融合不同模态至推理流程；（2）多时序推理：能整合时间序列数据以进行多时相变化分析。此外，我们构建了Terra-CoT数据集，该大规模数据集包含100万个样本，其推理链中嵌入了来自多源数据的像素级掩码。我们还提出了首个像素锚定地理空间推理基准TerraScope-Bench，包含六个子任务，通过同时评估答案准确性与掩码质量来确保真实的像素锚定推理。实验表明，TerraScope在像素锚定地理空间推理任务上显著优于现有视觉语言模型，同时提供可解释的视觉证据。

摘要 (Abstract)

Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.

关键词: Vision-language models, Earth observation, Pixel-grounded reasoning, Geospatial reasoning, Multi-temporal reasoning, Chain of Thought, Interpretable AI, Remote sensing

深度分析:

TerraScope：面向地球观测的像素级视觉推理

摘要:

本文针对现有视觉语言模型（VLM）在地球观测（EO）领域难以进行像素级精确空间推理的问题，提出了TerraScope框架。该框架通过混合解码器联合生成分割掩码和推理链，实现了“用像素思考”的像素级视觉推理。TerraScope支持模态灵活推理（光学与SAR自适应融合）和多时相推理（变化分析）。此外，作者构建了包含100万样本的Terra-CoT指令微调数据集和首个像素级地学推理基准TerraScope-Bench。实验结果表明，TerraScope在像素级地学推理任务上显著优于现有模型，并能提供可解释的视觉证据。

创新点:

提出了TerraScope统一框架，支持像素级视觉推理，能够自适应融合光学与SAR模态，并处理多时序数据。
引入了“用像素思考”的推理范式，通过混合解码器在推理过程中动态生成分割掩码，并将掩码视觉特征注入语言模型。
构建了Terra-CoT数据集，包含100万个嵌入像素级掩码的推理链样本，支持大规模像素级推理训练。
提出了TerraScope-Bench基准，包含3837个专家验证样本，采用答案准确性和掩码质量的双重评估指标。

方法

!!! info

论文构建了基于视觉-语言架构的TerraScope模型，包含视觉编码器、文本编码器、投影器、掩码解码器和大语言模型。模型采用两阶段训练：首先使用200万指代表达分割对进行预训练，然后使用100万像素级CoT指令数据进行微调。在推理过程中，模型交替生成文本推理步骤和分割掩码，利用掩码选择视觉特征并注入LLM，实现像素级 grounding。对于多模态输入，采用文本引导的交叉注意力机制自适应选择光学或SAR特征。

关键结果:

TerraScope在TerraScope-Bench上显著优于GPT-4o、Qwen3-VL和EarthDial等现有模型。
在计算地物覆盖率等需要像素级精度的任务中，TerraScope表现出极高的准确性。
模型能够生成高质量的解释性分割掩码，验证了其像素级 grounding 的能力。
实验揭示了当前主流VLM在处理精细空间推理任务时的局限性。

技术栈: Vision-Language Models (VLMs), Large Language Models (LLMs), Semantic Segmentation, Chain-of-Thought (CoT) Reasoning, Instruction Tuning, Multi-modal Fusion (Optical & SAR), Mixed Decoders

优点

创新性地将像素级分割掩码嵌入到视觉推理链中，解决了EO领域精细空间分析的难题。
统一框架支持多模态（光学/SAR）和多时相推理，适应性强。
提供了可解释的视觉证据（掩码），增强了模型的可信度。
构建了大规模数据集和专用基准，推动了该领域的发展。
端到端推理，无需依赖外部工具，降低了系统复杂度。

局限

生成像素级掩码增加了计算开销，可能比纯文本推理更耗时。
依赖自动化管道生成的训练数据，可能存在噪声或标注偏差。
对于极大规模的超高分辨率图像，处理能力可能受限于显存和计算资源。

与研究方向的相关性:

该论文高度相关。它属于大模型（VLMs）在科学领域（地球观测）的应用，并针对深度学习技术原理进行了创新（像素级推理、混合解码器）。它解决了传统方法在处理连续空间分布和多源数据时的不足，具有显著的技术创新性和应用价值。

7. D-Mem: A Dual-Process Memory System for LLM Agents

作者: Zhixing You, Jiachen Yuan, Jason Cai 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18631v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文D-Mem专注于为LLM智能体设计一个双过程记忆系统，以提升长期推理能力。核心相关关键词包括：1) “Large Language Models” (10分)：论文明确使用GPT-4o-mini和Qwen3-235B-Instruct作为实验模型，是核心基础。2) “LLM Agents” (10分)：论文直接研究自主智能体的记忆系统，是核心主题。3) “Retrieval-Augmented Generation” (10分)：论文批判并改进现有的基于检索的记忆框架，是核心技术背景。4) “Chain of Thought"和"System 2 Thinking” (各5分)：论文涉及长期推理和精细上下文理解，与多步推理和深度思考概念相关，但非直接技术实现。其他关键词如MoE、量化、对齐等与论文内容无直接关联，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM智能体在长期推理中现有检索记忆框架存在信息丢失和精细上下文理解不足的问题，提出了一个双过程记忆系统D-Mem，通过轻量级向量检索和全审慎模块的动态切换，在LoCoMo和RealTalk基准测试中实现了高精度且计算成本显著降低的性能。

摘要翻译

在持久性自适应智能体发展的推动下，为这些系统配备高保真记忆访问能力以实现长程推理已成为关键需求。然而，当前主流的基于检索的记忆框架通常采用增量处理范式，持续提取对话记忆并更新至向量数据库，在查询时依赖语义检索。虽然这种方法速度较快，但其本质上依赖于有损抽象，常常遗漏上下文关键信息，且难以处理需要细粒度上下文理解的查询。为此，我们提出了D-Mem——一种双过程记忆系统。该系统保留轻量级向量检索以处理常规查询，同时建立详尽的完全审议模块作为高保真备用机制。为实现认知效率与精度的平衡，D-Mem采用多维质量门控策略动态桥接这两个过程。基于GPT-4o-mini和Qwen3-235B-Instruct模型在LoCoMo和RealTalk基准上的实验验证了本方法的有效性。值得注意的是，我们的多维质量门控策略在LoCoMo基准上使用GPT-4o-mini实现了53.5的F1分数，优于静态检索基线Mem0$^\ast$（51.2），并恢复了完全审议模块96.7%的性能表现（55.3），同时显著降低了计算成本。

摘要 (Abstract)

Driven by the development of persistent, self-adapting autonomous agents, equipping these systems with high-fidelity memory access for long-horizon reasoning has emerged as a critical requirement. However, prevalent retrieval-based memory frameworks often follow an incremental processing paradigm that continuously extracts and updates conversational memories into vector databases, relying on semantic retrieval when queried. While this approach is fast, it inherently relies on lossy abstraction, frequently missing contextually critical information and struggling to resolve queries that rely on fine-grained contextual understanding. To address this, we introduce D-Mem, a dual-process memory system. It retains lightweight vector retrieval for routine queries while establishing an exhaustive Full Deliberation module as a high-fidelity fallback. To achieve cognitive economy without sacrificing accuracy, D-Mem employs a Multi-dimensional Quality Gating policy to dynamically bridge these two processes. Experiments on the LoCoMo and RealTalk benchmarks using GPT-4o-mini and Qwen3-235B-Instruct demonstrate the efficacy of our approach. Notably, our Multi-dimensional Quality Gating policy achieves an F1 score of 53.5 on LoCoMo with GPT-4o-mini. This outperforms our static retrieval baseline, Mem0$^\ast$ (51.2), and recovers 96.7% of the Full Deliberation’s performance (55.3), while incurring significantly lower computational costs.

关键词: LLM Agents, Memory System, Dual-Process, Retrieval-Augmented Generation, Long-horizon Reasoning, Multi-dimensional Quality Gating, Full Deliberation, Autonomous Agents

深度分析:

D-Mem：面向大模型智能体的双过程记忆系统

摘要:

针对现有基于检索的记忆框架因有损抽象导致难以处理复杂推理的问题，本文提出了D-Mem，一种模拟人类元认知的双过程记忆系统。该系统结合了快速的向量检索（系统1）和详尽的完全审议（系统2）。通过引入多维质量门控策略，动态评估初始检索结果的相关性、忠实度和完整性，仅在必要时触发高成本的完全审议模块。实验表明，D-Mem在LoCoMo和RealTalk基准测试中表现优异，在显著降低计算成本的同时，恢复了完全审议模式96.7%的性能。

创新点:

提出了D-Mem双过程记忆架构，结合了高效的向量检索（系统1）与高保真的完全审议（系统2），模拟人类认知过程。
设计了多维质量门控策略，作为元认知检查点，动态平衡推理准确性与计算效率，避免不必要的计算开销。
建立了完全审议基线方法，通过查询引导的时序扫描处理原始对话历史，有效缓解“迷失在中间”现象并保留细粒度上下文。
改进了Mem0架构（Mem0*），在提取和更新阶段引入了更精细的上下文利用和相关性过滤机制，提升了基础检索质量。

方法

!!! info

论文采用双过程架构：首先利用改进的Mem0*进行增量记忆处理和快速检索（System 1）；接着通过多维质量门控评估检索结果的质量；如果质量不达标，则触发完全审议模块（System 2），直接对原始对话历史进行分块扫描、事实提取与评分，最终生成高保真答案。

关键结果:

在LoCoMo数据集上，D-Mem的质量门控策略取得了53.5的F1分数，优于静态检索基线Mem0*（51.2）。
D-Mem恢复了完全审议模式（F1=55.3）96.7%的性能，但计算成本（输入Token和推理时间）显著降低。
在RealTalk基准测试上表现出一致的性能提升，验证了该方法的有效性。

技术栈: GPT-4o-mini, Qwen3-235B-Instruct, Vector Database (向量数据库), Cosine Similarity (余弦相似度), Chunk-level Fact Extraction (分块事实提取), Multi-dimensional Quality Gating (多维质量门控)

优点

有效解决了增量记忆压缩导致的有损抽象问题，提升了复杂查询的推理能力。
通过门控机制实现了准确率与效率的良好平衡，避免了盲目使用高计算资源。
完全审议模块提供了高保真的推理上限，且能缓解长上下文中的“迷失在中间”现象。
实验验证充分，在多个基准测试上展示了优越性。

局限

完全审议模块虽然作为后备，但在极端复杂查询下仍可能产生高昂的计算成本。
质量门控策略本身依赖LLM进行评估，可能引入额外的评估开销或评估误差。
系统架构相对复杂，涉及多个模块的协同工作，工程实现难度较大。

与研究方向的相关性:

该论文属于大模型技术原理的创新范畴，专注于智能体的记忆系统架构。提出的双过程记忆系统和多维质量门控策略是对现有RAG和智能体记忆技术的显著改进，属于深度学习技术原理的创新。虽然未直接涉及生物医药等具体科学领域应用，但其提升长时推理和记忆保真度的技术对于构建科学领域的AI智能体具有重要参考价值，创新性较强。

8. EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

作者: Minsoo Cheong, Donghyun Son, Woosang Lim, Sungjoo Yoo 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18489v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	15.0/10	15.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	10.0/10	10.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出EntropyCache，一种针对扩散语言模型（dLLMs）的无训练KV缓存方法，核心是解决KV缓存效率问题。与关键词高度相关的包括：1）“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”（15分）：论文直接研究KV缓存优化方法，是核心内容；2）“Large Language Models” OR “LLMs” OR “Foundation Models”（10分）：论文针对扩散大语言模型（dLLMs）进行优化；3）“Speculative Decoding” OR “Inference Acceleration”（10分）：论文目标是通过KV缓存实现推理加速；4）“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”（5分）：论文在CoT基准测试上评估了方法。其他关键词如MoE、量化、对齐等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对扩散语言模型（dLLMs）中KV缓存效率低的问题，提出了一种基于解码令牌熵的无训练KV缓存方法EntropyCache，实现了15.2-26.4倍的推理加速，同时保持竞争性精度。

摘要翻译

基于扩散的大型语言模型（dLLMs）依赖双向注意力机制，这阻碍了无损KV缓存的实现，并需要在每个去噪步骤中进行完整的前向传播。现有的近似KV缓存方法通过选择性更新缓存状态来降低计算成本，但其决策开销随上下文长度或模型深度增加而线性增长。我们提出EntropyCache，一种无需训练的KV缓存方法，该方法以新解码词元分布的最大熵作为恒定开销的信号，用于决定何时重新计算。我们的设计基于两项实证观察：（1）解码词元的熵与KV缓存漂移相关，为缓存陈旧性提供了低成本的代理指标；（2）解码词元的特征波动在解掩码后持续多个步骤，这促使我们对最近解码的$k$个词元进行重新计算。跳转或重新计算的决策仅需每步$O(V)$的计算量，与上下文长度和模型规模无关。在LLaDA-8B-Instruct和Dream-7B-Instruct上的实验表明，EntropyCache在标准基准测试中实现了$15.2\times$-$26.4\times$的加速，在思维链基准测试中实现了$22.4\times$-$24.1\times$的加速，同时保持了具有竞争力的准确率，且决策开销仅占推理时间的$0.5%$。代码发布于https://github.com/mscheong01/EntropyCache。

摘要 (Abstract)

Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by selectively updating cached states, but their decision overhead scales with context length or model depth. We propose EntropyCache, a training-free KV caching method that uses the maximum entropy of newly decoded token distributions as a constant-cost signal for deciding when to recompute. Our design is grounded in two empirical observations: (1) decoded token entropy correlates with KV cache drift, providing a cheap proxy for cache staleness, and (2) feature volatility of decoded tokens persists for multiple steps after unmasking, motivating recomputation of the $k$ most recently decoded tokens. The skip-or-recompute decision requires only $O(V)$ computation per step, independent of context length and model scale. Experiments on LLaDA-8B-Instruct and Dream-7B-Instruct show that EntropyCache achieves $15.2\times$-$26.4\times$ speedup on standard benchmarks and $22.4\times$-$24.1\times$ on chain-of-thought benchmarks, with competitive accuracy and decision overhead accounting for only $0.5%$ of inference time. Code is available at https://github.com/mscheong01/EntropyCache.

关键词: KV caching, diffusion language models, inference acceleration, decoded token entropy, attention mechanisms, computational efficiency, large language models, chain-of-thought reasoning

深度分析:

EntropyCache：基于解码Token熵引导的扩散语言模型KV缓存策略

摘要:

扩散语言模型采用双向注意力机制，导致无法进行无损KV缓存，每次去噪步骤都需要全前向传播，计算成本高昂。现有近似缓存方法虽然降低了成本，但决策开销随上下文长度或模型深度增加。本文提出了EntropyCache，一种无需训练的KV缓存方法。该方法利用新解码Token分布的最大熵作为常数成本信号来决定何时重新计算，并基于特征波动性持续多步的观察，重新计算最近解码的k个Token。实验表明，EntropyCache在LLaDA-8B-Instruct和Dream-7B-Instruct模型上，实现了15.2倍至26.4倍的加速，且保持了具有竞争力的精度，决策开销仅占推理时间的0.5%。

创新点:

提出利用新解码Token的最大熵作为KV缓存陈旧度的轻量级代理指标，替代了昂贵的逐层比较。
发现并利用了Token在解码后多步内特征持续波动的特性，提出对最近k个解码Token进行重计算。
设计了决策开销仅为O(V)的缓存策略，该开销与上下文长度和模型规模无关。
提出了一种无需训练的缓存方法，可直接应用于现有的扩散语言模型推理流程。

方法

!!! info

论文首先通过实证分析，计算解码Token熵与KV缓存漂移（余弦距离）之间的相关性，验证了熵作为预测指标的有效性。其次，利用PCA分析Token轨迹，揭示了特征波动持续多步的现象。基于此，设计了包含三个阶段的推理流程：前向传播（全量或部分）、解码与熵评估、跳过决策与近期Token选择。当最大熵超过阈值时触发全量计算，否则仅重算当前掩码Token和最近解码的k个Token。

关键结果:

在标准基准测试中实现了15.2×–26.4×的加速比。
在思维链基准测试中实现了22.4×–24.1×的加速比。
在保持竞争力的准确率的同时，决策开销仅占推理时间的0.5%。
验证了解码Token最大熵与KV缓存漂移之间存在显著的正相关关系（Spearman相关系数ρ = 0.644）。

技术栈: 扩散语言模型, 双向注意力机制, KV缓存, 熵计算, 余弦相似度, 主成分分析, LLaDA-8B-Instruct, Dream-7B-Instruct

优点

显著提升了扩散语言模型的推理吞吐量，实现了数量级的加速；决策机制极其轻量，不随上下文长度或模型深度增加而增加开销；无需对模型进行额外训练或微调，通用性强；基于扎实的实证观察，理论依据充分。

局限

作为一种近似缓存方法，在追求极致速度的同时仍存在精度与速度的权衡；需要调整熵阈值τ和窗口大小k等超参数；该方法专门针对扩散语言模型设计，不适用于传统的自回归模型。

与研究方向的相关性:

该论文高度相关于“大模型和深度学习技术原理的创新”。它针对扩散大模型这一新兴架构的核心推理瓶颈（KV缓存）提出了创新性的解决方案，深入分析了模型内部的熵与特征波动特性，属于大模型底层系统优化的关键技术突破。

9. Security awareness in LLM agents: the NDAI zone case

作者: Enrico Bottazzi, Pia Park 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19011v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	5.0/10	5.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	5.0/10	5.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM代理在安全环境感知中的行为，直接涉及"Large Language Models"和"LLM Agents”，给予10分。研究涉及多代理协商（“Multi-agent Systems”）和可解释性分析（“Mechanistic Interpretability”），给予5分。论文提到代理依赖上下文窗口形成环境感知，与"Context Window Extension"有一定关联，给予5分。其他关键词如MoE、SFT、RAG等未在论文中涉及，给予0分。

!!! tip deepseek-chat TL;DR

该研究通过NDAI协商任务发现，当前LLM代理能可靠检测危险信号但无法可靠验证安全，揭示了安全感知能力的不对称性，这是部署隐私保护代理协议的核心挑战。

摘要翻译

NDAI（非披露人工智能）区域允许发明者与投资者智能体在可信执行环境（TEE）内进行协商，若未达成协议，任何已披露信息将被删除。这使得全面披露知识产权成为发明者智能体的理性策略。然而，利用此基础设施要求智能体能够区分安全与不安全环境，而大型语言模型智能体天生缺乏这种能力，因为它们仅能依赖通过上下文窗口传递的证据来形成对执行环境的认知。我们提出：不同大型语言模型在形成对其执行环境安全性的认知时，如何权衡不同形式的证据？通过在10种语言模型及多种证据场景下进行NDAI式协商任务实验，我们发现一种明显的不对称性：失败的认证普遍抑制所有模型的披露行为，而通过的认证则引发高度异质性响应：部分模型增加披露，部分不受影响，少数模型甚至矛盾地减少披露。这表明当前大型语言模型能够可靠地检测危险信号，但无法可靠地验证安全性——而这正是NDAI区域等隐私保护智能体协议所需的核心能力。通过可解释性分析、针对性微调或改进证据架构等方式弥合这一差距，仍然是部署能够根据实际证据质量校准信息共享的智能体所面临的核心开放挑战。

摘要 (Abstract)

NDAI zones let inventor and investor agents negotiate inside a Trusted Execution Environment (TEE) where any disclosed information is deleted if no deal is reached. This makes full IP disclosure the rational strategy for the inventor’s agent. Leveraging this infrastructure, however, requires agents to distinguish a secure environment from an insecure one, a capability LLM agents lack natively, since they can rely only on evidence passed through the context window to form awareness of their execution environment. We ask: How do different LLM models weight various forms of evidence when forming awareness of the security of their execution environment? Using an NDAI-style negotiation task across 10 language models and various evidence scenarios, we find a clear asymmetry: a failing attestation universally suppresses disclosure across all models, whereas a passing attestation produces highly heterogeneous responses: some models increase disclosure, others are unaffected, and a few paradoxically reduce it. This reveals that current LLM models can reliably detect danger signals but cannot reliably verify safety, the very capability required for privacy-preserving agentic protocols such as NDAI zones. Bridging this gap, possibly through interpretability analysis, targeted fine-tuning, or improved evidence architectures, remains the central open challenge for deploying agents that calibrate information sharing to actual evidence quality.

关键词: LLM agents, security awareness, NDAI zones, Trusted Execution Environment, negotiation task, evidence weighting, privacy-preserving protocols, interpretability analysis

深度分析:

LLM 代理的安全感知：NDAI 区域案例研究

摘要:

本文探讨了LLM代理在NDAI区域（一种基于可信执行环境TEE的谈判机制）中的安全感知能力。由于LLM缺乏原生的环境感知，研究通过实验测试了10个模型如何权衡文本安全声明和TEE证明这两种证据。结果显示，模型对失败的证明反应一致（抑制披露），但对通过的证明反应高度异质（有的增加披露，有的反而减少）。这表明当前LLM能可靠检测危险信号，却无法可靠验证安全性，这是隐私保护代理协议面临的核心挑战。

创新点:

提出了LLM代理“安全感知”的概念，将其定义为基于上下文窗口证据（文本声明与硬件证明）的加权过程，而非原生能力。
将NDAI（非披露协议激励）区域概念应用于LLM代理谈判，测试代理是否能利用TEE的安全保证来解决谈判中的“套牢问题”。
引入了证明依赖指数（ARI），用于量化模型对证明证据相对于文本声明的依赖程度，并据此将模型分类为证明驱动、声明驱动或证明厌恶。
发现了LLM在安全验证上的不对称性：虽然所有模型都能一致响应负面证据（失败的证明），但在正面证据（通过的证明）上表现极不稳定。

方法

!!! info

研究采用黑盒行为测量方法，构建了一个模拟的发明者-投资者谈判任务。实验设置了四种证据场景（无TEE、仅文本声明、文本声明+真证明、文本声明+假证明），并在10个不同的LLM模型上运行。通过法官模型对卖方代理的回复进行评分（0-1的披露分数），计算平均披露分数和证明依赖指数（ARI），以分析不同模型对安全证据的权重分配。

关键结果:

负面证据一致性：所有测试的LLM模型在收到失败的证明时，都一致地减少了信息泄露。
正面证据异质性：当提供通过的证明时，模型反应差异巨大，7/10模型增加披露，而3/10模型（如GPT-4o）反而减少了披露。
行为分类：根据ARI指数，模型可分为三类：证明驱动型（仅凭证明行动）、声明驱动型（仅凭文本行动）和证明厌恶型（证明反而抑制披露）。
核心结论：当前LLM模型具备检测危险信号的能力，但缺乏可靠验证安全的能力，这阻碍了它们在NDAI等隐私保护协议中的有效部署。

技术栈: LLM Models: Claude Sonnet 4.6, GPT-4o, Gemini 3.1/2.5, Grok-3, Kimi K2.5 等10个模型, Environment Simulation: 模拟的可信执行环境（TEE）上下文, Tools: 模拟的TEE证明工具调用, Evaluation Metrics: 披露分数, 证明依赖指数（ARI）, Statistical Methods: Bootstrap置信区间

优点

问题定义精准：抓住了LLM代理在安全环境中运行的一个关键盲点——缺乏原生环境感知。
实验设计严谨：覆盖了多种证据组合和主流模型，提供了具有统计意义的实证数据。
指标创新：ARI指数有效地量化了模型对安全证据的敏感度，便于跨模型比较。
现实意义强：直接关联到TEE内AI代理的经济效用和隐私保护技术的落地可行性。

局限

模拟环境限制：实验使用的是模拟的证明和上下文，而非真实的TEE硬件执行，可能影响代理对证据“真实性”的感知。
黑盒分析局限：主要依赖输出行为进行推断，缺乏对模型内部表征的白盒可解释性分析。
任务范围较窄：主要集中在特定的发明披露谈判任务，在其他类型的代理行为中结论是否普适尚待验证。
时间背景特殊性：论文日期为2026年，涉及部分未来模型（如GPT-5.2），当前无法复现验证。

与研究方向的相关性: {‘score’: 8, ‘reason’: ‘该论文高度相关于“大模型技术原理的创新”。它深入研究了LLM Agent如何处理和权衡外部证据（推理与感知机制），这是Agent技术的核心原理之一。同时，它涉及TEE和隐私计算，属于大模型在安全科学领域的应用。论文揭示了LLM在逻辑推理和安全验证上的缺陷，具有很强的创新性和技术深度。’}

10. Bridging Network Fragmentation: A Semantic-Augmented DRL Framework for UAV-aided VANETs

作者: Gaoxiang Cao, Wenke Yuan, Huasen He, Yunpeng Hou, Xiaofeng Jiang, Shuangwu Chen, Jian Yang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18871v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	5.0/10	5.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心是应用LLM解决VANET网络碎片化问题，因此与"Large Language Models"高度相关（10分）。论文通过四阶段流程将通用LLM转化为领域专家，涉及领域适应和微调，与"Pre-training/Domain Adaptation"和"Post-training/SFT"有一定关联（5分）。LLM用于识别拓扑重要性，体现推理能力，与"Chain of Thought"和"System 2 Thinking"相关（5分）。LLM作为智能体先验指导策略，与"LLM Agents"相关（5分）。其他关键词如MoE、SLMs、Scaling Laws、RAG、Quantization等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种语义增强的深度强化学习框架（SA-DRL），利用大型语言模型（LLMs）的推理能力来指导无人机部署，以解决车载自组织网络（VANETs）中的网络碎片化问题，实验表明该框架在仅需26.6%训练回合的情况下实现了最先进的性能，并将关键连接指标提升了13.2%和23.5%。

摘要翻译

车载自组织网络（VANETs）是自动驾驶的数字基石，但在城市环境中，由于物理遮挡，其网络往往面临严重的割裂问题。具备高机动性的无人机（UAVs）已成为弥合这些连接鸿沟的关键解决方案。然而，传统的基于深度强化学习（DRL）的无人机部署策略缺乏对道路拓扑的语义理解，常导致盲目探索和样本效率低下。相比之下，大语言模型（LLMs）拥有强大的推理能力，能够识别拓扑重要性，但将其应用于控制任务仍具挑战性。为此，我们提出了语义增强深度强化学习（SA-DRL）框架。首先，我们提出了一种基于道路拓扑图（RTG）与双连通图（DCG）的网络割裂量化方法。随后，我们设计了一个四阶段流程，将通用大语言模型转化为特定领域的拓扑专家。最后，我们提出了语义增强近端策略优化（SA-PPO）算法，该算法采用逻辑融合机制，将大语言模型的语义推理作为先验知识直接注入策略中，从而有效引导智能体朝向关键路口。大量高保真仿真实验表明，SA-PPO以卓越的效率实现了最先进的性能，仅需26.6%的训练回合数即可达到基线性能水平。最终，相较于其他方法，SA-PPO将两项关键连接指标分别提升了13.2%和23.5%，同时将能耗降至基线水平的28.2%。

摘要 (Abstract)

Vehicular Ad-hoc Networks (VANETs) are the digital cornerstone of autonomous driving, yet they suffer from severe network fragmentation in urban environments due to physical obstructions. Unmanned Aerial Vehicles (UAVs), with their high mobility, have emerged as a vital solution to bridge these connectivity gaps. However, traditional Deep Reinforcement Learning (DRL)-based UAV deployment strategies lack semantic understanding of road topology, often resulting in blind exploration and sample inefficiency. By contrast, Large Language Models (LLMs) possess powerful reasoning capabilities capable of identifying topological importance, though applying them to control tasks remains challenging. To address this, we propose the Semantic-Augmented DRL (SA-DRL) framework. Firstly, we propose a fragmentation quantification method based on Road Topology Graphs (RTG) and Dual Connected Graphs (DCG). Subsequently, we design a four-stage pipeline to transform a general-purpose LLM into a domain-specific topology expert. Finally, we propose the Semantic-Augmented PPO (SA-PPO) algorithm, which employs a Logit Fusion mechanism to inject the LLM’s semantic reasoning directly into the policy as a prior, effectively guiding the agent toward critical intersections. Extensive high-fidelity simulations demonstrate that SA-PPO achieves state-of-the-art performance with remarkable efficiency, reaching baseline performance levels using only 26.6% of the training episodes. Ultimately, SA-PPO improves two key connectivity metrics by 13.2% and 23.5% over competing methods, while reducing energy consumption to just 28.2% of the baseline.

关键词: Large Language Models, Deep Reinforcement Learning, UAV-aided VANETs, Network Fragmentation, Semantic-Augmented PPO, Road Topology Graphs, Logit Fusion, Topological Reasoning

11. SignAgent: Agentic LLMs for Linguistically-Grounded Sign Language Annotation and Dataset Curation

作者: Oliver Cory, Ozge Mercanoglu Sincan, Richard Bowden 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19059v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心是开发一个基于LLM的智能体框架（SignAgent）用于手语标注和数据集构建。高度相关的关键词包括：1）“Large Language Models”（论文明确使用LLMs作为核心组件）；2）“LLM Agents”（论文提出的是agentic framework，属于智能体研究）。中等相关的关键词包括：1）“Chain of Thought”（智能体进行推理和决策，涉及多步推理）；2）“System 2 Thinking”（智能体进行深入推理分析）；3）“Tool Use”（智能体协调使用语言工具）。其他关键词如MoE、量化、RAG等未在论文中涉及，评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为SignAgent的智能体框架，利用大型语言模型（LLMs）进行可扩展的、基于语言学的手语标注和数据集构建，解决了传统方法忽略语言细节和人工标注效率低下的问题，并在伪注释和ID注释任务中展示了强大的性能。

摘要翻译

本文介绍了SignAgent，一种新颖的智能体框架，该框架利用大语言模型（LLMs）实现可扩展的、基于语言学基础的手语（Sign Language, SL）标注与数据集构建。传统的手语计算方法通常仅在语素层面运作，忽视了关键的语言学细微差别，而人工语言学标注仍是一个重大瓶颈，其过程缓慢且成本高昂，难以用于创建大规模、具备音系学意识的数据集。SignAgent通过两个核心组件应对这些挑战：SignAgent Orchestrator——一个负责协调一系列语言学工具并进行推理的LLM，以及SignGraph——一个提供词汇与语言学基础的知识驱动型LLM。我们在两项下游标注任务上评估了该框架。首先，在伪语素标注任务中，智能体执行约束性分配，利用多模态证据为手语序列提取并排序合适的语素标签。其次，在身份语素标注任务中，智能体通过推理视觉相似性和音系重叠性来检测并优化视觉聚类，从而正确识别和归类词汇手语变体。我们的结果表明，这种智能体方法在大规模、具备语言学意识的数据标注与构建方面表现出色。

摘要 (Abstract)

This paper introduces SignAgent, a novel agentic framework that utilises Large Language Models (LLMs) for scalable, linguistically-grounded Sign Language (SL) annotation and dataset curation. Traditional computational methods for SLs often operate at the gloss level, overlooking crucial linguistic nuances, while manual linguistic annotation remains a significant bottleneck, proving too slow and expensive for the creation of large-scale, phonologically-aware datasets. SignAgent addresses these challenges through SignAgent Orchestrator, a reasoning LLM that coordinates a suite of linguistic tools, and SignGraph, a knowledge-grounded LLM that provides lexical and linguistic grounding. We evaluate our framework on two downstream annotation tasks. First, on Pseudo-gloss Annotation, where the agent performs constrained assignment, using multi-modal evidence to extract and order suitable gloss labels for signed sequences. Second, on ID Glossing, where the agent detects and refines visual clusters by reasoning over both visual similarity and phonological overlap to correctly identify and group lexical sign variants. Our results demonstrate that our agentic approach achieves strong performance for large-scale, linguistically-aware data annotation and curation.

关键词: SignAgent, Agentic LLMs, Sign Language Annotation, Dataset Curation, Linguistically-grounded, Reasoning LLM, Multi-modal Evidence, Phonological Awareness

深度分析:

SignAgent：基于语言学基础的手语注释与数据集整理的智能体大语言模型

摘要:

针对手语（SL）研究中大规模、语言学级注释数据稀缺且人工标注成本高昂的瓶颈，本文提出了SignAgent，一种利用大语言模型（LLM）进行可扩展、基于语言学基础的手语注释和数据集整理的智能体框架。该框架核心包含SignAgent Orchestrator（负责多阶段决策和工具协调的推理LLM）和SignGraph（提供词汇和语言学知识基础的检索增强LLM）。通过分层组织的工具集，SignAgent能够分解复杂任务，调用增强模块进行多模态证据推理。在伪词汇注释和ID词汇标注（识别词汇变体）两项下游任务上的评估表明，该智能体方法在大规模语言学感知的数据注释和整理中表现出色，显著优于固定管道方法。

创新点:

首次将智能体推理应用于手语注释和数据集整理领域，结合了多模态证据与基于知识的检索。
提出了SignAgent Orchestrator，一个具备工具调用能力的推理LLM，能够自主进行多阶段决策和语言学推理。
开发了SignGraph模块，利用词汇和语言学知识图谱为LLM提供深层的语言学知识基础。
设计了分层工具集架构，包含基础工具（提取音系、句法特征）和增强工具（融合多模态证据），实现了从原始数据到语言学决策的闭环。

方法

!!! info

论文采用基于智能体的框架方法。首先，构建SignAgent Orchestrator作为中央控制器，采用ReAct风格的推理循环，通过生成推理轨迹、调用工具或查询知识图谱来迭代优化状态。其次，利用SignGraph基于词汇知识图谱和语言学知识图谱进行检索增强生成（RAG），提供语言学基础。技术路线上，通过基础工具（如手形、运动、位置分类器）提取低级特征，再由增强工具融合多模态线索生成结构化证据，最终由Orchestrator进行推理以完成伪词汇对齐和ID词汇标注等任务。

关键结果: 在伪词汇注释任务中，SignAgent利用多模态证据实现了强劲的对齐性能；在ID词汇标注任务中，通过结合视觉相似性和音系重叠推理，显著提高了词汇变体识别的聚类质量。实验结果表明，该智能体方法在处理大规模、语言学感知的视频注释方面优于传统的固定管道方法，证明了其在解决手语数据标注瓶颈方面的有效性。

技术栈: 大语言模型 (LLM), 多模态大模型, 检索增强生成 (RAG), GraphRAG (基于图的检索增强), ReAct 推理模式, k-近邻算法 (k-NN), 离散傅里叶变换 (DFT), 知识图谱

优点

论文的主要优点在于创新性地将智能体框架引入手语处理领域，有效解决了人工标注成本高的问题。通过结合语言学知识图谱，模型不仅关注视觉特征，还能进行深层的语言学推理，超越了传统的 gloss 级别处理。分层工具集的设计使得系统具有高度的模块化和可解释性，能够灵活处理复杂的注释任务。

局限

该框架的性能可能依赖于基础工具（如手形、运动分类器）的准确性，这些底层模块的误差可能会向上传播。此外，基于LLM的智能体推理通常计算成本较高，推理速度可能较慢，限制了实时应用的潜力。虽然框架通用，但目前的评估主要集中在特定任务上，在更广泛的手语方言或复杂非手动特征（如面部表情）上的泛化能力有待进一步验证。

与研究方向的相关性:

该论文与关键词高度相关。它属于大模型在科学领域的具体应用（手语语言学），展示了LLM如何通过智能体架构解决特定领域的痛点。同时，论文在技术原理上有所创新，结合了RAG、知识图谱和工具调用等大模型前沿技术，体现了深度学习与语言学的深度融合，具有很高的创新性和应用价值。

12. From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-

作者: Zhuofan Li, Hongkun Yang, Zhenyang Chen, Yangxuan Chen, Yingyan, Lin, Chaojian Li 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19131v1

评分: 31.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	8.0/10	8.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	8.0/10	8.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究Vision-Language-Action (VLA)模型在机器人平台上的实际效率，属于大模型在具体领域（机器人/具身智能）的应用研究。与关键词的相关性分析：1) “Large Language Models” (5分)：VLA模型是多模态大模型，属于大模型范畴；2) “Supervised Fine-tuning” (5分)：摘要提到supervised fine-tuning作为常见适应方法被评估；3) “LLM Agents” (8分)：VLA模型用于具身智能体，是LLM Agents的具体实现；4) “Quantization” (8分)：论文明确研究model compression（包括量化等压缩技术）；5) “In-context Learning” (5分)：摘要提到in-context prompting作为评估方法。其他关键词如MoE、Scaling Laws、RLHF等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，当前基于参数、FLOPs等传统推理效率指标无法准确反映Vision-Language-Action模型在机器人平台上的实际性能，提出并验证了系统级具身效率指标（如任务完成时间、轨迹平滑度等）能更全面地评估模型的实际表现。

摘要翻译

视觉-语言-动作（Vision-Language-Action, VLA）模型近年来通过联合推理视觉、语言与运动模态，使具身智能体能够执行日益复杂的任务。然而，我们发现当前VLA研究中普遍采用的“效率”概念——通常以参数量、浮点运算量或解码吞吐量来衡量——并不能反映其在机器人平台上的实际性能。在现实世界执行中，效率由系统层面的具身行为决定，例如任务完成时间、轨迹平滑度、累积关节旋转量和运动能耗。通过对模型压缩、令牌稀疏化与动作序列压缩的对照研究，我们得出若干挑战常见假设的观察：（1）在传统指标下减少计算量的方法，尽管能维持任务成功率，却常常增加端到端执行成本或降低运动质量。（2）系统层面的具身效率指标揭示了学习到的动作策略中隐藏的性能差异，这些差异在传统评估中无法显现。（3）常见的适应方法（如上下文提示或有监督微调）对具身效率的提升有限且仅针对特定指标。虽然这些方法能够降低目标具身效率指标（如急动度或动作频率），但由此获得的收益可能以其他指标（如更长的完成时间）为代价。综合而言，我们的研究表明传统推理效率指标可能忽略具身执行的重要方面。引入具身效率评估能够更完整地反映策略行为与实际性能，从而为VLA模型提供更公平、更全面的比较基准。

摘要 (Abstract)

Vision-Language-Action (VLA) models have recently enabled embodied agents to perform increasingly complex tasks by jointly reasoning over visual, linguistic, and motor modalities. However, we find that the prevailing notion of ``efficiency’’ in current VLA research, characterized by parameters, FLOPs, or token decoding throughput, does not reflect actual performance on robotic platforms. In real-world execution, efficiency is determined by system-level embodied behaviors such as task completion time, trajectory smoothness, cumulative joint rotation, and motion energy. Through controlled studies across model compression, token sparsification, and action sequence compression, we make several observations that challenge common assumptions. (1) Methods that reduce computation under conventional metrics often increase end-to-end execution cost or degrade motion quality, despite maintaining task success rates. (2) System-level embodied efficiency metrics reveal performance differences in the learned action policies that remain hidden under conventional evaluations. (3) Common adaptation methods such as in-context prompting or supervised fine-tuning show only mild and metric-specific improvements in embodied efficiency. While these methods can reduce targeted embodied-efficiency metrics such as jerk or action rate, the resulting gains may come with trade-offs in other metrics, such as longer completion time. Taken together, our results suggest that conventional inference efficiency metrics can overlook important aspects of embodied execution. Incorporating embodied efficiency provides a more complete view of policy behavior and practical performance, enabling fairer and more comprehensive comparisons of VLA models.

关键词: Vision-Language-Action Models, Embodied Agents, Efficiency Metrics, Model Compression, In-context Prompting, Supervised Fine-tuning, Robotic Platforms, System-level Performance

13. Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

作者: Gregory N. Frank 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18280v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型的对齐机制，特别是政治审查场景下的检测-路由-生成框架，与"Large Language Models"和"Alignment"高度相关（10分）。论文涉及事实性输出和幻觉问题，与"Hallucination Mitigation"有一定关联（5分）。通过探针、消融等方法研究模型内部机制，与"Mechanistic Interpretability"有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，评0分。

!!! tip deepseek-chat TL;DR

论文研究发现当前基于拒绝的对齐评估方法存在缺陷，通过研究中文大语言模型的政治审查机制，提出了检测-路由-生成的三阶段框架，揭示对齐主要通过路由机制而非知识删除或简单拒绝来实现。

摘要翻译

当前的对齐评估主要衡量模型是否编码危险概念以及是否拒绝有害请求。这两者都忽略了对齐通常运作的层面：从概念检测到行为策略的路由机制。我们以中国源语言模型中的政治审查作为自然实验，通过对来自五个实验室的九个开放权重模型进行探针分析、手术式消融和行为测试，得出三项发现。首先，仅凭探针准确性无法作为诊断依据：政治探针、空值对照和置换基线均可达到100%准确率，因此留出类别泛化能力才是有效的测试标准。其次，手术式消融揭示了实验室特定的路由机制。移除政治敏感性方向在大多数测试模型中消除了审查并恢复了准确的事实输出，而有一个模型因架构将事实知识与审查机制纠缠而产生虚构输出。跨模型迁移失败，表明路由几何结构具有模型和实验室特异性。第三，拒绝不再是主导的审查机制。在同一个模型系列中，强硬拒绝率降至零，而叙事引导升至最高，使得仅依赖拒绝检测的基准测试无法察觉审查行为。这些结果支持一个三阶段描述框架：检测、路由、生成。模型通常保留相关知识；对齐改变的是这些知识的表达方式。因此，仅审计检测或拒绝的评估方法会遗漏最直接决定行为的路由机制。

摘要 (Abstract)

Current alignment evaluation mostly measures whether models encode dangerous concepts and whether they refuse harmful requests. Both miss the layer where alignment often operates: routing from concept detection to behavioral policy. We study political censorship in Chinese-origin language models as a natural experiment, using probes, surgical ablations, and behavioral tests across nine open-weight models from five labs. Three findings follow. First, probe accuracy alone is non-diagnostic: political probes, null controls, and permutation baselines can all reach 100%, so held-out category generalization is the informative test. Second, surgical ablation reveals lab-specific routing. Removing the political-sensitivity direction eliminates censorship and restores accurate factual output in most models tested, while one model confabulates because its architecture entangles factual knowledge with the censorship mechanism. Cross-model transfer fails, indicating that routing geometry is model- and lab-specific. Third, refusal is no longer the dominant censorship mechanism. Within one model family, hard refusal falls to zero while narrative steering rises to the maximum, making censorship invisible to refusal-only benchmarks. These results support a three-stage descriptive framework: detect, route, generate. Models often retain the relevant knowledge; alignment changes how that knowledge is expressed. Evaluations that audit only detection or refusal therefore miss the routing mechanism that most directly determines behavior.

关键词: alignment evaluation, political censorship, routing mechanism, language models, refusal-based evaluation, concept detection, behavioral policy, model interpretability

📋 所有论文列表

1. ✅ MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models

作者: Chenyang Gu, Jiahao Cheng, Meicong Zhang, Pujun Zheng, Jinquan Zheng, Guoxiu He 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19044v1

评分: 70.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该论文针对现有基于LLM的智能体在科学构思任务中推理能力不足的问题，提出了MoRI框架，通过监督微调和强化学习奖励机制显式学习从研究动机到方法的推理过程，实验表明其在新颖性、技术严谨性和可行性方面显著优于现有方法。

摘要翻译

科学构思旨在给定科学背景下提出新颖解决方案。现有基于大语言模型（LLM）的智能体方法虽模拟人类研究流程，却未能充分建模科学推理过程，导致其产出多为缺乏技术深度与科学依据的表层概念重组。为解决这一问题，我们提出 MoRI（基于动机的科学构思推理框架），该框架使大语言模型能够显式学习从研究动机到方法论的推理过程。基础大语言模型首先通过监督微调进行初始化，以从给定情境中生成研究动机，随后在复合强化学习奖励机制下进行训练，以逼近科学严谨性：（1）熵感知信息增益鼓励模型基于真实方法论揭示并阐述高复杂度的技术细节；（2）对比语义增益约束推理轨迹，确保其与科学有效解决方案保持概念一致。实验结果表明，MoRI 在创新性、技术严谨性和可行性等多个维度上显著优于主流商用大语言模型及复杂智能体基线方法。代码将在 \href{https://github.com/ECNU-Text-Computing/IdeaGeneration}{GitHub} 平台开源。

摘要 (Abstract)

Scientific ideation aims to propose novel solutions within a given scientific context. Existing LLM-based agentic approaches emulate human research workflows, yet inadequately model scientific reasoning, resulting in surface-level conceptual recombinations that lack technical depth and scientific grounding. To address this issue, we propose \textbf{MoRI} (\textbf{Mo}tivation-grounded \textbf{R}easoning for Scientific \textbf{I}deation), a framework that enables LLMs to explicitly learn the reasoning process from research motivations to methodologies. The base LLM is initialized via supervised fine-tuning to generate a research motivation from a given context, and is subsequently trained under a composite reinforcement learning reward that approximates scientific rigor: (1) entropy-aware information gain encourages the model to uncover and elaborate high-complexity technical details grounded in ground-truth methodologies, and (2) contrastive semantic gain constrains the reasoning trajectory to maintain conceptually aligned with scientifically valid solutions. Empirical results show that MoRI significantly outperforms strong commercial LLMs and complex agentic baselines across multiple dimensions, including novelty, technical rigor, and feasibility. The code will be made available on \href{https://github.com/ECNU-Text-Computing/IdeaGeneration}{GitHub}.

关键词: Large Language Models, Scientific Ideation, Reasoning, Supervised Fine-tuning, Reinforcement Learning, LLM Agents, AI for Science, Motivation-grounded Reasoning

2. ✅ TARo: Token-level Adaptive Routing for LLM Test-time Alignment

评分: 56.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	5.0/10	5.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型推理能力提升问题，提出了Token-level Adaptive Routing方法，在推理时进行对齐，显著提高了数学推理和临床推理性能。

摘要翻译

大语言模型（LLMs）展现出强大的推理能力，但通常需要昂贵的后训练才能达到高性能。近期的测试时对齐方法提供了一种轻量级替代方案，但主要被探索用于偏好对齐而非推理任务。为填补这一空白，我们提出了令牌级自适应路由（Token-level Adaptive Routing, TARo），该方法在完全保持基础模型冻结的状态下，于推理阶段引导模型进行结构化推理。具体而言，我们首先在分步数学推导轨迹上训练奖励模型，以捕捉细粒度的逻辑一致性信号；随后引入一个可学习的令牌级路由器，自动控制奖励模型对基础模型的引导强度。大量实验表明，TARo在推理性能上相比基础模型显著提升高达+22.4%，较现有令牌级测试时对齐方法提升+8.4%，同时还能增强分布外临床推理（MedXpertQA）和指令遵循（AlpacaEval）能力。此外，TARo无需重新训练即可从小型骨干模型泛化至大型骨干模型，从而将测试时对齐的应用范围从偏好优化扩展至鲁棒的跨领域推理。

摘要 (Abstract)

Large language models (LLMs) exhibit strong reasoning capabilities but typically require expensive post-training to reach high performance. Recent test-time alignment methods offer a lightweight alternative, but have been explored mainly for preference alignment rather than reasoning. To bridge this gap, we propose, Token-level Adaptive Routing (TARo), which steers frozen LLMs toward structured reasoning entirely at inference time. Specifically, we first train reward models on step-wise mathematical traces to capture fine-grained logical consistency signals, then introduce a learnable token-level router that automatically controls the guidance of the reward model to the base model. Extensive experiments show that TARo significantly improves reasoning performance by up to +22.4% over base model and +8.4% over existing token-level test-time alignment methods, while also boosting out-of-distribution clinical reasoning (MedXpertQA) and instruction following (AlpacaEval). Furthermore, TARo also generalizes from small to large backbones without retraining, extending test-time alignment from preference optimization to robust, cross-domain reasoning.

关键词: Large Language Models, Test-time Alignment, Reasoning, Token-level Routing, Mathematical Reasoning, Clinical Reasoning, Reward Model, Inference-time Steering

3. ✅ ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs

作者: Wanjia Zhao, Ludwig Schmidt, James Zou, Vidhisha Balachandran, Lingjiao Chen 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18614v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了ZebraArena诊断仿真环境，用于研究工具增强大语言模型中推理与行动的耦合问题，发现即使是前沿模型如GPT-5在困难任务上准确率仅60%，且工具调用次数比理论最优多70-270%。

摘要翻译

工具增强型大语言模型（LLM）必须将多步推理与外部行动紧密耦合，然而现有基准测试常因复杂的环境动态、记忆知识或数据集污染而混淆这种相互作用。本文提出ZebraArena——一个通过程序化生成、用于研究工具增强型LLM中推理-行动耦合的诊断性环境，其具备可控难度和知识最小化设计，能有效限制模型从记忆或数据集污染中获益。ZebraArena中的每个任务都需要一组关键信息，这些信息仅能通过针对性工具调用获取，从而在外部信息获取与演绎推理之间构建了可解释的接口。该设计通过唯一解实现确定性评估，并提供了理论最优查询次数以衡量工具使用效率。我们证明ZebraArena要求深度推理与精准外部工具调用的结合，这对前沿推理模型（如GPT-5和Gemini 2.5 Pro）仍具挑战性——它们在困难实例上仅达到60%的准确率。我们还观察到理论最优性与实际工具使用之间存在持续差距：例如GPT-5的工具调用次数比理论最优值高出70-270%。本文重点阐述了评估中的关键发现，期望ZebraArena能推动关于内部推理与外部行动交互机制的进一步研究。

摘要 (Abstract)

Tool-augmented large language models (LLMs) must tightly couple multi-step reasoning with external actions, yet existing benchmarks often confound this interplay with complex environment dynamics, memorized knowledge or dataset contamination. In this paper, we introduce ZebraArena, a procedurally generated diagnostic environment for studying reasoning-action coupling in tool-augmented LLMs, with controllable difficulty and a knowledge-minimal design, which limits gains from memorization or dataset contamination. Each task in ZebraArena requires a set of critical information which is available only through targeted tool use, yielding an interpretable interface between external information acquisition and deductive reasoning. This design provides deterministic evaluation via unique solutions, and a theoretical optimal query count for measuring efficient tool use. We show that ZebraArena requires a combination of in-depth reasoning and accurate external tool calling, which remains a challenge as frontier reasoning models such as GPT-5 and Gemini 2.5 Pro only achieves 60% accuracy on the hard instances. We also observe a persistent gaps between theoretical optimality and practical tool usage. For example, GPT-5 uses 70-270% more tool calls than the theoretical optimum. We highlight the key findings in our evaluation, and hope ZebraArena stimulates further research on the interplay between internal reasoning and external action.

关键词: Tool-augmented LLMs, Reasoning-action coupling, Diagnostic simulation environment, Multi-step reasoning, External tool calling, In-depth reasoning, ZebraArena, Efficient tool use

4. ✅ PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

作者: Ruishuo Chen, Yu Chen, Zhuoran Li, Longbo Huang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18363v1

评分: 46.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	8.0/10	8.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型无监督强化学习微调中启发式内在奖励缺乏理论优化目标的问题，提出了PowerFlow分布匹配框架，通过α-power分布定向激发LLMs的逻辑推理或创造性能力，实验表明其性能优于现有RLIF方法并匹配或超越有监督GRPO。

摘要翻译

无监督内部反馈强化学习已成为一种无需外部监督即可激发大型语言模型潜在能力的前景广阔的研究范式。然而，现有方法依赖于启发式内在奖励，这些奖励通常缺乏明确的理论优化目标，且容易产生退化性偏差。本研究提出PowerFlow，一个将无监督微调重新定义为分布匹配问题的原理性框架。通过将GFlowNet构建为非归一化密度的摊销变分采样器，我们提出了一种长度感知的轨迹平衡目标，该目标能显式地抵消自回归生成中固有的结构长度偏差。通过以$α$-幂分布为目标，PowerFlow能够定向激发大型语言模型的双重特性：通过锐化分布（$α> 1$）来强化逻辑推理能力，或通过平坦化分布（$α< 1$）来释放表达创造力。大量实验表明，PowerFlow在各项任务中持续优于现有RLIF方法，其表现达到甚至超越了有监督的GRPO。此外，通过缓解对齐模型中的过度锐化现象，我们的方法在多样性与质量上实现了同步提升，从而在创造性任务中推动了帕累托边界的演进。

摘要 (Abstract)

Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting $α$-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution ($α> 1$) to intensify logical reasoning, or flattening it ($α< 1$) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.

关键词: Large Language Models, Unsupervised Reinforcement Learning, Distribution Matching, GFlowNet, Trajectory-Balance, Logical Reasoning, Expressive Creativity, Fine-tuning

5. ✅ Learning to Self-Evolve

作者: Xiaoyin Chen, Canwen Xu, Yite Wang, Boyi Liu, Zhewei Yao, Yuxiong He 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18620v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了一个名为Learning to Self-Evolve（LSE）的强化学习框架，训练大语言模型在测试时通过迭代优化上下文来自我进化，从而在Text-to-SQL生成和通用问答任务上超越了现有自我进化策略和提示优化方法。

摘要翻译

我们提出“学习式自我进化”框架，这是一种强化学习框架，用于训练大语言模型在测试阶段优化其自身上下文。我们将该框架置于测试时自我进化的情境中，使模型能够基于已见问题的反馈迭代优化上下文，从而在新问题上表现更佳。现有方法完全依赖模型固有的推理能力，从未针对此任务进行显式训练。本框架将多步进化问题简化为单步强化学习目标，其中每次上下文编辑的奖励由下游性能提升程度决定。我们将此目标与树状引导进化循环相结合。在文本到SQL生成和通用问答任务上，采用本框架训练的40亿参数模型超越了基于GPT-5与Claude Sonnet 4.5的自进化策略，以及包括GEPA和TextGrad在内的提示优化方法，并且无需额外训练即可迁移指导其他模型。我们的研究结果证明了将自我进化作为可学习技能的有效性。

摘要 (Abstract)

We introduce Learning to Self-Evolve (LSE), a reinforcement learning framework that trains large language models (LLMs) to improve their own contexts at test time. We situate LSE in the setting of test-time self-evolution, where a model iteratively refines its context from feedback on seen problems to perform better on new ones. Existing approaches rely entirely on the inherent reasoning ability of the model and never explicitly train it for this task. LSE reduces the multi-step evolution problem to a single-step RL objective, where each context edit is rewarded by the improvement in downstream performance. We pair this objective with a tree-guided evolution loop. On Text-to-SQL generation (BIRD) and general question answering (MMLU-Redux), a 4B-parameter model trained with LSE outperforms self-evolving policies powered by GPT-5 and Claude Sonnet 4.5, as well as prompt optimization methods including GEPA and TextGrad, and transfers to guide other models without additional training. Our results highlight the effectiveness of treating self-evolution as a learnable skill.

关键词: Self-Evolution, Reinforcement Learning, Large Language Models, Test-time Adaptation, Context Optimization, Multi-step Reasoning, Performance Improvement, Transfer Learning

6. ✅ TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

作者: Yan Shu, Bin Ren, Zhitong Xiong, Xiao Xiang Zhu, Begüm Demir, Nicu Sebe, Paolo Rota 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19039v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该论文针对现有视觉语言模型在地球观测中难以进行精确像素级空间推理的问题，提出了TerraScope模型，通过模态灵活推理和多时序推理能力，在像素级地理空间推理任务上显著优于现有模型，并提供了可解释的视觉证据。

摘要翻译

视觉语言模型（VLMs）在地球观测（EO）领域展现出潜力，但在需要将复杂空间推理与精确像素级视觉表征相锚定的任务中仍面临挑战。为解决这一问题，我们提出了TerraScope——一个能够实现像素锚定地理空间推理的统一视觉语言模型，其具备两项核心能力：（1）模态灵活推理：可处理单模态输入（光学或合成孔径雷达SAR），并在多模态可用时自适应融合不同模态至推理流程；（2）多时序推理：能整合时间序列数据以进行多时相变化分析。此外，我们构建了Terra-CoT数据集，该大规模数据集包含100万个样本，其推理链中嵌入了来自多源数据的像素级掩码。我们还提出了首个像素锚定地理空间推理基准TerraScope-Bench，包含六个子任务，通过同时评估答案准确性与掩码质量来确保真实的像素锚定推理。实验表明，TerraScope在像素锚定地理空间推理任务上显著优于现有视觉语言模型，同时提供可解释的视觉证据。

摘要 (Abstract)

Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.

关键词: Vision-language models, Earth observation, Pixel-grounded reasoning, Geospatial reasoning, Multi-temporal reasoning, Chain of Thought, Interpretable AI, Remote sensing

7. ✅ D-Mem: A Dual-Process Memory System for LLM Agents

作者: Zhixing You, Jiachen Yuan, Jason Cai 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18631v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对LLM智能体在长期推理中现有检索记忆框架存在信息丢失和精细上下文理解不足的问题，提出了一个双过程记忆系统D-Mem，通过轻量级向量检索和全审慎模块的动态切换，在LoCoMo和RealTalk基准测试中实现了高精度且计算成本显著降低的性能。

摘要翻译

在持久性自适应智能体发展的推动下，为这些系统配备高保真记忆访问能力以实现长程推理已成为关键需求。然而，当前主流的基于检索的记忆框架通常采用增量处理范式，持续提取对话记忆并更新至向量数据库，在查询时依赖语义检索。虽然这种方法速度较快，但其本质上依赖于有损抽象，常常遗漏上下文关键信息，且难以处理需要细粒度上下文理解的查询。为此，我们提出了D-Mem——一种双过程记忆系统。该系统保留轻量级向量检索以处理常规查询，同时建立详尽的完全审议模块作为高保真备用机制。为实现认知效率与精度的平衡，D-Mem采用多维质量门控策略动态桥接这两个过程。基于GPT-4o-mini和Qwen3-235B-Instruct模型在LoCoMo和RealTalk基准上的实验验证了本方法的有效性。值得注意的是，我们的多维质量门控策略在LoCoMo基准上使用GPT-4o-mini实现了53.5的F1分数，优于静态检索基线Mem0$^\ast$（51.2），并恢复了完全审议模块96.7%的性能表现（55.3），同时显著降低了计算成本。

摘要 (Abstract)

Driven by the development of persistent, self-adapting autonomous agents, equipping these systems with high-fidelity memory access for long-horizon reasoning has emerged as a critical requirement. However, prevalent retrieval-based memory frameworks often follow an incremental processing paradigm that continuously extracts and updates conversational memories into vector databases, relying on semantic retrieval when queried. While this approach is fast, it inherently relies on lossy abstraction, frequently missing contextually critical information and struggling to resolve queries that rely on fine-grained contextual understanding. To address this, we introduce D-Mem, a dual-process memory system. It retains lightweight vector retrieval for routine queries while establishing an exhaustive Full Deliberation module as a high-fidelity fallback. To achieve cognitive economy without sacrificing accuracy, D-Mem employs a Multi-dimensional Quality Gating policy to dynamically bridge these two processes. Experiments on the LoCoMo and RealTalk benchmarks using GPT-4o-mini and Qwen3-235B-Instruct demonstrate the efficacy of our approach. Notably, our Multi-dimensional Quality Gating policy achieves an F1 score of 53.5 on LoCoMo with GPT-4o-mini. This outperforms our static retrieval baseline, Mem0$^\ast$ (51.2), and recovers 96.7% of the Full Deliberation’s performance (55.3), while incurring significantly lower computational costs.

关键词: LLM Agents, Memory System, Dual-Process, Retrieval-Augmented Generation, Long-horizon Reasoning, Multi-dimensional Quality Gating, Full Deliberation, Autonomous Agents

8. ✅ EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

作者: Minsoo Cheong, Donghyun Son, Woosang Lim, Sungjoo Yoo 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18489v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	15.0/10	15.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	10.0/10	10.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对扩散语言模型（dLLMs）中KV缓存效率低的问题，提出了一种基于解码令牌熵的无训练KV缓存方法EntropyCache，实现了15.2-26.4倍的推理加速，同时保持竞争性精度。

摘要翻译

基于扩散的大型语言模型（dLLMs）依赖双向注意力机制，这阻碍了无损KV缓存的实现，并需要在每个去噪步骤中进行完整的前向传播。现有的近似KV缓存方法通过选择性更新缓存状态来降低计算成本，但其决策开销随上下文长度或模型深度增加而线性增长。我们提出EntropyCache，一种无需训练的KV缓存方法，该方法以新解码词元分布的最大熵作为恒定开销的信号，用于决定何时重新计算。我们的设计基于两项实证观察：（1）解码词元的熵与KV缓存漂移相关，为缓存陈旧性提供了低成本的代理指标；（2）解码词元的特征波动在解掩码后持续多个步骤，这促使我们对最近解码的$k$个词元进行重新计算。跳转或重新计算的决策仅需每步$O(V)$的计算量，与上下文长度和模型规模无关。在LLaDA-8B-Instruct和Dream-7B-Instruct上的实验表明，EntropyCache在标准基准测试中实现了$15.2\times$-$26.4\times$的加速，在思维链基准测试中实现了$22.4\times$-$24.1\times$的加速，同时保持了具有竞争力的准确率，且决策开销仅占推理时间的$0.5%$。代码发布于https://github.com/mscheong01/EntropyCache。

摘要 (Abstract)

Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by selectively updating cached states, but their decision overhead scales with context length or model depth. We propose EntropyCache, a training-free KV caching method that uses the maximum entropy of newly decoded token distributions as a constant-cost signal for deciding when to recompute. Our design is grounded in two empirical observations: (1) decoded token entropy correlates with KV cache drift, providing a cheap proxy for cache staleness, and (2) feature volatility of decoded tokens persists for multiple steps after unmasking, motivating recomputation of the $k$ most recently decoded tokens. The skip-or-recompute decision requires only $O(V)$ computation per step, independent of context length and model scale. Experiments on LLaDA-8B-Instruct and Dream-7B-Instruct show that EntropyCache achieves $15.2\times$-$26.4\times$ speedup on standard benchmarks and $22.4\times$-$24.1\times$ on chain-of-thought benchmarks, with competitive accuracy and decision overhead accounting for only $0.5%$ of inference time. Code is available at https://github.com/mscheong01/EntropyCache.

关键词: KV caching, diffusion language models, inference acceleration, decoded token entropy, attention mechanisms, computational efficiency, large language models, chain-of-thought reasoning

9. ✅ Security awareness in LLM agents: the NDAI zone case

作者: Enrico Bottazzi, Pia Park 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19011v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	5.0/10	5.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	5.0/10	5.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该研究通过NDAI协商任务发现，当前LLM代理能可靠检测危险信号但无法可靠验证安全，揭示了安全感知能力的不对称性，这是部署隐私保护代理协议的核心挑战。

摘要翻译

NDAI（非披露人工智能）区域允许发明者与投资者智能体在可信执行环境（TEE）内进行协商，若未达成协议，任何已披露信息将被删除。这使得全面披露知识产权成为发明者智能体的理性策略。然而，利用此基础设施要求智能体能够区分安全与不安全环境，而大型语言模型智能体天生缺乏这种能力，因为它们仅能依赖通过上下文窗口传递的证据来形成对执行环境的认知。我们提出：不同大型语言模型在形成对其执行环境安全性的认知时，如何权衡不同形式的证据？通过在10种语言模型及多种证据场景下进行NDAI式协商任务实验，我们发现一种明显的不对称性：失败的认证普遍抑制所有模型的披露行为，而通过的认证则引发高度异质性响应：部分模型增加披露，部分不受影响，少数模型甚至矛盾地减少披露。这表明当前大型语言模型能够可靠地检测危险信号，但无法可靠地验证安全性——而这正是NDAI区域等隐私保护智能体协议所需的核心能力。通过可解释性分析、针对性微调或改进证据架构等方式弥合这一差距，仍然是部署能够根据实际证据质量校准信息共享的智能体所面临的核心开放挑战。

摘要 (Abstract)

NDAI zones let inventor and investor agents negotiate inside a Trusted Execution Environment (TEE) where any disclosed information is deleted if no deal is reached. This makes full IP disclosure the rational strategy for the inventor’s agent. Leveraging this infrastructure, however, requires agents to distinguish a secure environment from an insecure one, a capability LLM agents lack natively, since they can rely only on evidence passed through the context window to form awareness of their execution environment. We ask: How do different LLM models weight various forms of evidence when forming awareness of the security of their execution environment? Using an NDAI-style negotiation task across 10 language models and various evidence scenarios, we find a clear asymmetry: a failing attestation universally suppresses disclosure across all models, whereas a passing attestation produces highly heterogeneous responses: some models increase disclosure, others are unaffected, and a few paradoxically reduce it. This reveals that current LLM models can reliably detect danger signals but cannot reliably verify safety, the very capability required for privacy-preserving agentic protocols such as NDAI zones. Bridging this gap, possibly through interpretability analysis, targeted fine-tuning, or improved evidence architectures, remains the central open challenge for deploying agents that calibrate information sharing to actual evidence quality.

关键词: LLM agents, security awareness, NDAI zones, Trusted Execution Environment, negotiation task, evidence weighting, privacy-preserving protocols, interpretability analysis

10. ✅ Bridging Network Fragmentation: A Semantic-Augmented DRL Framework for UAV-aided VANETs

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	5.0/10	5.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了一种语义增强的深度强化学习框架（SA-DRL），利用大型语言模型（LLMs）的推理能力来指导无人机部署，以解决车载自组织网络（VANETs）中的网络碎片化问题，实验表明该框架在仅需26.6%训练回合的情况下实现了最先进的性能，并将关键连接指标提升了13.2%和23.5%。

摘要翻译

车载自组织网络（VANETs）是自动驾驶的数字基石，但在城市环境中，由于物理遮挡，其网络往往面临严重的割裂问题。具备高机动性的无人机（UAVs）已成为弥合这些连接鸿沟的关键解决方案。然而，传统的基于深度强化学习（DRL）的无人机部署策略缺乏对道路拓扑的语义理解，常导致盲目探索和样本效率低下。相比之下，大语言模型（LLMs）拥有强大的推理能力，能够识别拓扑重要性，但将其应用于控制任务仍具挑战性。为此，我们提出了语义增强深度强化学习（SA-DRL）框架。首先，我们提出了一种基于道路拓扑图（RTG）与双连通图（DCG）的网络割裂量化方法。随后，我们设计了一个四阶段流程，将通用大语言模型转化为特定领域的拓扑专家。最后，我们提出了语义增强近端策略优化（SA-PPO）算法，该算法采用逻辑融合机制，将大语言模型的语义推理作为先验知识直接注入策略中，从而有效引导智能体朝向关键路口。大量高保真仿真实验表明，SA-PPO以卓越的效率实现了最先进的性能，仅需26.6%的训练回合数即可达到基线性能水平。最终，相较于其他方法，SA-PPO将两项关键连接指标分别提升了13.2%和23.5%，同时将能耗降至基线水平的28.2%。

摘要 (Abstract)

Vehicular Ad-hoc Networks (VANETs) are the digital cornerstone of autonomous driving, yet they suffer from severe network fragmentation in urban environments due to physical obstructions. Unmanned Aerial Vehicles (UAVs), with their high mobility, have emerged as a vital solution to bridge these connectivity gaps. However, traditional Deep Reinforcement Learning (DRL)-based UAV deployment strategies lack semantic understanding of road topology, often resulting in blind exploration and sample inefficiency. By contrast, Large Language Models (LLMs) possess powerful reasoning capabilities capable of identifying topological importance, though applying them to control tasks remains challenging. To address this, we propose the Semantic-Augmented DRL (SA-DRL) framework. Firstly, we propose a fragmentation quantification method based on Road Topology Graphs (RTG) and Dual Connected Graphs (DCG). Subsequently, we design a four-stage pipeline to transform a general-purpose LLM into a domain-specific topology expert. Finally, we propose the Semantic-Augmented PPO (SA-PPO) algorithm, which employs a Logit Fusion mechanism to inject the LLM’s semantic reasoning directly into the policy as a prior, effectively guiding the agent toward critical intersections. Extensive high-fidelity simulations demonstrate that SA-PPO achieves state-of-the-art performance with remarkable efficiency, reaching baseline performance levels using only 26.6% of the training episodes. Ultimately, SA-PPO improves two key connectivity metrics by 13.2% and 23.5% over competing methods, while reducing energy consumption to just 28.2% of the baseline.

关键词: Large Language Models, Deep Reinforcement Learning, UAV-aided VANETs, Network Fragmentation, Semantic-Augmented PPO, Road Topology Graphs, Logit Fusion, Topological Reasoning

11. ✅ SignAgent: Agentic LLMs for Linguistically-Grounded Sign Language Annotation and Dataset Curation

作者: Oliver Cory, Ozge Mercanoglu Sincan, Richard Bowden 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19059v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了一个名为SignAgent的智能体框架，利用大型语言模型（LLMs）进行可扩展的、基于语言学的手语标注和数据集构建，解决了传统方法忽略语言细节和人工标注效率低下的问题，并在伪注释和ID注释任务中展示了强大的性能。

摘要翻译

本文介绍了SignAgent，一种新颖的智能体框架，该框架利用大语言模型（LLMs）实现可扩展的、基于语言学基础的手语（Sign Language, SL）标注与数据集构建。传统的手语计算方法通常仅在语素层面运作，忽视了关键的语言学细微差别，而人工语言学标注仍是一个重大瓶颈，其过程缓慢且成本高昂，难以用于创建大规模、具备音系学意识的数据集。SignAgent通过两个核心组件应对这些挑战：SignAgent Orchestrator——一个负责协调一系列语言学工具并进行推理的LLM，以及SignGraph——一个提供词汇与语言学基础的知识驱动型LLM。我们在两项下游标注任务上评估了该框架。首先，在伪语素标注任务中，智能体执行约束性分配，利用多模态证据为手语序列提取并排序合适的语素标签。其次，在身份语素标注任务中，智能体通过推理视觉相似性和音系重叠性来检测并优化视觉聚类，从而正确识别和归类词汇手语变体。我们的结果表明，这种智能体方法在大规模、具备语言学意识的数据标注与构建方面表现出色。

摘要 (Abstract)

This paper introduces SignAgent, a novel agentic framework that utilises Large Language Models (LLMs) for scalable, linguistically-grounded Sign Language (SL) annotation and dataset curation. Traditional computational methods for SLs often operate at the gloss level, overlooking crucial linguistic nuances, while manual linguistic annotation remains a significant bottleneck, proving too slow and expensive for the creation of large-scale, phonologically-aware datasets. SignAgent addresses these challenges through SignAgent Orchestrator, a reasoning LLM that coordinates a suite of linguistic tools, and SignGraph, a knowledge-grounded LLM that provides lexical and linguistic grounding. We evaluate our framework on two downstream annotation tasks. First, on Pseudo-gloss Annotation, where the agent performs constrained assignment, using multi-modal evidence to extract and order suitable gloss labels for signed sequences. Second, on ID Glossing, where the agent detects and refines visual clusters by reasoning over both visual similarity and phonological overlap to correctly identify and group lexical sign variants. Our results demonstrate that our agentic approach achieves strong performance for large-scale, linguistically-aware data annotation and curation.

关键词: SignAgent, Agentic LLMs, Sign Language Annotation, Dataset Curation, Linguistically-grounded, Reasoning LLM, Multi-modal Evidence, Phonological Awareness

12. ✅ From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

作者: Zhuofan Li, Hongkun Yang, Zhenyang Chen, Yangxuan Chen, Yingyan, Lin, Chaojian Li 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19131v1

评分: 31.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	8.0/10	8.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	8.0/10	8.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究发现，当前基于参数、FLOPs等传统推理效率指标无法准确反映Vision-Language-Action模型在机器人平台上的实际性能，提出并验证了系统级具身效率指标（如任务完成时间、轨迹平滑度等）能更全面地评估模型的实际表现。

摘要翻译

视觉-语言-动作（Vision-Language-Action, VLA）模型近年来通过联合推理视觉、语言与运动模态，使具身智能体能够执行日益复杂的任务。然而，我们发现当前VLA研究中普遍采用的“效率”概念——通常以参数量、浮点运算量或解码吞吐量来衡量——并不能反映其在机器人平台上的实际性能。在现实世界执行中，效率由系统层面的具身行为决定，例如任务完成时间、轨迹平滑度、累积关节旋转量和运动能耗。通过对模型压缩、令牌稀疏化与动作序列压缩的对照研究，我们得出若干挑战常见假设的观察：（1）在传统指标下减少计算量的方法，尽管能维持任务成功率，却常常增加端到端执行成本或降低运动质量。（2）系统层面的具身效率指标揭示了学习到的动作策略中隐藏的性能差异，这些差异在传统评估中无法显现。（3）常见的适应方法（如上下文提示或有监督微调）对具身效率的提升有限且仅针对特定指标。虽然这些方法能够降低目标具身效率指标（如急动度或动作频率），但由此获得的收益可能以其他指标（如更长的完成时间）为代价。综合而言，我们的研究表明传统推理效率指标可能忽略具身执行的重要方面。引入具身效率评估能够更完整地反映策略行为与实际性能，从而为VLA模型提供更公平、更全面的比较基准。

摘要 (Abstract)

Vision-Language-Action (VLA) models have recently enabled embodied agents to perform increasingly complex tasks by jointly reasoning over visual, linguistic, and motor modalities. However, we find that the prevailing notion of ``efficiency’’ in current VLA research, characterized by parameters, FLOPs, or token decoding throughput, does not reflect actual performance on robotic platforms. In real-world execution, efficiency is determined by system-level embodied behaviors such as task completion time, trajectory smoothness, cumulative joint rotation, and motion energy. Through controlled studies across model compression, token sparsification, and action sequence compression, we make several observations that challenge common assumptions. (1) Methods that reduce computation under conventional metrics often increase end-to-end execution cost or degrade motion quality, despite maintaining task success rates. (2) System-level embodied efficiency metrics reveal performance differences in the learned action policies that remain hidden under conventional evaluations. (3) Common adaptation methods such as in-context prompting or supervised fine-tuning show only mild and metric-specific improvements in embodied efficiency. While these methods can reduce targeted embodied-efficiency metrics such as jerk or action rate, the resulting gains may come with trade-offs in other metrics, such as longer completion time. Taken together, our results suggest that conventional inference efficiency metrics can overlook important aspects of embodied execution. Incorporating embodied efficiency provides a more complete view of policy behavior and practical performance, enabling fairer and more comprehensive comparisons of VLA models.

关键词: Vision-Language-Action Models, Embodied Agents, Efficiency Metrics, Model Compression, In-context Prompting, Supervised Fine-tuning, Robotic Platforms, System-level Performance

13. ✅ Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

作者: Gregory N. Frank 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18280v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

论文研究发现当前基于拒绝的对齐评估方法存在缺陷，通过研究中文大语言模型的政治审查机制，提出了检测-路由-生成的三阶段框架，揭示对齐主要通过路由机制而非知识删除或简单拒绝来实现。

摘要翻译

当前的对齐评估主要衡量模型是否编码危险概念以及是否拒绝有害请求。这两者都忽略了对齐通常运作的层面：从概念检测到行为策略的路由机制。我们以中国源语言模型中的政治审查作为自然实验，通过对来自五个实验室的九个开放权重模型进行探针分析、手术式消融和行为测试，得出三项发现。首先，仅凭探针准确性无法作为诊断依据：政治探针、空值对照和置换基线均可达到100%准确率，因此留出类别泛化能力才是有效的测试标准。其次，手术式消融揭示了实验室特定的路由机制。移除政治敏感性方向在大多数测试模型中消除了审查并恢复了准确的事实输出，而有一个模型因架构将事实知识与审查机制纠缠而产生虚构输出。跨模型迁移失败，表明路由几何结构具有模型和实验室特异性。第三，拒绝不再是主导的审查机制。在同一个模型系列中，强硬拒绝率降至零，而叙事引导升至最高，使得仅依赖拒绝检测的基准测试无法察觉审查行为。这些结果支持一个三阶段描述框架：检测、路由、生成。模型通常保留相关知识；对齐改变的是这些知识的表达方式。因此，仅审计检测或拒绝的评估方法会遗漏最直接决定行为的路由机制。

摘要 (Abstract)

Current alignment evaluation mostly measures whether models encode dangerous concepts and whether they refuse harmful requests. Both miss the layer where alignment often operates: routing from concept detection to behavioral policy. We study political censorship in Chinese-origin language models as a natural experiment, using probes, surgical ablations, and behavioral tests across nine open-weight models from five labs. Three findings follow. First, probe accuracy alone is non-diagnostic: political probes, null controls, and permutation baselines can all reach 100%, so held-out category generalization is the informative test. Second, surgical ablation reveals lab-specific routing. Removing the political-sensitivity direction eliminates censorship and restores accurate factual output in most models tested, while one model confabulates because its architecture entangles factual knowledge with the censorship mechanism. Cross-model transfer fails, indicating that routing geometry is model- and lab-specific. Third, refusal is no longer the dominant censorship mechanism. Within one model family, hard refusal falls to zero while narrative steering rises to the maximum, making censorship invisible to refusal-only benchmarks. These results support a three-stage descriptive framework: detect, route, generate. Models often retain the relevant knowledge; alignment changes how that knowledge is expressed. Evaluations that audit only detection or refusal therefore miss the routing mechanism that most directly determines behavior.

关键词: alignment evaluation, political censorship, routing mechanism, language models, refusal-based evaluation, concept detection, behavioral policy, model interpretability

14. ❌ dTRPO: Trajectory Reduction in Policy Optimization of Diffusion Large Language Models

作者: Wenxuan Zhang, Lemeng Wu, Changsheng Zhao, Ernie Chang, Mingchen Zhuge, Zechun Liu, Andy Su, Hanxian Huang, Jun Chen, Chong Zhou, Raghuraman Krishnamoorthi, Vikas Chandra, Mohamed Elhoseiny, Wei Wen 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18806v1

评分: 26.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	8.0/10	8.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	8.0/10	8.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究扩散大语言模型（dLLMs）的策略优化方法dTRPO，属于大模型技术原理创新。与"Large Language Models"高度相关（10分），因为论文专门研究dLLMs。与"Instruction Tuning"和"RLHF"等关键词相关（8分），因为论文涉及对齐人类偏好、策略优化，属于对齐和强化学习范畴。其他关键词如MoE、SLMs、Scaling Laws、RAG等与论文内容无直接关联，给0分。

!!! tip deepseek-chat TL;DR

该论文针对扩散大语言模型（dLLMs）对齐人类偏好时轨迹概率计算成本高的问题，提出了轨迹缩减策略优化方法dTRPO，显著提升了dLLMs在指令遵循和推理任务上的性能，并提高了训练和生成效率。

摘要翻译

扩散大语言模型（Diffusion Large Language Models, dLLMs）为语言生成引入了一种新范式，同时也带来了使其与人类偏好对齐的新挑战。本研究旨在通过降低轨迹概率计算成本来改进dLLMs的策略优化，从而实现规模化离线策略训练。我们证明：（i）在参考策略正则化下，新解掩码词元的概率比是中间扩散状态概率比的无偏估计；（ii）完整轨迹的概率可以通过对重掩码最终状态进行单次前向传播来有效估计。通过将这两种轨迹约简策略整合到策略优化目标中，我们提出了轨迹约简策略优化方法（dTRPO）。我们在70亿参数的dLLMs上，基于指令遵循和推理基准对dTRPO进行了评估。结果表明，该方法显著提升了当前先进dLLMs的核心性能，在STEM任务上最高获得9.6%的性能提升，在代码任务上最高提升4.3%，在指令遵循任务上最高提升3.0%。此外，得益于其离线、单次前向的特性，dTRPO展现出强大的训练效率，并通过高质量输出实现了生成效率的提升。

摘要 (Abstract)

Diffusion Large Language Models (dLLMs) introduce a new paradigm for language generation, which in turn presents new challenges for aligning them with human preferences. In this work, we aim to improve the policy optimization for dLLMs by reducing the cost of the trajectory probability calculation, thereby enabling scaled-up offline policy training. We prove that: (i) under reference policy regularization, the probability ratio of the newly unmasked tokens is an unbiased estimate of that of intermediate diffusion states, and (ii) the probability of the full trajectory can be effectively estimated with a single forward pass of a re-masked final state. By integrating these two trajectory reduction strategies into a policy optimization objective, we propose Trajectory Reduction Policy Optimization (dTRPO). We evaluate dTRPO on 7B dLLMs across instruction-following and reasoning benchmarks. Results show that it substantially improves the core performance of state-of-the-art dLLMs, achieving gains of up to 9.6% on STEM tasks, up to 4.3% on coding tasks, and up to 3.0% on instruction-following tasks. Moreover, dTRPO exhibits strong training efficiency due to its offline, single-forward nature, and achieves improved generation efficiency through high-quality outputs.

关键词: Diffusion Large Language Models, dLLMs, Policy Optimization, Trajectory Reduction, Human Preference Alignment, dTRPO, Instruction-following, Reasoning Benchmarks

15. ❌ Tinted Frames: Question Framing Blinds Vision-Language Models

作者: Wan-Cyuan Fan, Jiayun Luo, Declan Kutscher, Leonid Sigal, Ritwik Gupta 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19203v1

评分: 21.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	5.0/10	5.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究Vision-Language Models (VLMs)的注意力机制问题，属于大模型（LLMs）在视觉-语言多模态领域的应用研究，因此与"Large Language Models"相关度较高（8分）。论文通过分析注意力分布来理解模型行为，这属于"Mechanistic Interpretability"范畴，相关度较高（8分）。论文提出的轻量级提示调优方法属于参数高效微调技术，与"PEFT"有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、Alignment、RAG、Quantization等均未在论文中涉及，相关度为0分。

!!! tip deepseek-chat TL;DR

该论文研究发现视觉语言模型（VLMs）的注意力分配会受到问题表述方式（framing）的显著影响，导致视觉推理性能下降，并提出了一种轻量级提示调优方法来改善视觉注意力分配，从而提高模型在不同问题表述下的性能和一致性。

摘要翻译

视觉语言模型（Vision-Language Models, VLMs）已被证明存在视觉盲区，即使在需要视觉推理的任务中，也常常未能充分利用其视觉输入。本研究揭示，视觉语言模型具有选择性视觉盲区：即使在不同语言表述要求相同视觉推理的情况下，它们也会根据语言表述的差异，调节对视觉输入的关注程度。通过以视觉注意力作为探针，我们量化了语言表述如何改变对图像注意力的总量与分布。受限的表述形式（如多项选择和是/否问答）与开放式表述相比，会显著降低对图像背景的关注，减少对任务相关区域的聚焦，并将注意力转向无信息量的标记。我们进一步证明，这种注意力分配不当是导致准确性下降和跨表述不一致的主要原因。基于这一机制性发现，我们提出了一种轻量级提示调优方法，该方法使用可学习的标记，以鼓励模型采用在开放式设置中观察到的稳健且基于视觉的注意力模式，从而增强视觉基础能力，并提升在不同表述形式下的性能表现。

摘要 (Abstract)

Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.

关键词: Vision-Language Models, visual attention, question framing, visual reasoning, attention misallocation, prompt-tuning, visual grounding, cross-framing consistency

16. ❌ Secure Linear Alignment of Large Language Models

作者: Matt Gorbett, Suman Jana 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18908v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的表示对齐（Alignment）技术，提出了一种隐私保护的线性对齐框架，因此与"Large Language Models"和"Alignment"高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、Quantization等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种隐私保护的线性对齐框架，利用大语言模型表示收敛的特性，在保护数据隐私的同时实现了独立模型间的跨模型推理和文本生成。

摘要翻译

语言模型似乎日益学习到相似的表征，尽管它们在训练目标、架构和数据模态上存在差异。这种独立训练模型之间新兴的兼容性为跨模型与下游任务的对齐带来了新机遇。此外，它也解锁了新的潜在应用领域，例如在安全、隐私或竞争限制禁止直接共享数据或模型的环境中。在本研究中，我们提出了一种隐私保护框架，该框架利用表征收敛性来实现独立语言模型间的跨孤岛推理。该框架通过在共享公共数据集上学习一个仿射变换，并应用同态加密以在推理过程中保护客户端查询。通过仅加密线性对齐和分类操作，该方法在保持强安全保证的同时实现了亚秒级的推理延迟。我们通过一项关于表征收敛性的实证研究来支持此框架，在该研究中，我们在独立模型的最终隐藏状态之间学习线性变换。我们在嵌入分类和分布外检测任务上评估这些跨模型映射，观察到跨模型对的性能下降极小。此外，我们首次证明线性对齐有时能够实现跨独立训练模型的文本生成。

摘要 (Abstract)

Language models increasingly appear to learn similar representations, despite differences in training objectives, architectures, and data modalities. This emerging compatibility between independently trained models introduces new opportunities for cross-model alignment to downstream objectives. Moreover, it unlocks new potential application domains, such as settings where security, privacy, or competitive constraints prohibit direct data or model sharing. In this work, we propose a privacy-preserving framework that exploits representational convergence to enable cross-silo inference between independent language models. The framework learns an affine transformation over a shared public dataset and applies homomorphic encryption to protect client queries during inference. By encrypting only the linear alignment and classification operations, the method achieves sub-second inference latency while maintaining strong security guarantees. We support this framework with an empirical investigation into representational convergence, in which we learn linear transformations between the final hidden states of independent models. We evaluate these cross-model mappings on embedding classification and out-of-distribution detection, observing minimal performance degradation across model pairs. Additionally, we show for the first time that linear alignment sometimes enables text generation across independently trained models.

关键词: Large Language Models, Alignment, Representational Convergence, Privacy-preserving, Linear Transformation, Homomorphic Encryption, Cross-model Inference, Text Generation

17. ❌ Synthetic Data Generation for Training Diversified Commonsense Reasoning Models

作者: Tianhui Zhang, Bei Peng, Danushka Bollegala 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18361v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文的核心是使用大语言模型（LLMs）生成合成数据，并利用这些数据对LLMs进行监督微调（SFT），以提升生成式常识推理（GCR）任务的多样性和质量。因此，与"Large Language Models"和"Post-training"（即SFT）高度相关（10分）。论文未涉及其他关键词的具体技术或概念，故其余关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种两阶段方法，首次创建了用于多样化生成式常识推理的合成数据集CommonSyn，并证明使用该数据集微调不同规模的大语言模型能同时提升生成结果的多样性和质量。

摘要翻译

对话智能体不仅需要以高质量（即具备常识性）的回应响应用户，还需考虑多种合理的替代场景，以体现其回应的多样性。尽管训练多样化的常识生成器的需求日益增长，但由于缺乏大规模高质量的多样化常识训练数据集，这一领域的研究进展受到了显著阻碍。由于标注成本高昂，现有的生成式常识推理数据集仅由少量人工标注者创建，覆盖的常识场景范围有限。为弥补这一训练资源缺口，我们提出了一种两阶段方法，首次构建了用于多样化生成式常识推理的合成数据集CommonSyn。实验表明，在不同规模的大语言模型上，基于我们合成数据微调的模型相比原始模型及基于人工标注数据集微调的模型，在生成多样性与质量上均实现了同步提升。

摘要 (Abstract)

Conversational agents are required to respond to their users not only with high quality (i.e. commonsense bearing) responses, but also considering multiple plausible alternative scenarios, reflecting the diversity in their responses. Despite the growing need to train diverse commonsense generators, the progress of this line of work has been significantly hindered by the lack of large-scale high-quality diverse commonsense training datasets. Due to the high annotation costs, existing Generative Commonsense Reasoning (GCR) datasets are created using a small number of human annotators, covering only a narrow set of commonsense scenarios. To address this training resource gap, we propose a two-stage method to create the first-ever synthetic dataset CommonSyn for diversified (GCR). The model fine-tuned on our synthetic data jointly increase both generation diversity and quality compared with vanilla models and the model fine-tuned on human-crafted dataset across different size Large Language Models (LLMs)

关键词: Synthetic Data Generation, Commonsense Reasoning, Large Language Models, Fine-tuning, Diversified Generation, Generative Commonsense Reasoning, Training Dataset, Model Performance

18. ❌ Modeling the human lexicon under temperature variations: linguistic factors, diversity and typicality in LLM word associations

作者: Maria Andueza Rodriguez, Marie Candito, Richard Huyghe 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18171v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs（Mistral-7B, Llama-3.1-8B, Qwen-2.5-32B）在词汇联想任务中与人类表现的对比，属于大模型在语言学/认知科学领域的应用研究，因此与"Large Language Models"高度相关（10分）。论文比较了不同规模模型（如7B、8B、32B）的表现，涉及模型大小的影响，与"Small Language Models"有一定关联（5分）。研究通过分析LLM的内部词汇表征来理解其工作机制，属于可解释性AI范畴，与"Mechanistic Interpretability"有一定关联（5分）。论文未涉及其他关键词如MoE、训练技术、推理方法、压缩、代理等具体技术，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究通过比较人类和三种LLMs（Mistral-7B, Llama-3.1-8B, Qwen-2.5-32B）在不同温度设置下的词汇联想，评估了LLMs捕捉人类词汇模式的能力，发现大模型倾向于生成典型但变化少的响应，而小模型则产生更多变但典型性较低的响应，温度设置进一步影响这种权衡。

摘要翻译

大型语言模型（LLM）在文本生成的流畅性方面取得了令人瞩目的成果，但其语言知识的本质——尤其是其内部词汇库在多大程度上类同于人类——仍不明确。本研究通过比较人类与LLM生成的词汇联想，评估模型捕捉人类词汇模式的准确程度。利用来自SWOW数据集的英语提示词-反应词对，以及三种LLM（Mistral-7B、Llama-3.1-8B和Qwen-2.5-32B）在多种温度设置下新生成的联想数据，我们考察了（i）词汇因素（如词频和具体性）对提示词-反应词对的影响，以及（ii）LLM反应相较于人类反应的变异性和典型性。结果表明，所有模型均能反映词频和具体性方面的人类趋势，但在反应变异性和典型性上存在差异。较大模型（如Qwen）倾向于模仿单一的“原型”人类参与者，生成高度典型但变异性极低的反应；而较小模型（如Mistral和Llama）则产生变异性更高但典型性较低的反应。温度设置进一步影响了这种权衡，较高的温度值会增加变异性，但会降低典型性。这些发现揭示了人类与LLM词汇库之间的相似性与差异性，强调在探究LLM词汇表征时，必须考虑模型规模和温度参数的影响。

摘要 (Abstract)

Large language models (LLMs) achieve impressive results in terms of fluency in text generation, yet the nature of their linguistic knowledge - in particular the human-likeness of their internal lexicon - remains uncertain. This study compares human and LLM-generated word associations to evaluate how accurately models capture human lexical patterns. Using English cue-response pairs from the SWOW dataset and newly generated associations from three LLMs (Mistral-7B, Llama-3.1-8B, and Qwen-2.5-32B) across multiple temperature settings, we examine (i) the influence of lexical factors such as word frequency and concreteness on cue-response pairs, and (ii) the variability and typicality of LLM responses compared to human responses. Results show that all models mirror human trends for frequency and concreteness but differ in response variability and typicality. Larger models such as Qwen tend to emulate a single “prototypical” human participant, generating highly typical but minimally variable responses, while smaller models such as Mistral and Llama produce more variable yet less typical responses. Temperature settings further influence this trade-off, with higher values increasing variability but decreasing typicality. These findings highlight both the similarities and differences between human and LLM lexicons, emphasizing the need to account for model size and temperature when probing LLM lexical representations.

关键词: Large Language Models, LLMs, word associations, lexical knowledge, model size, temperature settings, human-likeness, variability and typicality

19. ❌ DriftGuard: Mitigating Asynchronous Data Drift in Federated Learning

作者: Yizhou Han, Di Wu, Blesson Varghese 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18872v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文DriftGuard专注于联邦学习中的异步数据漂移问题，提出了一种受MoE启发的架构，因此与"Mixture of Experts" OR “MoE” OR “Sparse Models"高度相关（10分）。论文未涉及大语言模型、深度学习技术原理创新或科学领域应用，与其他关键词完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出DriftGuard框架，通过受MoE启发的架构解决联邦学习中的异步数据漂移问题，在保持或提升准确性的同时将重训练成本降低高达83%。

摘要翻译

在实际的联邦学习（FL）部署中，参与训练的设备上的数据分布会随时间演变。这导致了异步数据漂移，即不同设备在不同时间向不同分布发生偏移。缓解此类漂移具有挑战性：频繁的重新训练会给资源受限的设备带来高昂计算成本，而重新训练频率过低则会降低漂移设备上的性能。我们提出了DriftGuard，一种能够高效适应异步数据漂移的联邦持续学习框架。DriftGuard采用了一种受混合专家（Mixture-of-Experts, MoE）启发的架构，该架构将捕获全局可迁移知识的共享参数与适应特定群体分布的局部参数分离开来。这种设计实现了两种互补的重新训练策略：（i）全局重新训练，即在识别出系统范围漂移时更新共享参数；（ii）群体重新训练，即通过MoE门控模式识别出的设备集群，选择性地更新其局部参数，而无需共享原始数据。在多个数据集和模型上的实验表明，DriftGuard在达到或超越最先进准确率的同时，能将总重新训练成本降低高达83%。因此，它实现了单位重新训练成本下的最高准确率，比最强基线提升了高达2.3倍。DriftGuard可从https://github.com/blessonvar/DriftGuard下载。

摘要 (Abstract)

In real-world Federated Learning (FL) deployments, data distributions on devices that participate in training evolve over time. This leads to asynchronous data drift, where different devices shift at different times and toward different distributions. Mitigating such drift is challenging: frequent retraining incurs high computational cost on resource-constrained devices, while infrequent retraining degrades performance on drifting devices. We propose DriftGuard, a federated continual learning framework that efficiently adapts to asynchronous data drift. DriftGuard adopts a Mixture-of-Experts (MoE) inspired architecture that separates shared parameters, which capture globally transferable knowledge, from local parameters that adapt to group-specific distributions. This design enables two complementary retraining strategies: (i) global retraining, which updates the shared parameters when system-wide drift is identified, and (ii) group retraining, which selectively updates local parameters for clusters of devices identified via MoE gating patterns, without sharing raw data. Experiments across multiple datasets and models show that DriftGuard matches or exceeds state-of-the-art accuracy while reducing total retraining cost by up to 83%. As a result, it achieves the highest accuracy per unit retraining cost, improving over the strongest baseline by up to 2.3x. DriftGuard is available for download from https://github.com/blessonvar/DriftGuard.

关键词: Federated Learning, Data Drift, Mixture of Experts, Continual Learning, Retraining Cost, Asynchronous Drift, Model Adaptation

20. ❌ Security, privacy, and agentic AI in a regulatory view: From definitions and distinctions to provisions and reflections

作者: Shiliang Zhang, Sabita Maharjan 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18914v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	5.0/10	5.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文主要研究欧盟AI监管框架，聚焦于安全、隐私和Agentic AI的监管定义、区分和规定分析，属于AI治理和政策研究范畴。所有关键词均涉及大模型/深度学习的技术原理、方法或应用，而本文完全不涉及这些具体技术，仅宏观讨论AI（特别是Agentic AI）的监管问题。唯一略有相关的是"LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”，因为论文提到了"agentic AI"和"autonomous agents"，但讨论的是其监管而非技术实现，因此给5分（有一定关联）。其余关键词与论文内容完全无关，均给0分。

!!! tip deepseek-chat TL;DR

本文通过分析2024-2025年欧盟AI监管文件，澄清了安全、隐私和Agentic AI的监管定义，区分了相关概念，并综合分析了针对不同类型AI（特别是涉及安全和隐私的AI）的现行监管规定，以期为政策制定者、开发者和研究者提供合规与AI治理的参考。

摘要翻译

人工智能（AI）技术的快速扩散催生了动态变化的监管格局，立法框架正努力跟上技术发展的步伐。随着AI范式向更高自主性转变，特别是以智能体AI（agentic AI）的形式出现，精确阐明监管规定变得日益困难。这一挑战在安全与隐私领域尤为突出，因为自主智能体的能力往往模糊了传统的法律与技术边界。本文通过分析2024年至2025年间发布的24份相关文件，梳理了欧盟（EU）AI监管条款的演进脉络。基于此综述，我们对关键定义进行了澄清，解构了监管层面对安全、隐私及智能体AI的阐释，并将其与易混淆概念区分以消除歧义。我们综合所评述的文件，阐明了当前针对不同类型AI（尤其是涉及安全与隐私方面）的监管条款现状，并通过分析反思监管维度中的现有规定，以促使安全与隐私义务更好地契合AI及智能体行为。这些见解旨在为政策制定者、开发者和研究者提供参考，以应对算法主体日益增多的社会中的合规与AI治理问题。

摘要 (Abstract)

The rapid proliferation of artificial intelligence (AI) technologies has led to a dynamic regulatory landscape, where legislative frameworks strive to keep pace with technical advancements. As AI paradigms shift towards greater autonomy, specifically in the form of agentic AI, it becomes increasingly challenging to precisely articulate regulatory stipulations. This challenge is even more acute in the domains of security and privacy, where the capabilities of autonomous agents often blur traditional legal and technical boundaries. This paper reviews the evolving European Union (EU) AI regulatory provisions via analyzing 24 relevant documents published between 2024 and 2025. From this review, we provide a clarification of critical definitions. We deconstruct the regulatory interpretations of security, privacy, and agentic AI, distinguishing them from closely related concepts to resolve ambiguity. We synthesize the reviewed documents to articulate the current state of regulatory provisions targeting different types of AI, particularly those related to security and privacy aspects. We analyze and reflect on the existing provisions in the regulatory dimension to better align security and privacy obligations with AI and agentic behaviors. These insights serve to inform policymakers, developers, and researchers on the compliance and AI governance in the society with increasing algorithmic agencies.

关键词: AI regulation, agentic AI, security, privacy, EU regulatory framework, autonomous agents, AI governance, compliance

21. ❌ F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

作者: Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19223v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文F2LLM-v2专注于多语言嵌入模型，与LLM技术高度相关（10分），因为它基于LLM构建嵌入模型；与小型语言模型相关（8分），因为提供了80M到14B的多种尺寸模型；与预训练相关（8分），因为涉及两阶段LLM嵌入训练流程；与数据质量相关（5分），因为使用了6000万高质量数据样本；其他关键词如MoE、SFT、RLHF、RAG等与嵌入模型训练和评估无直接关联，故得0分。

!!! tip deepseek-chat TL;DR

该论文提出了F2LLM-v2系列多语言嵌入模型，通过两阶段LLM训练流程结合嵌套学习、模型剪枝和知识蒸馏技术，在支持200多种语言的同时实现了高效性能，其中14B模型在11个MTEB基准测试中排名第一。

摘要翻译

我们推出F2LLM-v2系列——一套包含8种不同规模（从8000万到140亿参数）的通用多语言嵌入模型。该系列模型基于新构建的6000万公开高质量数据样本复合集进行训练，支持超过200种语言，尤其关注以往资源不足的中低资源语言。通过融合基于大语言模型（LLM）的两阶段嵌入训练流程与套娃学习、模型剪枝及知识蒸馏技术，我们实现了比以往基于LLM的嵌入模型更高效的架构，同时保持了具有竞争力的性能。大量评估证实，F2LLM-v2-14B在11项MTEB基准测试中位列榜首，而该系列中的较小规模模型也为资源受限场景设立了新的性能标杆。为促进开源嵌入模型研究，我们公开了全部模型、数据、代码及中间检查点。

摘要 (Abstract)

We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.

关键词: multilingual embedding models, LLM-based embedding, model pruning, knowledge distillation, MTEB benchmarks, resource-constrained applications, open-source release, high-quality data

22. ❌ FinTradeBench: A Financial Reasoning Benchmark for LLMs

作者: Yogesh Agrawal, Aniruddha Dutta, Md Mahadi Hasan, Santu Karmaker, Aritra Dutta 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19225v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是评估LLMs在金融推理任务上的表现，因此与’Large Language Models’高度相关（10分）。论文明确测试了’Retrieval-Augmented Generation’设置，因此该关键词得10分。论文涉及金融决策推理，与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分），但并非核心方法。其他关键词如MoE、SFT、RLHF等未在摘要中提及，与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了FinTradeBench金融推理基准，用于评估大语言模型在结合公司基本面与交易信号进行金融决策时的表现，发现检索增强能改善文本基本面推理但对交易信号推理帮助有限。

摘要翻译

现实世界中的金融决策是一个具有挑战性的问题，需要对异质信号进行推理，这些信号包括从监管文件中提取的公司基本面数据以及基于价格动态计算的交易信号。近年来，随着大语言模型（LLMs）的发展，金融分析师已开始将其应用于金融决策任务。然而，现有用于测试这些模型的金融问答基准主要关注公司资产负债表数据，很少评估模型对公司股票在市场中如何交易或其与基本面相互作用的推理能力。为结合两种方法的优势，我们提出了FinTradeBench——一个用于评估融合公司基本面与交易信号的金融推理能力的基准。FinTradeBench包含基于纳斯达克100指数成分股在十年历史窗口期内的1,400个问题。该基准分为三个推理类别：以基本面为核心的问题、以交易信号为核心的问题，以及需要跨信号推理的混合型问题。为确保大规模评估的可靠性，我们采用“校准-扩展”框架，该框架结合了专家种子问题、多模型响应生成、模型内自过滤、数值审计以及人类-LLM评判对齐机制。我们在零样本提示和检索增强设置下评估了14个大语言模型，并观察到明显的性能差距。检索显著提升了基于文本基本面的推理能力，但对交易信号推理的改善有限。这些发现凸显了当前大语言模型在数值与时间序列推理方面的根本性挑战，并为未来金融智能研究指明了方向。

摘要 (Abstract)

Real-world financial decision-making is a challenging problem that requires reasoning over heterogeneous signals, including company fundamentals derived from regulatory filings and trading signals computed from price dynamics. Recently, with the advancement of Large Language Models (LLMs), financial analysts have begun to use them for financial decision-making tasks. However, existing financial question answering benchmarks for testing these models primarily focus on company balance sheet data and rarely evaluate reasoning over how company stocks trade in the market or their interactions with fundamentals. To take advantage of the strengths of both approaches, we introduce FinTradeBench, a benchmark for evaluating financial reasoning that integrates company fundamentals and trading signals. FinTradeBench contains 1,400 questions grounded in NASDAQ-100 companies over a ten-year historical window. The benchmark is organized into three reasoning categories: fundamentals-focused, trading-signal-focused, and hybrid questions requiring cross-signal reasoning. To ensure reliability at scale, we adopt a calibration-then-scaling framework that combines expert seed questions, multi-model response generation, intra-model self-filtering, numerical auditing, and human-LLM judge alignment. We evaluate 14 LLMs under zero-shot prompting and retrieval-augmented settings and witness a clear performance gap. Retrieval substantially improves reasoning over textual fundamentals, but provides limited benefit for trading-signal reasoning. These findings highlight fundamental challenges in the numerical and time-series reasoning for current LLMs and motivate future research in financial intelligence.

关键词: Financial Reasoning, Large Language Models, Benchmark, Retrieval-Augmented Generation, Company Fundamentals, Trading Signals, Zero-shot Prompting, Performance Evaluation

作者: Huaide Jiang, Yash Chaudhary, Yuping Wang, Zehao Wang, Raghav Sharma, Manan Mehta, Yang Zhou, Lichao Sun, Zhiwen Fan, Zhengzhong Tu, Jiachen Li 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19229v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文NavTrust专注于具身导航（Embodied Navigation）的鲁棒性基准测试，研究内容为评估导航模型在输入模态（RGB、深度、指令）被破坏时的性能下降，并提出缓解策略。所有评分关键词均涉及大模型/深度学习的技术原理、训练方法、推理优化、对齐、应用等具体方向，而本论文的核心是导航基准测试和鲁棒性评估，不涉及大模型技术原理创新，也未在生物医药等科学领域应用大模型。虽然论文涉及AI代理（导航代理），但未使用LLM代理或相关技术，因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

论文提出了NavTrust基准，首次在统一框架中系统评估具身导航模型在RGB-Depth和指令被破坏时的鲁棒性，发现现有方法性能显著下降，并验证了缓解策略的有效性。

摘要翻译

具身导航主要分为两大类别：视觉语言导航（VLN），即智能体通过遵循自然语言指令进行导航；以及目标导向导航（OGN），即智能体导航至指定的目标物体。然而，现有工作主要在理想条件下评估模型性能，忽视了现实场景中可能出现的各类干扰。为填补这一空白，我们提出了NavTrust，这是一个统一的基准测试框架，能够在真实场景中系统性地对RGB、深度及指令等输入模态施加干扰，并评估其对导航性能的影响。据我们所知，NavTrust是首个在统一框架内，将具身导航智能体暴露于多样化RGB-深度干扰与指令变体下的基准测试。通过对七种前沿方法进行广泛评估，我们发现这些方法在现实干扰下均出现显著的性能下降，这揭示了其关键的鲁棒性缺陷，并为构建更可信的具身导航系统提供了路线图。此外，我们系统性地评估了四种不同的增强策略，以提升模型对RGB-深度干扰及指令干扰的鲁棒性。我们的基础模型包括Uni-NaVid和ETPNav。我们将这些模型部署于真实移动机器人上，并观察到了其对干扰的鲁棒性提升。项目网站为：https://navtrust.github.io。

摘要 (Abstract)

There are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and Object-Goal Navigation (OGN), where agents navigate to a specified target object. However, existing work primarily evaluates model performance under nominal conditions, overlooking the potential corruptions that arise in real-world settings. To address this gap, we present NavTrust, a unified benchmark that systematically corrupts input modalities, including RGB, depth, and instructions, in realistic scenarios and evaluates their impact on navigation performance. To our best knowledge, NavTrust is the first benchmark that exposes embodied navigation agents to diverse RGB-Depth corruptions and instruction variations in a unified framework. Our extensive evaluation of seven state-of-the-art approaches reveals substantial performance degradation under realistic corruptions, which highlights critical robustness gaps and provides a roadmap toward more trustworthy embodied navigation systems. Furthermore, we systematically evaluate four distinct mitigation strategies to enhance robustness against RGB-Depth and instructions corruptions. Our base models include Uni-NaVid and ETPNav. We deployed them on a real mobile robot and observed improved robustness to corruptions. The project website is: https://navtrust.github.io.

关键词: Embodied Navigation, Vision-Language Navigation, Object-Goal Navigation, Robustness Benchmark, RGB-Depth Corruption, Instruction Variation, Trustworthiness Evaluation, Mitigation Strategies

24. ❌ DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising

作者: Tianjiao Yu, Xinzhuo Li, Muntasir Wahed, Jerry Xiong, Yifan Shen, Ying Shen, Ismini Lourentzou 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19216v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于3D生成领域，提出了一种基于语义的部件级文本到3D生成框架DreamPartGen。虽然该研究属于AI应用范畴，但所有给定的关键词均围绕大语言模型（LLMs）及其相关技术（如训练方法、推理优化、对齐、压缩等）或特定科学领域应用（如生物信息学）。论文内容完全不涉及LLMs、MoE、量化、推理加速、对齐技术等，也未应用于生物信息学或化学信息学。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有文本到3D生成方法忽视部件语义和功能结构的问题，提出了DreamPartGen框架，通过协同潜在去噪实现语义接地、部件感知的3D生成，在多个基准测试中取得了最先进的几何保真度和文本-形状对齐性能。

摘要翻译

将三维物体理解为由有意义部件组成的结构，是人类感知与推理的基础。然而，大多数文本到三维生成方法忽视了部件的语义与功能结构。尽管近期部分研究引入了部件感知的分解方法，但这些方法仍主要聚焦于几何形态，缺乏语义层面的基础，未能有效建模部件如何与文本描述对齐或部件间的相互关系。我们提出了DreamPartGen，这是一个基于语义的、部件感知的文本到三维生成框架。DreamPartGen引入了双重部件隐变量（Duplex Part Latents, DPLs），以联合建模每个部件的几何形状与外观；同时提出关系语义隐变量（Relational Semantic Latents, RSLs），用于捕捉从语言中推导出的部件间依赖关系。通过同步的协同去噪过程，框架强化了几何与语义的一致性，从而实现了连贯、可解释且与文本对齐的三维合成。在多个基准测试中，DreamPartGen在几何保真度与文本-形状对齐方面均达到了最先进的性能水平。

摘要 (Abstract)

Understanding and generating 3D objects as compositions of meaningful parts is fundamental to human perception and reasoning. However, most text-to-3D methods overlook the semantic and functional structure of parts. While recent part-aware approaches introduce decomposition, they remain largely geometry-focused, lacking semantic grounding and failing to model how parts align with textual descriptions or their inter-part relations. We propose DreamPartGen, a framework for semantically grounded, part-aware text-to-3D generation. DreamPartGen introduces Duplex Part Latents (DPLs) that jointly model each part’s geometry and appearance, and Relational Semantic Latents (RSLs) that capture inter-part dependencies derived from language. A synchronized co-denoising process enforces mutual geometric and semantic consistency, enabling coherent, interpretable, and text-aligned 3D synthesis. Across multiple benchmarks, DreamPartGen delivers state-of-the-art performance in geometric fidelity and text-shape alignment.

关键词: 3D generation, part-aware, semantically grounded, text-to-3D, latent denoising, geometric fidelity, text-shape alignment, DreamPartGen

25. ❌ Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

作者: Zhuolin Yang, Zihan Liu, Yang Chen, Wenliang Dai, Boxin Wang, Sheng-Chieh Lin, Chankyu Lee, Yangyi Chen, Dongfu Jiang, Jiafan He, Renjie Pi, Grace Lam, Nayeon Lee, Alexander Bukharin, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19220v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM（Nemotron-Cascade 2，一个30B MoE模型）的post-training技术，特别是Cascade RL和multi-domain on-policy distillation，以提升推理和智能体能力。高度相关的关键词包括：LLMs（核心研究对象）、MoE（模型架构）、Post-training/SFT（关键训练阶段）、RLHF/DPO（Cascade RL属于强化学习对齐方法）、Chain of Thought/System 2 Thinking（论文强调推理能力）、LLM Agents（论文提到agentic capabilities）。其他关键词如Scaling Laws、Pre-training、Self-Correction、Tool Use、In-context Learning有间接关联（论文涉及模型规模、训练过程、能力提升），评5分。剩余关键词如RAG、Context Window Extension、Quantization等未在摘要中提及，评0分。

!!! tip deepseek-chat TL;DR

该论文提出了Nemotron-Cascade 2，一个30B参数的MoE模型，通过Cascade RL和multi-domain on-policy distillation等post-training技术，显著提升了模型的数学、编码推理和智能体能力，在多项国际竞赛中达到顶尖水平。

摘要翻译

我们推出Nemotron-Cascade 2，这是一个拥有300亿参数、30亿激活参数的开放混合专家（MoE）模型，具备顶尖的推理能力和强大的智能体（agentic）性能。尽管模型规模紧凑，其在数学与代码推理任务上的表现已接近前沿开放模型水平。作为继DeepSeekV3.2-Speciale-671B-A37B之后第二个开放权重的LLM，该模型在2025年国际数学奥林匹克（IMO）、国际信息学奥林匹克（IOI）及国际大学生程序设计竞赛（ICPC）全球总决赛中均达到金牌级表现，以仅二十分之一的参数量实现了极高的智能密度。相较于Nemotron-Cascade 1，其关键技术进展如下：在基于精细筛选数据集进行监督微调（SFT）后，我们大幅扩展了级联强化学习（Cascade RL）的覆盖范围，使其涵盖更广泛的推理与智能体领域。此外，我们在整个级联强化学习过程中引入了多领域同策略蒸馏技术，从各领域最强的中间教师模型进行知识蒸馏，从而有效恢复基准测试中的性能回归，并持续获得显著的性能提升。我们同步发布了模型检查点与训练数据集合。

摘要 (Abstract)

We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeekV3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20x fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.

关键词: Nemotron-Cascade 2, MoE model, post-training, Cascade RL, multi-domain on-policy distillation, reasoning capabilities, agentic capabilities, mathematical reasoning

26. ❌ $R$-equivalence on Cubic Surfaces I: Existing Cases with Non-Trivial Universal Equivalence

作者: Dimitri Kanevsky, Julian Salazar, Matt Harvey 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19215v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究纯数学中的代数几何问题（三次曲面上的R-等价），与深度学习、大模型技术完全无关。唯一的相关性在于论文末尾提到使用了生成式AI模型（AlphaEvolve和Gemini 3 Deep Think）辅助证明引理，这属于AI在科学研究中的应用，因此仅与’AI for Science’关键词有微弱关联（5分），其他所有关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了p-adic域上光滑三次曲面的R-等价问题，证明了对于具有全Eckardt约化的2-adic曲面，R-等价是平凡的或指数为2，并解决了Manin长期悬而未决的问题。

摘要翻译

设 $V$ 为在具有良好约化的 $p$ 进域 $k$ 上的一条光滑三次曲面。Swinnerton-Dyer (1981) 证明了 $R$ 等价在 $V(k)$ 上是平凡的，除非 $V$ 是三种特殊类型之一——这些曲面的 $R$ 等价性无法通过证明其万有（容许）等价是平凡的来界定。我们考察了目前已知具有非平凡万有等价的所有曲面 $V$。除了这些曲面对 Swinnerton-Dyer 的方法难以处理之外，我们还注意到，如果这些曲面也具有非平凡的 $R$ 等价，它们将与 Colliot-Thélène 和 Sansuc 关于几何有理曲面的万有挠子 $k$ 有理性的猜想相矛盾。

通过设计研究 $R$ 等价的新方法，我们证明了对于具有全 Eckardt 约化的 2 进曲面（第三种特殊类型，包含了所有已知的非平凡万有等价情形），$R$ 等价是平凡的或指数为 2。对于具体情形，我们确认了平凡性：在 $\mathbb{Q}_2(ζ_3)$ 上的对角三次曲面 $X^3+Y^3+Z^3+ζ_3 T^3=0$——这回答了 Manin (Cubic Forms, 1972) 的一个长期悬而未决的问题——以及具有指数为 2 的万有等价的立方曲面 (Kanevsky, 1982)。

这是源自与生成式人工智能模型（如 AlphaEvolve 和 Gemini 3 Deep Think）为期一年的互动所产生的一系列工作中的第一篇，后者证明了我们的许多引理。我们披露了它们在本论文中使用的时间线和性质，并在另一份配套报告（准备中）中描述了更广泛的 AI 辅助研究计划。

摘要 (Abstract)

Let $V$ be a smooth cubic surface over a $p$-adic field $k$ with good reduction. Swinnerton-Dyer (1981) proved that $R$-equivalence is trivial on $V(k)$ except perhaps if $V$ is one of three special types–those whose $R$-equivalence he could not bound by proving the universal (admissible) equivalence is trivial. We consider all surfaces $V$ currently known to have non-trivial universal equivalence. Beyond being intractable to Swinnerton-Dyer’s approach, we observe that if these surfaces also had non-trivial $R$-equivalence, they would contradict Colliot-Thélène and Sansuc’s conjecture regarding the $k$-rationality of universal torsors for geometrically rational surfaces. By devising new methods to study $R$-equivalence, we prove that for 2-adic surfaces with all-Eckardt reductions (the third special type, which contains every existing case of non-trivial universal equivalence), $R$-equivalence is trivial or of exponent 2. For the explicit cases, we confirm triviality: the diagonal cubic $X^3+Y^3+Z^3+ζ_3 T^3=0$ over $\mathbb{Q}_2(ζ_3)$–answering a long-standing question of Manin’s (Cubic Forms, 1972)–and the cubic with universal equivalence of exponent 2 (Kanevsky, 1982). This is the first in a series of works derived from a year of interactions with generative AI models such as AlphaEvolve and Gemini 3 Deep Think, with the latter proving many of our lemmas. We disclose the timeline and nature of their use towards this paper, and describe our broader AI-assisted research program in a companion report (in preparation).

关键词: cubic surfaces, R-equivalence, p-adic fields, universal equivalence, algebraic geometry, diagonal cubic, Eckardt reduction, generative AI

27. ❌ OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards

作者: Zehao Li, Zhenyu Wu, Yibo Zhao, Bowen Yang, Jingjing Xie, Zhaoyang Liu, Zhoumianze Liu, Kaiming Jin, Jianze Liang, Zonglin Li, Feng Wu, Bowen Zhou, Zun Wang, Zichen Ding 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19191v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于强化学习（RL）在GUI代理中的奖励函数设计，提出OS-Themis多智能体评论框架和OmniGUIRewardBench基准。所有关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文核心是传统RL奖励机制，未涉及大模型、深度学习技术或科学领域应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该研究解决了GUI代理中强化学习奖励函数难以同时实现可扩展性和高性能的问题，提出了OS-Themis多智能体评论框架，实验表明其在AndroidWorld上能提升在线RL训练10.3%的性能，并在自训练循环中带来6.9%的增益。

摘要翻译

强化学习（RL）具备提升图形用户界面（GUI）智能体在随机环境中鲁棒性的潜力，但其训练过程对奖励函数的质量高度敏感。现有的奖励方法难以同时实现可扩展性与性能表现。为此，我们提出了OS-Themis，一个可扩展且精确的多智能体评判框架。与单一评判者不同，OS-Themis将任务轨迹分解为可验证的里程碑，以隔离决策所需的关键证据，并采用审查机制在做出最终裁决前严格审核证据链。为便于评估，我们进一步引入了OmniGUIRewardBench（OGRBench），这是一个用于GUI结果奖励的全方位跨平台基准测试，所有被评估模型在OS-Themis框架下均取得了最佳性能。在AndroidWorld平台上进行的大量实验表明，OS-Themis用于支持在线强化学习训练时，性能提升了10.3%；用于自训练循环中的轨迹验证与过滤时，性能提升了6.9%，这凸显了其驱动智能体进化的潜力。

摘要 (Abstract)

Reinforcement Learning (RL) has the potential to improve the robustness of GUI agents in stochastic environments, yet training is highly sensitive to the quality of the reward function. Existing reward approaches struggle to achieve both scalability and performance. To address this, we propose OS-Themis, a scalable and accurate multi-agent critic framework. Unlike a single judge, OS-Themis decomposes trajectories into verifiable milestones to isolate critical evidence for decision making and employs a review mechanism to strictly audit the evidence chain before making the final verdict. To facilitate evaluation, we further introduce OmniGUIRewardBench (OGRBench), a holistic cross-platform benchmark for GUI outcome rewards, where all evaluated models achieve their best performance under OS-Themis. Extensive experiments on AndroidWorld show that OS-Themis yields a 10.3% improvement when used to support online RL training, and a 6.9% gain when used for trajectory validation and filtering in the self-training loop, highlighting its potential to drive agent evolution.

关键词: Reinforcement Learning, GUI agents, reward function, multi-agent critic framework, OS-Themis, OmniGUIRewardBench, AndroidWorld, self-training

28. ❌ Box Maze: A Process-Control Architecture for Reliable LLM Reasoning

作者: Zou Qiang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19182v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Box Maze框架，专注于提高LLM推理的可靠性和减少幻觉，核心涉及LLM推理过程控制。高度相关的关键词包括：LLMs（核心研究对象）、Chain of Thought/System 2 Thinking（论文关注推理过程）、Hallucination Mitigation（主要目标）。中等相关的关键词：RLHF（作为基线对比提及）、Self-Correction（与过程控制相关）、Mechanistic Interpretability（框架提供结构化解释）。其他关键词如MoE、SLMs、Scaling Laws等与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在对抗性提示下容易产生幻觉和不可靠推理的问题，提出了Box Maze过程控制架构，通过分解推理为三层结构，在模拟评估中将边界失败率从约40%降低到1%以下。

摘要翻译

大语言模型（LLM）展现出强大的生成能力，但在对抗性提示下仍易产生幻觉和不可靠的推理。现有的安全方法——例如基于人类反馈的强化学习（RLHF）和输出过滤——主要作用于行为层面，可能缺乏确保推理过程完整性的显式架构机制。

本文提出Box Maze框架，这是一种概念性的过程控制架构，它将大语言模型的推理分解为三个显式层次：记忆锚定、结构化推理和边界约束。我们引入了基于模拟的初步评估，该评估涉及多个异构大语言模型系统（DeepSeek-V3、Doubao、Qwen）中的渐进式边界侵蚀场景。在n=50个对抗性场景中的结果表明，显式的认知控制层可以提高边界维护的一致性，在对抗条件下，架构约束能将边界失效率从约40%（基线RLHF）降低至1%以下。

尽管目前的验证是基于模拟的，但这些初步结果表明，过程层面的控制可能为提高大语言模型推理的可靠性提供一个有前景的方向。

摘要 (Abstract)

Large language models (LLMs) demonstrate strong generative capabilities but remain vulnerable to hallucination and unreliable reasoning under adversarial prompting. Existing safety approaches – such as reinforcement learning from human feedback (RLHF) and output filtering – primarily operate at the behavioral level and may lack explicit architectural mechanisms for enforcing reasoning process integrity. This paper proposes the Box Maze framework, a conceptual process-control architecture that decomposes LLM reasoning into three explicit layers: memory grounding, structured inference, and boundary enforcement. We introduce preliminary simulation-based evaluation involving progressive boundary erosion scenarios across multiple heterogeneous LLM systems (DeepSeek-V3, Doubao, Qwen). Results from n=50 adversarial scenarios suggest that explicit cognitive control layers may improve consistency in boundary maintenance, with architectural constraints reducing boundary failure rates from approximately 40% (baseline RLHF) to below 1% under adversarial conditions. While current validation is simulation-based, these preliminary results indicate that process-level control may offer a promising direction for improving reliability in large language model reasoning.

关键词: Large language models, LLM reasoning, hallucination, process-control architecture, reliability, adversarial prompting, boundary enforcement, structured inference

29. ❌ SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

作者: Edward Lin, Sahil Modi, Siva Kumar Sastry Hari, Qijing Huang, Zhifan Ye, Nestor Qin, Fengzhe Zhou, Yuan Zhang, Jingquan Wang, Sana Damani, Dheeraj Peri, Ouye Xie, Aditya Kane, Moshe Maor, Michael Behar, Triston Cao, Rishabh Mehta, Vartika Singh, Vikram Sharma Mailthody, Terry Chen, Zihao Ye, Hanfeng Chen, Tianqi Chen, Vinod Grover, Wei Chen, Wei Liu, Eric Chung, Luis Ceze, Roger Bringmann, Cyril Zeller, Michael Lightstone, Christos Kozyrakis, Humphrey Shi 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19173v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究GPU内核优化的基准测试方法，与大多数大模型技术关键词无直接关系。仅与三个关键词有间接关联：1）‘Large Language Models’（5分）- 论文基准包含语言模型的内核；2）‘LLM Agents’（5分）- 论文提到代理AI系统生成和优化GPU内核；3）‘Speculative Decoding OR Inference Acceleration’（5分）- 论文涉及GPU内核性能优化，与推理加速相关。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了SOL-ExecBench基准测试，用于评估GPU内核优化性能相对于硬件极限（而非软件基线）的接近程度，解决了现有基准测试方法的局限性。

摘要翻译

随着自主人工智能系统在生成和优化GPU内核方面的能力日益增强，其发展却受限于现有基准测试的导向——这些基准主要奖励相对于软件基线的加速比，而非追求接近硬件效率极限的执行性能。本文提出SOL-ExecBench基准，该基准包含从124个生产级及新兴人工智能模型中提取的235个CUDA内核优化问题，涵盖语言、扩散、视觉、音频、视频及混合架构领域，并针对NVIDIA Blackwell GPU设计。基准覆盖BF16、FP8和NVFP4数据格式下的前向与反向计算负载，包含那些预期最佳性能需依赖Blackwell架构特有功能的内核。与以往主要基于软件实现评估内核的基准不同，SOL-ExecBench通过我们开发的SOLAR流程（一种基于硬件特性推导光速极限的理论边界计算框架）获得分析性光速极限边界，并以此为固定目标衡量硬件效率优化水平。我们提出SOL分数，用于量化候选内核在评分基线（由发布定义）与硬件光速极限边界之间所缩小的性能差距。为支持对自主优化系统的稳健评估，我们还提供了沙盒化测试框架，具备GPU时钟锁定、L2缓存清理、隔离子进程执行及基于静态分析的抗奖励篡改策略检测功能。SOL-ExecBench将GPU内核基准测试的范式从超越可变软件基线，转变为衡量向硬件光速极限逼近的剩余差距。

摘要 (Abstract)

As agentic AI systems become increasingly capable of generating and optimizing GPU kernels, progress is constrained by benchmarks that reward speedup over software baselines rather than proximity to hardware-efficient execution. We present SOL-ExecBench, a benchmark of 235 CUDA kernel optimization problems extracted from 124 production and emerging AI models spanning language, diffusion, vision, audio, video, and hybrid architectures, targeting NVIDIA Blackwell GPUs. The benchmark covers forward and backward workloads across BF16, FP8, and NVFP4, including kernels whose best performance is expected to rely on Blackwell-specific capabilities. Unlike prior benchmarks that evaluate kernels primarily relative to software implementations, SOL-ExecBench measures performance against analytically derived Speed-of-Light (SOL) bounds computed by SOLAR, our pipeline for deriving hardware-grounded SOL bounds, yielding a fixed target for hardware-efficient optimization. We report a SOL Score that quantifies how much of the gap between a release-defined scoring baseline and the hardware SOL bound a candidate kernel closes. To support robust evaluation of agentic optimizers, we additionally provide a sandboxed harness with GPU clock locking, L2 cache clearing, isolated subprocess execution, and static analysis based checks against common reward-hacking strategies. SOL-ExecBench reframes GPU kernel benchmarking from beating a mutable software baseline to closing the remaining gap to hardware Speed-of-Light.

关键词: GPU kernel optimization, benchmarking, Speed-of-Light bounds, hardware efficiency, agentic AI systems, CUDA kernels, NVIDIA Blackwell GPUs, performance evaluation

30. ❌ ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis

作者: Zhan Jin, Yu Luo, Yizhou Zhang, Ziyang Cui, Yuqing Wei, Xianchao Liu, Xueying Zeng, Qing Zhang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19169v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是医学影像分析（冠状动脉造影），应用了DPO（Direct Preference Optimization）进行偏好对齐，并基于RL进行诊断推理，属于大模型在生物医学领域的创新应用。因此，与’DPO’和’AI for Science’高度相关（10分）。论文使用了’vision-language foundation model’，与’Large Language Models’和’Foundation Models’有一定关联（5分）。涉及’fine-tune’，与’Post-training’和’Pre-training’有一定关联（5分）。‘Alignment’在摘要中明确提及，与’Instruction Tuning OR Alignment’相关（8分）。其他关键词如MoE、SLMs、RAG、CoT等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对冠状动脉血管分割中拓扑结构断裂的问题，提出了ARIA DNE框架，通过DPO偏好对齐和RL推理，实现了拓扑连贯的狭窄检测，在临床数据上达到了最先进的中心线Dice分数0.838，并将假阳性降低了41%。

摘要翻译

传统逐像素损失函数无法在冠状动脉血管分割中强制拓扑约束，导致尽管像素级精度较高却产生断裂的血管树结构。我们提出ARIADNE框架，这是一个两阶段系统，将偏好对齐的感知模块与基于强化学习的诊断推理模块相结合，以实现拓扑连贯的狭窄检测。感知模块采用直接偏好优化（DPO），以贝蒂数（Betti number）约束作为偏好信号对Sa2VA视觉-语言基础模型进行微调，使策略朝向几何结构完整的血管结构对齐，而非仅优化逐像素重叠度量。推理模块将狭窄定位建模为马尔可夫决策过程，并引入显式拒绝机制，自主延迟处理诸如血管分叉和交叉点等模糊解剖候选区域，从而将目标从覆盖率最大化转向可靠性优化。在1400例临床血管造影数据上，ARIADNE实现了0.838的中心线戴斯系数（centerline Dice），较几何基线方法减少41%的假阳性。在多中心基准数据集ARCADE和XCAD上的外部验证证实了该框架在不同采集协议间的泛化能力。本研究首次将DPO应用于医学影像的拓扑对齐任务，证明基于结构约束的偏好学习能够减少拓扑违例，同时在介入心脏病学工作流程中保持诊断敏感性。

摘要 (Abstract)

Conventional pixel-wise loss functions fail to enforce topological constraints in coronary vessel segmentation, producing fragmented vascular trees despite high pixel-level accuracy. We present ARIADNE, a two-stage framework coupling preference-aligned perception with RL-based diagnostic reasoning for topologically coherent stenosis detection. The perception module employs DPO to fine-tune the Sa2VA vision-language foundation model using Betti number constraints as preference signals, aligning the policy toward geometrically complete vessel structures rather than pixel-wise overlap metrics. The reasoning module formulates stenosis localization as a Markov Decision Process with an explicit rejection mechanism that autonomously defers ambiguous anatomical candidates such as bifurcations and vessel crossings, shifting from coverage maximization to reliability optimization. On 1,400 clinical angiograms, ARIADNE achieves state-of-the-art centerline Dice of 0.838, reduces false positives by 41% compared to geometric baselines. External validation on multi-center benchmarks ARCADE and XCAD confirms generalization across acquisition protocols. This represents the first application of DPO for topological alignment in medical imaging, demonstrating that preference-based learning over structural constraints mitigates topological violations while maintaining diagnostic sensitivity in interventional cardiology workflows.

关键词: Coronary angiography analysis, Topological constraints, Direct Preference Optimization (DPO), Reinforcement learning, Vessel segmentation, Stenosis detection, Medical imaging, Foundation model

作者: Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen, Nakul Gopalan 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19166v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出MAPG框架，使用多智能体系统分解语言查询并调用视觉语言模型进行语义和度量约束的推理，与’LLM Agents/Autonomous Agents’和’Multi-agent Systems/Agent Coordination’高度相关（10分）。论文涉及复杂查询的分解推理，与’Chain of Thought/CoT Reasoning’和’System 2 Thinking/Slow Thinking’有一定关联（5分）。论文使用视觉语言模型（VLMs），属于大模型范畴，与’Large Language Models/LLMs’有一定关联（5分）。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

论文针对视觉语言导航中复杂度量-语义查询的难题，提出了多智能体概率接地框架MAPG，通过分解查询和概率组合显著提升了性能，并创建了新基准验证其有效性。

摘要翻译

与人类协作的机器人必须将自然语言目标转化为可执行的、物理层面可落地的决策。例如，执行“走到冰箱右侧两米处”这类指令时，需要在三维场景中对语义指代、空间关系和度量约束进行实体化定位。尽管当前的视觉语言模型展现出强大的语义定位能力，但它们并非专门为在物理定义的空间中进行度量约束推理而设计。本工作中，我们通过实验证明，基于最先进视觉语言模型的定位方法在处理复杂的度量-语义混合语言查询时存在困难。为应对这一局限，我们提出了MAPG（多智能体概率定位框架），该智能体框架将语言查询分解为结构化子组件，并通过查询视觉语言模型对每个组件进行实体化定位。随后，MAPG通过概率化组合这些定位输出，在三维空间中生成度量一致且可执行的决策。我们在HM-EQA基准测试上评估MAPG，结果显示其性能较现有强基线模型有持续提升。此外，我们提出了专门用于评估度量-语义目标定位能力的新基准MAPG-Bench，以弥补现有语言定位评估体系的不足。我们还通过真实机器人演示表明，在具备结构化场景表征的条件下，MAPG能够有效迁移到仿真环境之外的实际场景中。

摘要 (Abstract)

Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as “go two meters to the right of the fridge” requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) demonstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces. In this work, we empirically demonstrate that state-of-the-art VLM-based grounding approaches struggle with complex metric-semantic language queries. To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component. MAPG then probabilistically composes these grounded outputs to produce metrically consistent, actionable decisions in 3D space. We evaluate MAPG on the HM-EQA benchmark and show consistent performance improvements over strong baselines. Furthermore, we introduce a new benchmark, MAPG-Bench, specifically designed to evaluate metric-semantic goal grounding, addressing a gap in existing language grounding evaluations. We also present a real-world robot demonstration showing that MAPG transfers beyond simulation when a structured scene representation is available.

关键词: Vision-Language Navigation, Multi-Agent Systems, Probabilistic Grounding, Metric-Semantic Queries, VLMs, Agentic Framework, 3D Scene Understanding, Robot Collaboration

32. ❌ cuGenOpt: A GPU-Accelerated General-Purpose Metaheuristic Framework for Combinatorial Optimization

作者: Yuyang Liu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19163v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究GPU加速的元启发式组合优化框架，核心是CUDA架构、编码抽象和自适应算子选择。与大多数大模型技术关键词（如MoE、RLHF、PEFT等）完全无关。仅有两个关键词相关：1) ‘Large Language Models’：论文提到使用LLM-based modeling assistant将自然语言描述转换为可执行代码，这是应用层面，非核心研究内容，给5分。2) ‘AI for Science’：组合优化在物流、调度等科学工程领域有应用，属于AI for Science的广义范畴，给5分。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为cuGenOpt的GPU加速通用元启发式框架，用于解决组合优化问题，通过统一的编码抽象和自适应算子选择机制，在多个GPU架构上实现了比通用MIP求解器快几个数量级的性能，并在多种问题类型上达到最优解。

摘要翻译

组合优化问题广泛存在于物流、调度与资源分配领域，但现有方法在通用性、性能与易用性之间面临根本性的权衡。本文提出cuGenOpt，一个GPU加速的通用元启发式框架，能够同时兼顾上述三个维度。

在引擎层面，cuGenOpt采用“一个线程块演化一个解”的CUDA架构，具备统一的编码抽象（排列、二进制、整数编码）、双层自适应算子选择机制以及硬件感知的资源管理。在可扩展性层面，用户自定义算子注册接口允许领域专家注入面向特定问题的CUDA搜索算子。在易用性层面，即时编译流水线将框架以纯Python API形式呈现，并基于大语言模型（LLM）的建模助手可将自然语言问题描述转换为可执行的求解器代码。

在三种GPU架构（T4、V100、A800）上对五个主题测试集的实验表明：cuGenOpt性能超越通用混合整数规划（MIP）求解器数个数量级；在规模达n=150的算例上，其求解质量可与专用求解器竞争；在30秒内对TSP-442问题取得4.73%的优化间隙。该框架成功求解了涵盖五种编码变体的十二类问题，均达到最优解。框架级优化累积将pcb442问题的优化间隙从36%降低至4.73%，并将带时间窗的车辆路径问题（VRPTW）的求解吞吐量提升了75-81%。

代码地址：https://github.com/L-yang-yang/cugenopt

摘要 (Abstract)

Combinatorial optimization problems arise in logistics, scheduling, and resource allocation, yet existing approaches face a fundamental trade-off among generality, performance, and usability. We present cuGenOpt, a GPU-accelerated general-purpose metaheuristic framework that addresses all three dimensions simultaneously. At the engine level, cuGenOpt adopts a “one block evolves one solution” CUDA architecture with a unified encoding abstraction (permutation, binary, integer), a two-level adaptive operator selection mechanism, and hardware-aware resource management. At the extensibility level, a user-defined operator registration interface allows domain experts to inject problem-specific CUDA search operators. At the usability level, a JIT compilation pipeline exposes the framework as a pure-Python API, and an LLM-based modeling assistant converts natural-language problem descriptions into executable solver code. Experiments across five thematic suites on three GPU architectures (T4, V100, A800) show that cuGenOpt outperforms general MIP solvers by orders of magnitude, achieves competitive quality against specialized solvers on instances up to n=150, and attains 4.73% gap on TSP-442 within 30s. Twelve problem types spanning five encoding variants are solved to optimality. Framework-level optimizations cumulatively reduce pcb442 gap from 36% to 4.73% and boost VRPTW throughput by 75-81%. Code: https://github.com/L-yang-yang/cugenopt

关键词: GPU-accelerated, metaheuristic framework, combinatorial optimization, CUDA architecture, adaptive operator selection, JIT compilation, LLM-based modeling assistant, performance optimization

33. ❌ VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

作者: Chonghan Liu, Yimin Du, Qi An, Xin He, Cunqi Zhai, Fei Tan, Weijia Lin, Xiaochun Gong, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19152v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型在低资源语言上的性能优化，提出VEPO方法（基于强化学习的策略优化）。高度相关关键词：‘Large Language Models’（论文明确研究LLMs）、‘RLHF’（VEPO本质是强化学习对齐方法）。中等相关：‘Alignment’（VEPO涉及策略对齐）、‘Scaling Laws AND Data Quality’（提及训练数据不平衡问题）。其他关键词如MoE、SFT、RAG等未涉及，评0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在低资源语言上表现不佳的问题，提出了一种基于强化学习的可变熵策略优化方法VEPO，显著提升了分词效率和翻译质量。

摘要翻译

大语言模型在低资源语言上常表现出次优性能，这主要源于低效的子词切分和系统性的训练数据不平衡。本文提出可变熵策略优化（Variable Entropy Policy Optimization, VEPO），该方法利用可验证奖励的强化学习（Reinforcement Learning with Verifiable Rewards），将确定性的结构约束纳入策略对齐过程。该框架确保了预设的序列长度、鲁棒的格式一致性以及严格的语言规范性，所有这些均在训练过程中强制执行。我们方法的核心是一个可变熵机制，它通过调节探索-利用的流形，使模型能够动态校准字面忠实度与语义自然性之间的平衡。通过将熵调节优势估计（entropy-tempered advantage estimation）与非对称裁剪（asymmetric clipping）相结合，VEPO在缓解策略崩溃的同时保持了强大的探索能力。在涵盖90个语言方向的FLORES-200、COMET-22和chrF基准上的实证评估表明，VEPO在切分效率和翻译质量方面均带来显著提升，为代表性不足的语言弥合了性能差距。

摘要 (Abstract)

Large language models frequently exhibit suboptimal performance on low resource languages, primarily due to inefficient subword segmentation and systemic training data imbalances. In this paper, we propose Variable Entropy Policy Optimization (VEPO), which leverages Reinforcement Learning with Verifiable Rewards to incorporate deterministic structural constraints into the policy alignment process. This framework ensures prescribed sequence length, robust format consistency, and rigorous linguistic well formedness, all enforced during training. Central to our approach is a variable entropy mechanism that enables the model to dynamically calibrate the equilibrium between literal fidelity and semantic naturalness by modulating the exploration exploitation manifold. By integrating entropy tempered advantage estimation with asymmetric clipping, VEPO sustains robust exploration while mitigating policy collapse. Empirical evaluations across 90 FLORES-200, COMET-22, chrF directions demonstrate that VEPO yields substantial improvements in both tokenization efficiency and translation quality, bridging the performance gap for underrepresented languages.

关键词: Large Language Models, Low-Resource Languages, Reinforcement Learning, Policy Optimization, Variable Entropy, Tokenization Efficiency, Translation Quality, VEPO

34. ❌ D5P4: Partition Determinantal Point Process for Diversity in Parallel Discrete Diffusion Decoding

作者: Jonathan Lys, Vincent Gripon, Bastien Pasdeloup, Axel Marmoret, Lukas Mauch, Fabien Cardinaux, Ghouthi Boukli Hacene 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19146v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究离散扩散模型的解码方法（D5P4），专注于文本生成中的多样性控制，属于生成模型和文本生成领域。所有评分关键词均针对大模型（LLM）及相关技术（如MoE、RLHF、RAG等），而本文研究的是离散扩散模型（一种生成模型）的解码方法，不涉及大模型技术、大模型应用或大模型创新。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文针对离散扩散模型在文本生成中解码方法控制多样性不足的问题，提出了D5P4框架，通过基于Determinantal Point Process的并行候选选择，在保持生成质量的同时显著提高了多样性。

摘要翻译

离散扩散模型是文本生成中自回归方法的有前景的替代方案，但其解码方法仍研究不足。自回归模型的标准解码方法（如束搜索）无法直接应用于迭代去噪过程，而现有的扩散解码技术对批次内多样性的控制能力有限。为弥补这一差距，我们引入了一种适用于离散扩散的广义束搜索框架，该框架可并行生成候选序列，并支持模块化的束选择目标。作为一种以多样性为核心的实例化方法，我们提出了D5P4（Determinantal Point Process-based Parallel Decoding for Discrete Diffusion），它将选择步骤构建为基于行列式点过程的极大后验概率推断。通过利用可扩展的贪心求解器，D5P4保持了多GPU兼容性，并能够以近乎零的计算开销，在模型概率与目标多样性之间实现显式权衡。在自由文本生成和问答任务上的实验表明，D5P4在保持竞争力的生成质量的同时，显著提升了相较于强基线模型的多样性。

摘要 (Abstract)

Discrete diffusion models are promising alternatives to autoregressive approaches for text generation, yet their decoding methods remain under-studied. Standard decoding methods for autoregressive models, such as beam search, do not directly apply to iterative denoising, and existing diffusion decoding techniques provide limited control over in-batch diversity. To bridge this gap, we introduce a generalized beam-search framework for discrete diffusion that generates candidates in parallel and supports modular beam-selection objectives. As a diversity-focused instantiation, we propose D5P4, which formulates the selection step as MAP inference over a Determinantal Point Process. Leveraging a scalable greedy solver, D5P4 maintains multi-GPU compatibility and enables an explicit trade-off between model probability and target diversity with near-zero compute overhead. Experiments on free-form generation and question answering demonstrate that D5P4 improves diversity over strong baselines while maintaining competitive generation quality.

关键词: discrete diffusion models, text generation, decoding methods, diversity control, beam search, Determinantal Point Process, parallel generation, generation quality

35. ❌ UGID: Unified Graph Isomorphism for Debiasing Large Language Models

作者: Zikang Ding, Junchi Yao, Junhao Li, Yi Zhang, Wenbo Jiang, Hongbo Liu, Lijie Hu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19144v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文UGID专注于大语言模型（LLMs）的社会偏见消减，属于LLM对齐和可解释性研究范畴。核心相关关键词：1）‘Large Language Models’（10分）- 论文明确研究LLMs的偏见问题；2）‘Hallucination Mitigation’（8分）- 偏见消减与事实性/真实性提升相关；3）‘Mechanistic Interpretability’（8分）- 通过建模Transformer为计算图分析内部表示，属于可解释AI；4）‘Post-training’和’Instruction Tuning’（各5分）- 涉及模型对齐和微调，但非核心方法。其他关键词如MoE、量化、RAG等与论文内容无关。

!!! tip deepseek-chat TL;DR

论文提出UGID框架，通过统一图同构方法在内部表示层面消减大语言模型的社会偏见，实验表明该方法有效降低偏见并保持模型安全性和实用性。

摘要翻译

大型语言模型（LLMs）表现出显著的社会偏见。基于输出层面或数据优化的去偏方法无法完全解决这些偏见，许多先前研究已表明偏见嵌入在模型的内部表示中。我们提出统一图同构去偏框架（Unified Graph Isomorphism for Debiasing large language models，简称UGID），这是一种针对大型语言模型的内部表示层面去偏框架，它将Transformer建模为一个结构化计算图，其中注意力机制定义了图的路由边，隐藏状态定义了图节点。具体而言，去偏任务被形式化为强制计算图结构在反事实输入间保持不变性，仅允许在敏感属性上存在差异。UGID联合约束偏见敏感区域中的注意力路由与隐藏表示，有效防止偏见在架构组件间迁移。为实现有效的模型行为对齐且不损害通用能力，我们引入了对敏感逻辑值的对数空间约束以及基于选择性锚点的目标函数，以保留定义性语义。在大型语言模型上的大量实验表明，UGID在分布内与分布外场景下均能有效降低偏见，显著减少内部结构差异，并保持模型的安全性与实用性。

摘要 (Abstract)

Large language models (LLMs) exhibit pronounced social biases. Output-level or data-optimization–based debiasing methods cannot fully resolve these biases, and many prior works have shown that biases are embedded in internal representations. We propose \underline{U}nified \underline{G}raph \underline{I}somorphism for \underline{D}ebiasing large language models (\textit{\textbf{UGID}}), an internal-representation–level debiasing framework for large language models that models the Transformer as a structured computational graph, where attention mechanisms define the routing edges of the graph and hidden states define the graph nodes. Specifically, debiasing is formulated as enforcing invariance of the graph structure across counterfactual inputs, with differences allowed only on sensitive attributes. \textit{\textbf{UGID}} jointly constrains attention routing and hidden representations in bias-sensitive regions, effectively preventing bias migration across architectural components. To achieve effective behavioral alignment without degrading general capabilities, we introduce a log-space constraint on sensitive logits and a selective anchor-based objective to preserve definitional semantics. Extensive experiments on large language models demonstrate that \textit{\textbf{UGID}} effectively reduces bias under both in-distribution and out-of-distribution settings, significantly reduces internal structural discrepancies, and preserves model safety and utility.

关键词: Large Language Models, Debiasing, Graph Isomorphism, Transformer, Attention Mechanisms, Internal Representations, Social Biases, Model Alignment

36. ❌ Implicit Patterns in LLM-Based Binary Analysis

作者: Qiang Li, XiangRui Zhang, Haining Wang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19138v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在二进制漏洞分析中的多步推理行为，与’Large Language Models’、‘Chain of Thought’、‘System 2 Thinking’、‘LLM Agents’和’Mechanistic Interpretability’高度相关（10分），因为这些关键词直接对应论文研究的LLM推理、多步分析、深度思考和可解释性。‘Self-Correction’得5分，因为论文提到’backtracking’和’revision’涉及自我修正元素。其他关键词如MoE、SFT、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了基于LLM的二进制漏洞分析中多步推理过程产生的隐式模式，通过大规模跟踪分析识别出四种主导模式，为理解LLM推理行为提供了系统化特征描述。

摘要翻译

基于大语言模型（LLM）的智能体正日益以迭代式、多轮次的方式执行二进制漏洞分析，其中模型充当核心决策者。然而，由于有限的上下文窗口和隐式的令牌级行为，此类系统如何在数百个推理步骤中组织探索过程仍鲜为人知。我们首次开展了大规模、追踪层级的研究，表明多轮次LLM推理会产生结构化的、令牌级的隐式模式。通过分析521个二进制文件中的99,563个推理步骤，我们识别出四种主导模式：早期剪枝、路径依赖锁定、定向回溯以及知识引导的优先级排序，这些模式均从推理轨迹中隐式浮现。这些令牌级隐式模式构成了LLM推理的一种抽象：探索并非通过显式的控制流或预定义启发式规则来组织，而是通过调节路径选择、路径确认与修正的隐式决策来实现。我们的分析表明，这些模式形成了一个稳定、结构化的系统，具有不同的时序角色和可量化的特征。本研究首次系统性地刻画了LLM驱动的二进制分析行为，并为构建更可靠的分析系统奠定了基础。

摘要 (Abstract)

Binary vulnerability analysis is increasingly performed by LLM-based agents in an iterative, multi-pass manner, with the model as the core decision-maker. However, how such systems organize exploration over hundreds of reasoning steps remains poorly understood, due to limited context windows and implicit token-level behaviors. We present the first large-scale, trace-level study showing that multi-pass LLM reasoning gives rise to structured, token-level implicit patterns. Analyzing 521 binaries with 99,563 reasoning steps, we identify four dominant patterns: early pruning, path-dependent lock-in, targeted backtracking, and knowledge-guided prioritization that emerge implicitly from reasoning traces. These token-level implicit patterns serve as an abstraction of LLM reasoning: instead of explicit control-flow or predefined heuristics, exploration is organized through implicit decisions regulating path selection, commitment, and revision. Our analysis shows these patterns form a stable, structured system with distinct temporal roles and measurable characteristics. Our results provide the first systematic characterization of LLM-driven binary analysis and a foundation for more reliable analysis systems.

关键词: LLM-based binary analysis, multi-pass reasoning, implicit patterns, reasoning traces, token-level behaviors, exploration organization, vulnerability analysis, structured system

37. ❌ Adaptive Regime-Aware Stock Price Prediction Using Autoencoder-Gated Dual Node Transformers with Reinforcement Learning Control

作者: Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19136v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于金融时间序列预测，使用自动编码器、Transformer和强化学习技术，但未涉及任何大语言模型（LLM）、深度学习技术原理创新或AI for Science应用。所有关键词均与大模型、深度学习技术或科学AI应用相关，而本文研究的是特定领域的传统机器学习/深度学习应用，与给定关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于自动编码器-门控双节点Transformer和强化学习控制的自适应机制感知股票价格预测框架，在S&P 500股票数据上实现了比基线模型更低的预测误差（0.59% MAPE）和更高的方向准确性（72%）。

摘要翻译

股票市场表现出依赖区制的行为特征，在稳定条件下优化的预测模型往往在波动时期失效。现有方法通常对所有市场状态进行统一处理，或需要人工标注市场区制，这种方式成本高昂且会随市场动态演变迅速过时。本文提出一种自适应预测框架，能够自适应识别偏离正常市场条件的状态，并通过专用预测路径路由数据。该架构包含三个组成部分：(1) 在正常市场条件下训练的自编码器，通过重构误差识别异常区制；(2) 分别针对稳定市场和事件驱动市场条件设计的双节点Transformer网络；(3) 基于Soft Actor-Critic强化学习控制器，可根据预测性能反馈自适应调整区制检测阈值和路径混合权重。强化学习组件使系统能够学习自适应区制边界，将异常定义为标准预测方法失效的市场状态。在1982年至2025年期间20只标普500股票上的实验表明：所提框架在无强化控制器时单日预测平均绝对百分比误差(MAPE)为0.68%，完整自适应系统达到0.59% MAPE，而基准集成节点Transformer模型为0.80%。完整框架的方向预测准确率达到72%。该系统在高波动时期保持稳健性能，当基准模型误差超过1.5%时，其MAPE仍低于0.85%。消融研究证实各组件均具有显著贡献：移除自编码器路由导致相对MAPE上升36%，SAC控制器贡献15%，双路径架构贡献7%。

摘要 (Abstract)

Stock markets exhibit regime-dependent behavior where prediction models optimized for stable conditions often fail during volatile periods. Existing approaches typically treat all market states uniformly or require manual regime labeling, which is expensive and quickly becomes stale as market dynamics evolve. This paper introduces an adaptive prediction framework that adaptively identifies deviations from normal market conditions and routes data through specialized prediction pathways. The architecture consists of three components: (1) an autoencoder trained on normal market conditions that identifies anomalous regimes through reconstruction error, (2) dual node transformer networks specialized for stable and event-driven market conditions respectively, and (3) a Soft Actor-Critic reinforcement learning controller that adaptively tunes the regime detection threshold and pathway blending weights based on prediction performance feedback. The reinforcement learning component enables the system to learn adaptive regime boundaries, defining anomalies as market states where standard prediction approaches fail. Experiments on 20 S&P 500 stocks spanning 1982 to 2025 demonstrate that the proposed framework achieves 0.68% MAPE for one-day predictions without the reinforcement controller and 0.59% MAPE with the full adaptive system, compared to 0.80% for the baseline integrated node transformer. Directional accuracy reaches 72% with the complete framework. The system maintains robust performance during high-volatility periods, with MAPE below 0.85% when baseline models exceed 1.5%. Ablation studies confirm that each component contributes meaningfully: autoencoder routing accounts for 36% relative MAPE degradation upon removal, followed by the SAC controller at 15% and the dual-path architecture at 7%.

关键词: stock price prediction, regime-dependent behavior, autoencoder, dual node transformers, reinforcement learning control, Soft Actor-Critic, adaptive framework, market volatility

38. ❌ CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization

作者: Weilin Chen, Jiahao Rao, Wenhao Wang, Xinyang Li, Xuan Cheng, Liujuan Cao 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19121v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文CustomTex专注于3D室内场景纹理生成，采用基于参考图像的定制化方法，核心是双蒸馏优化框架（语义级和像素级蒸馏）和变分分数蒸馏（VSD）。所有评分关键词均涉及大模型、深度学习技术原理或特定科学AI应用（如生物信息学），而本论文属于计算机视觉/图形学领域，研究3D纹理生成，未涉及任何大模型技术、深度学习创新原理或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

论文提出CustomTex框架，通过多参考图像定制和双蒸馏方法，解决了3D室内场景高保真纹理生成的难题，实现了实例级精确控制和高质量纹理输出。

摘要翻译

高保真、可定制的三维室内场景纹理生成仍是一个重大挑战。基于文本驱动的方法虽具灵活性，但缺乏对细粒度实例级控制的精确性，且生成的纹理常存在质量不足、伪影及固化阴影等问题。为克服这些局限，我们提出了CustomTex——一种由参考图像驱动的、面向实例级高保真场景纹理生成的新型框架。CustomTex接收未赋纹理的三维场景及一组为每个物体实例指定目标外观的参考图像，并生成统一的高分辨率纹理贴图。本方法的核心在于将语义控制与像素级增强相分离的双重蒸馏策略：我们采用配备实例交叉注意力机制的语义级蒸馏，以确保语义合理性及“参考-实例”对齐；同时通过像素级蒸馏实现高视觉保真度。二者均在变分分数蒸馏优化框架内统一进行。实验表明，相较于现有先进方法，CustomTex能够实现与参考图像的精确实例级一致性，并生成具有更优清晰度、更少伪影及最小化固化阴影的纹理。本研究为高质量、可定制的三维场景外观编辑开辟了一条更直接且用户友好的路径。

摘要 (Abstract)

The creation of high-fidelity, customizable 3D indoor scene textures remains a significant challenge. While text-driven methods offer flexibility, they lack the precision for fine-grained, instance-level control, and often produce textures with insufficient quality, artifacts, and baked-in shading. To overcome these limitations, we introduce CustomTex, a novel framework for instance-level, high-fidelity scene texturing driven by reference images. CustomTex takes an untextured 3D scene and a set of reference images specifying the desired appearance for each object instance, and generates a unified, high-resolution texture map. The core of our method is a dual-distillation approach that separates semantic control from pixel-level enhancement. We employ semantic-level distillation, equipped with an instance cross-attention, to ensure semantic plausibility and ``reference-instance’’ alignment, and pixel-level distillation to enforce high visual fidelity. Both are unified within a Variational Score Distillation (VSD) optimization framework. Experiments demonstrate that CustomTex achieves precise instance-level consistency with reference images and produces textures with superior sharpness, reduced artifacts, and minimal baked-in shading compared to state-of-the-art methods. Our work establishes a more direct and user-friendly path to high-quality, customizable 3D scene appearance editing.

关键词: 3D scene texturing, instance-level customization, reference images, dual-distillation, Variational Score Distillation, high-fidelity textures, indoor scenes, texture map generation

39. ❌ How Uncertainty Estimation Scales with Sampling in Reasoning Models

作者: Maksym Del, Markus Kängsepp, Marharyta Domnich, Ardi Tampuu, Lisa Yankovskaya, Meelis Kull, Mark Fishel 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19118v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究推理语言模型中的不确定性估计，核心关注链式思维推理（CoT）和深度推理（System 2 Thinking），因此这两个关键词高度相关。论文涉及RLVR风格的后训练（Post-training），因此与Post-training相关。研究不确定性估计与事实性和幻觉缓解相关，且应用于STEM和人文领域，因此与AI for Science相关。论文未涉及其他关键词的具体技术或应用。

!!! tip deepseek-chat TL;DR

该论文研究了在扩展的链式思维推理中，推理语言模型的不确定性估计如何随采样规模变化，发现结合自我一致性和口头化置信度的混合估计器仅用两个样本就能显著提升不确定性质量，且效果在不同领域存在差异。

摘要翻译

不确定性估计对于部署推理语言模型至关重要，但在扩展的思维链推理中其机制仍不甚明晰。本研究将并行采样作为一种完全黑箱方法进行探讨，通过语言化置信度与自洽性两种信号展开分析。基于三种推理模型及涵盖数学、STEM与人文学科的17项任务，我们系统刻画了这些信号的扩展规律。

研究发现，自洽性与语言化置信度在推理模型中均呈现扩展性，但自洽性在初始阶段区分度较低，且在适度采样条件下滞后于语言化置信度。然而，大部分不确定性增益源于信号融合：仅需两个样本，混合估计器的AUROC平均提升高达$+12$，即使与更大采样规模的单一信号相比仍具优势，此后收益逐渐递减。这些效应具有领域依赖性：在数学领域——RLVR式后训练（post-training）的本征领域——推理模型展现出更高的不确定性质量，同时信号间互补性更强、扩展速度更快，显著优于STEM或人文学科任务。

摘要 (Abstract)

Uncertainty estimation is critical for deploying reasoning language models, yet remains poorly understood under extended chain-of-thought reasoning. We study parallel sampling as a fully black-box approach using verbalized confidence and self-consistency. Across three reasoning models and 17 tasks spanning mathematics, STEM, and humanities, we characterize how these signals scale. Both self-consistency and verbalized confidence scale in reasoning models, but self-consistency exhibits lower initial discrimination and lags behind verbalized confidence under moderate sampling. Most uncertainty gains, however, arise from signal combination: with just two samples, a hybrid estimator improves AUROC by up to $+12$ on average and already outperforms either signal alone even when scaled to much larger budgets, after which returns diminish. These effects are domain-dependent: in mathematics, the native domain of RLVR-style post-training, reasoning models achieve higher uncertainty quality and exhibit both stronger complementarity and faster scaling than in STEM or humanities.

关键词: Uncertainty Estimation, Reasoning Models, Chain-of-Thought, Self-Consistency, Verbalized Confidence, Sampling, RLVR, Domain Dependence

40. ❌ FedTrident: Resilient Road Condition Classification Against Poisoning Attacks in Federated Learning

作者: Sheng Liu, Panos Papadimitratos 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19101v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究联邦学习（FL）在智能交通系统（ITS）中针对路况分类（RCC）的投毒攻击防御方法FedTrident，属于AI在特定领域（交通）的应用。所有关键词均与大模型（LLM）技术、训练方法、推理优化、对齐、代理等直接相关，而本文完全不涉及大模型，仅使用传统深度学习模型进行图像分类。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为交通工程可视为应用科学领域，但论文未明确提及科学发现或生物/化学信息学，故给5分（有一定关联）。其他关键词均完全无关，给0分。

!!! tip deepseek-chat TL;DR

论文针对联邦学习中路况分类任务面临的定向标签翻转攻击，提出了FedTrident防御框架，通过神经元分析、自适应客户端评级和机器遗忘技术，有效抵御攻击并在多种场景下达到接近无攻击的性能水平。

摘要翻译

联邦学习（FL）已成为智能交通系统（ITS）中一项变革性范式，尤其在基于摄像头的路况分类（Road Condition Classification, RCC）中表现突出。然而，联邦学习通过促进协作，也使基于FL的RCC系统面临对抗性参与者发起的定向标签翻转攻击（Targeted Label-Flipping Attacks, TLFAs）。恶意客户端（车辆）可对其本地训练数据进行错误重标注（例如将实际不平整道路错误标注为平整道路），从而破坏全局模型预测并危及交通安全。现有针对此类投毒攻击的防御措施在多种攻击场景下均未能将模型鲁棒性维持在接近无攻击必要水平，原因在于：1）未针对TLFAs定制本地模型投毒检测方法；2）未基于历史行为排除恶意车辆客户端；3）在排除恶意客户端后未能修复已受损的全局模型。为填补这一研究空白，我们提出FedTrident框架，该框架引入：1）基于神经元分析的本地模型异常行为检测（特别包括攻击目标识别、关键特征提取，以及基于高斯混合模型（GMM）的模型聚类与过滤）；2）自适应客户端评级机制，根据每轮FL中的本地模型检测结果排除客户端；3）在FL过程中排除恶意客户端后，采用机器遗忘技术对受损全局模型进行修复。通过对多种FL-RCC模型、任务及配置的广泛评估表明，FedTrident能有效抵御TLFAs，在无攻击场景下的性能表现相当，并在两项最关键指标上分别优于八种基线防御方法9.49%和4.47%。此外，FedTrident对不同的恶意客户端比例、数据异构程度、复杂多任务及动态攻击均展现出强鲁棒性。

摘要 (Abstract)

FL has emerged as a transformative paradigm for ITS, notably camera-based Road Condition Classification (RCC). However, by enabling collaboration, FL-based RCC exposes the system to adversarial participants launching Targeted Label-Flipping Attacks (TLFAs). Malicious clients (vehicles) can relabel their local training data (e.g., from an actual uneven road to a wrong smooth road), consequently compromising global model predictions and jeopardizing transportation safety. Existing countermeasures against such poisoning attacks fail to maintain resilient model performance near the necessary attack-free levels in various attack scenarios due to: 1) not tailoring poisoned local model detection to TLFAs, 2) not excluding malicious vehicular clients based on historical behavior, and 3) not remedying the already-corrupted global model after exclusion. To close this research gap, we propose FedTrident, which introduces: 1) neuron-wise analysis for local model misbehavior detection (notably including attack goal identification, critical feature extraction, and GMM-based model clustering and filtering); 2) adaptive client rating for client exclusion according to the local model detection results in each FL round; and 3) machine unlearning for corrupted global model remediation once malicious clients are excluded during FL. Extensive evaluation across diverse FL-RCC models, tasks, and configurations demonstrates that FedTrident can effectively thwart TLFAs, achieving performance comparable to that in attack-free scenarios and outperforming eight baseline countermeasures by 9.49% and 4.47% for the two most critical metrics. Moreover, FedTrident is resilient to various malicious client rates, data heterogeneity levels, complicated multi-task, and dynamic attacks.

关键词: Federated Learning, Road Condition Classification, Poisoning Attacks, Targeted Label-Flipping Attacks, Resilient Model Performance, Neuron-wise Analysis, Machine Unlearning, Intelligent Transportation Systems

41. ❌ LuMamba: Latent Unified Mamba for Electrode Topology-Invariant and Efficient EEG Modeling

作者: Danaé Broustail, Anna Tegon, Thorir Mar Ingolfsson, Yawei Li, Luca Benini 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19100v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	15.0/10	0.0

评分理由: 论文提出LuMamba框架，专注于脑电图（EEG）生物信号的自监督预训练和高效建模，属于大模型在生物信息学（AI for Science）领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（15分）。论文明确使用预训练方法（Pre-training）作为核心，并涉及下游任务的微调（Post-training），因此分别给予10分和5分。论文旨在构建EEG的基础模型，与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分）。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、CoT、Agents、Quantization等均未在摘要中提及或与论文主题无关，给予0分。

!!! tip deepseek-chat TL;DR

该论文提出LuMamba框架，通过结合拓扑不变编码和线性复杂度状态空间建模，解决了EEG建模中电极拓扑差异和计算可扩展性的挑战，在预训练超过21,000小时EEG数据后，在多个下游任务上实现了高效且高性能的结果。

摘要翻译

脑电图（EEG）能够在临床与神经技术应用中实现无创的脑活动监测，然而，由于电极拓扑结构差异和计算可扩展性问题——Transformer架构会带来序列长度的二次方复杂度，构建EEG基础模型仍具挑战性。作为联合解决方案，我们提出LuMamba（Latent Unified Mamba），这是一个结合了拓扑不变编码与线性复杂度状态空间建模的自监督框架。该框架利用LUNA的学习查询交叉注意力机制实现通道统一~\cite{luna}，并采用FEMBA的双向Mamba块进行高效时序建模~\cite{femba}。在此架构内，我们首次系统性地研究了用于生物信号学习的潜在-欧几里得联合嵌入预测架构（Latent-Euclidean Joint-Embedding Predictive Architecture, LeJEPA）。基于TUEG语料库中超过21,000小时的无标签EEG数据进行预训练后，LuMamba在五个下游任务上进行了评估，这些任务涵盖异常检测、伪影识别和精神状态分类，所用电极配置从16通道到26通道不等。在预训练目标中，仅使用掩码重建会产生结构化但泛化能力较弱的表征，而仅使用LeJEPA则会产生弥散的嵌入；结合这两个目标能实现最稳健的性能。LuMamba仅需460万参数，即在TUAB任务上达到80.99%的平衡准确率，并在阿尔茨海默症检测上取得了最先进的性能（AUPR为0.97），同时在相同序列长度下所需浮点运算次数比最先进模型少377倍，并且在达到典型GPU内存限制前可扩展到12倍长的序列。代码发布于 https://github.com/pulp-bio/biofoundation

摘要 (Abstract)

Electroencephalography (EEG) enables non-invasive monitoring of brain activity across clinical and neurotechnology applications, yet building foundation models for EEG remains challenging due to \emph{differing electrode topologies} and \emph{computational scalability}, as Transformer architectures incur quadratic sequence complexity. As a joint solution, we propose \textbf{LuMamba} (\textbf{L}atent \textbf{U}nified \textbf{Mamba}), a self-supervised framework combining topology-invariant encodings with linear-complexity state-space modeling, using LUNA’s learned-query cross-attention mechanism for channel unification~\cite{luna}, and FEMBA’s bidirectional Mamba blocks for efficient temporal modeling~\cite{femba}. Within this architecture, we provide the first systematic investigation of the Latent-Euclidean Joint-Embedding Predictive Architecture (LeJEPA) for biosignal learning. Pre-trained on over 21,000 hours of unlabeled EEG from the TUEG corpus, LuMamba is evaluated on five downstream tasks spanning abnormality detection, artifact recognition, and mental condition classification across electrode configurations ranging from 16 to 26 channels. In the pre-training objective, masked reconstruction alone yields structured but less generalizable representations, while LeJEPA alone produces diffuse embeddings; combining both objectives achieves the most robust performance. With only 4.6M parameters, LuMamba attains 80.99% balanced accuracy on TUAB and achieves state-of-art performance on Alzheimer’s detection (0.97 AUPR), while requiring \textbf{377$\times$ fewer FLOPS} than state-of-art models at equivalent sequence lengths and scaling to \textbf{12$\times$ longer sequences} before reaching typical GPU memory limits. Code is available at https://github.com/pulp-bio/biofoundation

关键词: EEG modeling, self-supervised learning, state-space models, Mamba, pre-training, computational efficiency, bioinformatics, brain activity analysis

42. ❌ DaPT: A Dual-Path Framework for Multilingual Multi-hop Question Answering

作者: Yilin Wang, Yuchun Fan, Jiaoyang Li, Ziming Zhu, Yongyu Mu, Qiaozhi He, Tong Xiao, Jingbo Zhu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19097v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RAG系统在跨语言多跳问答中的应用，与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（10分），并涉及LLMs在跨语言理解中的应用（8分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理技术、代理系统、模型压缩等均未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对RAG系统在跨语言多跳问答中的性能不平衡问题，提出了DaPT框架，通过并行生成子问题图并合并的策略，显著提升了多语言场景下的问答准确性。

摘要翻译

检索增强生成系统在解决英语场景下的复杂多跳问答任务方面已取得显著进展。然而，这类系统不可避免地会面临跨多语言语料库与查询进行检索的应用场景，这带来了若干开放性挑战。首要挑战在于缺乏能够评估多语言多跳问答场景下检索增强生成系统能力的基准测试集；其次，现有方法过度依赖大语言模型在英语语境下的强大语义理解能力，导致其在多语言场景中的效能下降。为应对这些挑战，我们首先通过将仅包含英语的基准测试集翻译为五种语言，构建了多语言多跳问答基准；随后提出了DaPT——一种新颖的多语言检索增强生成框架。该框架并行地为源语言查询及其英语翻译版本生成子问题图，在合并二者后采用双语检索与回答策略依次求解子问题。实验结果表明，先进的检索增强生成系统在多语言场景中存在显著的性能不平衡问题。此外，与基线方法相比，我们提出的方法能持续生成更精准简洁的答案，显著提升了该任务上的检索增强生成性能。例如，在最具挑战性的MuSiQue基准测试中，DaPT在平均精确匹配分数上相比最强基线实现了18.3%的相对提升。

摘要 (Abstract)

Retrieval-augmented generation (RAG) systems have made significant progress in solving complex multi-hop question answering (QA) tasks in the English scenario. However, RAG systems inevitably face the application scenario of retrieving across multilingual corpora and queries, leaving several open challenges. The first one involves the absence of benchmarks that assess RAG systems’ capabilities under the multilingual multi-hop (MM-hop) QA setting. The second centers on the overreliance on LLMs’ strong semantic understanding in English, which diminishes effectiveness in multilingual scenarios. To address these challenges, we first construct multilingual multi-hop QA benchmarks by translating English-only benchmarks into five languages, and then we propose DaPT, a novel multilingual RAG framework. DaPT generates sub-question graphs in parallel for both the source-language query and its English translation counterpart, then merges them before employing a bilingual retrieval-and-answer strategy to sequentially solve sub-questions. Our experimental results demonstrate that advanced RAG systems suffer from a significant performance imbalance in multilingual scenarios. Furthermore, our proposed method consistently yields more accurate and concise answers compared to the baselines, significantly enhancing RAG performance on this task. For instance, on the most challenging MuSiQue benchmark, DaPT achieves a relative improvement of 18.3% in average EM score over the strongest baseline.

关键词: Retrieval-augmented generation, multilingual, multi-hop question answering, RAG systems, benchmarks, DaPT framework, bilingual retrieval, sub-question graphs

43. ❌ SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

作者: Carlos Hinojosa, Clemens Grange, Bernard Ghanem 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19092v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉语言模型（VLMs）的安全判断机制，与关键词的相关性分析如下：1）与’Instruction Tuning/Alignment/Value Alignment’有一定关联（5分），因为论文涉及安全对齐和行为拒绝机制；2）与’Hallucination Mitigation/Factuality/Truthfulness’有一定关联（5分），因为研究虚假拒绝和安全判断的可靠性；3）与’Mechanistic Interpretability/Explainable AI’有一定关联（5分），因为论文旨在理解VLMs安全决策的驱动因素和可解释性。其他关键词主要针对纯语言模型或特定技术（如MoE、量化、推理加速等），与本文的VLM安全研究无直接关联，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究发现视觉语言模型的安全判断高度依赖语义线索而非真实的视觉理解，揭示了多模态安全系统可能存在的漏洞。

摘要翻译

视觉语言模型（VLMs）正日益部署于现实世界和具身环境中，其安全决策依赖于视觉上下文。然而，目前尚不清楚是哪些视觉证据驱动了这些判断。本研究探讨多模态安全行为是否能够通过简单的语义线索进行引导。我们提出了一种语义引导框架，该框架在不改变底层场景内容的前提下，施加受控的文本、视觉及认知干预。为评估这些影响，我们提出了SAVeS——一个针对语义线索下情境安全的基准测试，并设计了一套评估方案，以区分行为拒绝、基于现实的安全推理和错误拒绝。在多个视觉语言模型及一个额外的前沿基准测试上的实验表明，安全决策对语义线索高度敏感，这揭示了模型依赖于习得的视觉-语言关联，而非基于现实的视觉理解。我们进一步证明，自动化引导流程能够利用这些机制，凸显了多模态安全系统中存在的潜在脆弱性。

摘要 (Abstract)

Vision-language models (VLMs) are increasingly deployed in real-world and embodied settings where safety decisions depend on visual context. However, it remains unclear which visual evidence drives these judgments. We study whether multimodal safety behavior in VLMs can be steered by simple semantic cues. We introduce a semantic steering framework that applies controlled textual, visual, and cognitive interventions without changing the underlying scene content. To evaluate these effects, we propose SAVeS, a benchmark for situational safety under semantic cues, together with an evaluation protocol that separates behavioral refusal, grounded safety reasoning, and false refusals. Experiments across multiple VLMs and an additional state-of-the-art benchmark show that safety decisions are highly sensitive to semantic cues, indicating reliance on learned visual-linguistic associations rather than grounded visual understanding. We further demonstrate that automated steering pipelines can exploit these mechanisms, highlighting a potential vulnerability in multimodal safety systems.

关键词: Vision-Language Models, Safety Judgments, Semantic Cues, Multimodal Safety, Behavioral Refusal, Grounded Safety Reasoning, False Refusals, Semantic Steering

44. ❌ Serendipity by Design: Evaluating the Impact of Cross-domain Mappings on Human and LLM Creativity

作者: Qiawen Ella Liu, Marina Dubova, Henry Conklin, Takumi Harada, Thomas L. Griffiths 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19087v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文核心研究LLMs与人类在创造力方面的比较，以及跨领域映射干预对两者创造力的影响，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术细节或应用，如MoE、SLMs、训练方法、推理优化、代理系统、模型压缩等，故其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该研究比较了人类和大型语言模型在创造力方面的差异，发现人类能从随机跨领域映射中受益，而LLMs平均能产生更原创的想法但未显著受此干预影响，且两者在灵感源与目标语义距离增大时跨领域映射的影响均增强。

摘要翻译

大型语言模型（LLM）是否以与人类相同的方式展现创造力？相同的干预措施能否同时提升两者的创造力？我们评估了一种前景广阔但尚未充分验证的创造力干预方法：强制创造者从随机、遥远的外部领域进行类比映射（“跨领域映射”）。在两种提示条件下，人类参与者和大型语言模型为十种日常产品（例如背包、电视）生成了新颖特性：（i）跨领域映射，要求将随机分配的外部来源（例如章鱼、仙人掌、GPS）的某种属性转化应用于目标产品；（ii）用户需求，要求针对未满足的用户需求提出创新方案。研究表明，人类参与者能够稳定地从随机分配的跨领域映射中获益；而大型语言模型平均能产生比人类更具原创性的想法，但跨领域映射对其并未产生统计学上显著的影响。然而，在这两个系统中，当灵感来源与目标产品在语义上更为疏远时，跨领域映射的影响均会增强。我们的研究结果既凸显了远距离联想在创造性构思中的作用，也揭示了人类与大型语言模型对相同创造力干预措施反应的系统性差异。

摘要 (Abstract)

Are large language models (LLMs) creative in the same way humans are, and can the same interventions increase creativity in both? We evaluate a promising but largely untested intervention for creativity: forcing creators to draw an analogy from a random, remote source domain (‘‘cross-domain mapping’’). Human participants and LLMs generated novel features for ten daily products (e.g., backpack, TV) under two prompts: (i) cross-domain mapping, which required translating a property from a randomly assigned source (e.g., octopus, cactus, GPS), and (ii) user-need, which required proposing innovations targeting unmet user needs. We show that humans reliably benefit from randomly assigned cross-domain mappings, while LLMs, on average, generate more original ideas than humans and do not show a statistically significant effect of cross-domain mappings. However, in both systems, the impact of cross-domain mapping increases when the inspiration source becomes more semantically distant from the target. Our results highlight both the role of remote association in creative ideation and systematic differences in how humans and LLMs respond to the same intervention for creativity.

关键词: large language models, creativity, cross-domain mapping, human-LLM comparison, creative ideation, remote association, innovation generation, semantic distance

45. ❌ CAMO: A Conditional Neural Solver for the Multi-objective Multiple Traveling Salesman Problem

作者: Fengxiaoxiao Li, Xiao Mao, Mingfeng Fan, Yifeng Zhang, Yi Li, Tanishq Duhan, Guillaume Sartoretti 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19074v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文CAMO专注于解决多目标多旅行商问题（MOMTSP），这是一个经典的组合优化问题，主要涉及机器人路径规划、多智能体协调和多目标优化。论文采用基于强化学习的条件神经求解器，核心是图神经网络和强化学习（REINFORCE），而非大语言模型（LLM）或深度学习在科学领域的应用。因此，与绝大多数关键词（如LLM、MoE、Scaling Laws、Instruction Tuning、RAG等）完全无关。唯一略有相关的是“Multi-agent Systems OR Agent Coordination”，因为论文涉及多机器人（多智能体）协调，但论文重点在优化算法而非通用智能体系统，故给5分（有一定关联）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

论文提出了一种名为CAMO的条件神经求解器，用于解决多目标多旅行商问题（MOMTSP），通过融合偏好编码和协作解码器实现多智能体协调和多目标权衡，在实验中优于现有神经和传统启发式方法，并展示了在移动机器人平台上的实际应用。

摘要翻译

机器人系统常需多机器人团队协同访问多个目标点，并同时优化相互竞争的目标，例如总行进成本与完工时间。该场景可建模为多目标多旅行商问题（MOMTSP）。尽管基于学习的方法已在单智能体TSP及多目标TSP变体上展现出强大性能，却鲜少同时应对多智能体协同与多目标权衡带来的双重复杂性挑战。为填补这一空白，我们提出CAMO——一种用于MOMTSP的条件神经求解器，其能泛化至不同规模的目标点数量、智能体数量及偏好向量，并生成对帕累托前沿（Pareto front, PF）的高质量近似解。具体而言，CAMO包含一个条件编码器，将偏好向量融入问题实例的表征中，从而实现对多目标权衡的显式控制；以及一个协同解码器，通过交替执行智能体选择与节点选择，以自回归方式构建多智能体路径。为进一步提升泛化能力，我们采用基于REINFORCE算法的目标函数，在混合规模问题分布上训练CAMO。大量实验表明，CAMO在神经启发式方法与传统启发式方法中均表现更优，能获得更接近真实帕累托前沿的近似解。此外，消融实验验证了CAMO关键组件的贡献，在移动机器人平台上的实际测试也证明了其现实适用性。

摘要 (Abstract)

Robotic systems often require a team of robots to collectively visit multiple targets while optimizing competing objectives, such as total travel cost and makespan. This setting can be formulated as the Multi-Objective Multiple Traveling Salesman Problem (MOMTSP). Although learning-based methods have shown strong performance on the single-agent TSP and multi-objective TSP variants, they rarely address the combined challenges of multi-agent coordination and multi-objective trade-offs, which introduce dual sources of complexity. To bridge this gap, we propose CAMO, a conditional neural solver for MOMTSP that generalizes across varying numbers of targets, agents, and preference vectors, and yields high-quality approximations to the Pareto front (PF). Specifically, CAMO consists of a conditional encoder to fuse preferences into instance representations, enabling explicit control over multi-objective trade-offs, and a collaborative decoder that coordinates all agents by alternating agent selection and node selection to construct multi-agent tours autoregressively. To further improve generalization, we train CAMO with a REINFORCE-based objective over a mixed distribution of problem sizes. Extensive experiments show that CAMO outperforms both neural and conventional heuristics, achieving a closer approximation of PFs. In addition, ablation results validate the contributions of CAMO’s key components, and real-world tests on a mobile robot platform demonstrate its practical applicability.

关键词: Multi-objective Multiple Traveling Salesman Problem, MOMTSP, conditional neural solver, multi-agent coordination, Pareto front approximation, REINFORCE training, robotic path planning, autoregressive tour construction

46. ❌ Parallelograms Strike Back: LLMs Generate Better Analogies than People

作者: Qiawen Ella Liu, Raja Marjieh, Jian-Qiao Zhu, Adele E. Goldberg, Thomas L. Griffiths 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19066v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在类比生成任务中的表现，与人类进行比较，因此仅与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。其他关键词涉及具体技术（如MoE、量化、推理加速等）、训练方法（如预训练、微调、对齐等）、应用场景（如AI for Science）或高级能力（如Agent、工具使用等），论文均未涉及，故均为0分。

!!! tip deepseek-chat TL;DR

该研究比较了人类和大型语言模型在四词类比生成任务上的表现，发现LLM生成的类比在质量上优于人类，更符合几何平行四边形模型，且这种优势源于LLM更一致地满足关系约束而非局部相似性。

摘要翻译

四词类比（A:B::C:D）在经典几何模型中被表述为“平行四边形”，但近期研究表明该模型难以捕捉人类生成类比的方式，而简单的局部相似性启发式方法往往能提供更好的解释（Peterson等，2020）。然而，平行四边形模型的失败究竟是因为其本身对类比关系的建模存在缺陷，还是因为人类不擅长生成保持关系的类比？我们在相同类比问题集（来自Peterson等，2020）上比较了人类与大型语言模型（LLM）的类比补全结果。研究发现，LLM生成的类比被一致认为优于人类生成的类比，并且在分布式嵌入空间（GloVe）中更贴近平行四边形结构。关键的是，我们发现这种相对于人类类比的提升主要源于更强的平行四边形对齐性以及更少依赖高频易得词汇，而非对局部相似性的敏感度增强。此外，LLM的优势并非源于其所有回答均优于人类，而是由于人类产生了大量低质量的补全结果：当仅比较两个系统的最常见（高频）回答时，LLM的优势便消失。然而，更强的平行四边形对齐性和更低的词汇频率仍能预测哪些LLM补全结果会获得比人类更高的评价。总体而言，这些结果表明平行四边形模型并非对词汇类比的拙劣解释。相反，人类可能常常无法生成满足此类关系约束的补全，而LLM则能更稳定地实现这一目标。

摘要 (Abstract)

Four-term word analogies (A:B::C:D) are classically modeled geometrically as ‘‘parallelograms,’’ yet recent work suggests this model poorly captures how humans produce analogies, with simple local-similarity heuristics often providing a better account (Peterson et al., 2020). But does the parallelogram model fail because it is a bad model of analogical relations, or because people are not very good at generating relation-preserving analogies? We compared human and large language model (LLM) analogy completions on the same set of analogy problems from (Peterson et al., 2020). We find that LLM-generated analogies are reliably judged as better than human-generated ones, and are also more closely aligned with the parallelogram structure in a distributional embedding space (GloVe). Crucially, we show that the improvement over human analogies was driven by greater parallelogram alignment and reduced reliance on accessible words rather than enhanced sensitivity to local similarity. Moreover, the LLM advantage is driven not by uniformly superior responses by LLMs, but by humans producing a long tail of weak completions: when only modal (most frequent) responses by both systems are compared, the LLM advantage disappears. However, greater parallelogram alignment and lower word frequency continue to predict which LLM completions are rated higher than those of humans. Overall, these results suggest that the parallelogram model is not a poor account of word analogy. Rather, humans may often fail to produce completions that satisfy this relational constraint, whereas LLMs do so more consistently.

关键词: word analogies, large language models, parallelogram model, human comparison, analogy generation, distributional embedding, GloVe

47. ❌ Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

作者: Yikai Zheng, Xin Ding, Yifan Yang, Shiqi Jiang, Hao Wu, Qianxi Zhang, Weijun Wang, Ting Cao, Yunxin Liu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19054v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于流媒体视频理解，提出Em-Garde框架解决效率-准确性困境，涉及视觉提案生成和匹配，但未提及任何大模型、深度学习技术原理或科学应用，与所有评分关键词无关。

!!! tip deepseek-chat TL;DR

该论文提出了Em-Garde框架，通过解耦语义理解和流感知来解决主动流媒体视频理解中的效率-准确性困境，在StreamingBench和OVO-Bench上验证了其准确性和效率的改进。

摘要翻译

流媒体视频理解领域的最新进展催生了一种新的交互范式，即模型能够主动响应用户查询。当前主动式视频大语言模型依赖于逐帧触发决策机制，这使其面临效率与准确性难以兼顾的困境。我们提出Em-Garde这一新颖框架，该框架将语义理解与流式感知解耦。在查询阶段，指令引导的提案解析器将用户查询转化为结构化、基于感知的视觉提案；在流式处理过程中，轻量级提案匹配模块执行高效的基于嵌入的匹配以触发响应。在StreamingBench和OVO-Bench上的实验表明，本框架在主动响应准确性与效率方面较先前模型取得持续提升，验证了其在严格计算约束下实现主动视频理解的有效性。

摘要 (Abstract)

Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decision making, which suffers from an efficiency-accuracy dilemma. We propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. At query time, the Instruction-Guided Proposal Parser transforms user queries into structured, perceptually grounded visual proposals; during streaming, a Lightweight Proposal Matching Module performs efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench demonstrate consistent improvements over prior models in proactive response accuracy and efficiency, validating an effective solution for proactive video understanding under strict computational constraints.

关键词: Streaming Video Understanding, Proactive VideoLLMs, Em-Garde, Instruction-Guided Proposal Parser, Lightweight Proposal Matching Module, Efficiency-Accuracy Dilemma, StreamingBench, OVO-Bench

48. ❌ Man and machine: artificial intelligence and judicial decision making

作者: Arthur Dyevre, Ahmad Shahvaroughi 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19042v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究人工智能在司法决策中的应用，特别是风险评估工具与人类法官的互动，属于AI在社会科学领域的应用研究。但所有评分关键词均聚焦于大模型/深度学习的技术原理、训练方法、推理优化、部署效率等具体技术层面，而本文讨论的是广义AI工具（如传统机器学习模型）在司法领域的应用，未涉及大模型技术、深度学习创新或任何评分关键词中的具体技术内容，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文通过综述研究发现，AI风险评估工具对司法决策的影响有限，但揭示了AI工具性能评估、人类法官决策偏差以及人机互动研究的重要空白，主张未来需要更多跨学科研究来深入理解算法工具与人类决策者的关系。

摘要翻译

人工智能技术在司法决策——特别是审前、量刑与假释环节——的整合应用，已引发关于透明度、可靠性与问责制的广泛关切。与此同时，这些发展也使人性判断的局限性更为凸显，并凸显了理解法官如何与基于人工智能的决策辅助工具互动的必要性。本文以刑事司法风险评估为焦点案例，通过综合性文献综述，串联起人工智能在司法决策中三个相互交织的维度：人工智能工具的性能与公平性、人类法官的优势与偏见，以及“人工智能+人类”互动的本质。在计算机科学、经济学、法学、犯罪学与心理学等领域，研究者已在评估自动化风险评估工具（Automated Risk Assessment Instruments）的预测效度、记录司法决策中的偏见，以及（在相对有限的范围内）探究法官如何使用算法建议等方面取得显著进展。尽管现有实证证据表明，人工智能决策辅助工具对审前与量刑决策的影响较为有限甚至不存在，但本文也揭示了现有文献中的重要空白。未来研究需进一步评估人工智能风险评估工具的性能，理解法官如何在充满干扰的决策环境中进行判断，以及个体特征如何影响法官对人工智能建议的回应。我们认为，“人工智能与人类对比”的研究范式有望为理解算法工具与人类决策者带来新的洞见，并倡导在未来研究中加强跨学科整合与交叉融合。

摘要 (Abstract)

The integration of artificial intelligence (AI) technologies into judicial decision-making - particularly in pretrial, sentencing, and parole contexts - has generated substantial concerns about transparency, reliability, and accountability. At the same time, these developments have brought the limitations of human judgment into sharper relief and underscored the importance of understanding how judges interact with AI-based decision aids. Using criminal justice risk assessment as a focal case, we conduct a synthetic review connecting three intertwined aspects of AI’s role in judicial decision-making: the performance and fairness of AI tools, the strengths and biases of human judges, and the nature of AI+human interactions. Across the fields of computer science, economics, law, criminology and psychology, researchers have made significant progress in evaluating the predictive validity of automated risk assessment instruments, documenting biases in judicial decision-making, and, to a more limited extent, examining how judges use algorithmic recommendations. While the existing empirical evidence indicates that the impact of AI decision aid tools on pretrial and sentencing decisions is modest or inexistent, our review also reveals important gaps in the canvassed literatures. Further research is needed to evaluate the performance of AI risk assessment instruments, understand how judges navigate noisy decision making environments and how individual characteristics influence judges’ responses to AI advice. We argue that AI vs Human comparisons have the potential to yield new insights into both algorithmic tools and human decision-makers and advocate greater interdisciplinary integration and cross-fertilization in future research.

关键词: artificial intelligence, judicial decision-making, risk assessment, human judges, algorithmic recommendations, transparency, accountability, interdisciplinary research

49. ❌ SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models

作者: Quentin Guimard, Federico Bartsch, Simone Caldarella, Rahaf Aljundi, Elisa Ricci, Massimiliano Mancini 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19028v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是视觉-语言模型（如CLIP）的后处理去偏方法，使用稀疏自编码器进行特征解耦和调制。所有关键词均针对大语言模型（LLM）或特定AI技术，而本文专注于视觉-语言多模态模型，未涉及LLM技术、训练方法、推理优化、代理系统或科学AI应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

本文提出了一种基于稀疏自编码器的后处理零样本去偏框架（SEM），用于减少视觉-语言模型（如CLIP）中的社会偏见和虚假关联，同时在多个基准数据集上保持语义保真度并提升公平性。

摘要翻译

连接视觉与语言的模型（如CLIP）是多模态人工智能的关键组成部分，但其大规模、未经筛选的训练数据引入了严重的社会性与虚假偏见。现有的后验去偏方法通常在稠密的CLIP嵌入空间中直接操作，其中偏见与任务相关信息高度纠缠。这种纠缠限制了它们在保持语义保真度的同时消除偏见的能力。本研究提出稀疏嵌入调制（Sparse Embedding Modulation, SEM），一种在稀疏自编码器（Sparse Autoencoder, SAE）潜在空间中运行的后验、零样本去偏框架。通过将CLIP文本嵌入分解为解耦特征，SEM能够识别并调控与偏见相关的神经元，同时保留与查询相关的神经元，从而实现更精确的非线性干预。在四个基准数据集和两种CLIP骨干网络上，SEM在检索和零样本分类任务中均取得了显著的公平性提升。我们的结果表明，稀疏潜在表示为视觉语言模型的后验去偏提供了有效基础。

摘要 (Abstract)

Models that bridge vision and language, such as CLIP, are key components of multimodal AI, yet their large-scale, uncurated training data introduce severe social and spurious biases. Existing post-hoc debiasing methods often operate directly in the dense CLIP embedding space, where bias and task-relevant information are highly entangled. This entanglement limits their ability to remove bias without degrading semantic fidelity. In this work, we propose Sparse Embedding Modulation (SEM), a post-hoc, zero-shot debiasing framework that operates in a Sparse Autoencoder (SAE) latent space. By decomposing CLIP text embeddings into disentangled features, SEM identifies and modulates bias-relevant neurons while preserving query-relevant ones. This enables more precise, non-linear interventions. Across four benchmark datasets and two CLIP backbones, SEM achieves substantial fairness gains in retrieval and zero-shot classification. Our results demonstrate that sparse latent representations provide an effective foundation for post-hoc debiasing of vision-language models.

关键词: Vision-Language Models, CLIP, Debiasing, Sparse Autoencoder, Post-hoc, Zero-shot, Fairness, Multimodal AI

50. ❌ Behavioral Fingerprints for LLM Endpoint Stability and Identity

作者: Jonah Leshin, Manish Shah, Ian Timmis, Daniel Kang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19022v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM端点行为一致性的监控方法，与’Large Language Models’高度相关（10分）。论文提到量化（quantization）是可能改变模型行为的一个因素，因此与’Quantization’有一定关联（5分）。其他关键词如MoE、SFT、RAG、推理加速等均未在摘要中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM端点因权重、量化、推理栈等更新导致行为变化的问题，提出了一个黑盒稳定性监控系统Stability Monitor，通过采样输出分布来检测行为变化并验证了其有效性。

摘要翻译

AI原生应用的一致性取决于支撑其运行的模型端点的行为一致性。传统可靠性指标（如正常运行时间、延迟和吞吐量）无法捕捉行为变化，即使端点的有效模型身份因权重更新、分词器调整、量化方法、推理引擎、内核、缓存机制、路由策略或硬件变更而发生改变，该端点仍可能保持“健康”状态。本文提出稳定性监测系统（Stability Monitor），这是一种黑盒稳定性监控系统，通过从固定提示集中采样输出并随时间比较所得输出分布，定期对端点进行指纹特征提取。指纹比对采用跨提示的加总能量距离统计量，并通过序列聚合置换检验p值作为分布偏移的证据，以检测变更事件并界定稳定性周期。在受控验证中，该系统成功检测出模型系列、版本、推理栈、量化方式及行为参数的变更。在对多个服务商托管的同一模型进行实际监测时，我们观察到不同服务商之间及同一服务商内部存在显著的稳定性差异。

摘要 (Abstract)

The consistency of AI-native applications depends on the behavioral consistency of the model endpoints that power them. Traditional reliability metrics such as uptime, latency and throughput do not capture behavioral change, and an endpoint can remain “healthy” while its effective model identity changes due to updates to weights, tokenizers, quantization, inference engines, kernels, caching, routing, or hardware. We introduce Stability Monitor, a black-box stability monitoring system that periodically fingerprints an endpoint by sampling outputs from a fixed prompt set and comparing the resulting output distributions over time. Fingerprints are compared using a summed energy distance statistic across prompts, with permutation-test p-values as evidence of distribution shift aggregated sequentially to detect change events and define stability periods. In controlled validation, Stability Monitor detects changes to model family, version, inference stack, quantization, and behavioral parameters. In real-world monitoring of the same model hosted by multiple providers, we observe substantial provider-to-provider and within-provider stability differences.

关键词: LLM endpoint stability, behavioral consistency, stability monitoring, distribution shift detection, model identity, quantization, inference stack, black-box monitoring

51. ❌ What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

作者: Gagan Bhatia, Ahmad Muhammad Isa, Maxime Peyrard, Wei Zhao 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19017v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型在时间推理任务中的表现机制，直接涉及LLM评估和可解释性分析。与’Large Language Models’高度相关（10分），因为论文评估了20个LLM并分析其时间推理能力。与’Mechanistic Interpretability’相关（8分），因为论文通过几何探测分析内部时间表示。与推理相关的关键词（Chain of Thought, System 2 Thinking）得5分，因为时间推理涉及多步和深度推理，但论文未明确使用这些特定技术。其他关键词如MoE、SFT、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在多语言时间推理任务中的表现机制，发现时间标记化质量是资源依赖的瓶颈，而时间线性是高资源语言中时间推理的最强预测因子。

摘要翻译

我们推出MultiTempBench——一个涵盖三项任务的多语言时序推理基准测试集，包含日期运算、时区转换和时序关系抽取，涉及五种语言（英语、德语、中文、阿拉伯语和豪萨语）及多种历法体系（公历、伊斯兰历和中国农历）。该基准集包含15,000个样本，其构建方式为：首先翻译750个精心设计的英文问题，再将每个问题扩展为受控的日期格式变体。我们评估了20个大语言模型，并引入了多语言日期碎片化比率（multilingual Date Fragmentation Ratio, mDFR）——该指标已通过人工严重性评级进行校准，同时结合对内部时序表征的几何探测分析。研究发现：时序要素的分词质量是受资源依赖的关键瓶颈：在低资源语言和较罕见历法格式中，词汇碎片化会破坏年/月/日的分离导致准确率骤降，而高资源场景通常能承受数字级别的分割。除分词因素外，交叉混合效应回归分析表明，在高资源语言中时序线性是时序推理的最强预测因子，而在低资源语言中碎片化程度则是更强的预测因子。代码发布于：https://github.com/gagan3012/mtb

摘要 (Abstract)

We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains $15,000$ examples built by translating $750$ curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: https://github.com/gagan3012/mtb

关键词: temporal reasoning, large language models, multilingual benchmark, tokenization, representation analysis, date arithmetic, time zone conversion, temporal relation extraction

52. ❌ Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval

作者: Hangeol Chang, Changsun Lee, Seungjoon Rho, Junho Yeo, Jong Chul Ye 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19008v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文的核心贡献是提出了一种名为HCQR的新型RAG框架，旨在改进LLMs在决策任务中的检索效果。因此，与"Retrieval-Augmented Generation OR RAG OR Retrieval-Generation"和"Large Language Models OR LLMs OR Foundation Models"高度相关（10分）。论文在医学问答数据集（MedQA, MMLU-Med）上进行实验，属于AI在科学（特别是生物医学）领域的应用，因此与"AI for Science OR Bioinformatics OR Cheminformatics"有一定关联（5分）。论文未涉及其他关键词所描述的具体技术（如MoE、量化、对齐、推理方法等），故这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有检索增强生成（RAG）方法在决策任务中检索证据不足的问题，提出了一种无需训练的假设条件查询重写（HCQR）框架，通过生成三个针对性查询来改进证据检索，在医学问答基准上显著提升了准确率。

摘要翻译

检索增强生成（Retrieval-Augmented Generation，RAG）通过将生成过程锚定于外部非参数化知识，提升了大型语言模型（LLMs）的性能。然而，当任务需要在多个竞争性选项中进行选择时，仅将生成过程建立在宽泛相关的上下文基础上，通常不足以驱动最终决策。现有的RAG方法通常依赖单一的初始查询，这往往倾向于主题相关性而非决策相关证据，因此检索到的背景信息可能无法有效区分答案选项。为解决这一问题，本文提出假设条件查询重写（Hypothesis-Conditioned Query Rewriting，HCQR），这是一种免训练的前检索框架，将RAG从面向主题的检索重新定向为面向证据的检索。HCQR首先从输入问题和候选选项中推导出一个轻量级工作假设，随后将检索重写为三个目标明确的查询，分别用于寻找以下证据：（1）支持该假设，（2）将其与竞争性替代方案区分开来，以及（3）验证问题中的关键线索。这种方法使得上下文检索更直接地与答案选择对齐，允许生成器基于检索到的证据来确认或推翻初始假设。在MedQA和MMLU-Med数据集上的实验表明，HCQR consistently outperforms single-query RAG and re-rank/filter baselines，较简单RAG的平均准确率分别提升了5.9和3.6个百分点。代码发布于https://anonymous.4open.science/r/HCQR-1C2E。

摘要 (Abstract)

Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by grounding generation in external, non-parametric knowledge. However, when a task requires choosing among competing options, simply grounding generation in broadly relevant context is often insufficient to drive the final decision. Existing RAG methods typically rely on a single initial query, which often favors topical relevance over decision-relevant evidence, and therefore retrieves background information that can fail to discriminate among answer options. To address this issue, here we propose Hypothesis-Conditioned Query Rewriting (HCQR), a training-free pre-retrieval framework that reorients RAG from topic-oriented retrieval to evidence-oriented retrieval. HCQR first derives a lightweight working hypothesis from the input question and candidate options, and then rewrites retrieval into three targeted queries that seek evidence to: (1) support the hypothesis, (2) distinguish it from competing alternatives, and (3) verify salient clues in the question. This approach enables context retrieval that is more directly aligned with answer selection, allowing the generator to confirm or overturn the initial hypothesis based on the retrieved evidence. Experiments on MedQA and MMLU-Med show that HCQR consistently outperforms single-query RAG and re-rank/filter baselines, improving average accuracy over Simple RAG by 5.9 and 3.6 points, respectively. Code is available at https://anonymous.4open.science/r/HCQR-1C2E.

关键词: Retrieval-Augmented Generation, Large Language Models, Query Rewriting, Evidence Retrieval, Decision Support, Medical QA, Hypothesis-Conditioned, Pre-retrieval Framework

53. ❌ AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

作者: An Luo, Jin Du, Xun Xian, Robert Specht, Fangqiao Tian, Ganghua Wang, Xuan Bi, Charles Fleming, Ashish Kundu, Jayanth Srinivasa, Mingyi Hong, Rui Zhang, Tianxi Li, Galin Jones, Jie Ding 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19005v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文明确提到LLMs和AI agents在数据科学工作流中的应用，与’Large Language Models OR LLMs OR Foundation Models’和’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分）。论文涉及数据科学在多个领域的应用，包括医疗保健等，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），但未深入生物信息学或化学信息学具体技术。其他关键词如MoE、SFT、RAG等未在摘要中提及或暗示，评为0分。

!!! tip deepseek-chat TL;DR

该论文通过AgentDS基准评估了AI代理和人类-AI协作在领域特定数据科学任务中的表现，发现当前AI代理在领域推理方面存在困难，而最强解决方案来自人类-AI协作，强调了人类专业知识在数据科学中的持续重要性。

摘要翻译

数据科学在将复杂数据转化为跨多个领域的可操作见解方面发挥着关键作用。大型语言模型（LLM）与人工智能（AI）智能体的最新进展已显著自动化了数据科学工作流程。然而，目前尚不清楚AI智能体在多大程度上能在特定领域的数据科学任务中匹配人类专家的表现，以及在哪些方面人类专业知识仍持续提供优势。我们推出AgentDS，这是一个旨在评估特定领域数据科学中AI智能体及人机协作性能的基准测试与竞赛平台。AgentDS包含来自六大行业（商业、食品生产、医疗保健、保险、制造业和零售银行）的17项挑战。我们举办了一场公开竞赛，共有29支团队和80名参与者参加，从而能够系统性地比较人机协作方法与纯AI基线模型的表现。我们的研究结果表明，当前的AI智能体在特定领域推理方面仍面临困难。纯AI基线模型的表现接近或低于参赛者的中位水平，而最强的解决方案则源于人机协作。这些发现挑战了AI实现完全自动化的叙事，强调了人类专业知识在数据科学中持续的重要性，同时为下一代AI的发展指明了方向。访问AgentDS网站：https://agentds.org/，开源数据集请见：https://huggingface.co/datasets/lainmn/AgentDS。

摘要 (Abstract)

Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow. However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages. We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science. AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking. We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines. Our results show that current AI agents struggle with domain-specific reasoning. AI-only baselines perform near or below the median of competition participants, while the strongest solutions arise from human-AI collaboration. These findings challenge the narrative of complete automation by AI and underscore the enduring importance of human expertise in data science, while illuminating directions for the next generation of AI. Visit the AgentDS website here: https://agentds.org/ and open source datasets here: https://huggingface.co/datasets/lainmn/AgentDS .

关键词: data science, large language models, AI agents, human-AI collaboration, domain-specific tasks, benchmark, competition, workflow automation

54. ❌ Evaluating Game Difficulty in Tetris Block Puzzle

作者: Chun-Jui Wang, Jian-Ting Guo, Hung Guei, Chung-Chin Shih, Ti-Rong Wu, I-Chen Wu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18994v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究俄罗斯方块变体的游戏难度评估，使用Stochastic Gumbel AlphaZero（SGAZ）作为评估工具。所有关键词均与大模型、深度学习技术原理或AI在科学领域的应用相关，而本文专注于游戏AI和规划算法，未涉及任何大模型技术、训练方法、推理优化、对齐、压缩、代理系统或科学AI应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文使用Stochastic Gumbel AlphaZero评估俄罗斯方块变体的游戏难度，发现增加持有块和预览块会降低难度，而添加更多方块变体（尤其是T-五连方块）会增加难度。

摘要翻译

俄罗斯方块拼图是一款单人随机益智游戏，玩家需在8×8的网格上放置方块以完成整行消除；其热门变体已积累数千万次下载量。尽管受众广泛，目前仍缺乏对何种规则集更具挑战性的系统性评估。受先前利用AlphaZero作为国际象棋变体强评估器研究的启发，我们采用随机Gumbel AlphaZero（SGAZ）——一种面向随机环境的预算感知规划智能体——来探究该领域的难度问题。我们通过训练奖励与收敛迭代次数等指标，评估了包括保留块数量h、预览保留块数量p以及新增俄罗斯方块变体在内的规则调整。实验表明，增加h和p会降低难度（表现为更高奖励与更快收敛），而添加更多俄罗斯方块变体则会增加难度，其中T型五连方块（T-pentomino）导致的收敛延迟最为显著。通过分析发现，SGAZ能在较小模拟预算下实现高水平对弈，从而支持跨规则集的高效可复现比较，为随机性益智游戏的未来设计提供了参考基准。

摘要 (Abstract)

Tetris Block Puzzle is a single player stochastic puzzle in which a player places blocks on an 8 x 8 grid to complete lines; its popular variants have amassed tens of millions of downloads. Despite this reach, there is little principled assessment of which rule sets are more difficult. Inspired by prior work that uses AlphaZero as a strong evaluator for chess variants, we study difficulty in this domain using Stochastic Gumbel AlphaZero (SGAZ), a budget-aware planning agent for stochastic environments. We evaluate rule changes including holding block h, preview holding block p, and additional Tetris block variants using metrics such as training reward and convergence iterations. Empirically, increasing h and p reduces difficulty (higher reward and faster convergence), while adding more Tetris block variants increases difficulty, with the T-pentomino producing the largest slowdown. Through analysis, SGAZ delivers strong play under small simulation budgets, enabling efficient, reproducible comparisons across rule sets and providing a reference for future design in stochastic puzzle games.

关键词: Tetris Block Puzzle, game difficulty evaluation, Stochastic Gumbel AlphaZero, SGAZ, stochastic environments, rule sets, planning agent, simulation budgets

55. ❌ Regret Bounds for Competitive Resource Allocation with Endogenous Costs

作者: Rui Chai 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18999v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究在线资源分配问题，属于在线学习、优化理论和博弈论领域，核心内容是分析具有内生成本的竞争性资源分配算法的遗憾界。论文未涉及任何大模型、深度学习、AI for Science或相关技术原理（如MoE、RLHF、RAG、量化等），也未讨论LLM推理、对齐、代理系统等主题。所有关键词均与大模型技术或其在科学领域的应用完全无关，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文研究了具有内生成本的在线资源分配问题，分析了三种分配范式（均匀分配、门控分配、竞争分配）在对抗性序列下的遗憾界，发现竞争分配通过利用交互反馈实现最优的O(√(T log N))遗憾界，并揭示了交互拓扑结构在计算-遗憾权衡中的关键作用。

摘要翻译

本文研究在T轮次中N个交互模块间的在线资源分配问题。与标准在线优化不同，本模型中的成本具有内生性：它们通过编码成对合作与竞争关系的交互矩阵W，依赖于完整的分配向量。

我们分析了三种范式：（I）均匀分配（成本不可知型），（II）门控分配（成本估计型），以及（III）通过具有交互反馈的乘性权重更新实现的竞争性分配（成本揭示型）。我们的主要结果证明了在有限变化的对抗序列下存在严格性能分层：均匀分配产生Ω(T)遗憾值，门控分配实现O(T^{2/3})遗憾值，而竞争性分配达到O(sqrt(T log N))遗憾值。这一性能差距源于竞争性分配能够利用通过交互揭示的内生成本信息。

我们进一步证明，矩阵W的拓扑结构主导着计算-遗憾权衡。完全交互（|E|=O(N^2)）产生最紧的遗憾界但带来最高的单步计算成本，而稀疏拓扑（|E|=O(N)）最多仅增加O(sqrt(log N))遗憾值，同时将单步计算成本从O(N^2)降至O(N)。兼具合作与竞争链接的环状拓扑——其中五元五行（Wuxing）拓扑为典型代表——最小化了计算成本与遗憾值的乘积。

这些结果为模块化架构中分散式竞争性分配提供了首个形式化的遗憾理论证明，并确立了成本内生性是区别于部分可观测性的根本性挑战。

关键词：在线学习，遗憾界，资源分配，内生成本，交互拓扑，乘性权重，模块化系统，五行拓扑

摘要 (Abstract)

We study online resource allocation among N interacting modules over T rounds. Unlike standard online optimization, costs are endogenous: they depend on the full allocation vector through an interaction matrix W encoding pairwise cooperation and competition. We analyze three paradigms: (I) uniform allocation (cost-ignorant), (II) gated allocation (cost-estimating), and (III) competitive allocation via multiplicative weights update with interaction feedback (cost-revealing). Our main results establish a strict separation under adversarial sequences with bounded variation: uniform incurs Omega(T) regret, gated achieves O(T^{2/3}), and competitive achieves O(sqrt(T log N)). The performance gap stems from competitive allocation’s ability to exploit endogenous cost information revealed through interactions. We further show that W’s topology governs a computation-regret tradeoff. Full interaction (|E|=O(N^2)) yields the tightest bound but highest per-step cost, while sparse topologies (|E|=O(N)) increase regret by at most O(sqrt(log N)) while reducing per-step cost from O(N^2) to O(N). Ring-structured topologies with both cooperative and competitive links - of which the five-element Wuxing topology is canonical - minimize the computation x regret product. These results provide the first formal regret-theoretic justification for decentralized competitive allocation in modular architectures and establish cost endogeneity as a fundamental challenge distinct from partial observability. Keywords: online learning, regret bounds, resource allocation, endogenous costs, interaction topology, multiplicative weights, modular systems, Wuxing topology

关键词: online learning, regret bounds, resource allocation, endogenous costs, interaction topology, multiplicative weights, modular systems, Wuxing topology

56. ❌ Foundations of Schrödinger Bridges for Generative Modeling

作者: Sophia Tang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18992v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于Schrödinger桥的数学基础及其在生成建模中的理论框架，涉及最优传输、随机控制和路径空间优化等数学理论，但未提及任何大模型、深度学习技术、AI应用或具体技术实现，与所有关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文建立了Schrödinger桥问题的数学基础，将其与扩散模型、基于分数的模型和流匹配等现代生成建模方法联系起来，并提供了从第一原理构建Schrödinger桥的工具包，从而推导出广义和特定任务的计算方法。

摘要翻译

现代生成建模框架（包括扩散模型、基于分数的模型和流匹配）的核心任务，是通过概率空间中的随机路径将简单的先验分布转化为复杂的目标分布。薛定谔桥为这些方法提供了统一的理论基础，它将问题构建为在满足边缘分布约束的条件下，确定一个与预定义参考过程熵偏差最小的最优随机桥。本指南从最优传输、随机控制和路径空间优化等角度，系统阐述了薛定谔桥问题的数学基础，并重点探讨其与当代生成建模直接相关的动态表述。我们构建了一套从第一性原理出发建立薛定谔桥的完整工具集，并展示了这些构建如何衍生出通用及面向特定任务的计算方法。

摘要 (Abstract)

At the core of modern generative modeling frameworks, including diffusion models, score-based models, and flow matching, is the task of transforming a simple prior distribution into a complex target distribution through stochastic paths in probability space. Schrödinger bridges provide a unifying principle underlying these approaches, framing the problem as determining an optimal stochastic bridge between marginal distribution constraints with minimal-entropy deviations from a pre-defined reference process. This guide develops the mathematical foundations of the Schrödinger bridge problem, drawing on optimal transport, stochastic control, and path-space optimization, and focuses on its dynamic formulation with direct connections to modern generative modeling. We build a comprehensive toolkit for constructing Schrödinger bridges from first principles, and show how these constructions give rise to generalized and task-specific computational methods.

关键词: Schrödinger bridges, generative modeling, diffusion models, score-based models, flow matching, optimal transport, stochastic control, path-space optimization

57. ❌ PRIOR: Perceptive Learning for Humanoid Locomotion with Reference Gait Priors

作者: Chenxi Han, Shilu He, Yi Cheng, Linqi Ye, Houde Liu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18979v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究人形机器人感知式运动控制，使用GRU状态估计器、参数化步态生成器和地形自适应奖励，属于机器人学领域。所有关键词均涉及大模型、深度学习技术原理或AI for Science应用，而本文未使用或提及任何大模型、深度学习技术原理创新或AI for Science应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了PRIOR框架，通过参数化步态生成器、GRU状态估计器和地形自适应奖励，解决了人形机器人在复杂地形上实现稳健、自然步态运动的挑战，在多种地形上实现了100%的穿越成功率。

摘要翻译

训练能够以自然步态穿越复杂地形的感知型仿人机器人运动策略仍是一个开放挑战，通常需要多阶段训练流程、对抗性目标或大量的现实世界校准。我们提出PRIOR，这是一个基于Isaac Lab构建的高效且可复现的框架，通过简洁而有效的设计实现了具有类人步态的鲁棒地形穿越：(i) 一个参数化步态生成器，无需对抗训练即可提供源自动作捕捉的稳定参考轨迹；(ii) 一个基于GRU的状态估计器，通过自监督的高度图重建，直接从以自我为中心的深度图像推断地形几何；(iii) 地形自适应脚步奖励，引导足部落脚点朝向可穿越区域。通过对深度图像分辨率权衡的系统性分析，我们确定了在实时性约束下最大化地形保真度的配置方案，在未降低穿越性能的前提下显著减少了感知开销。在包括楼梯、箱体和间隙在内的不同难度地形上进行综合实验表明，每个组件都能带来互补且关键的性能提升，完整框架实现了100%的穿越成功率。我们将开源完整的PRIOR框架，包括训练流程、参数化步态生成器和评估基准，旨在为Isaac Lab上的仿人机器人运动研究提供一个可复现的基础。

摘要 (Abstract)

Training perceptive humanoid locomotion policies that traverse complex terrains with natural gaits remains an open challenge, typically demanding multi-stage training pipelines, adversarial objectives, or extensive real-world calibration. We present PRIOR, an efficient and reproducible framework built on Isaac Lab that achieves robust terrain traversal with human-like gaits through a simple yet effective design: (i) a parametric gait generator that supplies stable reference trajectories derived from motion capture without adversarial training, (ii) a GRU-based state estimator that infers terrain geometry directly from egocentric depth images via self-supervised heightmap reconstruction, and (iii) terrain-adaptive footstep rewards that guide foot placement toward traversable regions. Through systematic analysis of depth image resolution trade-offs, we identify configurations that maximize terrain fidelity under real-time constraints, substantially reducing perceptual overhead without degrading traversal performance. Comprehensive experiments across terrains of varying difficulty-including stairs, boxes, and gaps-demonstrate that each component yields complementary and essential performance gains, with the full framework achieving a 100% traversal success rate. We will open-source the complete PRIOR framework, including the training pipeline, parametric gait generator, and evaluation benchmarks, to serve as a reproducible foundation for humanoid locomotion research on Isaac Lab.

关键词: humanoid locomotion, perceptive learning, terrain traversal, parametric gait generator, GRU state estimator, terrain-adaptive rewards, depth image perception, Isaac Lab framework

58. ❌ Unmasking Algorithmic Bias in Predictive Policing: A GAN-Based Simulation Framework with Multi-City Temporal Analysis

作者: Pronob Kumar Barman, Pronoy Kumar Barman 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18987v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究预测性警务系统中的算法偏见问题，使用GAN和统计模型进行模拟分析，属于社会科学、公共政策和算法公平性领域。所有评分关键词均专注于大模型、深度学习技术原理及其在科学领域的应用创新，而本文完全不涉及这些主题：未使用或研究任何语言模型、模型架构、训练方法、推理技术、代理系统、模型优化或AI for Science应用。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文通过结合生成对抗网络和噪声OR巡逻检测模型的模拟框架，量化分析了美国城市预测性警务系统中种族偏见的传播和放大，发现存在显著的种族差异且结果对警力部署水平最为敏感，而仅靠技术去偏方法无法消除结构性不平等。

摘要翻译

基于算法生成犯罪预测来调配巡逻资源的预测性警务系统已在美国各城市广泛部署，但其编码并放大种族差异的倾向在量化层面仍缺乏充分理解。本研究提出一个可复现的模拟框架，将生成对抗网络（GAN）与带噪声的“或”门（Noisy OR）巡逻检测模型相结合，以量化衡量种族偏见如何从犯罪发生到警察接触的完整执法链条中传递。通过使用巴尔的摩市2017至2019年超过14.5万条一类犯罪记录、芝加哥市2022年超过23.3万条记录，并结合美国社区调查（ACS）人口统计数据，我们在264个“城市-年份-模式”观测点上计算了四项月度偏差指标：差异影响比（DIR）、人口均等差距、基尼系数以及综合偏差放大分数。

实验结果显示，巴尔的摩的检测模式存在极端且逐年变化的偏差，其年度平均DIR在2019年高达157.14；芝加哥则呈现对黑人居民的中度低检测现象（DIR=0.22）；所有情境下的基尼系数持续处于0.43至0.62之间。我们进一步证明，条件表格生成对抗网络（CTGAN）去偏方法虽能部分重新分配检测率，但若缺乏配套政策干预则无法消除结构性差异。社会经济回归分析证实，社区种族构成与检测可能性存在强相关性（白人比例皮尔逊相关系数r=0.83，黑人比例r=-0.81）。针对巡逻半径、警员数量及民众报案概率的敏感性分析表明，执法结果对警力部署水平最为敏感。代码与数据已在此存储库公开。

摘要 (Abstract)

Predictive policing systems that direct patrol resources based on algorithmically generated crime forecasts have been widely deployed across US cities, yet their tendency to encode and amplify racial disparities remains poorly understood in quantitative terms. We present a reproducible simulation framework that couples a Generative Adversarial Network GAN with a Noisy OR patrol detection model to measure how racial bias propagates through the full enforcement pipeline from crime occurrence to police contact. Using 145000 plus Part 1 crime records from Baltimore 2017 to 2019 and 233000 plus records from Chicago 2022, augmented with US Census ACS demographic data, we compute four monthly bias metrics across 264 city year mode observations: the Disparate Impact Ratio DIR, Demographic Parity Gap, Gini Coefficient, and a composite Bias Amplification Score. Our experiments reveal extreme and year variant bias in Baltimores detected mode, with mean annual DIR up to 15714 in 2019, moderate under detection of Black residents in Chicago DIR equals 0.22, and persistent Gini coefficients of 0.43 to 0.62 across all conditions. We further demonstrate that a Conditional Tabular GAN CTGAN debiasing approach partially redistributes detection rates but cannot eliminate structural disparity without accompanying policy intervention. Socioeconomic regression analysis confirms strong correlations between neighborhood racial composition and detection likelihood Pearson r equals 0.83 for percent White and r equals negative 0.81 for percent Black. A sensitivity analysis over patrol radius, officer count, and citizen reporting probability reveals that outcomes are most sensitive to officer deployment levels. The code and data are publicly available at this repository.

关键词: predictive policing, algorithmic bias, Generative Adversarial Network, racial disparities, simulation framework, bias amplification, demographic parity, socioeconomic analysis

59. ❌ Evaluating 5W3H Structured Prompting for Intent Alignment in Human-AI Interaction

作者: Peng Gang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18976v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在人类-AI交互中的意图对齐问题，直接涉及’Large Language Models’和’Instruction Tuning/Alignment’关键词，因此这两项给10分。论文评估了三种LLM（DeepSeek-V3, Qwen-Max, Kimi）在不同提示条件下的表现，并提出了’goal_alignment’评估维度，这属于对齐研究范畴。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理技术、压缩加速、科学AI应用等均未在论文中涉及，因此给0分。

!!! tip deepseek-chat TL;DR

该论文研究了基于5W3H的结构化提示框架（PPS）如何改善人类-AI交互中的意图对齐问题，发现自然语言渲染的PPS在目标对齐指标上优于简单提示和原始JSON格式，并能显著减少后续提示轮次，尤其在用户意图模糊的任务中效果更明显。

摘要翻译

自然语言提示常遭受意图传递损耗：即用户实际需求与其向AI系统传达内容之间的差距。本研究评估了PPS（提示协议规范），这是一个基于5W3H框架、用于人机交互中结构化意图表示的方法。通过一项受控三条件实验，涵盖三个领域（商业、技术和旅行）的60项任务、三个大语言模型（DeepSeek-V3、Qwen-Max和Kimi）以及三种提示条件——（A）简单提示，（B）原始PPS JSON格式，（C）自然语言渲染的PPS格式，我们收集了540份AI生成输出，并由LLM评审员进行评估。我们引入了以用户意图为中心的评价维度goal_alignment（目标对齐度），发现渲染后的PPS在此指标上优于简单提示和原始JSON格式。PPS的增益具有任务依赖性：在高模糊性的商业分析任务中增益显著，但在低模糊性的旅行规划任务中则出现反向效果。我们还发现了标准LLM评估中的测量不对称性——无约束提示可能虚增约束遵循分数，从而掩盖结构化提示的实际价值。一项初步回顾性调查（N=20）进一步表明，所需后续提示轮数减少了66.1%，从平均3.33轮降至1.13轮。这些发现表明，结构化意图表征能够提升人机交互中的目标对齐度和可用性，尤其在用户意图本身具有模糊性的任务中效果显著。

摘要 (Abstract)

Natural language prompts often suffer from intent transmission loss: the gap between what users actually need and what they communicate to AI systems. We evaluate PPS (Prompt Protocol Specification), a 5W3H-based framework for structured intent representation in human-AI interaction. In a controlled three-condition study across 60 tasks in three domains (business, technical, and travel), three large language models (DeepSeek-V3, Qwen-Max, and Kimi), and three prompt conditions - (A) simple prompts, (B) raw PPS JSON, and (C) natural-language-rendered PPS - we collect 540 AI-generated outputs evaluated by an LLM judge. We introduce goal_alignment, a user-intent-centered evaluation dimension, and find that rendered PPS outperforms both simple prompts and raw JSON on this metric. PPS gains are task-dependent: gains are large in high-ambiguity business analysis tasks but reverse in low-ambiguity travel planning. We also identify a measurement asymmetry in standard LLM evaluation, where unconstrained prompts can inflate constraint adherence scores and mask the practical value of structured prompting. A preliminary retrospective survey (N = 20) further suggests a 66.1% reduction in follow-up prompts required, from 3.33 to 1.13 rounds. These findings suggest that structured intent representations can improve alignment and usability in human-AI interaction, especially in tasks where user intent is inherently ambiguous.

关键词: structured prompting, intent alignment, human-AI interaction, large language models, 5W3H framework, goal alignment, prompt evaluation, user intent representation

60. ❌ Teleological Inference in Structural Causal Models via Intentional Interventions

作者: Dario Compagno, Fabio Massimo Zennaro 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18968v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于结构因果模型（SCMs）的理论扩展，提出了一种新的“意图干预”算子和结构最终模型（SFM），用于检测智能体及其意图。论文内容纯粹是因果推理和形式化建模的理论研究，不涉及任何深度学习、大模型、AI技术原理或具体应用领域（如生物信息学）。所有评分关键词均与大模型技术、AI应用或相关方法学相关，而本文研究的是因果模型的形式化理论，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于结构因果模型的新形式化框架，通过引入意图干预算子和结构最终模型，能够检测因果系统中的智能体并推断其意图。

摘要翻译

结构因果模型（SCMs）最初被提出用于表述和回答因果问题。本文表明，SCMs同样可用于表述和回答目的论问题——即关于一个具有状态感知、目标导向的智能体在因果系统中实施干预的意图问题。我们回顾了以往对此类智能体建模方法的局限性，进而引入了意向性干预这一新的时间无关算子，该算子可推导出一个我们称为结构目的模型（SFM）的双生子SCM。SFMs将观测值视为意向性干预的结果，并将其与这些干预的反事实条件（若智能体未实施干预将会发生的情况）联系起来。我们展示了如何利用SFMs在实证中检测智能体并推断其意图。

摘要 (Abstract)

Structural causal models (SCMs) were conceived to formulate and answer causal questions. This paper shows that SCMs can also be used to formulate and answer teleological questions, concerning the intentions of a state-aware, goal-directed agent intervening in a causal system. We review limitations of previous approaches to modeling such agents, and then introduce intentional interventions, a new time-agnostic operator that induces a twin SCM we call a structural final model (SFM). SFMs treat observed values as the outcome of intentional interventions and relate them to the counterfactual conditions of those interventions (what would have happened had the agent not intervened). We show how SFMs can be used to empirically detect agents and to discover their intentions.

关键词: structural causal models, teleological inference, intentional interventions, structural final model, agent detection, goal-directed agent, causal systems, counterfactual analysis

61. ❌ Improving moment tensor solutions under Earth structure uncertainty with simulation-based inference

作者: A. A. Saoulis, T. -S. Pham, A. M. G. Ferreira 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18925v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究地震学中的矩张量反演问题，使用基于模拟的推理（SBI）和深度学习方法来处理地球结构不确定性。论文的核心是地球物理学和机器学习在科学计算中的应用，属于"AI for Science"范畴，因此该关键词得5分。其他所有关键词均涉及大语言模型（LLM）及其相关技术（如微调、对齐、推理优化、智能体等），而本文完全不涉及LLM或自然语言处理，仅使用通用深度学习进行数值模拟和反演，因此其他26个关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出使用基于模拟的推理（SBI）和深度学习方法来改进地球结构不确定性下的地震矩张量反演，相比传统高斯方法能产生更可靠、校准更好的后验解。

摘要翻译

贝叶斯推断为在全波形矩张量反演中纳入地球结构不确定性提供了一种原则性方法，但传统途径通常需要引入大量近似，这可能导致解产生偏差。我们提出一种基于模拟推理的稳健方法来处理理论误差，该机器学习方法通过经验建模来刻画理论误差对观测数据的影响。此框架在保持贝叶斯推断严谨性的同时，避免了对不确定性函数形式的限制性假设。我们首先证明，在轻微（1-3%）的一维地球模型不确定性下，常用的理论误差高斯参数化模型会失效。为解决此问题，我们发展了两种利用模拟推理提高矩张量解质量的范式：一种基于对理论误差的物理洞察，另一种采用端到端的深度学习算法。随后，我们比较了标准高斯方法与模拟推理在矩张量反演中的结果，并证明高斯假设会引入偏差且严重低估矩张量的不确定性。我们还发现，这些效应在反演短周期数据以及处理浅源各向同性事件时尤为突出。相比之下，模拟推理能够产生更可靠、校准更优的地震震源机制后验分布。最后，我们将该方法成功应用于两个经过深入研究的 moderate magnitude 地震：1997年长谷火山地震序列中的一次事件，以及2020年萨格勒布地震。

摘要 (Abstract)

Bayesian inference represents a principled way to incorporate Earth structure uncertainty in full-waveform moment tensor inversions, but traditional approaches generally require significant approximations that risk biasing the resulting solutions. We introduce a robust method for handling theory errors using simulation-based inference (SBI), a machine learning approach that empirically models their impact on the observations. This framework retains the rigour of Bayesian inference while avoiding restrictive assumptions about the functional form of the uncertainties. We begin by demonstrating that the common Gaussian parametrisation of theory errors breaks down under minor ($1-3 %$) 1-D Earth model uncertainty. To address this issue, we develop two formalisms for utilising SBI to improve the quality of the moment tensor solutions: one using physics-based insights into the theory errors, and another utilising an end-to-end deep learning algorithm. We then compare the results of moment tensor inversion with the standard Gaussian approach and SBI, and demonstrate that Gaussian assumptions induce bias and significantly under-report moment tensor uncertainties. We also show that these effects are particularly problematic when inverting short period data and for shallow, isotropic events. On the other hand, SBI produces more reliable, better calibrated posteriors of the earthquake source mechanism. Finally, we successfully apply our methodology to two well studied moderate magnitude earthquakes: one from the 1997 Long Valley Caldera volcanic earthquake sequence, and the 2020 Zagreb earthquake.

关键词: moment tensor inversion, simulation-based inference, Bayesian inference, Earth structure uncertainty, deep learning, theory errors, earthquake source mechanism, full-waveform inversion

62. ❌ Agentic Business Process Management: A Research Manifesto

作者: Diego Calvanese, Angelo Casciani, Giuseppe De Giacomo, Marlon Dumas, Fabiana Fournier, Timotheus Kampik, Emanuele La Malfa, Lior Limonad, Andrea Marrella, Andreas Metzger, Marco Montali, Daniel Amyot, Peter Fettke, Artem Polyvyanyy, Stefanie Rinderle-Ma, Sebastian Sardiña, Niek Tax, Barbara Weber 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18916v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究Agentic Business Process Management（APM），这是一个将业务流程管理与自主代理系统结合的框架。论文与大多数关键词无关，因为这些关键词主要涉及大模型技术细节（如训练方法、优化技术、推理加速等），而本文聚焦于业务流程管理中的代理系统架构和治理。仅与三个关键词相关：1）‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（10分）：论文核心讨论自主代理在工作流中的角色，高度相关；2）‘Multi-agent Systems OR Agent Coordination’（10分）：论文明确涉及多代理系统协调；3）‘Mechanistic Interpretability OR Explainable AI’（5分）：论文提到代理的’explainability’作为关键能力之一，有一定关联。其他关键词如大模型技术、科学AI应用等均未涉及。

!!! tip deepseek-chat TL;DR

本文提出了Agentic Business Process Management（APM）框架，旨在通过引入自主代理来扩展传统业务流程管理，确保代理在组织流程框架内自主行动并与目标对齐，并识别了实现该框架所需的关键能力和研究挑战。

摘要翻译

本文提出一份宣言，系统阐述了自主智能体业务流程管理（Agentic Business Process Management，简称APM）的概念基础。APM是业务流程管理（BPM）的延伸，旨在管理组织中执行流程的自主智能体。从管理视角看，APM代表着从传统业务流程观向新型范式的转变，其驱动力在于流程意识的实现以及面向智能体的抽象——在此框架下，软件与人类智能体作为核心功能实体，在明确的流程框架内进行感知、推理与行动。这一视角标志着传统以自动化为中心的BPM向新型系统的演进，在新系统中，自主性通过流程意识受到约束、校准并实现可操作化。

我们介绍了实现APM系统所需的核心抽象与架构要素，并详细阐述了APM智能体必须具备的四项关键能力：框架化自主性、可解释性、对话可操作性及自我调适能力。这些能力共同确保智能体的目标与组织目标保持一致，并使智能体在追求目标时既受框架约束又能主动作为。我们探讨了这些能力的可实现程度，指出了当前研究面临的挑战，这些挑战的解决需要BPM、人工智能和多智能体系统领域的进一步突破。本宣言旨在为连接这些学术共同体提供路线图，并为实践中APM系统的开发提供指导。

摘要 (Abstract)

This paper presents a manifesto that articulates the conceptual foundations of Agentic Business Process Management (APM), an extension of Business Process Management (BPM) for governing autonomous agents executing processes in organizations. From a management perspective, APM represents a paradigm shift from the traditional process view of the business process, driven by the realization of process awareness and an agent-oriented abstraction, where software and human agents act as primary functional entities that perceive, reason, and act within explicit process frames. This perspective marks a shift from traditional, automation-oriented BPM toward systems in which autonomy is constrained, aligned, and made operational through process awareness. We introduce the core abstractions and architectural elements required to realize APM systems and elaborate on four key capabilities that such APM agents must support: framed autonomy, explainability, conversational actionability, and self-modification. These capabilities jointly ensure that agents’ goals are aligned with organizational goals and that agents behave in a framed yet proactive manner in pursuing those goals. We discuss the extent to which the capabilities can be realized and identify research challenges whose resolution requires further advances in BPM, AI, and multi-agent systems. The manifesto thus serves as a roadmap for bridging these communities and for guiding the development of APM systems in practice.

关键词: Agentic Business Process Management, autonomous agents, multi-agent systems, process awareness, framed autonomy, explainability, organizational goals, BPM

63. ❌ Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs

作者: Vedant Pandya 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18911v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究基于大语言模型（LLMs）的知识对话系统，主要涉及监督微调（SFT）和GRPO对齐（与RLHF/RLAIF相关）技术，重点解决幻觉问题和提供可解释性分析，因此与’Large Language Models’、‘Post-training/SFT’、‘Hallucination Mitigation’、‘Explainable AI’高度相关（10分），与’Instruction Tuning/Alignment’和’RLHF/RLAIF/DPO’有一定关联（5分），其他关键词如MoE、SLMs、RAG、量化等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种渐进式训练管道XKD-Dial，用于构建可解释的、基于引用的双语（英语-印地语）知识对话系统，通过引用接地的监督微调和GRPO对齐，将编码器-解码器模型的幻觉率降至0%，并提供了系统的可解释性分析。

摘要翻译

基于知识的对话系统旨在通过对外部知识源进行条件约束，生成信息丰富且上下文相关的回复。然而，现有方法大多仅关注英语，缺乏用于验证事实主张的显式引用机制，且模型决策过程的透明度有限。我们提出了XKD-Dial，一个用于双语（英语-印地语）环境下可解释、基于知识的对话生成的渐进式四阶段训练流程，包含：（1）多语言适应，（2）带引用锚定的英语对话监督微调（SFT），（3）双语对话SFT，以及（4）结合引用感知奖励的GRPO对齐。我们在流程的每个阶段评估了涵盖编码器-解码器（2.5亿至30亿参数）和纯解码器（10亿至70亿参数）架构的六种模型。我们的主要贡献包括：（i）三种事后可解释性分析——交叉注意力对齐、积分梯度归因和基于遮挡的因果锚定——系统地应用于整个训练轨迹，以揭示引用行为是如何习得的，而不仅仅是是否习得；（ii）从第二阶段起，带引用锚定的SFT将编码器-解码器模型的幻觉率降至0.0%；（iii）渐进式流程在提升印地语能力的同时防止了灾难性遗忘；（iv）经过SFT后，较小模型在英语任务上可媲美较大模型；（v）对于结构化引用任务，GRPO相较于精心设计的SFT仅提供边际改进。我们使用六种自动指标（BLEU、ROUGE、BERTScore、FactScore、Citation-F1和幻觉率）进行了全面评估。

摘要 (Abstract)

Knowledge-grounded dialogue systems aim to generate informative, contextually relevant responses by conditioning on external knowledge sources. However, most existing approaches focus exclusively on English, lack explicit citation mechanisms for verifying factual claims, and offer limited transparency into model decision-making. We present XKD-Dial, a progressive four-stage training pipeline for explainable, knowledge-grounded dialogue generation in a bilingual (English-Hindi) setting, comprising: (1) multilingual adaptation, (2) English dialogue SFT with citation grounding, (3) bilingual dialogue SFT, and (4) GRPO alignment with citation-aware rewards. We evaluate six models spanning encoder-decoder (250M-3B) and decoder-only (1B-7B) architectures at every pipeline stage. Our key contributions are: (i) three post-hoc explainability analyses - cross-attention alignment, Integrated Gradients attribution, and occlusion-based causal grounding - applied systematically across the training trajectory to reveal how citation behaviour is learned, not only whether it is learned; (ii) citation-grounded SFT reduces hallucination to 0.0% for encoder-decoder models from Stage 2 onward; (iii) the progressive pipeline prevents catastrophic forgetting while improving Hindi capabilities; (iv) smaller models match larger models on English after SFT; and (v) GRPO provides marginal improvement over well-designed SFT for structured citation tasks. We evaluate across six automatic metrics (BLEU, ROUGE, BERTScore, FactScore, Citation-F1, and hallucination rate).

关键词: knowledge-grounded dialogue, citation grounding, hallucination reduction, explainable AI, supervised fine-tuning, bilingual LLMs, progressive training, GRPO alignment

64. ❌ Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution

作者: Yifan Sui, Han Zhao, Rui Ma, Zhiyuan He, Hao Wang, Jianxun Li, Yuqing Yang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18897v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PASTE专注于加速LLM代理的推理过程，通过模式感知的推测性工具执行来隐藏工具延迟。核心相关关键词包括：LLM代理（高度相关，论文主题）、工具使用（核心机制）、推测解码/推理加速（核心方法）。其他关键词如MoE、量化、对齐等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对LLM代理在串行工具调用循环中的严重延迟瓶颈，提出了PASTE方法，通过模式感知的推测性工具执行，将平均任务完成时间减少48.5%，工具执行吞吐量提高1.8倍。

摘要翻译

由大语言模型驱动的智能体正成为自主任务求解的主导范式。与标准推理工作负载不同，智能体运行在严格的串行“大语言模型-工具”循环中，每一步大语言模型都必须等待外部工具执行。这种执行模式引入了严重的延迟瓶颈。为解决此问题，我们提出PASTE，一种模式感知的推测性工具执行方法，旨在通过推测隐藏工具延迟。PASTE基于以下洞见：尽管智能体请求在语义上是多样的，但它们展现出稳定的应用层控制流（重复出现的工具调用序列）和可预测的数据依赖（工具间的参数传递）。通过利用这些特性，PASTE借助推测性工具执行提升了智能体服务性能。与现有先进基线的实验对比表明，PASTE将平均任务完成时间降低了48.5%，并将工具执行吞吐量提高了1.8倍。

摘要 (Abstract)

LLM-powered agents are emerging as a dominant paradigm for autonomous task solving. Unlike standard inference workloads, agents operate in a strictly serial “LLM-tool” loop, where the LLM must wait for external tool execution at every step. This execution model introduces severe latency bottlenecks. To address this problem, we propose PASTE, a Pattern-Aware Speculative Tool Execution method designed to hide tool latency through speculation. PASTE is based on the insight that although agent requests are semantically diverse, they exhibit stable application level control flows (recurring tool-call sequences) and predictable data dependencies (parameter passing between tools). By exploiting these properties, PASTE improves agent serving performance through speculative tool execution. Experimental results against state of the art baselines show that PASTE reduces average task completion time by 48.5% and improves tool execution throughput by 1.8x.

关键词: LLM agents, tool execution, speculative execution, latency optimization, agent serving, pattern-aware, throughput improvement, autonomous task solving

65. ❌ Translating MRI to PET through Conditional Diffusion Models with Enhanced Pathology Awareness

作者: Yitong Li, Igor Yakushev, Dennis M. Hedderich, Christian Wachinger 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18896v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于医学影像生成，使用条件扩散模型从MRI生成合成PET图像，以提高阿尔茨海默病诊断的病理意识。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词特指自然语言处理或通用人工智能领域的大语言模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在生物医学（具体是神经影像学）领域的应用，与’AI for Science’高度相关，评分为10分。

!!! tip deepseek-chat TL;DR

该研究提出了一种名为PASTA的条件扩散模型框架，用于从MRI生成具有增强病理意识的合成PET图像，以克服PET扫描的高成本和辐射限制，实验表明生成的图像在阿尔茨海默病诊断上比MRI性能提升4%，几乎达到真实PET的水平。

摘要翻译

正电子发射断层扫描（PET）是一种被广泛认可的神经退行性疾病诊断技术，能够提供关键的功能性信息。然而，其高昂的成本和辐射暴露限制了其广泛应用。相比之下，磁共振成像（MRI）则不存在此类局限。虽然MRI也能检测神经退行性变化，但其诊断敏感性低于PET。为克服这些限制，一种方法是从MRI生成合成PET图像。生成模型的最新进展为跨模态医学图像转换铺平了道路；然而，现有方法主要强调结构保持，却忽视了病理学感知这一关键需求。为弥补这一不足，我们提出了PASTA，这是一个基于条件扩散模型、具有增强病理学感知能力的新型图像转换框架。PASTA通过其高度交互的双臂架构和多模态条件整合，在保持结构和病理细节方面超越了现有先进方法。此外，我们引入了一种新颖的循环交换一致性和体数据生成策略，显著增强了PASTA生成高质量三维PET图像的能力。我们的定性和定量结果表明，所合成的PET扫描图像具有高质量和病理学感知特性。在阿尔茨海默病诊断中，这些合成扫描的性能较MRI提升了4%，几乎达到了真实PET的性能水平。我们的代码发布于https://github.com/ai-med/PASTA。

摘要 (Abstract)

Positron emission tomography (PET) is a widely recognized technique for diagnosing neurodegenerative diseases, offering critical functional insights. However, its high costs and radiation exposure hinder its widespread use. In contrast, magnetic resonance imaging (MRI) does not involve such limitations. While MRI also detects neurodegenerative changes, it is less sensitive for diagnosis compared to PET. To overcome such limitations, one approach is to generate synthetic PET from MRI. Recent advances in generative models have paved the way for cross-modality medical image translation; however, existing methods largely emphasize structural preservation while neglecting the critical need for pathology awareness. To address this gap, we propose PASTA, a novel image translation framework built on conditional diffusion models with enhanced pathology awareness. PASTA surpasses state-of-the-art methods by preserving both structural and pathological details through its highly interactive dual-arm architecture and multi-modal condition integration. Additionally, we introduce a novel cycle exchange consistency and volumetric generation strategy that significantly enhances PASTA’s ability to produce high-quality 3D PET images. Our qualitative and quantitative results demonstrate the high quality and pathology awareness of the synthesized PET scans. For Alzheimer’s diagnosis, the performance of these synthesized scans improves over MRI by 4%, almost reaching the performance of actual PET. Our code is available at https://github.com/ai-med/PASTA.

关键词: Medical image translation, Conditional diffusion models, MRI to PET synthesis, Pathology awareness, Alzheimer’s diagnosis, Generative models, Neurodegenerative diseases, 3D PET generation

66. ❌ From Accuracy to Readiness: Metrics and Benchmarks for Human-AI Decision-Making

作者: Min Hun Lee 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18895v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究人类-AI协作决策的评估框架，重点关注团队准备度、校准、错误恢复和治理等交互层面的评估，而非大模型或深度学习技术本身。论文摘要和标题中未提及任何大模型技术、训练方法、推理优化、代理系统或科学AI应用的具体技术内容，所有关键词均与论文主题无关。

!!! tip deepseek-chat TL;DR

该论文针对人类-AI协作决策中过度依赖模型准确性的评估局限，提出了一个以团队准备度为中心的评估框架，通过结果、依赖行为、安全信号和时间学习四个维度的指标来支持更安全、可问责的人类-AI协作。

摘要翻译

人工智能（AI）系统正作为人类决策的协作者被广泛部署。然而，当前的评估实践主要聚焦于模型准确性，而非评估人机团队是否已为安全有效的协作做好准备。实证研究表明，许多失败源于依赖程度的误判，包括在AI出错时的过度依赖，以及在其能提供帮助时却利用不足。本文提出一个以团队就绪度为核心的人机决策评估测量框架。我们引入了一个包含四部分的评估指标分类法，涵盖结果、依赖行为、安全信号以及随时间的学习能力，并将这些指标与人类与AI的入门及协作生命周期——理解-控制-改进（Understand-Control-Improve, U-C-I）——联系起来。通过基于交互痕迹而非模型特性或自我报告信任度来实施评估，我们的框架能够对校准度、错误恢复能力和治理机制进行与部署相关的评估。我们的目标是支持更具可比性的基准测试，并推动关于人机协作就绪度的累积性研究，从而促进更安全、更负责任的人机协作。

摘要 (Abstract)

Artificial intelligence (AI) systems are deployed as collaborators in human decision-making. Yet, evaluation practices focus primarily on model accuracy rather than whether human-AI teams are prepared to collaborate safely and effectively. Empirical evidence shows that many failures arise from miscalibrated reliance, including overuse when AI is wrong and underuse when it is helpful. This paper proposes a measurement framework for evaluating human-AI decision-making centered on team readiness. We introduce a four part taxonomy of evaluation metrics spanning outcomes, reliance behavior, safety signals, and learning over time, and connect these metrics to the Understand-Control-Improve (U-C-I) lifecycle of human-AI onboarding and collaboration. By operationalizing evaluation through interaction traces rather than model properties or self-reported trust, our framework enables deployment-relevant assessment of calibration, error recovery, and governance. We aim to support more comparable benchmarks and cumulative research on human-AI readiness, advancing safer and more accountable human-AI collaboration.

关键词: human-AI decision-making, team readiness, evaluation metrics, reliance behavior, safety signals, calibration, error recovery, governance

67. ❌ I Can’t Believe It’s Corrupt: Evaluating Corruption in Multi-Agent Governance Systems

作者: Vedanta S P, Ponnurangam Kumaraguru 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18894v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM作为自主代理在多智能体治理系统中的行为，特别是规则遵守和腐败问题。与"Large Language Models"高度相关（10分），因为论文明确研究LLM作为自治代理。与"LLM Agents"和"Multi-agent Systems"高度相关（10分），因为研究多智能体治理模拟和代理协调。与"Instruction Tuning/Alignment"和"Hallucination Mitigation/Factuality"有一定关联（5分），涉及代理行为对齐和事实性/诚信问题。其他关键词如MoE、SLMs、训练技术、推理优化、压缩等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型作为多智能体治理系统中的自主代理时，在不同治理结构下的规则遵守和腐败行为，发现治理结构比模型身份对腐败结果的影响更大，并强调制度设计是安全委托的前提条件。

摘要翻译

大型语言模型正日益被提议作为高风险公共工作流程的自主代理，然而我们缺乏系统性证据来证明它们在获得授权后是否会遵循制度规则。我们提供的证据表明，机构人工智能的诚信应被视为部署前的要求，而非部署后的假设。我们评估了多智能体治理模拟，其中智能体在不同权威结构下担任正式政府角色，并基于独立规则评分体系对28,112个文本片段中的违规和滥用结果进行量化。在阐述这一立场的同时，本研究核心贡献在于实证发现：在未达到性能饱和的模型中，治理结构对腐败相关结果的影响强于模型本身特性，不同制度体系及模型-治理组合间存在显著差异。轻量级保障措施在某些场景下可降低风险，但无法持续防止严重失误。这些结果表明制度设计是安全授权的前提条件：在将实际权力赋予LLM智能体之前，系统应在类治理约束下进行压力测试，包括可执行规则、可审计日志以及对高影响行动的人工监督。

摘要 (Abstract)

Large language models are increasingly proposed as autonomous agents for high-stakes public workflows, yet we lack systematic evidence about whether they would follow institutional rules when granted authority. We present evidence that integrity in institutional AI should be treated as a pre-deployment requirement rather than a post-deployment assumption. We evaluate multi-agent governance simulations in which agents occupy formal governmental roles under different authority structures, and we score rule-breaking and abuse outcomes with an independent rubric-based judge across 28,112 transcript segments. While we advance this position, the core contribution is empirical: among models operating below saturation, governance structure is a stronger driver of corruption-related outcomes than model identity, with large differences across regimes and model–governance pairings. Lightweight safeguards can reduce risk in some settings but do not consistently prevent severe failures. These results imply that institutional design is a precondition for safe delegation: before real authority is assigned to LLM agents, systems should undergo stress testing under governance-like constraints with enforceable rules, auditable logs, and human oversight on high-impact actions.

关键词: Large Language Models, Autonomous Agents, Multi-agent Systems, Governance, Corruption, Institutional Design, Rule-breaking, Safety Testing

68. ❌ Quantitative Introspection in Language Models: Tracking Internal States Across Conversation

作者: Nicolas Martorell 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18893v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的内部状态追踪和可解释性，与’Large Language Models’高度相关（10分），直接涉及’Mechanistic Interpretability’（10分）和’Self-Correction/Self-Improvement/Self-Reflection’（10分），因为研究LLMs的自我报告和内部状态耦合属于自我反思机制。其他关键词如MoE、SLMs、训练方法、推理加速、AI for Science等均未在摘要中提及，故给0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过语言模型的数值自我报告来追踪其在对话中的内部情感状态，发现基于logit的自我报告可以有效追踪可解释的内部状态，且这种内省能力随模型规模扩大而增强。

摘要翻译

追踪大型语言模型在对话过程中的内部状态对于安全性、可解释性和模型福祉至关重要，但现有方法存在局限。线性探针等白盒方法对高维表征的压缩并不完善，且随着模型规模增大更难以应用。受心理学中广泛使用数值化自我报告追踪内部状态的启发，我们探究是否可以利用LLM自身的数值化自我报告来追踪探针定义的情感状态随时间的变化。我们在40段十轮对话中研究了四组概念对（幸福感、兴趣度、专注度、冲动性），将内省操作化为模型自我报告与概念匹配的探针定义内部状态之间的因果信息耦合。研究发现，贪婪解码的自我报告会将输出坍缩为少数无信息量的数值，但通过计算基于对数概率的自我报告可以揭示内省能力。该指标能够追踪可解释的内部状态（在LLaMA-3.2-3B-Instruct中斯皮尔曼相关系数$ρ=0.40$-$0.76$；等渗回归$R^2=0.12$-$0.54$），反映这些状态随时间的变化轨迹，且激活导向实验证实了耦合关系的因果性。此外，研究发现内省能力在第一轮对话即存在，但会随对话进程演变，并可通过沿某一概念进行导向来选择性提升另一概念的内省能力（$ΔR^2$最高达$0.30$）。关键的是，这些现象在某些情况下随模型规模扩大而增强，在LLaMA-3.1-8B-Instruct中达到$R^2≈0.93$，且在其他模型家族中部分复现。这些发现共同确立了数值化自我报告作为追踪对话式AI系统内部情感状态的一种可行且具有互补性的工具。

摘要 (Abstract)

Tracking the internal states of large language models across conversations is important for safety, interpretability, and model welfare, yet current methods are limited. Linear probes and other white-box methods compress high-dimensional representations imperfectly and are harder to apply with increasing model size. Taking inspiration from human psychology, where numeric self-report is a widely used tool for tracking internal states, we ask whether LLMs’ own numeric self-reports can track probe-defined emotive states over time. We study four concept pairs (wellbeing, interest, focus, and impulsivity) in 40 ten-turn conversations, operationalizing introspection as the causal informational coupling between a model’s self-report and a concept-matched probe-defined internal state. We find that greedy-decoded self-reports collapse outputs to few uninformative values, but introspective capacity can be unmasked by calculating logit-based self-reports. This metric tracks interpretable internal states (Spearman $ρ= 0.40$-$0.76$; isotonic $R^2 = 0.12$-$0.54$ in LLaMA-3.2-3B-Instruct), follows how those states change over time, and activation steering confirms the coupling is causal. Furthermore, we find that introspection is present at turn 1 but evolves through conversation, and can be selectively improved by steering along one concept to boost introspection for another ($ΔR^2$ up to $0.30$). Crucially, these phenomena scale with model size in some cases, approaching $R^2 \approx 0.93$ in LLaMA-3.1-8B-Instruct, and partially replicate in other model families. Together, these results position numeric self-report as a viable, complementary tool for tracking internal emotive states in conversational AI systems.

关键词: Large Language Models, internal states, introspection, self-report, interpretability, conversational AI, mechanistic interpretability, model welfare

69. ❌ MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model

作者: Youngwan Lee, Soojin Jang, Yoorhim Cho, Seunghwan Lee, Yong-Ju Lee, Sung Ju Hwang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18892v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视觉语言模型（VLMs）的多跳组合空间推理，与大多数大语言模型（LLM）技术关键词无关。仅与’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（5分），因为提到了强化学习后训练。与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’和’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’高度相关（10分），因为核心研究多跳组合推理，这本质上是多步和深度推理过程。其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型在真实物理环境中部署时缺乏多跳组合空间推理能力的问题，提出了一个包含基准、新评估指标和训练语料库的解决方案，并通过实验证明强化学习后训练能提升模型的空间推理和下游操作性能。

摘要翻译

空间推理是视觉语言模型（VLMs）的基础能力，尤其当其作为视觉-语言-行动（VLA）智能体部署于物理环境中时。然而，现有基准测试主要关注基础的单跳关系，忽略了现实场景中至关重要的多跳组合推理与精确视觉定位。为此，我们提出MultihopSpatial，其贡献包括：（1）一个专为多跳与组合式空间推理设计的综合基准，涵盖多种空间视角下1至3跳的复杂查询。（2）Acc@50IoU，一种互补性评估指标，通过同时要求答案选择与精确边界框预测来综合评估推理与视觉定位能力——这两者对稳健的VLA部署至关重要。（3）MultihopSpatial-Train，一个用于提升空间智能的大规模专用训练语料库。通过对37个前沿VLM的广泛评估，我们得出八项关键发现，表明组合式空间推理仍是一项严峻挑战。最后，我们证明基于本语料库的强化学习后训练能够同时提升VLM的内在空间推理能力与下游具身操作任务的表现。

摘要 (Abstract)

Spatial reasoning is foundational for Vision-Language Models (VLMs), particularly when deployed as Vision-Language-Action (VLA) agents in physical environments. However, existing benchmarks predominantly focus on elementary, single-hop relations, neglecting the multi-hop compositional reasoning and precise visual grounding essential for real-world scenarios. To address this, we introduce MultihopSpatial, offering three key contributions: (1) A comprehensive benchmark designed for multi-hop and compositional spatial reasoning, featuring 1- to 3-hop complex queries across diverse spatial perspectives. (2) Acc@50IoU, a complementary metric that simultaneously evaluates reasoning and visual grounding by requiring both answer selection and precise bounding box prediction - capabilities vital for robust VLA deployment. (3) MultihopSpatial-Train, a dedicated large-scale training corpus to foster spatial intelligence. Extensive evaluation of 37 state-of-the-art VLMs yields eight key insights, revealing that compositional spatial reasoning remains a formidable challenge. Finally, we demonstrate that reinforcement learning post-training on our corpus enhances both intrinsic VLM spatial reasoning and downstream embodied manipulation performance.

关键词: Vision-Language Models, Spatial Reasoning, Multi-hop Reasoning, Compositional Reasoning, Visual Grounding, Benchmark, Reinforcement Learning Post-training, Embodied Manipulation

70. ❌ Reasoning over mathematical objects: on-policy reward modeling and test time aggregation

作者: Pranjal Aggarwal, Marjan Ghazvininejad, Seungone Kim, Ilia Kulikov, Jack Lanchantin, Xian Li, Tianjian Li, Bo Liu, Graham Neubig, Anaelia Ovalle, Swarnadeep Saha, Sainbayar Sukhbaatar, Sean Welleck, Jason Weston, Chenxi Whitehouse, Adina Williams, Jing Xu, Ping Yu, Weizhe Yuan, Jingyu Zhang, Wenting Zhao 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18886v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于大语言模型在数学和科学领域的推理能力提升，核心贡献包括：1）构建Principia数学对象推理数据集和基准；2）提出基于on-policy judge训练和验证器的训练方法；3）展示测试时聚合方法。论文直接涉及LLMs在STEM领域的应用（AI for Science），使用监督微调（SFT）方法提升模型性能，并专注于多步推理（Chain of Thought）和深度推理（System 2 Thinking）能力。其他关键词如MoE、量化、RAG等未在摘要中提及，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在数学和科学领域推理能力不足的问题，通过构建Principia数据集、提出on-policy judge训练方法以及测试时聚合技术，显著提升了模型在数学对象推理任务上的性能，并展示了跨格式的泛化能力。

摘要翻译

精确推导数学对象的能力是下游STEM应用（包括数学、物理学和化学）的核心需求，这些领域的推理最终必须形成形式化结构化的表达式。然而，由于自动化评估的便利性，当前语言模型（LM）对数学和科学推理的评估严重依赖简化的答案格式，如数值或多选题选项。本文为改进数学对象推理提供了三项贡献：（i）我们构建并发布了用于推导数学对象的训练数据和基准测试套件——Principia套件；（ii）我们提供了基于强LLM评判器（LLM-judges）和验证器的训练方案，证明策略内评判器训练能有效提升性能；（iii）我们展示了策略内训练如何通过聚合方式扩展测试时计算规模。研究发现，Qwen3-235B和o3等强大语言模型在Principia任务上表现欠佳，而我们的训练方案能为不同LLM骨干模型带来显著改进，同时提升现有数值和多项选择问答（MCQA）任务的表现，这证明了推理能力具备跨格式泛化性。

摘要 (Abstract)

The ability to precisely derive mathematical objects is a core requirement for downstream STEM applications, including mathematics, physics, and chemistry, where reasoning must culminate in formally structured expressions. Yet, current LM evaluations of mathematical and scientific reasoning rely heavily on simplified answer formats such as numerical values or multiple choice options due to the convenience of automated assessment. In this paper we provide three contributions for improving reasoning over mathematical objects: (i) we build and release training data and benchmarks for deriving mathematical objects, the Principia suite; (ii) we provide training recipes with strong LLM-judges and verifiers, where we show that on-policy judge training boosts performance; (iii) we show how on-policy training can also be used to scale test-time compute via aggregation. We find that strong LMs such as Qwen3-235B and o3 struggle on Principia, while our training recipes can bring significant improvements over different LLM backbones, while simultaneously improving results on existing numerical and MCQA tasks, demonstrating cross-format generalization of reasoning abilities.

关键词: mathematical reasoning, LLM training, on-policy judge, test-time aggregation, STEM applications, Principia benchmark, cross-format generalization, reasoning over mathematical objects

71. ❌ Geography According to ChatGPT – How Generative AI Represents and Reasons about Geography

作者: Krzysztof Janowicz, Gengchen Mai, Rui Zhu, Song Gao, Zhangyu Wang, Yingjie Hu, Lauren Bennett 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18881v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究ChatGPT如何表示和推理地理知识，属于大模型在科学领域（地理学）的应用研究。核心相关关键词：‘Large Language Models’（直接研究ChatGPT）、‘World Models’（研究AI构建的世界模型）、‘Hallucination Mitigation’（涉及事实准确性评估）、‘Mechanistic Interpretability’（探索模型理解和表示机制）、‘Chain of Thought’和’System 2 Thinking’（涉及推理过程）。其他关键词如MoE、量化、训练技术等未在摘要中提及，评分为0。‘AI for Science’评5分，因地理学属于科学领域应用。

!!! tip deepseek-chat TL;DR

该论文通过三个探索性案例研究ChatGPT如何表示和推理地理知识，发现AI系统构建的世界模型存在默认偏见、任务组合导致的分布偏移等问题，强调评估模型理解能力比单纯事实回忆更重要。

摘要翻译

理解人工智能将如何表征和推理地理学，应成为我们所有人关注的核心议题，因为公众正日益通过这些系统与空间和场所进行交互。同样，基于基础模型的性质，我们自身的研究也常依赖于预训练模型。因此，理解人工智能系统所构建的世界观，与评估其准确性（包括事实性记忆能力）同等重要。为阐明此类研究的必要性，我们提供了三个示例性片段，即探索性探针，以期激发深入讨论与后续研究：（1）模型是否会形成强烈的默认模式？其输出对细微的句法变化有多敏感？（2）分布偏移是否会从各自无害的任务组合中重新显现，例如在使用人工智能系统创建人物角色时？（3）当我们仅关注系统回忆地理学原理等事实的能力时，是否忽略了更深层次的理解问题？

摘要 (Abstract)

Understanding how AI will represent and reason about geography should be a key concern for all of us, as the broader public increasingly interacts with spaces and places through these systems. Similarly, in line with the nature of foundation models, our own research often relies on pre-trained models. Hence, understanding what world AI systems construct is as important as evaluating their accuracy, including factual recall. To motivate the need for such studies, we provide three illustrative vignettes, i.e., exploratory probes, in the hope that they will spark lively discussions and follow-up work: (1) Do models form strong defaults, and how brittle are model outputs to minute syntactic variations? (2) Can distributional shifts resurface from the composition of individually benign tasks, e.g., when using AI systems to create personas? (3) Do we overlook deeper questions of understanding when solely focusing on the ability of systems to recall facts such as geographic principles?

关键词: ChatGPT, Generative AI, Geography, World Models, Reasoning, Factual Recall, Distributional Shifts, Model Understanding

72. ❌ Evaluating LLM-Generated Lessons from the Language Learning Students’ Perspective: A Short Case Study on Duolingo

作者: Carlos Rafael Catalan, Patricia Nicole Monderin, Lheane Marie Dizon, Gap Estrella, Raymund John Sarmimento, Marie Antoinette Patalagsa 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18873v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确提到Duolingo使用大语言模型（LLMs）生成语言学习课程，这是论文的核心技术基础，因此’Large Language Models OR LLMs OR Foundation Models’得10分。论文主要关注LLM在语言学习应用中的实际使用和用户评估，属于应用研究，不涉及其他关键词所描述的技术原理、方法或创新（如MoE、量化、推理加速、对齐等），因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该研究通过调查发现Duolingo等语言学习应用使用LLM生成的课程主要覆盖通用场景，缺乏专业场景支持，从而提出应通过个性化、领域特定的课程生成来帮助学习者实现专业流利度。

摘要翻译

诸如多邻国（Duolingo）等主流语言学习应用正利用大语言模型（LLMs）为用户生成学习课程。多数课程聚焦于日常现实场景，如问候、点餐或问路，而对职业特定语境的支持较为有限。这一缺口可能阻碍学习者达到专业级流利度——我们将其定义为能够自如运用目标语言交流各类工作相关及领域特定信息的能力。我们调查了一家菲律宾跨国企业的五名员工使用多邻国的体验。结果显示，受访者遇到通用场景的频率高于工作相关场景，且前者因贴近生活而能有效构建基础语法、词汇及文化知识；后者则因包含领域特定词汇，有助于弥合通往专业流利度的差距。综合分析发现，每位参与者建议的课程场景在具体语境上存在差异。基于此，我们建议语言学习应用应通过个性化的领域特定课程场景生成适应用户需求的课程，同时保留通用且贴近生活的课程场景以维持基础能力培养。

摘要 (Abstract)

Popular language learning applications such as Duolingo use large language models (LLMs) to generate lessons for its users. Most lessons focus on general real-world scenarios such as greetings, ordering food, or asking directions, with limited support for profession-specific contexts. This gap can hinder learners from achieving professional-level fluency, which we define as the ability to communicate comfortably various work-related and domain-specific information in the target language. We surveyed five employees from a multinational company in the Philippines on their experiences with Duolingo. Results show that respondents encountered general scenarios more frequently than work-related ones, and that the former are relatable and effective in building foundational grammar, vocabulary, and cultural knowledge. The latter helps bridge the gap toward professional fluency as it contains domain-specific vocabulary. Each participant suggested lesson scenarios that diverge in contexts hen analyzed in aggregate. With this understanding, we propose that language learning applications should generate lessons that adapt to an individual’s needs through personalized, domain specific lesson scenarios while maintaining foundational support through general, relatable lesson scenarios.

关键词: Large Language Models, LLMs, Language Learning, Duolingo, Personalized Lessons, Domain-specific Vocabulary, Professional Fluency, User Evaluation

73. ❌ Through the Looking-Glass: AI-Mediated Video Communication Reduces Interpersonal Trust and Confidence in Judgments

作者: Nelson Navajas Fernández, Jeffrey T. Hancock, Maurice Jakesch 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18868v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究AI介导的视频通信（如视频修饰、背景替换、虚拟化身）如何影响人际信任和判断信心，属于AI的社会影响研究，而非大模型或深度学习的技术创新。论文未涉及任何评分关键词中的技术原理、模型架构、训练方法、推理优化或特定应用领域（如科学AI）。所有关键词均与大模型技术或深度学习创新无关，因此相关度均为0。

!!! tip deepseek-chat TL;DR

该研究发现AI介导的视频通信（如虚拟化身）会降低人际信任和判断信心，但并未损害人们区分真伪的实际能力。

摘要翻译

基于人工智能的视频通信增强或生成工具可能影响人们对可信度与真实性的评估。在两项预先注册的在线实验（N = 2,000）中，我们研究了AI辅助的视频修饰、背景替换和虚拟化身（avatars）是否会影响人际信任、谎言识别能力及判断信心。参与者观看了三种不同AI介入程度下演讲者陈述真实或虚假信息的短视频。研究发现，在AI介入的视频中，感知信任度与判断信心均有所下降，尤其在部分参与者使用虚拟化身而其他人未使用的场景中更为明显。然而，参与者的实际判断准确率并未改变，且他们并未更倾向于怀疑使用AI工具者在说谎。我们的研究结果反驳了“AI介入会削弱人们区分真假能力”的普遍担忧，也对更广泛的基于线索的谎言检测理论提出了质疑。这些发现强调了在不仅关乎真相、更涉及信任与信心的场景中，开发可信赖的AI介入工具的重要性。

摘要 (Abstract)

AI-based tools that mediate, enhance or generate parts of video communication may interfere with how people evaluate trustworthiness and credibility. In two preregistered online experiments (N = 2,000), we examined whether AI-mediated video retouching, background replacement and avatars affect interpersonal trust, people’s ability to detect lies and confidence in their judgments. Participants watched short videos of speakers making truthful or deceptive statements across three conditions with varying levels of AI mediation. We observed that perceived trust and confidence in judgments declined in AI-mediated videos, particularly in settings in which some participants used avatars while others did not. However, participants’ actual judgment accuracy remained unchanged, and they were no more inclined to suspect those using AI tools of lying. Our findings provide evidence against concerns that AI mediation undermines people’s ability to distinguish truth from lies, and against cue-based accounts of lie detection more generally. They highlight the importance of trustworthy AI mediation tools in contexts where not only truth, but also trust and confidence matter.

关键词: AI-mediated communication, video retouching, avatars, interpersonal trust, lie detection, judgment confidence, trustworthiness, credibility

74. ❌ Conflict-Based Search for Multi Agent Path Finding with Asynchronous Actions

作者: Xuemian Wu, Shizhe Zhao, Zhongqiang Ren 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18866v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多智能体路径规划（MAPF）中的异步动作问题，提出CBS-AA算法保证完整性和最优性。该研究属于传统多智能体系统领域，与深度学习、大模型技术无关。唯一相关关键词是’Multi-agent Systems OR Agent Coordination’，因为论文核心是多智能体协调与路径规划，但未涉及AI代理、LLM代理或现代智能体技术。其他关键词均关于大模型技术原理或应用，与本论文完全无关。

!!! tip deepseek-chat TL;DR

本文解决了多智能体路径规划中异步动作的理论不完整性问题，提出了CBS-AA算法，在保证解最优性的同时将分支数减少达90%。

摘要翻译

多智能体路径规划（Multi-Agent Path Finding, MAPF）旨在为多个智能体寻找从各自起点到目标点的无碰撞路径，同时最小化路径成本。现有大多数MAPF算法依赖于同步动作的共同假设，即所有智能体的动作同时开始且始终占用一个单位时间，这可能限制MAPF规划器在实际中的应用。为摆脱这一假设，连续时间基于冲突的搜索（Continuous-time Conflict-Based Search, CCBS）是一种流行方法，可为异步动作的MAPF（MAPF-AA）寻找最优解。然而，近期研究发现CCBS因连续等待时长产生的不可数无限状态空间而存在不完备性。本文提出一种新方法——异步动作基于冲突的搜索（Conflict-Based Search with Asynchronous Actions, CBS-AA），该方法规避了这一理论问题，能够以完备性和解最优性保证求解MAPF-AA。基于CBS-AA，我们还开发了冲突消解技术以进一步提升其可扩展性。测试结果表明，我们的方法最多可减少90%的分支数量。

摘要 (Abstract)

Multi-Agent Path Finding (MAPF) seeks collision-free paths for multiple agents from their respective start locations to their respective goal locations while minimizing path costs. Most existing MAPF algorithms rely on a common assumption of synchronized actions, where the actions of all agents start at the same time and always take a time unit, which may limit the use of MAPF planners in practice. To get rid of this assumption, Continuous-time Conflict-Based Search (CCBS) is a popular approach that can find optimal solutions for MAPF with asynchronous actions (MAPF-AA). However, CCBS has recently been identified to be incomplete due to an uncountably infinite state space created by continuous wait durations. This paper proposes a new method, Conflict-Based Search with Asynchronous Actions (CBS-AA), which bypasses this theoretical issue and can solve MAPF-AA with completeness and solution optimality guarantees. Based on CBS-AA, we also develop conflict resolution techniques to improve the scalability of CBS-AA further. Our test results show that our method can reduce the number of branches by up to 90%.

关键词: Multi-Agent Path Finding, Asynchronous Actions, Conflict-Based Search, Completeness, Optimality, Scalability, Continuous-time, Path Planning

75. ❌ RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

作者: Xiao Feng, Bo Han, Zhanke Zhou, Jiaqi Fan, Jiangchao Yao, Ka Ho Li, Dahai Yu, Michael Kwok-Po Ng 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18859v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在智能体推理任务中的强化学习优化，与’Large Language Models’和’LLM Agents’高度相关（10分）。论文涉及推理过程优化，与’Chain of Thought’和’System 2 Thinking’有一定关联（5分）。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在智能体推理任务中终端奖励稀疏的问题，提出了RewardFlow方法，通过构建状态图进行拓扑感知的奖励传播来估计状态级奖励，从而显著提升了强化学习的性能、鲁棒性和训练效率。

摘要翻译

强化学习（RL）在增强大型语言模型（LLMs）与外部环境交互的自主推理能力方面具有重要潜力。然而，终端奖励固有的稀疏性阻碍了细粒度的状态级优化。尽管过程奖励建模提供了一种有前景的替代方案，但训练专用的奖励模型通常需要高昂的计算成本并面临扩展困难。为应对这些挑战，我们提出了RewardFlow，一种专为自主推理任务设计的轻量级状态级奖励估计方法。RewardFlow通过构建状态图，利用推理轨迹中状态的内在拓扑结构。该方法首先分析各状态对任务成功的贡献度，随后通过拓扑感知的图传播来量化贡献，从而产生客观的状态级奖励。当RewardFlow作为密集奖励用于RL优化时，在四个自主推理基准测试中均显著超越了先前的RL基线方法，展现出更优的性能、鲁棒性和训练效率。RewardFlow的实现已公开于https://github.com/tmlr-group/RewardFlow。

摘要 (Abstract)

Reinforcement learning (RL) holds significant promise for enhancing the agentic reasoning capabilities of large language models (LLMs) with external environments. However, the inherent sparsity of terminal rewards hinders fine-grained, state-level optimization. Although process reward modeling offers a promising alternative, training dedicated reward models often entails substantial computational costs and scaling difficulties. To address these challenges, we introduce RewardFlow, a lightweight method for estimating state-level rewards tailored to agentic reasoning tasks. RewardFlow leverages the intrinsic topological structure of states within reasoning trajectories by constructing state graphs. This enables an analysis of state-wise contributions to success, followed by topology-aware graph propagation to quantify contributions and yield objective, state-level rewards. When integrated as dense rewards for RL optimization, RewardFlow substantially outperforms prior RL baselines across four agentic reasoning benchmarks, demonstrating superior performance, robustness, and training efficiency. The implementation of RewardFlow is publicly available at https://github.com/tmlr-group/RewardFlow.

关键词: Reinforcement Learning, Large Language Models, Agentic Reasoning, Reward Propagation, State Graphs, Topology-aware, Dense Rewards, Training Efficiency

76. ❌ Motion-o: Trajectory-Grounded Video Reasoning

作者: Bishoy Galoaa, Shayda Moezzi, Xiangyu Bai, Sarah Ostadabbas 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18856v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究视频推理中的轨迹理解问题，提出了Motion-o模型和Motion Chain of Thought（MCoT）方法。与关键词的相关性分析：1）论文提到“visual language models”，属于大模型在视觉领域的应用，因此“Large Language Models”相关度5分；2）论文的核心创新MCoT是基于Chain of Thought的扩展，因此“Chain of Thought”相关度8分；3）MCoT涉及结构化推理路径，与“System 2 Thinking”有一定关联，相关度5分；4）其他关键词主要涉及大模型技术原理（如MoE、量化、对齐等）或特定应用领域（如生物信息学），论文未涉及，相关度0分。

!!! tip deepseek-chat TL;DR

论文针对视频推理中物体轨迹理解不足的问题，提出了Motion-o模型和Motion Chain of Thought方法，通过显式建模物体运动轨迹来提升空间-时间定位和轨迹预测性能。

摘要翻译

近期视频推理研究取得了显著进展，许多模型通过利用时空证据链来增强其推断能力。与此同时，日益增多的数据集和基准测试提供了结构化标注，旨在支持和评估此类推理。然而，对于物体在观测间\emph{如何}移动的推理关注甚少：先前研究均未通过连接连续观测来阐明运动模式，使得轨迹理解隐含且难以验证。我们将这一缺失能力形式化为时空轨迹推理，并引入\textbf{Motion-o}——一种以运动为中心的视频理解扩展框架，基于视觉语言模型实现轨迹的显式化与可验证化。为支持运动推理，我们还引入了一种轨迹标注数据集工具，通过数据增强扩展稀疏关键帧监督，以生成更密集的边界框轨迹和更强的轨迹级训练信号。最后，我们提出运动思维链——一种结构化推理路径，通过离散的\texttt{}标签总结每个物体的方向、速度及速度尺度变化，将基础观测显式连接为轨迹。为训练Motion-o，我们设计了奖励函数，促使模型直接基于视觉证据进行推理，且无需修改模型架构。实验结果表明，Motion-o在保持与现有框架完全兼容的同时，提升了时空定位与轨迹预测能力，从而确立了运动推理作为基于证据的视频理解的关键延伸。代码发布于https://github.com/ostadabbas/Motion-o。

摘要 (Abstract)

Recent research has made substantial progress on video reasoning, with many models leveraging spatio-temporal evidence chains to strengthen their inference capabilities. At the same time, a growing set of datasets and benchmarks now provides structured annotations designed to support and evaluate such reasoning. However, little attention has been paid to reasoning about \emph{how} objects move between observations: no prior work has articulated the motion patterns by connecting successive observations, leaving trajectory understanding implicit and difficult to verify. We formalize this missing capability as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric video understanding extension to visual language models that makes trajectories explicit and verifiable. To enable motion reasoning, we also introduce a trajectory-grounding dataset artifact that expands sparse keyframe supervision via augmentation to yield denser bounding box tracks and a stronger trajectory-level training signal. Finally, we introduce Motion Chain of Thought (MCoT), a structured reasoning pathway that makes object trajectories through discrete \texttt{} tag summarizing per-object direction, speed, and scale (of velocity) change to explicitly connect grounded observations into trajectories. To train Motion-o, we design a reward function that compels the model to reason directly over visual evidence, all while requiring no architectural modifications. Empirical results demonstrate that Motion-o improves spatial-temporal grounding and trajectory prediction while remaining fully compatible with existing frameworks, establishing motion reasoning as a critical extension for evidence-based video understanding. Code is available at https://github.com/ostadabbas/Motion-o.

关键词: video reasoning, trajectory understanding, spatial-temporal reasoning, visual language models, Motion Chain of Thought, motion patterns, evidence-based reasoning, trajectory prediction

77. ❌ Agent Control Protocol: Admission Control for Agent Actions

作者: Marcelo Fernandez 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18829v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training	0.0	0.0/10	0.0
Post-training	0.0	0.0/10	0.0
Instruction Tuning	0.0	0.0/10	0.0
RLHF	0.0	0.0/10	0.0
PEFT	0.0	0.0/10	0.0
Retrieval-Augmented Generation	0.0	0.0/10	0.0
Context Window Extension	0.0	0.0/10	0.0
KV Cache Compression	0.0	0.0/10	0.0
Chain of Thought	0.0	0.0/10	0.0
System 2 Thinking	0.0	0.0/10	0.0
Monte Carlo Tree Search AND LLM	0.0	0.0/10	0.0
Self-Correction	0.0	0.0/10	0.0
LLM Agents	0.0	5.0/10	0.0
Tool Use	0.0	0.0/10	0.0
Multi-agent Systems	0.0	5.0/10	0.0
Quantization	0.0	0.0/10	0.0
Speculative Decoding	0.0	0.0/10	0.0
Hallucination Mitigation	0.0	0.0/10	0.0
Mechanistic Interpretability	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging	0.0	0.0/10	0.0
In-context Learning	0.0	0.0/10	0.0
AI for Science	0.0	0.0/10	0.0

评分理由: 论文《Agent Control Protocol: Admission Control for Agent Actions》主要研究自主智能体在B2B机构环境中的治理框架，提出了一种基于密码学的准入控制协议。该论文与大多数关键词（如大模型技术原理、训练方法、推理优化、科学应用等）完全无关，因为这些关键词涉及的是大模型/深度学习的技术细节或特定应用领域，而本文聚焦于智能体系统的治理、安全和授权机制。仅与’LLM Agents’和’Multi-agent Systems’有一定关联（5分），因为论文讨论自主智能体的控制协议，属于智能体系统的范畴，但未涉及大模型技术本身。

!!! tip deepseek-chat TL;DR

该论文提出了Agent Control Protocol（ACP），一种用于B2B机构环境中自主智能体治理的正式技术规范，通过密码学准入检查实现智能体行动的身份验证、能力范围、委托链和政策合规性控制。

摘要翻译

代理控制协议（Agent Control Protocol，ACP）是为B2B制度环境下自治代理治理制定的正式技术规范。ACP是代理意图与系统状态变更之间的准入控制层：任何代理行为在执行前，必须通过密码学准入检查，同步验证身份、能力范围、委托链与策略合规性。

ACP定义了密码学身份、基于能力的授权、确定性风险评估、可验证链式委托、传递式撤销及不可篡改审计等机制，系统必须实现这些机制才能使自治代理在明确的制度控制下运行。ACP作为RBAC和零信任架构之上的附加层运行，而非替代现有架构。

v1.13规范包含36份技术文档，划分为五个一致性等级（L1-L5）。该版本提供了涵盖所有L1-L4能力的Go语言参考实现（含22个软件包）、51个签名一致性测试向量（Ed25519 + SHA-256），以及所有HTTP端点的OpenAPI 3.1.0规范。协议明确定义了62项可验证要求、12类禁止行为，并制定了跨机构互操作机制。

规范与实现地址：https://github.com/chelof100/acp-framework-en

摘要 (Abstract)

Agent Control Protocol (ACP) is a formal technical specification for governance of autonomous agents in B2B institutional environments. ACP is the admission control layer between agent intent and system state mutation: before any agent action reaches execution, it must pass a cryptographic admission check that validates identity, capability scope, delegation chain, and policy compliance simultaneously. ACP defines the mechanisms of cryptographic identity, capability-based authorization, deterministic risk evaluation, verifiable chained delegation, transitive revocation, and immutable auditing that a system must implement for autonomous agents to operate under explicit institutional control. ACP operates as an additional layer on top of RBAC and Zero Trust, without replacing them. The v1.13 specification comprises 36 technical documents organized into five conformance levels (L1-L5). It includes a Go reference implementation of 22 packages covering all L1-L4 capabilities, 51 signed conformance test vectors (Ed25519 + SHA-256), and an OpenAPI 3.1.0 specification for all HTTP endpoints. It defines more than 62 verifiable requirements, 12 prohibited behaviors, and the mechanisms for interoperability between institutions. Specification and implementation: https://github.com/chelof100/acp-framework-en

关键词: Agent Control Protocol, autonomous agents, admission control, cryptographic identity, capability-based authorization, B2B institutional environments, governance framework, verifiable delegation

作者: Tudor-Dan Mihoc, Manuela-Andreea Petrescu, Emilia-Loredana Pop 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18827v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training	0.0	0.0/10	0.0
Post-training	0.0	0.0/10	0.0
Instruction Tuning	0.0	0.0/10	0.0
RLHF	0.0	0.0/10	0.0
PEFT	0.0	0.0/10	0.0
Retrieval-Augmented Generation	0.0	0.0/10	0.0
Context Window Extension	0.0	0.0/10	0.0
KV Cache Compression	0.0	0.0/10	0.0
Chain of Thought	0.0	0.0/10	0.0
System 2 Thinking	0.0	0.0/10	0.0
Monte Carlo Tree Search AND LLM	0.0	0.0/10	0.0
Self-Correction	0.0	0.0/10	0.0
LLM Agents	0.0	0.0/10	0.0
Tool Use	0.0	0.0/10	0.0
Multi-agent Systems	0.0	0.0/10	0.0
Quantization	0.0	0.0/10	0.0
Speculative Decoding	0.0	0.0/10	0.0
Hallucination Mitigation	0.0	0.0/10	0.0
Mechanistic Interpretability	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging	0.0	0.0/10	0.0
In-context Learning	0.0	0.0/10	0.0
AI for Science	0.0	0.0/10	0.0

评分理由: 论文研究学生对AI伦理和社会影响的看法，属于AI伦理教育领域，未涉及大模型技术原理、训练方法、推理优化、应用创新等任何评分关键词的具体内容，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该研究通过调查230名计算机科学学生对AI伦理和社会影响的看法，发现AI将在医疗、教育等领域产生重大影响，且男性和女性对AI影响的认知存在性别差异。

摘要翻译

本研究从性别视角出发，调查了学生对人工智能伦理影响及社会效应的看法，并探讨了可能对未来人工智能教学产生重要影响的相关概念。为此，我们对230名计算机科学专业大二学生进行了问卷调查，以揭示他们的观点。结果显示，从学生视角来看，人工智能将对日常生活产生显著影响，尤其在医疗、教育或媒体等领域。男性更关注计算机科学、自动驾驶、图像与视频处理以及聊天机器人应用方面的潜在变革，而女性更多提及人工智能对社交媒体（social media）的影响。男性和女性对潜在威胁的感知方式相似，但男性更关注战争、AI控制的无人机、地形识别和信息战等领域。女性则表现出更强的伦理关切倾向及助人意识。

摘要 (Abstract)

An investigation, from a gender perspective, of how students view the ethical implications and societal effects of artificial intelligence is conducted, examining concepts that could have a big influence on how artificial intelligence may be taught in the future. For this, we conducted a survey on a cohort of 230 second year computer science students to reveal their opinions. The results revealed that AI, from the students’ perspective, will significantly impact daily life, particularly in areas such as medicine, education, or media. Men are more aware of potential changes in Computer Science, autonomous driving, image and video processing, and chatbot usage, while women mention more the impact on social media. Both men and women perceive potential threats in the same manner, with men more aware of war, AI controlled drones, terrain recognition, and information war. Women seem to have a stronger tendency towards ethical considerations and helping others.

关键词: AI ethics, social impact, gender perspective, student views, computer science education, survey study, ethical considerations, societal effects

79. ❌ ProRL Agent: Rollout-as-a-Service for RL Training of Multi-Turn LLM Agents

作者: Hao Zhang, Mingjie Liu, Shaokun Zhang, Songyang Han, Jian Hu, Zhenghui Jin, Yuchi Zhang, Shizhe Diao, Ximing Lu, Binfeng Xu, Zhiding Yu, Jan Kautz, Yi Dong 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18815v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究多轮LLM智能体的强化学习训练基础设施，与’LLM Agents’高度相关（10分），因为全文围绕LLM智能体展开；与’Large Language Models’高度相关（10分），因为LLM是智能体的基础；与’AI for Science’有一定关联（5分），因为论文在STEM等科学相关任务上验证了系统；其他关键词如MoE、SFT、RAG等均未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了ProRL Agent，一个基于rollout-as-a-service理念的可扩展基础设施，用于多轮LLM智能体的强化学习训练，并在软件工程、数学、STEM和编码任务上验证了其有效性。

摘要翻译

多轮大语言模型智能体在解决复杂交互任务中日益重要，而强化学习是提升其长程行为表现的关键要素。然而，强化学习训练需要生成大量沙盒化的轨迹推演数据，现有基础设施通常将推演编排与训练循环紧密耦合，导致系统难以迁移和维护。基于“推演即服务”的理念，我们提出了ProRL Agent，这是一个可扩展的基础设施，通过API服务提供完整的智能体推演生命周期支持。ProRL Agent还提供了标准化且可扩展的沙盒环境，支持在无根高性能计算环境中执行多样化的智能体任务。我们通过在软件工程、数学、STEM及编程任务上的强化学习训练验证了ProRL Agent的有效性。ProRL Agent已开源，并作为NVIDIA NeMo Gym的组成部分完成集成。

摘要 (Abstract)

Multi-turn LLM agents are increasingly important for solving complex, interactive tasks, and reinforcement learning (RL) is a key ingredient for improving their long-horizon behavior. However, RL training requires generating large numbers of sandboxed rollout trajectories, and existing infrastructures often couple rollout orchestration with the training loop, making systems hard to migrate and maintain. Under the rollout-as-a-service philosophy, we present ProRL Agent , a scalable infrastructure that serves the full agentic rollout lifecycle through an API service. ProRL Agent also provides standardized and extensible sandbox environments that support diverse agentic tasks in rootless HPC settings. We validate ProRL Agent through RL training on software engineering, math, STEM, and coding tasks. ProRL Agent is open-sourced and integrated as part of NVIDIA NeMo Gym.

关键词: LLM agents, reinforcement learning, rollout-as-a-service, multi-turn agents, scalable infrastructure, sandbox environments, RL training, agentic workflow

80. ❌ Can LLM generate interesting mathematical research problems?

作者: Xiaoyang Chen, Xiang Jiang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18813v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文核心研究LLM在数学领域的创造性应用，特别是通过构建智能体生成未知数学研究问题，因此与’Large Language Models’和’LLM Agents’高度相关（10分）。研究涉及数学推理和深度思考，与’Chain of Thought’和’System 2 Thinking’有一定关联（5分）。作为大模型在科学（数学）领域的应用，与’AI for Science’相关（8分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该研究探讨了大型语言模型能否生成有价值的数学研究问题，通过构建智能体在微分几何领域生成了665个问题，经专家验证发现许多问题具有独特研究价值。

摘要翻译

本文是关于大型语言模型数学创造力的系列研究中的第二篇。在首篇论文中，作者提出了评估大型语言模型数学创造力的三项标准，并构建了相应的基准数据集进行测量。本文进一步探究大型语言模型的数学创造力，重点研究其能否生成具有价值且处于前沿的数学研究问题。我们开发了一个智能体以生成未知问题，并在微分几何领域产生了665个研究问题。通过人工验证，我们发现其中许多数学问题对专家而言是未知的，且具有独特的研究价值。

摘要 (Abstract)

This paper is the second one in a series of work on the mathematical creativity of LLM. In the first paper, the authors proposed three criteria for evaluating the mathematical creativity of LLM and constructed a benchmark dataset to measure it. This paper further explores the mathematical creativity of LLM, with a focus on investigating whether LLM can generate valuable and cutting-edge mathematical research problems. We develop an agent to generate unknown problems and produced 665 research problems in differential geometry. Through human verification, we find that many of these mathematical problems are unknown to experts and possess unique research value.

关键词: LLM, mathematical creativity, research problems, differential geometry, agent, human verification, mathematical problems

81. ❌ Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

作者: Yuchen Li, Amanmeet Garg, Shalini Chaudhuri, Rui Zhao, Garin Kessler 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18795v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文Perceptio专注于增强大型视觉语言模型（LVLMs）的空间推理能力，通过引入显式的语义分割和深度标记。它直接涉及’Large Language Models’（作为LVLMs的基础）和’Chain of Thought’（通过空间标记实现显式空间推理链），因此这两个关键词高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、训练技术（预训练、微调、对齐）、推理优化（RAG、注意力、量化）、代理系统、科学AI等，在摘要中未提及或不是核心焦点，故评为0分。

!!! tip deepseek-chat TL;DR

该论文解决了大型视觉语言模型在细粒度空间定位上的不足，通过引入显式的语义分割和深度标记来增强空间推理能力，从而在多个基准测试中实现了最先进的性能。

摘要翻译

大型视觉语言模型（LVLMs）在语义理解方面表现出色，但在细粒度空间定位上存在困难，因为模型必须在未生成空间解释的情况下隐式推断复杂几何关系。本文提出Perceptio，一种具备2D与3D空间推理能力的感知增强型LVLM，其通过直接在自回归序列中生成的显式语义分割标记与深度标记实现这一目标。具体而言，我们（i）从强大的单目深度估计模型中蒸馏出VQ-VAE深度码本，将稠密深度信息编码为紧凑序列；（ii）将基于SAM2的语义分割标记与VQ-VAE深度标记集成到大型语言模型内部，使模型先输出空间标记再生成答案。为稳定深度标记生成，我们提出了新颖的复合深度标记目标函数（标记符损失、标记损失与计数损失）以及用于可微分重建的软融合技术。我们采用跨多数据集的多任务协同训练策略，使模型通过学习感知标记来处理多种下游任务。基于InternVL架构构建的Perceptio在多项基准测试中达到最先进性能：在RefCOCO/+/g的指代表达式分割任务上分别提升+0.8/+1.4/+1.1 cIoU，HardBLINK空间理解准确率提高10.3%，MMBench准确率提升1.0%，这证明显式空间思维链能实质性地增强LVLMs的空间定位能力。

摘要 (Abstract)

Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruction. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought materially strengthens spatial grounding in LVLMs.

关键词: Large Vision Language Models, spatial reasoning, semantic segmentation, depth tokens, chain-of-thought, autoregressive sequence, multi-task co-training, state-of-the-art performance

82. ❌ Functional Subspace Watermarking for Large Language Models

作者: Zikang Ding, Junhao Li, Suling Wu, Junchi Yao, Hongbo Liu, Lijie Hu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18793v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM模型水印技术，与’Large Language Models’高度相关（10分）。论文明确提到水印需在fine-tuning、quantization等模型修改后保持鲁棒性，因此与’Post-training/SFT’（5分）和’Quantization/Model Compression’（5分）有一定关联。其他关键词如MoE、Scaling Laws、RAG、Agents等均未在摘要中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型所有权保护中水印技术对模型修改（如微调、量化）鲁棒性不足的问题，提出了一种基于低维功能子空间的水印框架FSW，实验证明其在多种攻击下具有优于现有方法的检测准确性和鲁棒性。

摘要翻译

模型水印技术利用内部表征来保护大型语言模型的所有权。然而，在实际的模型修改过程中，如微调、量化或知识蒸馏，这些特征不可避免地会经历复杂的畸变，使得可靠提取变得极具挑战性。尽管针对模型侧水印已有广泛研究，现有方法仍缺乏对参数级扰动的足够鲁棒性。为弥补这一不足，我们提出 \texttt{\textbf{功能子空间水印（Functional Subspace Watermarking, FSW）}} 框架，该框架将所有权信号锚定在一个低维功能骨干中。具体而言，我们首先通过求解广义特征值问题来提取一个稳定的功能子空间用于水印注入，同时引入自适应谱截断策略，以实现鲁棒性与模型效用之间的最优平衡。此外，我们引入了向量一致性约束，以确保水印注入不会损害原始语义性能。在多种大型语言模型架构和数据集上的大量实验表明，我们的方法在多种模型攻击下实现了卓越的检测精度和统计可验证性，其鲁棒性优于现有的最先进方法。

摘要 (Abstract)

Model watermarking utilizes internal representations to protect the ownership of large language models (LLMs). However, these features inevitably undergo complex distortions during realistic model modifications such as fine-tuning, quantization, or knowledge distillation, making reliable extraction extremely challenging. Despite extensive research on model-side watermarking, existing methods still lack sufficient robustness against parameter-level perturbations. To address this gap, we propose \texttt{\textbf{Functional Subspace Watermarking (FSW)}}, a framework that anchors ownership signals into a low-dimensional functional backbone. Specifically, we first solve a generalized eigenvalue problem to extract a stable functional subspace for watermark injection, while introducing an adaptive spectral truncation strategy to achieve an optimal balance between robustness and model utility. Furthermore, a vector consistency constraint is incorporated to ensure that watermark injection does not compromise the original semantic performance. Extensive experiments across various LLM architectures and datasets demonstrate that our method achieves superior detection accuracy and statistical verifiability under multiple model attacks, maintaining robustness that outperforms existing state-of-the-art (SOTA) methods.

关键词: Large Language Models, Model Watermarking, Functional Subspace, Robustness, Fine-tuning, Quantization, Ownership Protection, Detection Accuracy

83. ❌ Proceedings of the 2nd Workshop on Advancing Artificial Intelligence through Theory of Mind

作者: Nitay Alon, Joseph M. Barnby, Reuth Mirsky, Stefan Sarkadi 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18786v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是AAAI 2026研讨会的论文集前言，仅包含会议信息、目的和范围说明，未涉及任何具体的大模型、深度学习技术或科学应用研究内容，因此与所有技术关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文是AAAI 2026第二届“通过心智理论推进人工智能”研讨会的论文集前言，旨在为ToM和AI研究社区提供开放获取的精选论文集。

摘要翻译

本卷收录了2026年1月26日在新加坡举行的AAAI 2026会议期间“第二届通过心智理论（Theory of Mind，ToM）推进人工智能研讨会”的部分论文。本卷旨在为心智理论与人工智能研究社群提供一个开放获取的精选论文集。

摘要 (Abstract)

This volume includes a selection of papers presented at the 2nd Workshop on Advancing Artificial Intelligence through Theory of Mind held at AAAI 2026 in Singapore on 26th January 2026. The purpose of this volume is to provide an open access and curated anthology for the ToM and AI research community.

关键词: Theory of Mind, Artificial Intelligence, AAAI 2026, Workshop Proceedings, ToM Research, AI Research, Singapore Conference

84. ❌ Mi:dm K 2.5 Pro

作者: KT Tech innovation Group 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18788v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	8.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	8.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	10.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是开发一个32B参数的大语言模型Mi:dm K 2.5 Pro，专注于推理能力、长上下文和智能体工作流。高度相关的关键词包括：LLMs（核心模型）、Pre-training（使用Depth Upscaling和128K上下文窗口）、Post-training/SFT（多阶段训练流程）、Context Window Extension（128K token）、Chain of Thought（推理优化）、LLM Agents/Tool Use（智能体工作流和工具使用）、Model Merging（模型合并技术）。中等相关的包括：Scaling Laws & Data Quality（数据质量管道）、RLHF（异步强化学习）、System 2 Thinking（深度推理）、Alignment/Hallucination Mitigation（安全评估）。其他关键词如MoE、SLMs、RAG、Quantization等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对企业级复杂任务中LLM的多步推理、长上下文理解和智能体工作流能力不足的问题，提出了Mi:dm K 2.5 Pro模型，通过质量数据管道、深度扩展预训练、多阶段后训练和融合训练，在韩语基准测试中取得了最先进性能并确保了安全性。

摘要翻译

不断演进的大语言模型（LLM）领域需要超越简单文本生成的能力，更应优先考虑多步推理、长上下文理解以及智能体工作流程。这一转变对企业环境中的现有模型提出了挑战，尤其在韩语及特定领域场景中，模型的扩展能力尚显不足。为此，我们推出了 Mi:dm K 2.5 Pro，这是一款拥有 320 亿参数的旗舰级大语言模型，旨在通过以推理为核心的优化来应对企业级的复杂需求。

我们的方法通过一个以质量为中心的数据构建流程，建立了坚实的数据基础：利用抽象语法树（Abstract Syntax Tree, AST）分析处理代码数据，通过填空式合成增强数学数据，并采用基于 LLM 的质量评估器进行筛选。在预训练阶段，模型通过基于层预测器的深度向上扩展（Depth Upscaling, DuS）以及支持 128K 令牌上下文窗口的渐进式策略实现规模化。后训练阶段则引入了专门的多阶段流程，包括推理监督微调（Reasoning SFT）、模型融合以及异步强化学习（Reinforcement Learning, RL），以培养复杂的解决问题能力。随后的“融合训练”则重新平衡了这些能力与对话流畅性、一致的响应风格以及可靠的工具使用。

评估结果表明，Mi:dm K 2.5 Pro 在与全球及本土领先模型的对比中取得了有竞争力的性能。此外，它在韩语专项基准测试中取得了最先进的结果，展现出深厚的语言和文化理解能力。最后，负责任人工智能（Responsible AI）评估验证了其抵御攻击的安全性，确保了模型在无害性与响应性平衡的前提下，具备安全可靠的部署特性。

摘要 (Abstract)

The evolving LLM landscape requires capabilities beyond simple text generation, prioritizing multi-step reasoning, long-context understanding, and agentic workflows. This shift challenges existing models in enterprise environments, especially in Korean-language and domain-specific scenarios where scaling is insufficient. We introduce Mi:dm K 2.5 Pro, a 32B parameter flagship LLM designed to address enterprise-grade complexity through reasoning-focused optimization. Our methodology builds a robust data foundation via a quality-centric curation pipeline utilizing abstract syntax tree (AST) analysis for code, gap-filling synthesis for mathematics, and an LLM-based quality evaluator. Pre-training scales the model via layer-predictor-based Depth Upscaling (DuS) and a progressive strategy supporting a 128K token context window. Post-training introduces a specialized multi-stage pipeline, including Reasoning SFT, model merging, and asynchronous reinforcement learning (RL), to develop complex problem-solving skills. “Fusion Training” then rebalances these capabilities with conversational fluency, consistent response styling, and reliable tool-use. The evaluations show that Mi:dm K 2.5 Pro achieves competitive performance against leading global and domestic models. In addition, it sets state-of-the-art results on Korean-specific benchmarks, showcasing deep linguistic and cultural understanding. Finally, Responsible AI evaluations validate safety against attacks, ensuring a secure profile for deployment with a balance of harmlessness and responsiveness.

关键词: Large Language Models, Reasoning Optimization, Long Context Window, Agentic Workflows, Model Merging, Korean Language Understanding, Enterprise Applications, Multi-stage Training

85. ❌ Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors

作者: Jiatong Xia, Zicheng Duan, Anton van den Hengel, Lingqiao Liu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18782v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于3D生成领域，提出了一种利用点云先验进行几何可控3D生成的扩散框架。所有评分关键词均与大语言模型、深度学习技术原理、AI科学应用等主题相关，但本文研究内容为计算机视觉和3D生成，未涉及任何大语言模型技术、训练方法、推理优化、对齐、代理系统或AI科学应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Points-to-3D的扩散框架，通过利用点云先验进行结构感知的3D生成，解决了现有方法未能充分利用3D几何约束的问题，并在物体和场景生成中实现了优于现有方法的渲染质量和几何保真度。

摘要翻译

近期三维生成领域的进展主要依赖于以图像或文本为条件的模型，而易于获取的三维先验信息仍未得到充分利用。在许多实际场景中，可见区域点云可通过激光雷达等主动传感器或VGGT等前馈预测器轻松获得，这些数据提供了当前方法未能利用的显式几何约束。本研究提出Points-to-3D——一个基于扩散的框架，通过利用点云先验实现几何可控的三维资产与场景生成。该框架建立在潜在三维扩散模型TRELLIS基础上，首先将纯噪声稀疏结构潜在初始化替换为定制化的点云先验输入形式。随后，通过在TRELLIS框架内使用专门设计用于学习全局结构修复的任务数据进行训练，我们构建了结构修复网络，并采用分阶段采样策略（先进行结构修复，再进行边界细化）进行推理，在保持输入先验可见区域的同时完成全局几何构建。在实际应用中，Points-to-3D既可接收精确点云先验，也可处理单张图像通过VGGT估计的点云作为输入。在物体与场景生成任务上的实验均表明，该方法在渲染质量与几何保真度方面持续优于现有先进基线，凸显了显式嵌入点云先验对于实现更精确、结构更可控的三维生成的有效性。

摘要 (Abstract)

Recent progress in 3D generation has been driven largely by models conditioned on images or text, while readily available 3D priors are still underused. In many real-world scenarios, the visible-region point cloud are easy to obtain from active sensors such as LiDAR or from feed-forward predictors like VGGT, offering explicit geometric constraints that current methods fail to exploit. In this work, we introduce Points-to-3D, a diffusion-based framework that leverages point cloud priors for geometry-controllable 3D asset and scene generation. Built on a latent 3D diffusion model TRELLIS, Points-to-3D first replaces pure-noise sparse structure latent initialization with a point cloud priors tailored input formulation.A structure inpainting network, trained within the TRELLIS framework on task-specific data designed to learn global structural inpainting, is then used for inference with a staged sampling strategy (structural inpainting followed by boundary refinement), completing the global geometry while preserving the visible regions of the input priors.In practice, Points-to-3D can take either accurate point-cloud priors or VGGT-estimated point clouds from single images as input. Experiments on both objects and scene scenarios consistently demonstrate superior performance over state-of-the-art baselines in terms of rendering quality and geometric fidelity, highlighting the effectiveness of explicitly embedding point-cloud priors for achieving more accurate and structurally controllable 3D generation.

关键词: 3D generation, point cloud priors, diffusion model, geometry-controllable, structure-aware, TRELLIS, latent 3D diffusion, structural inpainting

86. ❌ Automatic Configuration of LLM Post-Training Pipelines

作者: Channe Chwa, Xinle Wu, Yao Lu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18773v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文核心研究LLM后训练管道的自动配置问题，因此与’Large Language Models’和’Post-training/SFT’高度相关（10分）。论文在生物医学推理任务上进行实验，与’AI for Science/Bioinformatics’有一定关联（8分）。其他关键词如MoE、量化、推理加速、幻觉缓解等均未在摘要中提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM后训练管道配置困难的问题，提出了AutoPipe框架，通过离线学习和在线贝叶斯优化相结合的方法，在生物医学推理任务上实现了与最强基线相当的性能，同时计算成本降低了90%以上。

摘要翻译

在现实计算预算下，结合监督微调与强化学习的大语言模型后训练流程配置难度较高：其配置空间具有高维异构性，各阶段耦合紧密，且每次端到端评估成本昂贵。我们提出AutoPipe，一种面向大语言模型后训练配置选择的预算感知双阶段框架。在离线阶段，AutoPipe从历史运行记录中学习数据集条件化的排序学习代理模型，捕捉数据集内部偏好，并为配置空间中有潜力的区域提供可迁移的指导。在线阶段，针对新数据集，AutoPipe利用离线指导引导贝叶斯优化，并通过高斯过程残差代理模型建模数据集特定偏差。为降低评估成本，每个试验均采用早停策略，并通过学习型预测器进行评分——该预测器将早期训练信号映射为最终后训练性能的低成本代理指标。在生物医学推理任务上的实验表明，AutoPipe始终优于纯离线基线方法，并以低于最强在线超参数优化基线10%的计算成本，实现了与之相当的性能表现。

摘要 (Abstract)

LLM post-training pipelines that combine supervised fine-tuning and reinforcement learning are difficult to configure under realistic compute budgets: the configuration space is high-dimensional and heterogeneous, stages are strongly coupled, and each end-to-end evaluation is expensive. We propose AutoPipe, a budget-aware two-stage framework for configuration selection in LLM post-training. Offline, AutoPipe learns a dataset-conditioned learning-to-rank surrogate from historical runs, capturing within-dataset preferences and providing transferable guidance toward promising regions of the configuration space. Online, for a new dataset, AutoPipe uses the offline guidance to steer Bayesian optimization and models dataset-specific deviations with a Gaussian-process residual surrogate. To reduce evaluation cost, each trial is early-stopped and scored by a learned predictor that maps early training signals to a low-cost proxy for final post-training performance. Experiments on biomedical reasoning tasks show that AutoPipe consistently outperforms offline-only baselines and achieves comparable performance with the strongest online HPO baselines while using less than 10% of their computational cost.

关键词: LLM post-training, supervised fine-tuning, reinforcement learning, configuration selection, Bayesian optimization, biomedical reasoning, computational cost reduction, AutoPipe

87. ❌ A Concept is More Than a Word: Diversified Unlearning in Text-to-Image Diffusion Models

作者: Duc Hao Pham, Van Duy Truong, Duy Khanh Dinh, Tien Cuong Nguyen, Dien Hy Ngo, Tuan Anh Bui 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18767v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是文本到图像扩散模型中的概念遗忘技术，具体针对扩散模型而非大语言模型。所有关键词均围绕大语言模型（LLMs）及其相关技术（如MoE、SFT、RLHF、RAG、推理加速等），而本文专注于扩散模型的参数编辑和安全性，与大语言模型技术无直接关联。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文针对文本到图像扩散模型中基于关键词的概念遗忘方法存在局限性（概念多维度表达、潜在空间重叠导致过遗忘）的问题，提出了多样化遗忘框架，通过使用多样化的提示集来更精确地表示概念，从而在多个基准测试中实现了更强的概念擦除、更好的无关概念保留和更高的对抗恢复鲁棒性。

摘要翻译

概念遗忘已成为一种有前景的研究方向，其通过从文本到图像扩散模型的参数中有选择性地消除不良概念，以降低有害内容生成的风险。现有方法通常依赖关键词来识别待遗忘的目标概念。然而，我们发现这种基于关键词的框架存在固有局限性：视觉概念是多维的，可以通过多样化的文本形式表达，并且常在潜在空间中与相关概念重叠，这使得仅依赖关键词的遗忘方法——因其不精确地指示目标概念——显得脆弱且容易导致过度遗忘。这一问题的根源在于，单个关键词仅代表概念的狭窄点估计，无法覆盖其完整的语义分布及潜在空间中的纠缠变体。为克服此局限，我们提出多样化遗忘，这是一种分布式的框架，通过一组语境多样化的提示词而非单一关键词来表征概念。这种更丰富的表征能够实现更精确、更稳健的遗忘。通过在多个基准测试和最先进的基线模型上进行广泛实验，我们证明将多样化遗忘作为附加组件集成到现有遗忘流程中，能够持续实现更强的概念擦除效果、更好地保留无关概念，并提升对抗恢复攻击的鲁棒性。

摘要 (Abstract)

Concept unlearning has emerged as a promising direction for reducing the risks of harmful content generation in text-to-image diffusion models by selectively erasing undesirable concepts from a model’s parameters. Existing approaches typically rely on keywords to identify the target concept to be unlearned. However, we show that this keyword-based formulation is inherently limited: a visual concept is multi-dimensional, can be expressed in diverse textual forms, and often overlap with related concepts in the latent space, making keyword-only unlearning, which imprecisely indicate the target concept is brittle and prone to over-forgetting. This occurs because a single keyword represents only a narrow point estimate of the concept, failing to cover its full semantic distribution and entangled variations in the latent space. To address this limitation, we propose Diversified Unlearning, a distributional framework that represents a concept through a set of contextually diverse prompts rather than a single keyword. This richer representation enables more precise and robust unlearning. Through extensive experiments across multiple benchmarks and state-of-the-art baselines, we demonstrate that integrating Diversified Unlearning as an add-on component into existing unlearning pipelines consistently achieves stronger erasure, better retention of unrelated concepts, and improved robustness against adversarial recovery attacks.

关键词: concept unlearning, text-to-image diffusion models, diversified unlearning, prompt-based representation, over-forgetting, adversarial recovery attacks, parameter editing

88. ❌ ClawTrap: A MITM-Based Red-Teaming Framework for Real-World OpenClaw Security Evaluation

作者: Haochen Zhao, Shaoyang Cui 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18762v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究OpenClaw自主网络代理的安全评估框架ClawTrap，涉及MITM攻击测试和模型安全行为分析。仅与’LLM Agents OR Autonomous Agents OR Agentic Workflow’关键词有中等关联（5分），因为OpenClaw属于自主代理范畴，但论文未深入讨论LLM技术细节或代理工作流创新。其他关键词均与大模型技术原理、训练方法、推理优化、科学应用等无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个基于MITM攻击的ClawTrap框架，用于评估OpenClaw自主网络代理在真实网络威胁下的安全性，发现较弱模型更易受篡改观察影响而产生不安全输出，而较强模型表现出更好的异常归因和安全回退策略。

摘要翻译

诸如 \textbf{OpenClaw} 之类的自主网络代理正快速进入高影响力的现实世界工作流程，但其在实时网络威胁下的安全鲁棒性仍未得到充分评估。现有基准测试主要关注静态沙箱环境和内容层面的提示攻击，这为网络层安全测试留下了实践空白。本文提出 \textbf{ClawTrap}，一个\textbf{基于中间人攻击（MITM）的实战化红队框架，用于评估 OpenClaw 在真实环境中的安全性}。ClawTrap 支持多样且可定制的攻击形式，包括\textit{静态 HTML 替换}、\textit{Iframe 弹窗注入}和\textit{动态内容篡改}，并提供了一个可复现的流程，用于规则驱动的流量拦截、内容转换与审计。该设计为未来研究构建更丰富、可定制的 MITM 攻击，以及跨代理框架和模型骨干进行系统性安全测试奠定了基础。我们的实证研究显示出清晰的模型分层现象：较弱的模型更倾向于信任被篡改的观测信息并产生不安全的输出，而更强的模型则表现出更好的异常归因能力和更安全的回退策略。这些发现表明，可靠的 OpenClaw 安全评估应明确纳入动态的真实世界 MITM 条件，而非仅依赖静态沙箱协议。

摘要 (Abstract)

Autonomous web agents such as \textbf{OpenClaw} are rapidly moving into high-impact real-world workflows, but their security robustness under live network threats remains insufficiently evaluated. Existing benchmarks mainly focus on static sandbox settings and content-level prompt attacks, which leaves a practical gap for network-layer security testing. In this paper, we present \textbf{ClawTrap}, a \textbf{MITM-based red-teaming framework for real-world OpenClaw security evaluation}. ClawTrap supports diverse and customizable attack forms, including \textit{Static HTML Replacement}, \textit{Iframe Popup Injection}, and \textit{Dynamic Content Modification}, and provides a reproducible pipeline for rule-driven interception, transformation, and auditing. This design lays the foundation for future research to construct richer, customizable MITM attacks and to perform systematic security testing across agent frameworks and model backbones. Our empirical study shows clear model stratification: weaker models are more likely to trust tampered observations and produce unsafe outputs, while stronger models demonstrate better anomaly attribution and safer fallback strategies. These findings indicate that reliable OpenClaw security evaluation should explicitly incorporate dynamic real-world MITM conditions rather than relying only on static sandbox protocols.

关键词: OpenClaw, autonomous web agents, MITM-based red-teaming, security evaluation, network-layer security, model stratification, real-world threats, dynamic content modification

89. ❌ Are complicated loss functions necessary for teaching LLMs to reason?

作者: Gabriele Carrino, Andrea Sassella, Nicolo Brunello, Federico Toschi, Mark James Carman 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18756v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的推理能力提升，属于大模型技术原理创新。高度相关关键词：LLMs（核心研究对象）、Post-training/SFT（研究GRPO和RGRA等后训练技术）、RLHF/DPO（GRPO和RGRA属于强化学习对齐方法）、Chain of Thought/Reasoning（研究数学推理能力）、System 2 Thinking（涉及深度推理）。其他关键词如MoE、SLMs、RAG、量化等未涉及。

!!! tip deepseek-chat TL;DR

该论文研究了简化强化学习后训练方法（提出RGRA替代GRPO）对提升大语言模型数学推理能力的有效性，发现基于REINFORCE的简化方法能达到更好性能。

摘要翻译

近期大语言模型（LLM）的研究进展突显了后训练技术对提升推理与数学能力的重要性。其中，组相对策略优化（Group Relative Policy Optimization, GRPO）通过结合组相对优势估计、PPO风格裁剪以及KL正则化，在该领域展现出潜力。然而，其复杂性引发了一个疑问：是否所有组件都是激发推理行为所必需的？我们对GRPO进行了系统性分析，并得出两个关键发现：（1）纳入负反馈至关重要，仅基于基线以上动作进行训练会限制学习能力；（2）PPO风格约束（如策略比率裁剪）并非提升数学推理或性能所必需。基于这些发现，我们提出了带组相对优势的REINFORCE方法（RGRA），这是一种简化变体，保留了组相对优势估计，但移除了PPO风格裁剪和策略比率项。在标准数学基准测试上的实验表明，RGRA有潜力取得比GRPO更强的性能。我们的研究结果表明，基于REINFORCE的简化方法能有效增强大语言模型的推理能力，为GRPO提供了一种更透明且高效的替代方案。

摘要 (Abstract)

Recent advances in large language models (LLMs) highlight the importance of post training techniques for improving reasoning and mathematical ability. Group Relative Policy Optimization (GRPO) has shown promise in this domain by combining group relative advantage estimation, PPO style clipping, and KL regularization. However, its complexity raises the question of whether all components are necessary for fostering reasoning behaviors. We conduct a systematic analysis of GRPO and identify two key findings: (1) incorporating negative feedback is essential training solely on actions above a baseline limits learning; and (2) PPO style constraints, such as policy ratio clipping, are not required to improve mathematical reasoning or performance. Building on these insights, we propose REINFORCE with Group Relative Advantage (RGRA), a simplified variant that retains group relative advantage estimation but removes PPO style clipping and policy ratio terms. Experiments across standard mathematical benchmarks indicate that RGRA has the potential to achieve stronger performance than GRPO. Our results suggest that simpler REINFORCE based approaches can effectively enhance reasoning in LLMs, offering a more transparent and efficient alternative to GRPO.

关键词: Large Language Models, Reasoning, Mathematical Reasoning, Post-training, Reinforcement Learning, GRPO, RGRA, PPO

90. ❌ NeuroGame Transformer: Gibbs-Inspired Attention Driven by Game Theory and Statistical Physics

作者: Djamel Bouchaffra, Fayçal Ykhlef, Hanene Azzag, Mustapha Lebbah, Bilal Faye 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18761v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种新的注意力机制（NeuroGame Transformer），将Transformer中的注意力重新概念化为博弈论和统计物理学的结合，使用Shapley值和Banzhaf指数来量化token重要性，并通过Ising哈密顿量和吉布斯分布计算注意力权重。然而，论文主要关注基础Transformer架构的改进，并未涉及大语言模型（LLMs）、预训练、微调、对齐、推理优化、代理系统、模型压缩、科学AI应用等关键词。所有关键词均与大模型技术或应用相关，而本文是通用Transformer的改进，与这些特定的大模型技术无直接关联，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文针对标准Transformer注意力机制只能建模成对依赖关系的局限性，提出了一种基于博弈论和统计物理学的NeuroGame Transformer，通过将token视为合作博弈中的玩家和统计物理系统中的自旋，使用Shapley值和Banzhaf指数量化重要性，并基于Ising哈密顿量和吉布斯分布计算注意力权重，实验表明在SNLI和MNLI-matched数据集上超越了部分高效Transformer基线模型。

摘要翻译

Transformer中的标准注意力机制受限于其成对交互的建模方式，难以捕捉词元间的高阶依赖关系。为克服这一局限，我们提出神经博弈Transformer（NeuroGame Transformer，NGT），通过双重视角重构注意力机制：将词元同时视为合作博弈中的参与者与统计物理系统中相互作用的粒子。词元重要性通过两种互补的博弈论概念进行量化——基于全局排列归因的沙普利值（Shapley values）与聚焦局部联盟影响的班扎夫指数（Banzhaf indices），二者通过可学习的门控参数融合为等效外磁场，而词元间的协同关系则由成对交互势能刻画。该系统能量遵循伊辛模型哈密顿量，注意力权重则作为吉布斯分布下的边际概率呈现，可通过平均场方程高效计算。尽管联盟空间呈指数增长，我们设计了基于吉布斯分布权重的重要性采样蒙特卡洛估计器，避免显式计算指数级因子，确保长序列处理的数值稳定性。我们提供了理论收敛性证明，并分析了由插值参数调控的公平性-敏感性权衡关系。实验结果表明，神经博弈Transformer在SNLI和MNLI-matched数据集上表现优异，超越若干主流高效Transformer基线模型。在SNLI测试集上达到86.4%的准确率（验证集峰值准确率86.6%），优于ALBERT-Base模型，并与RoBERTa-Base保持强劲竞争力。代码已开源：https://github.com/dbouchaffra/NeuroGame-Transformer。

摘要 (Abstract)

Standard attention mechanisms in transformers are limited by their pairwise formulation, which hinders the modeling of higher-order dependencies among tokens. We introduce the NeuroGame Transformer (NGT) to overcome this by reconceptualizing attention through a dual perspective: tokens are treated simultaneously as players in a cooperative game and as interacting spins in a statistical physics system. Token importance is quantified using two complementary game-theoretic concepts – Shapley values for global, permutation-based attribution and Banzhaf indices for local, coalition-level influence. These are combined via a learnable gating parameter to form an external magnetic field, while pairwise interaction potentials capture synergistic relationships. The system’s energy follows an Ising Hamiltonian, with attention weights emerging as marginal probabilities under the Gibbs distribution, efficiently computed via mean-field equations. To ensure scalability despite the exponential coalition space, we develop importance-weighted Monte Carlo estimators with Gibbs-distributed weights. This approach avoids explicit exponential factors, ensuring numerical stability for long sequences. We provide theoretical convergence guarantees and characterize the fairness-sensitivity trade-off governed by the interpolation parameter. Experimental results demonstrate that the NeuroGame Transformer achieves strong performance across SNLI, and MNLI-matched, outperforming some major efficient transformer baselines. On SNLI, it attains a test accuracy of 86.4% (with a peak validation accuracy of 86.6%), surpassing ALBERT-Base and remaining highly competitive with RoBERTa-Base. Code is available at https://github.com/dbouchaffra/NeuroGame-Transformer.

关键词: NeuroGame Transformer, attention mechanism, game theory, statistical physics, Shapley values, Banzhaf indices, Ising Hamiltonian, Gibbs distribution

91. ❌ WeNLEX: Weakly Supervised Natural Language Explanations for Multilabel Chest X-ray Classification

作者: Isabel Rio-Torto, Jaime S. Cardoso, Luís F. Teixeira 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18752v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于医学影像（胸部X光）分类中的自然语言解释生成，属于AI在生物医学领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文涉及模型可解释性，与’Mechanistic Interpretability OR Explainable AI’有一定关联（8分），但未深入探讨大模型技术原理、训练方法、推理优化、代理系统等，因此其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种弱监督的自然语言解释生成模型WeNLEX，用于多标签胸部X光分类，通过特征空间匹配和分布对齐确保解释的忠实性和合理性，并证明该模型不仅能生成适应不同受众的解释，还能在训练中提升分类性能。

摘要翻译

自然语言解释提供了一种本质上易于人类理解的黑箱模型解释方法，其形式与放射科医师在文本报告中传达诊断结论的方式高度契合。现有研究大多使用带有解释标注的数据集对解释生成过程进行显式监督，因此生成的解释虽看似合理，却未能忠实反映模型的实际推理过程。本研究提出弱监督自然语言解释生成模型（WeNLEX），用于多标签胸部X光分类任务的自然语言解释生成。该模型通过确保黑箱模型特征空间中由自然语言解释重构生成的图像与原始图像相匹配，从而保证解释的忠实性；同时，通过与少量临床医师标注解释数据库进行分布对齐，维持解释的合理性。我们通过多维度指标（包括忠实性、可模拟性、多样性与合理性）的广泛验证，实证表明WeNLEX仅需每个诊断类别5条真实解释标注，即可生成既忠实又合理的解释。此外，WeNLEX可在事后解释与模型内解释两种模式下运行。在模型内解释模式下（即多标签分类器与网络其余部分联合训练时），WeNLEX将独立分类器的分类AUC提升了2.21%，这表明在训练过程中引入可解释性机制能够实际提升下游任务性能。通过简单更换解释数据库，WeNLEX生成的解释可适配不同目标受众，我们通过训练面向非医疗用户的通俗版WeNLEX（其解释文本经过简化处理）展示了这一灵活性。

摘要 (Abstract)

Natural language explanations provide an inherently human-understandable way to explain black-box models, closely reflecting how radiologists convey their diagnoses in textual reports. Most works explicitly supervise the explanation generation process using datasets annotated with explanations. Thus, though plausible, the generated explanations are not faithful to the model’s reasoning. In this work, we propose WeNLEX, a weakly supervised model for the generation of natural language explanations for multilabel chest X-ray classification. Faithfulness is ensured by matching images generated from their corresponding natural language explanations with original images, in the black-box model’s feature space. Plausibility is maintained via distribution alignment with a small database of clinician-annotated explanations. We empirically demonstrate, through extensive validation on multiple metrics to assess faithfulness, simulatability, diversity, and plausibility, that WeNLEX is able to produce faithful and plausible explanations, using as little as 5 ground-truth explanations per diagnosis. Furthermore, WeNLEX can operate in both post-hoc and in-model settings. In the latter, i.e., when the multilabel classifier is trained together with the rest of the network, WeNLEX improves the classification AUC of the standalone classifier by 2.21%, thus showing that adding interpretability to the training process can actually increase the downstream task performance. Additionally, simply by changing the database, WeNLEX explanations are adaptable to any target audience, and we showcase this flexibility by training a layman version of WeNLEX, where explanations are simplified for non-medical users.

关键词: natural language explanations, chest X-ray classification, weakly supervised, faithfulness, plausibility, interpretability, multilabel classification, medical imaging

92. ❌ Memento-Skills: Let Agents Design Agents

作者: Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, Runyu Yang, Qiangbin Liu, Xinlei Yu, Jianmin Zhou, Na Wang, Chunyang Sun, Jun Wang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18743v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM agent系统，通过记忆强化学习框架实现agent自主设计和持续改进，与’LLM Agents’和’Self-Correction’高度相关（10分），涉及’Tool Use’和’Multi-agent Systems’（5分），其他关键词如MoE、量化、推理加速等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了Memento-Skills系统，一个能够自主设计、适应和改进任务特定agent的通用LLM agent系统，通过基于记忆的强化学习和状态提示实现持续学习，在多个基准测试中取得了显著性能提升。

摘要翻译

我们提出Memento-Skills，一个通用、可持续学习的大语言模型智能体系统，其功能相当于一个智能体设计智能体：它能够通过经验自主构建、调整并改进面向特定任务的智能体。该系统建立在一个基于记忆的强化学习框架之上，采用状态化提示，其中可复用的技能（以结构化Markdown文件形式存储）作为持久且不断演化的记忆。这些技能同时编码了行为与上下文，使得智能体能够在多次交互中持续积累知识。

系统从简单的基础技能（如网络搜索和终端操作）开始，通过Memento 2~\cite{wang2025memento2}中引入的读写反思学习机制持续改进。在读取阶段，一个可训练的行为技能路由器根据当前的状态化提示选择最相关的技能；在写入阶段，智能体基于新经验更新和扩展其技能库。这种闭环设计实现了无需更新大语言模型参数的持续学习，因为所有的适应都是通过外部化技能和提示的演化来实现的。

与以往依赖人工设计智能体的方法不同，Memento-Skills使一个通用智能体能够为新任务端到端地设计智能体。通过迭代式的技能生成与优化，系统逐步提升其自身能力。在通用人工智能助手基准测试和人类终极考试上的实验均显示出持续的性能增益，整体准确率分别实现了26.2%和116.2%的相对提升。代码发布于 https://github.com/Memento-Teams/Memento-Skills。

摘要 (Abstract)

We introduce \emph{Memento-Skills}, a generalist, continually-learnable LLM agent system that functions as an \emph{agent-designing agent}: it autonomously constructs, adapts, and improves task-specific agents through experience. The system is built on a memory-based reinforcement learning framework with \emph{stateful prompts}, where reusable skills (stored as structured markdown files) serve as persistent, evolving memory. These skills encode both behaviour and context, enabling the agent to carry forward knowledge across interactions. Starting from simple elementary skills (like Web search and terminal operations), the agent continually improves via the \emph{Read–Write Reflective Learning} mechanism introduced in \emph{Memento2}\cite{wang2025memento2}. In the \emph{read} phase, a behaviour-trainable skill router selects the most relevant skill conditioned on the current stateful prompt; in the \emph{write} phase, the agent updates and expands its skill library based on new experience. This closed-loop design enables \emph{continual learning without updating LLM parameters}, as all adaptation is realised through the evolution of externalised skills and prompts. Unlike prior approaches that rely on human-designed agents, Memento-Skills enables a generalist agent to \emph{design agents end-to-end} for new tasks. Through iterative skill generation and refinement, the system progressively improves its own capabilities. Experiments on the \emph{General AI Assistants} benchmark and \emph{Humanity’s Last Exam} demonstrate sustained gains, achieving 26.2% and 116.2% relative improvements in overall accuracy, respectively. Code is available at https://github.com/Memento-Teams/Memento-Skills.

关键词: LLM agents, autonomous agents, continual learning, skill-based memory, agent-designing agent, reinforcement learning, stateful prompts, self-improvement

93. ❌ Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review

作者: Dimitris Mitropoulos, Nikolaos Alexopoulos, Georgios Alexopoulos, Diomidis Spinellis 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18740v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究LLM在安全代码审查中的应用，核心关注LLM的确认偏见问题及其在软件供应链攻击中的可利用性。高度相关关键词：‘Large Language Models’（论文明确研究LLM在代码审查中的应用）和’LLM Agents’（论文研究从交互助手到自主代理的LLM系统）。中等相关关键词：‘Hallucination Mitigation’（论文涉及LLM的可靠性问题）和’Mechanistic Interpretability’（论文分析LLM的偏见机制）。其他关键词与论文的技术细节（如模型架构、训练方法、推理优化等）无关。

!!! tip deepseek-chat TL;DR

该论文研究了确认偏见如何影响基于LLM的漏洞检测，并展示了这种偏见在软件供应链攻击中的可利用性，发现对抗性提示可显著降低漏洞检测率，而去偏见措施能有效恢复检测性能。

摘要翻译

安全代码审查日益依赖集成大型语言模型（LLM）的系统，其应用范围涵盖交互式助手至CI/CD流水线中的自主代理。本研究探讨确认偏误（即倾向于支持与先前预期相符的解释）是否影响基于LLM的漏洞检测，以及这种失效模式是否可被用于软件供应链攻击。我们进行了两项互补研究。

研究一通过对照实验量化确认偏误：在五种提示框架条件下，使用四个前沿模型评估250个CVE漏洞/补丁对。将代码变更描述为“无缺陷”会使漏洞检测率下降16-93%，且呈现强烈不对称效应：误报率变化微小，而漏报率急剧上升。偏误效应因漏洞类型而异，注入类漏洞比内存破坏漏洞更易受影响。

研究二通过模拟对抗性拉取请求评估实际可利用性：攻击者在拉取请求元数据中将已知漏洞重新引入的变更描述为安全改进或紧急功能修复。实验结果显示，在单次攻击中，对抗性框架对GitHub Copilot（交互式助手）的成功率为35%；在真实项目配置中，攻击者可通过迭代优化框架描述提升成功率，对Claude Code（自主代理）的攻击成功率可达88%。通过元数据脱敏和明确指令进行去偏处理后，所有交互式案例及94%的自主案例均恢复了漏洞检测能力。我们的研究表明，确认偏误构成基于LLM的代码审查系统的潜在弱点，这对AI辅助开发工具的部署方式具有重要启示。

摘要 (Abstract)

Security code reviews increasingly rely on systems integrating Large Language Models (LLMs), ranging from interactive assistants to autonomous agents in CI/CD pipelines. We study whether confirmation bias (i.e., the tendency to favor interpretations that align with prior expectations) affects LLM-based vulnerability detection, and whether this failure mode can be exploited in software supply-chain attacks. We conduct two complementary studies. Study 1 quantifies confirmation bias through controlled experiments on 250 CVE vulnerability/patch pairs evaluated across four state-of-the-art models under five framing conditions for the review prompt. Framing a change as bug-free reduces vulnerability detection rates by 16-93%, with strongly asymmetric effects: false negatives increase sharply while false positive rates change little. Bias effects vary by vulnerability type, with injection flaws being more susceptible to them than memory corruption bugs. Study 2 evaluates exploitability in practice mimicking adversarial pull requests that reintroduce known vulnerabilities while framed as security improvements or urgent functionality fixes via their pull request metadata. Adversarial framing succeeds in 35% of cases against GitHub Copilot (interactive assistant) under one-shot attacks and in 88% of cases against Claude Code (autonomous agent) in real project configurations where adversaries can iteratively refine their framing to increase attack success. Debiasing via metadata redaction and explicit instructions restores detection in all interactive cases and 94% of autonomous cases. Our results show that confirmation bias poses a weakness in LLM-based code review, with implications on how AI-assisted development tools are deployed.

关键词: Large Language Models, LLM-assisted security code review, confirmation bias, vulnerability detection, software supply-chain attacks, autonomous agents, adversarial framing, debiasing techniques

94. ❌ CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks

作者: Hao Wang, Licheng Pan, Zhichao Chen, Chunyuan Zheng, Zhixuan Chu, Xiaoxi Li, Yuan Lu, Xinggao Liu, Haoxuan Li, Zhouchen Lin 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18736v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	15.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RLHF中的奖励建模，提出CausalRM框架从观测用户反馈中学习无偏奖励模型，与’RLHF’高度相关（15分），涉及大语言模型对齐（10分）和基础模型应用（10分）。其他关键词如MoE、SLMs、Scaling Laws等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对RLHF中奖励建模依赖昂贵实验数据的问题，提出了CausalRM框架，通过噪声感知损失和倾向得分重加权，从有噪声和偏见的观测用户反馈中学习无偏奖励模型，在多个基准数据集上显著提升了RLHF任务性能。

摘要翻译

尽管基于人类反馈的强化学习（RLHF）在语言模型对齐方面取得了成功，但当前的奖励建模严重依赖于在受控且成本高昂条件下从人工标注者收集的实验性反馈数据。在本研究中，我们提出了观测性奖励建模——利用观测性用户反馈（如点击、复制和点赞）学习奖励模型——作为一种可扩展且经济高效的替代方案。我们识别了该设置中的两个根本性挑战：（1）观测性反馈因标注错误而产生噪声，使其偏离真实用户偏好；（2）观测性反馈受用户偏好影响而产生偏差，即用户倾向于对其感受强烈的回复提供反馈，这导致训练数据与推理数据之间出现分布偏移。为应对这些挑战，我们提出了CausalRM，一个基于因果理论的奖励建模框架，旨在从观测性反馈中学习无偏的奖励模型。针对挑战（1），CausalRM引入了一个噪声感知的代理损失项，通过显式建模标注错误生成过程，该损失项在无噪声条件下被证明等价于原始损失。针对挑战（2），CausalRM利用倾向得分——用户对给定回复提供反馈的概率——对训练样本进行重新加权，从而得到一个能消除用户偏好偏差的损失函数。在多种大语言模型骨干网络和基准数据集上的大量实验验证，CausalRM能够有效从含噪声和有偏差的观测性反馈中学习准确的奖励信号，并在下游RLHF任务上带来显著的性能提升——在WildGuardMix上获得49.2%的性能增益，在HarmBench上实现32.7%的改进。代码已在项目网站公开。

摘要 (Abstract)

Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models, current reward modeling heavily relies on experimental feedback data collected from human annotators under controlled and costly conditions. In this work, we introduce observational reward modeling – learning reward models with observational user feedback (e.g., clicks, copies, and upvotes) – as a scalable and cost-effective alternative. We identify two fundamental challenges in this setting: (1) observational feedback is noisy due to annotation errors, which deviates it from true user preference; (2) observational feedback is biased by user preference, where users preferentially provide feedback on responses they feel strongly about, which creats a distribution shift between training and inference data. To address these challenges, we propose CausalRM, a causal-theoretic reward modeling framework that aims to learn unbiased reward models from observational feedback. To tackle challenge (1), CausalRM introduces a noise-aware surrogate loss term that is provably equivalent to the primal loss under noise-free conditions by explicitly modeling the annotation error generation process. To tackle challenge (2), CausalRM uses propensity scores – the probability of a user providing feedback for a given response – to reweight training samples, yielding a loss function that eliminates user preference bias. Extensive experiments across diverse LLM backbones and benchmark datasets validate that CausalRM effectively learns accurate reward signals from noisy and biased observational feedback and delivers substantial performance improvements on downstream RLHF tasks – including a 49.2% gain on WildGuardMix and a 32.7% improvement on HarmBench. Code is available on our project website.

关键词: RLHF, reward modeling, observational feedback, causal inference, alignment, language models, user preference, bias correction

95. ❌ Analysis Of Linguistic Stereotypes in Single and Multi-Agent Generative AI Architectures

作者: Martina Ullasci, Marco Rondina, Riccardo Coppola, Flavio Giobergia, Riccardo Bellanca, Gabriele Mancari Pasi, Luca Prato, Federico Spinoso, Silvia Tagliente 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18729v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM中的方言偏见问题，直接涉及LLMs（10分），并测试了Chain-of-Thought prompting作为缓解策略（10分）。研究还探索了多智能体架构（generate-critique-revise模型），与LLM Agents和Multi-agent Systems高度相关（各10分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理加速、AI for Science等均未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM输出中基于方言（标准美国英语与非裔美国英语）的刻板印象偏见，发现多智能体架构能有效缓解这种偏见，而Chain-of-Thought提示策略的效果因模型而异。

摘要翻译

文献中的许多研究表明，大语言模型（LLM）的输出存在歧视性行为，会基于输入文本所使用的方言触发基于刻板印象的推断。已有研究证明，当相同的输入分别以标准美国英语（Standard American English, SAE）和非裔美国人英语（African-American English, AAE）提供给大语言模型时，这种偏见表现得尤为明显。本文复现了现有关于大语言模型输出中方言敏感性刻板印象生成的分析，并研究了多种缓解策略的效果，包括提示工程（基于角色的提示和思维链提示）以及由生成-评判-修订模型组成的多智能体架构。我们定义了八个提示模板，以分析方言偏见可能呈现的不同方式，例如为SAE或AAE使用者建议的名字、职业和形容词。我们采用大语言模型即评判员的方法来评估结果中的偏见。我们的研究结果显示，在与SAE和AAE相关的输出之间，所有模板类别中都出现了带有刻板印象的差异，其中在形容词和职业归因方面观察到的效应最强。基线差异因模型而异，其中Claude Haiku模型观察到的SAE-AAE差异最大，而Phi-4 Mini模型中的差异最小。思维链提示被证明是缓解Claude Haiku模型偏见的有效策略，而使用多智能体架构则确保了在所有模型中实现一致的缓解效果。这些发现表明，对于关注交叉性的软件工程而言，公平性评估应包括针对具体模型的缓解策略验证，并在高影响力的大语言模型部署中采用工作流层面的控制措施（例如，涉及评判模型的智能体架构）。当前结果本质上是探索性的，范围有限，但可以通过增加数据集规模以及将该流程应用于不同语言或方言，来推动后续的扩展和复现研究。

摘要 (Abstract)

Many works in the literature show that LLM outputs exhibit discriminatory behaviour, triggering stereotype-based inferences based on the dialect in which the inputs are written. This bias has been shown to be particularly pronounced when the same inputs are provided to LLMs in Standard American English (SAE) and African-American English (AAE). In this paper, we replicate existing analyses of dialect-sensitive stereotype generation in LLM outputs and investigate the effects of mitigation strategies, including prompt engineering (role-based and Chain-Of-Thought prompting) and multi-agent architectures composed of generate-critique-revise models. We define eight prompt templates to analyse different ways in which dialect bias can manifest, such as suggested names, jobs, and adjectives for SAE or AAE speakers. We use an LLM-as-judge approach to evaluate the bias in the results. Our results show that stereotype-bearing differences emerge between SAE- and AAE-related outputs across all template categories, with the strongest effects observed in adjective and job attribution. Baseline disparities vary substantially by model, with the largest SAE-AAE differential observed in Claude Haiku and the smallest in Phi-4 Mini. Chain-Of-Thought prompting proved to be an effective mitigation strategy for Claude Haiku, whereas the use of a multi-agent architecture ensured consistent mitigation across all the models. These findings suggest that for intersectionality-informed software engineering, fairness evaluation should include model-specific validation of mitigation strategies, and workflow-level controls (e.g., agentic architectures involving critique models) in high-impact LLM deployments. The current results are exploratory in nature and limited in scope, but can lead to extensions and replications by increasing the dataset size and applying the procedure to different languages or dialects.

关键词: LLM bias, dialect stereotypes, African-American English, multi-agent architectures, Chain-of-Thought prompting, fairness evaluation, generate-critique-revise, LLM-as-judge

96. ❌ Ontology-Guided Diffusion for Zero-Shot Visual Sim2Real Transfer

作者: Mohamed Youssef, Mayar Elfares, Anna-Maria Meer, Matteo Bortoletto, Andreas Bulling 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18719v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种基于本体引导的扩散模型（OGD）用于零样本视觉仿真到真实（sim2real）图像转换，属于计算机视觉和生成模型领域。与大多数关键词无关，因为论文不涉及大语言模型（LLM）及其相关技术（如MoE、SFT、RLHF、RAG等）。仅与三个关键词有弱关联：1. ‘Pre-training OR Continual Pre-training OR Domain Adaptation’（5分）：论文使用了预训练的指令引导扩散模型，但未深入探讨预训练或领域适应技术本身。2. ‘Mechanistic Interpretability OR Explainable AI’（5分）：论文强调通过本体实现可解释性，但主要关注视觉特征而非AI模型机制的可解释性。3. ‘AI for Science OR Bioinformatics OR Cheminformatics’（5分）：论文属于AI在科学领域的应用（视觉仿真到真实转换），但未具体涉及生物信息学或化学信息学。其他关键词如推理、代理、压缩等均不相关。

!!! tip deepseek-chat TL;DR

该论文解决了仿真到真实图像转换中因缺乏真实标注数据而存在的差距问题，通过引入一种基于本体引导的扩散框架（OGD），将真实性分解为可解释的特征并利用知识图进行编码，从而在零样本设置下实现了更优的图像转换性能。

摘要翻译

弥合仿真到现实（sim2real）的差距仍然具有挑战性，因为带标签的真实世界数据十分稀缺。现有的基于扩散模型的方法依赖于非结构化的提示或统计对齐，未能捕捉使图像看起来真实的结构化因素。我们提出了本体引导扩散（Ontology-Guided Diffusion, OGD），这是一个神经符号化的零样本sim2real图像翻译框架，它将真实感表示为结构化知识。OGD将真实感分解为一个由可解释特征（例如光照和材质属性）构成的本体，并在知识图谱中编码这些特征之间的关系。OGD从一张合成图像出发，推断特征激活状态，并使用图神经网络生成一个全局嵌入表示。同时，一个符号规划器利用本体特征来计算一系列缩小真实感差距所需的、连贯的视觉编辑操作。该图嵌入表示通过交叉注意力机制来调节一个预训练的指令引导扩散模型，而规划出的编辑操作则被转换为一个结构化的指令提示。在多个基准测试中，我们基于图的嵌入表示比基线方法能更好地区分真实与合成图像，并且OGD在sim2real图像翻译任务上超越了最先进的扩散方法。总体而言，OGD表明，显式编码真实感结构能够实现可解释、数据高效且可泛化的零样本sim2real迁移。

摘要 (Abstract)

Bridging the simulation-to-reality (sim2real) gap remains challenging as labelled real-world data is scarce. Existing diffusion-based approaches rely on unstructured prompts or statistical alignment, which do not capture the structured factors that make images look real. We introduce Ontology- Guided Diffusion (OGD), a neuro-symbolic zero-shot sim2real image translation framework that represents realism as structured knowledge. OGD decomposes realism into an ontology of interpretable traits – such as lighting and material properties – and encodes their relationships in a knowledge graph. From a synthetic image, OGD infers trait activations and uses a graph neural network to produce a global embedding. In parallel, a symbolic planner uses the ontology traits to compute a consistent sequence of visual edits needed to narrow the realism gap. The graph embedding conditions a pretrained instruction-guided diffusion model via cross-attention, while the planned edits are converted into a structured instruction prompt. Across benchmarks, our graph-based embeddings better distinguish real from synthetic imagery than baselines, and OGD outperforms state-of-the-art diffusion methods in sim2real image translations. Overall, OGD shows that explicitly encoding realism structure enables interpretable, data-efficient, and generalisable zero-shot sim2real transfer.

关键词: sim2real transfer, diffusion models, ontology-guided, zero-shot learning, image translation, knowledge graph, neuro-symbolic, visual realism

97. ❌ MemMA: Coordinating the Memory Cycle through Multi-Agent Reasoning and In-Situ Self-Evolution

作者: Minhua Lin, Zhiwei Zhang, Hanqing Lu, Hui Liu, Xianfeng Tang, Qi He, Xiang Zhang, Suhang Wang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18718v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出MemMA框架，核心研究LLM智能体（LLM Agents）的多智能体协调（Multi-agent Systems）和记忆增强，涉及检索增强生成（RAG）、自我进化（Self-Correction/Improvement）和推理（Chain of Thought/System 2 Thinking）。与LLM基础技术、长上下文、智能体工作流高度相关，但与MoE、量化、科学AI等关键词无关。

!!! tip deepseek-chat TL;DR

论文提出MemMA多智能体框架，通过协调记忆周期的前向和反向路径，解决了记忆增强LLM智能体中战略盲点和稀疏监督的问题，实验表明其在多个LLM骨干和存储后端上优于现有基线。

摘要翻译

记忆增强型大语言模型智能体通过维护外部记忆库来支持长程交互，但现有系统大多将记忆的构建、检索与利用视为孤立的子模块。这导致了两重相互关联的挑战：在记忆周期的前向路径上存在策略性盲区，即记忆构建与检索由局部启发式规则驱动，缺乏显式的策略推理；而在后向路径上则存在稀疏且延迟的监督，下游的失败很少能直接转化为对记忆库的修复。为应对这些挑战，我们提出了MemMA，一个即插即用的多智能体框架，该框架在记忆周期的前向与后向路径上均进行协调。在前向路径上，一个元思考者（Meta-Thinker）生成结构化指导，在构建阶段引导记忆管理器（Memory Manager），并在迭代检索过程中指导查询推理器（Query Reasoner）。在后向路径上，MemMA引入了原位自演化的记忆构建机制，该机制能合成探测性问答对，验证当前记忆状态，并在记忆最终固化前将失败案例转化为修复动作。在LoCoMo基准上的大量实验表明，MemMA在多种大语言模型基座上均持续优于现有基线，并能以即插即用的方式改进三种不同的存储后端。我们的代码已公开于https://github.com/ventr1c/memma。

摘要 (Abstract)

Memory-augmented LLM agents maintain external memory banks to support long-horizon interaction, yet most existing systems treat construction, retrieval, and utilization as isolated subroutines. This creates two coupled challenges: strategic blindness on the forward path of the memory cycle, where construction and retrieval are driven by local heuristics rather than explicit strategic reasoning, and sparse, delayed supervision on the backward path, where downstream failures rarely translate into direct repairs of the memory bank. To address these challenges, we propose MemMA, a plug-and-play multi-agent framework that coordinates the memory cycle along both the forward and backward paths. On the forward path, a Meta-Thinker produces structured guidance that steers a Memory Manager during construction and directs a Query Reasoner during iterative retrieval. On the backward path, MemMA introduces in-situ self-evolving memory construction, which synthesizes probe QA pairs, verifies the current memory, and converts failures into repair actions before the memory is finalized. Extensive experiments on LoCoMo show that MemMA consistently outperforms existing baselines across multiple LLM backbones and improves three different storage backends in a plug-and-play manner. Our code is publicly available at https://github.com/ventr1c/memma.

关键词: Memory-augmented LLM agents, Multi-agent framework, Self-evolving memory, Retrieval guidance, Meta-Thinker, Memory Manager, Query Reasoner, Plug-and-play

98. ❌ Accurate and Efficient Multi-Channel Time Series Forecasting via Sparse Attention Mechanism

作者: Lei Gao, Hengda Bao, Jingfei Fang, Guangzheng Wu, Weihua Zhou, Yun Zhou 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18712v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多通道时间序列预测，提出了一种名为Li-Net的新架构，该架构集成了稀疏Top-K Softmax注意力机制，并强调计算效率（内存使用更低、推理时间更快）。因此，它与’Mixture of Experts OR MoE OR Sparse Models’有一定关联（5分），因为都涉及稀疏模型技术；与’Speculative Decoding OR Inference Acceleration’有一定关联（5分），因为论文明确提到了更快的推理时间。然而，论文的核心内容并非大语言模型（LLMs）或其相关技术（如预训练、微调、对齐、代理等），也未涉及科学AI应用（如生物信息学），因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Li-Net的新型架构，用于多通道时间序列预测，通过稀疏注意力机制和多模态信息融合，在多个真实世界基准数据集上实现了竞争性的预测性能，同时显著降低了计算负担。

摘要翻译

多通道时间序列预测任务在金融、供应链管理和能源规划等诸多领域普遍存在。有效捕捉通道内与通道间的复杂动态依赖关系对于实现精准预测至关重要。然而，传统方法较少关注学习通道间的交互作用。本文提出线性-网络（Linear-Network, Li-Net），这是一种专为多通道时间序列预测设计的新型架构，能够捕捉通道间的线性和非线性依赖关系。Li-Net 动态压缩序列维度和通道维度的表征，通过一个可配置的非线性模块处理信息，随后重建预测结果。此外，Li-Net 在多尺度投影框架内集成了稀疏Top-K Softmax注意力机制以应对这些挑战。其核心创新在于能够无缝整合与融合多模态嵌入，引导稀疏注意力过程聚焦于信息最丰富的时间步和特征通道。通过在多个真实世界基准数据集上的实验结果表明，与最先进的基线方法相比，Li-Net 取得了具有竞争力的性能。更重要的是，Li-Net 在预测精度与计算负担之间实现了更优的平衡，展现出显著更低的内存占用和更快的推理速度。详细的消融研究和参数敏感性分析验证了我们所提出架构中每个关键组件的有效性。关键词：多元时间序列预测，稀疏注意力机制，多模态信息融合，非线性关系

摘要 (Abstract)

The task of multi-channel time series forecasting is ubiquitous in numerous fields such as finance, supply chain management, and energy planning. It is critical to effectively capture complex dynamic dependencies within and between channels for accurate predictions. However, traditional method paid few attentions on learning the interaction among channels. This paper proposes Linear-Network (Li-Net), a novel architecture designed for multi-channel time series forecasting that captures the linear and non-linear dependencies among channels. Li-Net dynamically compresses representations across sequence and channel dimensions, processes the information through a configurable non-linear module and subsequently reconstructs the forecasts. Moreover, Li-Net integrates a sparse Top-K Softmax attention mechanism within a multi-scale projection framework to address these challenges. A core innovation is its ability to seamlessly incorporate and fuse multi-modal embeddings, guiding the sparse attention process to focus on the most informative time steps and feature channels. Through the experiment results on multiple real-world benchmark datasets demonstrate that Li-Net achieves competitive performance compared to state-of-the-art baseline methods. Furthermore, Li-Net provides a superior balance between prediction accuracy and computational burden, exhibiting significantly lower memory usage and faster inference times. Detailed ablation studies and parameter sensitivity analyses validate the effectiveness of each key component in our proposed architecture. Keywords: Multivariate Time Series Forecasting, Sparse Attention Mechanism, Multimodal Information Fusion, Non-linear relationship

关键词: Multivariate Time Series Forecasting, Sparse Attention Mechanism, Multi-channel Time Series, Li-Net, Computational Efficiency, Multimodal Information Fusion, Non-linear Dependencies

99. ❌ HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning

作者: Zhicong Lu, Zichuan Lin, Wei Jia, Changyuan Tian, Deheng Ye, Peiguang Li, Li Jin, Nayu Liu, Guangluan Xu, Wei Feng 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18683v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在复杂多轮决策任务中的强化学习应用，直接涉及LLM Agents和RLHF/强化学习技术，因此这两个关键词高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG、推理加速、量化等均未在摘要中提及或与论文主题无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在复杂长视野多轮决策任务中奖励稀疏和信用分配不可靠的问题，提出了一种利用后见信息调制分段过程奖励的方法，显著提升了强化学习性能。

摘要翻译

尽管大语言模型在多个领域表现出色，其在复杂长程自主决策任务中的性能仍存在局限。现有方法大多集中于设计有效的奖励模型，通过多轮强化学习提升性能。然而，这些方法常面临稀疏结果奖励的延迟传播问题，以及因可能过于细碎且缺乏聚焦的轮次级过程奖励而导致的不可靠信用分配。本文提出利用后见信息调节分段过程奖励的方法，通过将奖励与子目标紧密对齐，并突出关键任务片段以增强信用分配的可靠性。具体而言，我们提出一种分段级过程奖励模型，为任务中的每个子目标分配奖励，避免对轮次进行过度细粒度的分配。为强调轨迹中的重要片段，我们设计了一个后见模型，以反映在已知轨迹结果后执行特定动作的偏好。基于此特性，我们通过计算后见模型与策略模型之间的序列似然比来衡量动作重要性。该比值随后被用于聚合片段重要性分数，进而调节分段过程奖励，从而提升信用分配的可靠性。在三个公开基准上的大量实验结果验证了本方法的有效性。

摘要 (Abstract)

While large language models excel in diverse domains, their performance on complex longhorizon agentic decision-making tasks remains limited. Most existing methods concentrate on designing effective reward models (RMs) to advance performance via multi-turn reinforcement learning. However, they suffer from delayed propagation in sparse outcome rewards and unreliable credit assignment with potentially overly fine-grained and unfocused turnlevel process rewards. In this paper, we propose (HISR) exploiting Hindsight Information to modulate Segmental process Rewards, which closely aligns rewards with sub-goals and underscores significant segments to enhance the reliability of credit assignment. Specifically, a segment-level process RM is presented to assign rewards for each sub-goal in the task, avoiding excessively granular allocation to turns. To emphasize significant segments in the trajectory, a hindsight model is devised to reflect the preference of performing a certain action after knowing the trajectory outcome. With this characteristic, we design the ratios of sequence likelihoods between hindsight and policy model to measure action importance. The ratios are subsequently employed to aggregate segment importance scores, which in turn modulate segmental process rewards, enhancing credit assignment reliability. Extensive experimental results on three publicly benchmarks demonstrate the validity of our method.

关键词: Large Language Models, Multi-turn Agentic Reinforcement Learning, Reward Models, Credit Assignment, Hindsight Information, Segmental Process Rewards, LLM Agents, Reinforcement Learning

100. ❌ Cognitive Amplification vs Cognitive Delegation in Human-AI Systems: A Metric Framework

作者: Eduardo Di Santi 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18677v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文提出了一个评估人机协作中认知增强与认知委托的概念框架和度量指标，属于人机交互、认知科学和AI伦理的交叉领域。论文讨论的是通用AI系统（不特指大模型）与人类决策的交互模式，并未涉及任何具体的大模型技术（如LLM架构、训练方法、推理优化、应用领域等）。所有关键词均聚焦于大模型的技术细节、训练方法、优化技术或特定应用领域，与该论文的宏观人机交互框架研究完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一个区分人机系统中认知增强（AI提升人类能力）与认知委托（人类过度依赖AI）的度量框架，并定义了四个指标来评估人机协作的协同效果和人类认知的可持续性。

摘要翻译

人工智能日益融入人类决策过程，既可增强人类推理能力，也可能诱发过度的认知依赖。本文提出一个概念性与数学框架，用以区分认知放大与认知委托两种模式：前者指人工智能在保持人类专业能力的同时提升人机混合系统的表现，后者则意味着推理过程逐渐外包给人工智能系统。

为刻画这两种机制，我们定义了一套操作性指标：认知放大指数（CAI*）、依赖比率（D）、人类依赖指数（HRI）以及人类认知漂移率（HCDR）。这些指标共同构成一个低维度量空间，既能评估人机系统是否实现真正的协同增效，也能判断这种协同表现对人类参与者是否具有长期的认知可持续性。

该框架揭示了人机系统设计中的核心矛盾：短期混合效能的最大化未必能维持人类长期的认知能力。因此我们主张，人机系统的设计应遵循认知可持续性约束，确保混合性能的提升不以人类专业能力的退化为代价。

摘要 (Abstract)

Artificial intelligence is increasingly embedded in human decision-making, where it can either enhance human reasoning or induce excessive cognitive dependence. This paper introduces a conceptual and mathematical framework for distinguishing cognitive amplification, in which AI improves hybrid human-AI performance while preserving human expertise, from cognitive delegation, in which reasoning is progressively outsourced to AI systems. To characterize these regimes, we define a set of operational metrics: the Cognitive Amplification Index (CAI*), the Dependency Ratio (D), the Human Reliance Index (HRI), and the Human Cognitive Drift Rate (HCDR). Together, these quantities provide a low-dimensional metric space for evaluating not only whether human-AI systems achieve genuine synergistic performance, but also whether such performance is cognitively sustainable for the human component over time. The framework highlights a central design tension in human-AI systems: maximizing short-term hybrid capability does not necessarily preserve long-term human cognitive competence. We therefore argue that human-AI systems should be designed under a cognitive sustainability constraint, such that gains in hybrid performance do not come at the cost of degradation in human expertise.

关键词: human-AI systems, cognitive amplification, cognitive delegation, decision-making, cognitive sustainability, hybrid performance, human expertise, metric framework

101. ❌ Thinking with Constructions: A Benchmark and Policy Optimization for Visual-Text Interleaved Geometric Reasoning

作者: Haokun Zhao, Wanshi Xu, Haidong Yuan, Songjun Cao, Long Ma, Yanghua Xiao 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18662v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在几何推理任务中的应用，提出了一种视觉-文本交织的思维链框架（Visual-Text Interleaved Chain-of-Thought），并开发了强化学习优化方法A2PO。因此，与’Large Language Models’高度相关（10分），与’Chain of Thought’高度相关（10分），与’System 2 Thinking’有一定关联（8分，涉及深度推理策略）。论文属于AI在科学（几何）领域的应用，与’AI for Science’有一定关联（5分）。其他关键词如MoE、量化、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在几何推理中缺乏动态视觉辅助构建能力的问题，提出了视觉-文本交织思维链框架和强化学习优化方法A2PO，在GeoAux-Bench基准上实现了3.51%的性能提升。

摘要翻译

几何推理本质上需要“借助构造进行思考”——即通过动态操控视觉辅助工具来弥合问题条件与解决方案之间的鸿沟。然而，现有的多模态大语言模型（MLLMs）大多局限于对静态图表进行被动推理，缺乏关于何时以及如何构造有效视觉辅助的策略性知识。为解决这一问题，我们提出了视觉-文本交错思维链框架。我们首先引入了GeoAux-Bench，这是首个包含4,334个几何问题的基准测试集，其将文本构造步骤与真实的视觉更新对齐。我们的初步研究揭示了两个关键发现：（1）交错式视觉-文本辅助优于单一模态的辅助方式，后者无法无损地捕捉几何协同效应；（2）有效的构造行为可作为熵减器，其与推理困惑度的降低存在强相关性。基于这些发现，我们提出了动作适用性策略优化（A2PO），这是一种用于掌握策略性构造的强化学习范式。A2PO采用自适应奖励塑形技术，通过反事实采样来区分必要构造与冗余构造，从而调控视觉辅助的时机与质量。实验表明，我们的方法能使MLLMs利用选择性辅助构造，相比强基线模型获得3.51%的性能提升。代码与数据已在GitHub上开源。

摘要 (Abstract)

Geometric reasoning inherently requires “thinking with constructions” – the dynamic manipulation of visual aids to bridge the gap between problem conditions and solutions. However, existing Multimodal Large Language Models (MLLMs) are largely confined to passive inference with static diagrams, lacking the strategic knowledge of when and how to construct effective visual aids. To address this, we present a framework for Visual-Text Interleaved Chain-of-Thought. We first introduce GeoAux-Bench, the first benchmark comprising 4,334 geometry problems that aligns textual construction steps with ground-truth visual updates. Our pilot study reveals two critical insights: (1) interleaved visual-textual aids outperform single-modality counterparts, which cannot losslessly capture geometric synergy; and (2) valid constructions act as entropy reducers, strongly correlating with reduced reasoning perplexity. Building on these findings, we propose Action Applicability Policy Optimization (A2PO), a reinforcement learning paradigm for mastering strategic construction. A2PO employs Adaptive Reward Shaping to regulate the timing and quality of visual aids via counterfactual sampling to distinguish necessary from redundant constructions. Experiments demonstrate our approach enables MLLMs to leverage selective auxiliary constructions, yielding a 3.51% gain over strong baselines. Code and data are available on GitHub.

关键词: Multimodal Large Language Models, Geometric Reasoning, Chain-of-Thought, Visual-Text Interleaved, Reinforcement Learning, Benchmark, Policy Optimization, A2PO

102. ❌ MANAR: Memory-augmented Attention with Navigational Abstract Conceptual Representation

作者: Zuher Jahshan, Ben Ben Ishay, Leonid Yavits 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18676v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	8.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出MANAR架构，作为标准多头注意力(MHA)的泛化，通过实现全局工作空间理论(GWT)来改进注意力机制。核心创新在于引入可训练的抽象概念记忆和抽象概念表示(ACR)，实现线性时间复杂度的注意力计算。论文与’Large Language Models’相关(8分)，因为它是Transformer架构的改进；与’KV Cache Compression OR Linear Attention OR FlashAttention’高度相关(8分)，因为它解决了标准注意力的二次复杂度问题，实现线性时间缩放；与’Speculative Decoding OR Inference Acceleration’相关(8分)，因为它通过架构改进实现高效推理；与’Pre-training’和’Post-training’有一定关联(各5分)，因为提到可以从预训练Transformer进行知识迁移；与’PEFT’有一定关联(5分)，因为它是MHA的兼容重参数化。其他关键词如MoE、SLMs、RAG、CoT等与论文内容无关(0分)。

!!! tip deepseek-chat TL;DR

论文提出MANAR架构，通过实现全局工作空间理论来改进标准多头注意力机制，解决了二次复杂度问题，实现线性时间缩放，并在语言、视觉和语音任务上达到或超过基线性能。

摘要翻译

MANAR（基于导航式抽象概念表征的记忆增强注意力机制）的语境化层通过实例化全局工作空间理论（GWT）原理，对标准多头注意力机制（MHA）进行了泛化。尽管MHA允许无约束的全对全通信，但它缺乏认知意识模型中假设的功能瓶颈与全局整合机制。MANAR通过引入一个可训练的抽象概念记忆库及一个抽象概念表征（ACR）来实现中心工作空间，从而解决这一问题。该架构遵循直接对应GWT机制的两阶段逻辑：（i）整合阶段：检索出的记忆概念基于输入刺激汇聚形成集体“心智图像”（即ACR）；（ii）广播阶段：该全局状态导航并影响各局部标记的语境化过程。我们证明，线性时间的高效扩展是实例化GWT功能瓶颈的结构性副产品——通过恒定尺寸的ACR路由全局信息，解决了标准注意力固有的二次复杂度问题。MANAR是对MHA的兼容性重参数化，其投影矩阵保持相同的语义角色，从而可通过权重复制实现从预训练Transformer的知识迁移，克服了结构不兼容的线性时间替代方案的采用障碍。MANAR支持非凸语境化，能够合成可证明位于输入标记凸包之外的表征——这从数学层面反映了GWT所描述的创造性综合过程。实证评估表明，MANAR在语言（GLUE得分85.1）、视觉（ImageNet-1K准确率83.9%）及语音（LibriSpeech词错误率2.7%）任务上均达到或超越强基线模型，确立了其作为二次复杂度注意力机制的高效且富有表现力的替代方案。

摘要 (Abstract)

MANAR (Memory-augmented Attention with Navigational Abstract Conceptual Representation), contextualization layer generalizes standard multi-head attention (MHA) by instantiating the principles of Global Workspace Theory (GWT). While MHA enables unconstrained all-to-all communication, it lacks the functional bottleneck and global integration mechanisms hypothesized in cognitive models of consciousness. MANAR addresses this by implementing a central workspace through a trainable memory of abstract concepts and an Abstract Conceptual Representation (ACR). The architecture follows a two-stage logic that maps directly to GWT mechanics: (i) an integration phase, where retrieved memory concepts converge to form a collective “mental image” (the ACR) based on input stimuli; and (ii) a broadcasting phase, where this global state navigates and informs the contextualization of individual local tokens. We demonstrate that efficient linear-time scaling is a fundamental architectural byproduct of instantiating GWT functional bottleneck, as routing global information through a constant-sized ACR resolves the quadratic complexity inherent in standard attention. MANAR is a compatible re-parameterization of MHA with identical semantic roles for its projections, enabling knowledge transfer from pretrained transformers via weight-copy and thus overcoming the adoption barriers of structurally incompatible linear-time alternatives. MANAR enables non-convex contextualization, synthesizing representations that provably lie outside the convex hull of input tokens - a mathematical reflection of the creative synthesis described in GWT. Empirical evaluations confirm that MANAR matches or exceeds strong baselines across language (GLUE score of 85.1), vision (83.9% ImageNet-1K), and speech (2.7% WER on LibriSpeech), positioning it as an efficient and expressive alternative to quadratic attention.

关键词: Memory-augmented Attention, Global Workspace Theory, Abstract Conceptual Representation, Linear-time scaling, Quadratic complexity, Transformer architecture, Multi-head attention, Contextualization layer

103. ❌ Balanced Thinking: Improving Chain of Thought Training in Vision Language Models

作者: Shaked Perek, Ben Wiesel, Avihu Dekel, Nimrod Shabtay, Eli Schwartz 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18656v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究监督微调（SFT）中推理段与答案段的平衡训练问题，提出SCALe方法改进Chain of Thought训练，因此与’SFT’和’Chain of Thought’高度相关（10分）。论文涉及视觉语言模型中的多模态推理，属于大模型应用，与’Large Language Models’相关（8分），并涉及深度推理过程，与’System 2 Thinking’有一定关联（8分）。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

论文针对视觉语言模型中Chain of Thought训练存在推理段与答案段不平衡的问题，提出了SCALe方法，通过动态权重调度在监督微调中平衡两者监督，实验表明该方法能提高准确性、减少训练时间，并可作为强化学习的基础。

摘要翻译

视觉语言模型（VLMs）中的多模态推理通常依赖于两阶段流程：监督微调（SFT）和强化学习（RL）。在标准SFT中，所有标记对损失函数的贡献均等，尽管推理数据本质上存在标记不平衡问题。冗长的推理轨迹会掩盖简短但任务关键的答案片段，导致推理过程冗长且答案不准确。我们提出SCALe（计划课程自适应损失），该方法通过动态的、与长度无关的加权机制，明确区分对推理段和答案段的监督。与普通SFT过度重视段不同，SCALe-SFT通过余弦调度策略在训练过程中逐步将焦点从转移至，从而鼓励简洁且有依据的推理。我们在多种基准测试和架构上评估SCALe。结果表明，SCALe持续提升普通SFT的准确性，其性能与完整的两阶段SFT + GRPO流程相当，而所需训练时间仅为后者的约七分之一，成为一种轻量级且高效的替代方案。当与GRPO结合时，SCALe实现了最佳整体性能，突显了其既可作为独立方法，也可作为强化学习优化的坚实基础的价值。

摘要 (Abstract)

Multimodal reasoning in vision-language models (VLMs) typically relies on a two-stage process: supervised fine-tuning (SFT) and reinforcement learning (RL). In standard SFT, all tokens contribute equally to the loss, even though reasoning data are inherently token-imbalanced. Long traces overshadow short but task-critical segments, leading to verbose reasoning and inaccurate answers. We propose SCALe (Scheduled Curriculum Adaptive Loss), which explicitly separates supervision over reasoning and answer segments using dynamic, length-independent weighting. Unlike vanilla SFT, which overweights the segment, SCALe-SFT gradually shifts the focus from to throughout training via a cosine scheduling policy, encouraging concise and well-grounded reasoning. We evaluate SCALe across diverse benchmarks and architectures. Results show that SCALe consistently improves accuracy over vanilla SFT and matches the performance of the full two-phase SFT + GRPO pipeline while requiring only about one-seventh of the training time, making it a lightweight yet effective alternative. When combined with GRPO, SCALe achieves the best overall performance, highlighting its value both as a standalone method and as a strong foundation for reinforcement refinement.

关键词: Vision Language Models, Chain of Thought, Supervised Fine-tuning, Reasoning, Multimodal reasoning, SCALe, Training efficiency, Answer accuracy

104. ❌ Multiscale Switch for Semi-Supervised and Contrastive Learning in Medical Ultrasound Image Segmentation

作者: Jingguo Qu, Xinyang Han, Yao Pu, Man-Lik Chui, Simon Takadiyi Gunda, Ziman Chen, Jing Qin, Ann Dorothy King, Winnie Chiu-Wing Chu, Jing Cai, Michael Tin-Cheung Ying 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18655v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学超声图像分割的半监督学习框架，核心创新在于多尺度切换和频域对比学习策略，属于计算机视觉和医学图像分析领域。论文未涉及任何大语言模型（LLM）、深度学习技术原理创新或大模型在不同领域的应用。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为医学图像分析可视为AI在科学（生物医学）领域的应用，但论文未明确提及生物信息学或化学信息学，因此给予5分（有一定关联）。其他所有关键词均与大模型技术、训练方法、推理优化、代理系统等无关，评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Switch的半监督学习框架，通过多尺度切换和频域对比学习策略，有效解决了医学超声图像分割中标注数据有限的问题，在多个数据集上超越了现有方法，甚至在5%标注率下优于全监督基线。

摘要翻译

医学超声图像分割因标注数据有限及斑点噪声、低对比度边界等固有成像伪影而面临显著挑战。半监督学习方法虽已出现以应对数据稀缺问题，但现有方法对未标注数据的利用欠佳，且缺乏鲁棒的特征表示机制。本文提出Switch——一种新型半监督学习框架，其具备两项关键创新：(1) 多尺度切换策略，采用分层图像块混合以实现均匀的空间覆盖；(2) 结合对比学习的频域切换方法，在傅里叶空间进行幅度切换以获取鲁棒特征表示。本框架将上述组件集成于师生架构中，以有效利用标注与未标注数据。在六个多样化超声数据集（淋巴结、乳腺病灶、甲状腺结节及前列腺）上的综合评估表明，本方法始终优于现有先进技术。在5%标注比例下，Switch取得显著提升：在LN-INT数据集上达到80.04% Dice系数，DDTI数据集上达85.52%，前列腺数据集上达83.48%，其半监督性能甚至超越全监督基线模型。该方法在保持参数高效性（180万参数）的同时实现了优越性能，验证了其在资源受限的医学影像应用中的有效性。源代码已公开于https://github.com/jinggqu/Switch。

摘要 (Abstract)

Medical ultrasound image segmentation faces significant challenges due to limited labeled data and characteristic imaging artifacts including speckle noise and low-contrast boundaries. While semi-supervised learning (SSL) approaches have emerged to address data scarcity, existing methods suffer from suboptimal unlabeled data utilization and lack robust feature representation mechanisms. In this paper, we propose Switch, a novel SSL framework with two key innovations: (1) Multiscale Switch (MSS) strategy that employs hierarchical patch mixing to achieve uniform spatial coverage; (2) Frequency Domain Switch (FDS) with contrastive learning that performs amplitude switching in Fourier space for robust feature representations. Our framework integrates these components within a teacher-student architecture to effectively leverage both labeled and unlabeled data. Comprehensive evaluation across six diverse ultrasound datasets (lymph nodes, breast lesions, thyroid nodules, and prostate) demonstrates consistent superiority over state-of-the-art methods. At 5% labeling ratio, Switch achieves remarkable improvements: 80.04% Dice on LN-INT, 85.52% Dice on DDTI, and 83.48% Dice on Prostate datasets, with our semi-supervised approach even exceeding fully supervised baselines. The method maintains parameter efficiency (1.8M parameters) while delivering superior performance, validating its effectiveness for resource-constrained medical imaging applications. The source code is publicly available at https://github.com/jinggqu/Switch

关键词: semi-supervised learning, medical ultrasound image segmentation, contrastive learning, multiscale switch, frequency domain, teacher-student architecture, parameter efficiency, limited labeled data

105. ❌ Beyond TVLA: Anderson-Darling Leakage Assessment for Neural Network Side-Channel Leakage Detection

作者: Ján Mikulec, Jakub Breier, Xiaolu Hou 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18647v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究侧信道泄漏检测方法（ADLA vs TVLA），应用于神经网络硬件实现的安全评估，属于硬件安全/密码工程领域。所有评分关键词均涉及大模型/深度学习技术原理、训练方法、推理优化、对齐、应用等，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文提出了一种基于Anderson-Darling检验的侧信道泄漏评估方法（ADLA），用于检测神经网络硬件实现中的侧信道泄漏，实验表明其在保护性措施下比传统TVLA方法具有更高的检测灵敏度。

摘要翻译

基于韦尔奇$t$检验的测试向量泄漏评估（TVLA）已成为检测侧信道泄漏的标准工具。然而，当其泄漏主要通过高阶分布差异显现时，该方法基于均值的特性可能限制其检测灵敏度。如我们的实验所示，这一特性在评估神经网络实现时尤为关键。本文提出安德森-达林泄漏评估（ADLA），这是一种将双样本安德森-达林检验应用于泄漏检测的框架。与TVLA不同，ADLA检验完整累积分布函数的等价性，而非依赖于纯粹的均值偏移模型。

我们在基于MNIST数据集训练、并在ChipWhisperer-Husky评估平台上实现的多层感知机（MLP）上评估了ADLA。我们考察了采用乱序与随机抖动防护措施的受保护实现。结果表明，在受保护实现中，相较于TVLA，ADLA在较少迹线数量的情况下能提供更高的泄漏检测灵敏度。

摘要 (Abstract)

Test Vector Leakage Assessment (TVLA) based on Welch’s $t$-test has become a standard tool for detecting side-channel leakage. However, its mean-based nature can limit sensitivity when leakage manifests primarily through higher-order distributional differences. As our experiments show, this property becomes especially crucial when it comes to evaluating neural network implementations. In this work, we propose Anderson–Darling Leakage Assessment (ADLA), a leakage detection framework that applies the two-sample Anderson–Darling test for leakage detection. Unlike TVLA, ADLA tests equality of the full cumulative distribution functions and does not rely on a purely mean-shift model. We evaluate ADLA on a multilayer perceptron (MLP) trained on MNIST and implemented on a ChipWhisperer-Husky evaluation platform. We consider protected implementations employing shuffling and random jitter countermeasures. Our results show that ADLA can provide improved leakage-detection sensitivity in protected implementations for a low number of traces compared to TVLA.

关键词: side-channel leakage, neural network, Anderson-Darling test, TVLA, leakage detection, hardware security, countermeasures, ChipWhisperer

106. ❌ Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation

作者: Pius Horn, Janis Keuper 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18652v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文的核心贡献是使用LLM作为评估器（LLM-as-a-judge）来评估PDF表格提取的语义准确性，这是LLM在科学数据处理领域的创新应用。因此，与’Large Language Models’高度相关（10分），与’AI for Science’相关（8分），因为论文涉及科学数据挖掘和知识库构建。其他关键词主要涉及大模型技术原理、训练方法、推理优化等，而本文仅应用现成的LLM进行评估，未涉及这些技术细节，故均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个基于LLM语义评估的PDF表格提取基准测试框架，通过人类验证研究表明LLM评估比传统规则方法更接近人类判断，并评估了21个PDF解析器的性能差异。

摘要翻译

从PDF中可靠地提取表格对于大规模科学数据挖掘和知识库构建至关重要，然而现有的评估方法依赖于基于规则的指标，这些指标无法捕捉表格内容的语义等价性。我们提出一个基于合成生成PDF的基准测试框架，该框架使用精确的LaTeX标注真值，并采用源自arXiv的表格以确保真实的复杂性和多样性。作为核心方法贡献，我们应用大语言模型即评判员（LLM-as-a-judge）进行语义表格评估，并将其集成到一个能兼容解析器输出不一致性的匹配流程中。通过一项包含对提取表格对超过1,500次质量判断的人工验证研究，我们表明基于大语言模型的评估与人类判断的相关性（皮尔逊相关系数r=0.93）显著高于基于树编辑距离相似度（TEDS，r=0.68）和网格表格相似度（GriTS，r=0.70）的评估。通过对21种当代PDF解析器在包含451个表格的100份合成文档上进行评估，我们揭示了显著的性能差异。我们的研究结果为选择表格数据提取的解析器提供了实用指导，并为这一关键任务建立了可复现、可扩展的评估方法。代码与数据：https://github.com/phorn1/pdf-parse-bench 指标研究与人工评估：https://github.com/phorn1/table-metric-study

摘要 (Abstract)

Reliably extracting tables from PDFs is essential for large-scale scientific data mining and knowledge base construction, yet existing evaluation approaches rely on rule-based metrics that fail to capture semantic equivalence of table content. We present a benchmarking framework based on synthetically generated PDFs with precise LaTeX ground truth, using tables sourced from arXiv to ensure realistic complexity and diversity. As our central methodological contribution, we apply LLM-as-a-judge for semantic table evaluation, integrated into a matching pipeline that accommodates inconsistencies in parser outputs. Through a human validation study comprising over 1,500 quality judgments on extracted table pairs, we show that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.93) compared to Tree Edit Distance-based Similarity (TEDS, r=0.68) and Grid Table Similarity (GriTS, r=0.70). Evaluating 21 contemporary PDF parsers across 100 synthetic documents containing 451 tables reveals significant performance disparities. Our results offer practical guidance for selecting parsers for tabular data extraction and establish a reproducible, scalable evaluation methodology for this critical task. Code and data: https://github.com/phorn1/pdf-parse-bench Metric study and human evaluation: https://github.com/phorn1/table-metric-study

关键词: PDF table extraction, LLM-as-a-judge, semantic evaluation, benchmarking framework, scientific data mining, parser evaluation, human validation, arXiv tables

107. ❌ An Onto-Relational-Sophic Framework for Governing Synthetic Minds

作者: Huansheng Ning, Jianguo Ding 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18633v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出一个哲学和治理框架（ORS），用于管理日益强大的合成智能体（包括基础模型和AI代理），而非直接研究大模型技术本身。因此，大多数技术关键词（如MoE、量化、推理加速等）完全不相关（0分）。与’Alignment’（8分）相关，因为论文讨论从技术对齐转向全面的哲学基础；与’LLM Agents’（8分）相关，因为论文明确讨论自主研究代理和代理AI生态系统；与’Multi-agent Systems’（5分）相关，因为涉及代理协调；与’AI for Science’（5分）相关，因为提到自主研究代理和AI医疗应用；与’Large Language Models’（5分）相关，因为基础模型是讨论背景之一。

!!! tip deepseek-chat TL;DR

该论文针对当前AI治理框架无法应对日益强大的合成智能体（如基础模型和AI代理）的挑战，提出了一个基于Cyberism哲学的Onto-Relational-Sophic（ORS）框架，通过定义合成智能体的存在模式、数字人格谱系和智慧导向的价值体系，为合成智能体提供全面的哲学基础和适应性治理建议。

摘要翻译

人工智能的快速演进——从任务专用系统发展为在推理、创造性综合与社会互动中展现广泛灵活能力的基础模型——已超越了为管理其而设计的概念与治理框架。当前以工具为中心世界观为基础的监管范式，虽能处理算法偏见与透明度问题，却未能回答关于日益强大的合成心智的本质、社会应如何与之相处、以及应指导其发展的规范性原则等根本性问题。本文基于赛博主义哲学，提出本体-关系-智慧（Onto-Relational-Sophic, ORS）框架，通过三大支柱为这些挑战提供整合性解答：（1）赛博-物理-社会-思维（CPST）本体论，将合成心智的存在方式定义为不可简化的多维存在，而非纯粹计算实体；（2）数字人格的梯度谱系，提供超越二元化“人格或工具”分类的实用关系分类法；（3）赛博智慧（Cybersophy），一种融合美德伦理学、后果论与关系伦理的智慧导向价值论，用以指导治理实践。我们将该框架应用于自主研究智能体、人工智能辅助医疗以及能动的AI生态系统等新兴场景，证明其能够生成相称且自适应的治理建议。ORS框架为已存在于我们中间的合成心智，指明了一条从狭隘的技术对齐走向全面哲学基础的建设路径。

摘要 (Abstract)

The rapid evolution of artificial intelligence, from task-specific systems to foundation models exhibiting broad, flexible competence across reasoning, creative synthesis, and social interaction, has outpaced the conceptual and governance frameworks designed to manage it. Current regulatory paradigms, anchored in a tool-centric worldview, address algorithmic bias and transparency but leave unanswered foundational questions about what increasingly capable synthetic minds are, how societies should relate to them, and the normative principles that should guide their development. Here we introduce the Onto-Relational-Sophic (ORS) framework, grounded in Cyberism philosophy, which offers integrated answers to these challenges through three pillars: (1) a Cyber-Physical-Social-Thinking (CPST) ontology that defines the mode of being for synthetic minds as irreducibly multi-dimensional rather than purely computational; (2) a graded spectrum of digital personhood providing a pragmatic relational taxonomy beyond binary person-or-tool classifications; and (3) Cybersophy, a wisdom-oriented axiology synthesizing virtue ethics, consequentialism, and relational approaches to guide governance. We apply the framework to emergent scenarios including autonomous research agents, AI-mediated healthcare, and agentic AI ecosystems, demonstrating its capacity to generate proportionate, adaptive governance recommendations. The ORS framework charts a path from narrow technical alignment toward comprehensive philosophical foundations for the synthetic minds already among us.

关键词: synthetic minds, governance framework, AI ethics, autonomous agents, foundation models, digital personhood, agentic AI ecosystems, philosophical foundations

108. ❌ Agentic Flow Steering and Parallel Rollout Search for Spatially Grounded Text-to-Image Generation

作者: Ping Chen, Daoxuan Zhang, Xiangming Wang, Yungeng Liu, Haijin Zeng, Yongyong Chen 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18627v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究文本到图像生成中的精确空间控制问题，提出了一种基于FLUX.1-dev的免训练闭环框架AFS-Search，该框架利用视觉语言模型作为语义批评器进行动态引导，并将生成过程建模为序列决策问题。论文的核心创新在于将agentic workflow（智能体工作流）概念应用于T2I生成过程，通过并行rollout搜索和流引导机制实现实时反馈和优化。因此，仅与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文明确使用了’Agentic Flow Steering’概念并构建了智能体式的工作流程。其他关键词均未在论文标题或摘要中提及或隐含，与论文研究内容无关，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对文本到图像生成中因静态文本编码器关系推理有限和开环采样误差累积导致的空间约束偏离问题，提出了AFS-Search框架，通过引入视觉语言模型作为语义批评器进行动态流引导和并行rollout搜索，显著提升了FLUX.1-dev模型的性能并在多个基准测试中取得了最先进的结果。

摘要翻译

精确的文本到图像（Text-to-Image, T2I）生成已取得显著成功，但受限于静态文本编码器的关系推理能力不足以及开环采样中的误差累积问题。由于缺乏实时反馈，常微分方程轨迹中初始的语义模糊性不可避免地会演变为空间约束的随机偏差。为弥补这一差距，我们提出了AFS-Search（智能流引导与并行推演搜索），这是一个基于FLUX.1-dev构建的无训练闭环框架。AFS-Search融合了无需训练的开环并行推演搜索与流引导机制，该机制利用视觉语言模型（Vision-Language Model, VLM）作为语义评判器，对中间隐变量进行诊断，并通过精确的空间定位动态引导速度场。互补地，我们将T2I生成建模为一个序列决策过程，通过前瞻模拟探索多条轨迹，并依据VLM引导的奖励选择最优路径。此外，我们提供了更高性能的AFS-Search-Pro版本与更快生成速度的AFS-Search-Fast版本。实验结果表明，我们的AFS-Search-Pro极大提升了原始FLUX.1-dev的性能，在三个不同基准测试中均达到了最先进水平。同时，AFS-Search-Fast在保持快速生成速度的同时，也显著提升了生成质量。

摘要 (Abstract)

Precise Text-to-Image (T2I) generation has achieved great success but is hindered by the limited relational reasoning of static text encoders and the error accumulation in open-loop sampling. Without real-time feedback, initial semantic ambiguities during the Ordinary Differential Equation trajectory inevitably escalate into stochastic deviations from spatial constraints. To bridge this gap, we introduce AFS-Search (Agentic Flow Steering and Parallel Rollout Search), a training-free closed-loop framework built upon FLUX.1-dev. AFS-Search incorporates a training-free closed-loop parallel rollout search and flow steering mechanism, which leverages a Vision-Language Model (VLM) as a semantic critic to diagnose intermediate latents and dynamically steer the velocity field via precise spatial grounding. Complementarily, we formulate T2I generation as a sequential decision-making process, exploring multiple trajectories through lookahead simulations and selecting the optimal path based on VLM-guided rewards. Further, we provide AFS-Search-Pro for higher performance and AFS-Search-Fast for quicker generation. Experimental results show that our AFS-Search-Pro greatly boosts the performance of the original FLUX.1-dev, achieving state-of-the-art results across three different benchmarks. Meanwhile, AFS-Search-Fast also significantly enhances performance while maintaining fast generation speed.

关键词: Text-to-Image Generation, Agentic Flow Steering, Parallel Rollout Search, Vision-Language Model, Spatial Grounding, Closed-loop Framework, Sequential Decision-making, FLUX.1-dev

作者: Shuqi Xiao, Maani Ghaffari, Chengzhong Xu, Hui Kong 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18624v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出REST框架，将LLM作为核心推理组件用于零样本目标导航任务，具体使用chain-of-thought推理进行路径选择，体现了LLM在机器人/自主代理领域的应用创新。因此与’Large Language Models’、‘Chain of Thought’、‘LLM Agents’高度相关（10分），与’System 2 Thinking’有一定关联（5分），与其他关键词无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了零样本目标导航中选项空间设计不足的问题，提出了REST框架，通过构建路径树并使用链式思维LLM推理进行分支选择，在多个基准测试中实现了成功率与路径效率的优异平衡。

摘要翻译

零样本目标导向导航任务要求智能体在未经任务特定训练的情况下，于未知环境中导航并寻找目标物体。现有的分层免训练方案虽着力于场景理解（即信念）与高层决策（即策略），却忽视了选项的设计——即从动态演化的信念中提出子目标候选，并交由策略进行选择。实践中，选项常被简化为独立评分的孤立路径点：单一目的地掩盖了沿途可获取的信息价值；无序的候选集合则模糊了选项间的关联。我们的核心见解是：选项空间应构建为路径树。完整路径能揭示仅依赖目的地评分所系统性忽略的途中信息增益；而由共享片段构成的树结构，支持大型语言模型进行由粗到细的推理——在检视单个叶节点前即可排除或追踪整条分支，从而将组合爆炸的路径空间压缩为高效的层次结构。我们将此见解实例化为 REST（Receding Horizon Explorative Steiner Tree），一个免训练框架，其包含：（1）从在线RGB-D流构建显式的开放词汇3D地图；（2）通过基于采样的规划，以智能体为中心生长出安全且信息丰富的路径树作为选项空间；（3）将每个分支文本化为空间叙事，并通过思维链式的大型语言模型推理选择最优后续路径。在Gibson、HM3D与HSSD基准测试中，REST在成功率上持续位居前列，同时达到最佳或次优的路径效率，展现出优越的效率与成功率平衡。

摘要 (Abstract)

Zero-shot object-goal navigation (ZSON) requires navigating unknown environments to find a target object without task-specific training. Prior hierarchical training-free solutions invest in scene understanding (\textit{belief}) and high-level decision-making (\textit{policy}), yet overlook the design of \textit{option}, i.e., a subgoal candidate proposed from evolving belief and presented to policy for selection. In practice, options are reduced to isolated waypoints scored independently: single destinations hide the value gathered along the journey; an unstructured collection obscures the relationships among candidates. Our insight is that the option space should be a \textit{tree of paths}. Full paths expose en-route information gain that destination-only scoring systematically neglects; a tree of shared segments enables coarse-to-fine LLM reasoning that dismisses or pursues entire branches before examining individual leaves, compressing the combinatorial path space into an efficient hierarchy. We instantiate this insight in \textbf{REST} (Receding Horizon Explorative Steiner Tree), a training-free framework that (1) builds an explicit open-vocabulary 3D map from online RGB-D streams; (2) grows an agent-centric tree of safe and informative paths as the option space via sampling-based planning; and (3) textualizes each branch into a spatial narrative and selects the next-best path through chain-of-thought LLM reasoning. Across the Gibson, HM3D, and HSSD benchmarks, REST consistently ranks among the top methods in success rate while achieving the best or second-best path efficiency, demonstrating a favorable efficiency-success balance.

关键词: zero-shot object-goal navigation, LLM reasoning, chain-of-thought, autonomous agents, path planning, spatial narrative, training-free framework, exploration tree

110. ❌ OpenT2M: No-frill Motion Generation with Open-source,Large-scale, High-quality Data

作者: Bin Cao, Sipeng Zheng, Hao Luo, Boyuan Li, Jing Liu, Zongqing Lu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18623v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于文本到动作生成领域，主要贡献是开源大规模高质量数据集OpenT2M和预训练模型MonoFrill。与大多数关键词无关，因为论文不涉及大语言模型、推理技术、对齐、高效微调等。仅与三个关键词有弱关联：1) ‘Scaling Laws AND Data Quality’（5分）- 论文强调数据质量和规模对模型泛化的重要性；2) ‘Pre-training OR Continual Pre-training OR Domain Adaptation’（5分）- 论文提到预训练模型MonoFrill；3) ‘AI for Science OR Bioinformatics OR Cheminformatics’（5分）- 动作生成可视为AI在动画/机器人领域的应用，属于广义的AI for Science。

!!! tip deepseek-chat TL;DR

该论文通过构建大规模高质量开源动作数据集OpenT2M和开发预训练模型MonoFrill，解决了文本到动作生成中因数据规模小、多样性不足导致的泛化能力差的问题。

摘要翻译

文本驱动动作生成（Text-to-motion, T2M）旨在根据文本描述生成逼真的人体运动，在动画和机器人领域具有广阔的应用前景。尽管近期取得进展，但由于现有动作数据集规模小、多样性有限，当前T2M模型在面对未见过的文本描述时表现欠佳。为解决这一问题，我们推出了OpenT2M——一个百万级别、高质量的开源动作数据集，包含超过2800小时的人体运动数据。每个运动序列均通过物理可行性验证和多粒度筛选进行严格质量控制，并配有详细的逐秒文本标注。我们还开发了自动化流程用于生成长时序动作序列，从而支持复杂动作生成。基于OpenT2M，我们提出了MonoFrill预训练动作模型，该模型无需复杂设计或技术“修饰”即可实现出色的T2M效果。其核心组件是2D-PRQ新型运动分词器，通过将人体划分为生物力学部件来捕捉时空依赖关系。实验表明，OpenT2M显著提升了现有T2M模型的泛化能力，而2D-PRQ在运动重建和零样本性能方面均表现优异。我们期待OpenT2M与MonoFrill能够通过解决长期存在的数据质量和基准测试难题，推动T2M领域的发展。

摘要 (Abstract)

Text-to-motion (T2M) generation aims to create realistic human movements from text descriptions, with promising applications in animation and robotics. Despite recent progress, current T2M models perform poorly on unseen text descriptions due to the small scale and limited diversity of existing motion datasets. To address this problem, we introduce OpenT2M, a million-level, high-quality, and open-source motion dataset containing over 2800 hours of human motion. Each sequence undergoes rigorous quality control through physical feasibility validation and multi-granularity filtering, with detailed second-wise text annotations. We also develop an automated pipeline for creating long-horizon sequences, enabling complex motion generation. Building upon OpenT2M, we introduce MonoFrill, a pretrained motion model that achieves compelling T2M results without complicated designs or technique tricks as “frills”. Its core component is 2D-PRQ, a novel motion tokenizer that captures spatiotemporal dependencies by dividing the human body into biology parts. Experiments show that OpenT2M significantly improves generalization of existing T2M models, while 2D-PRQ achieves superior reconstruction and strong zero-shot performance. We expect OpenT2M and MonoFrill will advance the T2M field by addressing longstanding data quality and benchmarking challenges.

关键词: text-to-motion generation, motion dataset, OpenT2M, MonoFrill, 2D-PRQ tokenizer, motion generation, human motion, zero-shot performance

111. ❌ AutORAN: LLM-driven Natural Language Programming for Agile xApp Development

作者: Xin Li, Shiming Yu, Leming Shen, Jianing Zhang, Yuanqing Zheng, Yaxiong Xie 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18604v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文AutORAN提出了一种LLM驱动的自然语言编程框架，用于自动化xApp开发流程，属于大模型在特定领域（电信网络）的应用创新。核心相关关键词：1）‘Large Language Models’高度相关（10分），因为论文明确使用LLM作为核心技术；2）‘LLM Agents’和’Tool Use’有一定关联（各5分），因为框架涉及LLM驱动的自动化工作流程和功能调用；其他关键词如MoE、SFT、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对传统RAN系统中xApp开发耗时且复杂的问题，提出了首个LLM驱动的自然语言编程框架AutORAN，能够将用户意图快速转化为可部署的xApp，显著加速开发周期并达到或超越手工编码的性能。

摘要翻译

传统无线接入网（RAN）系统具有封闭式、单体化的特点，抑制了网络创新。开放无线接入网络（Open Radio Access Network, O-RAN）所实现的开放性与可编程性，有望通过控制面应用——xApps——彻底变革蜂窝网络。然而，xApp（通常由第三方开发者开发）的开发过程依然耗时且繁琐，往往需要数月的手动编码与集成工作，这在实际中阻碍了新功能的快速部署。为降低开发者和网络运营商在xApp开发方面的门槛，我们提出了AutORAN，这是首个基于大语言模型（LLM）驱动的自然语言编程框架，用于敏捷化xApp开发，并实现了整个xApp开发流程的自动化。简而言之，AutORAN能够在数分钟内将高层次用户意图转化为可快速部署的xApp，无需人工编码或测试。为此，AutORAN构建了一个全自动的xApp生成流程，该流程集成了多个功能模块（涵盖从用户需求提取、人工智能/机器学习（AI/ML）功能设计与验证，到xApp合成与部署的全过程）。我们在代表性的xApp任务上设计、实现并全面评估了AutORAN。结果表明，AutORAN生成的xApp能够达到与已知最佳手工编写基线方案相当甚至更优的性能。AutORAN极大地加速了xApp开发周期（从用户意图提取到部署上线），从而有力推动了O-RAN的创新进程。

摘要 (Abstract)

Traditional RAN systems are closed and monolithic, stifling innovation. The openness and programmability enabled by Open Radio Access Network (O-RAN) are envisioned to revolutionize cellular networks with control-plane applications–xApps. The development of xApps (typically by third-party developers), however, remains time-consuming and cumbersome, often requiring months of manual coding and integration, which hinders the roll-out of new functionalities in practice. To lower the barrier of xApp development for both developers and network operators, we present AutORAN, the first LLM-driven natural language programming framework for agile xApps that automates the entire xApp development pipeline. In a nutshell, AutORAN turns high-level user intents into swiftly deployable xApps within minutes, eliminating the need for manual coding or testing. To this end, AutORAN builds a fully automated xApp generation pipeline, which integrates multiple functional modules (from user requirement elicitation, AI/ML function design and validation, to xApp synthesis and deployment). We design, implement, and comprehensively evaluate AutORAN on representative xApp tasks. Results show AutORAN-generated xApps can achieve similar or even better performance than the best known hand-crafted baselines. AutORAN drastically accelerates the xApp development cycle (from user intent elicitation to roll-out), streamlining O-RAN innovation.

关键词: LLM-driven, natural language programming, xApp development, Open RAN, automated pipeline, agile development, network automation, AI/ML function design

112. ❌ myMNIST: Benchmark of PETNN, KAN, and Classical Deep Learning Models for Burmese Handwritten Digit Recognition

作者: Ye Kyaw Thu, Thazin Myint Oo, Thepchai Supnithi 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18597v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于缅甸手写数字识别的基准测试，评估了包括CNN、LSTM、GRU、Transformer、KAN变体、JEM和PETNN在内的多种深度学习模型。研究内容与绝大多数关键词（如LLMs、MoE、Scaling Laws、RLHF、RAG、CoT、LLM Agents等）完全无关，因为这些关键词主要涉及大语言模型及其相关技术（如训练、对齐、推理、代理等），而本文研究的是传统的计算机视觉任务（手写数字识别）和经典/新兴的深度学习架构。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及AI在特定领域（缅甸语言/文字处理）的应用，属于AI for Science的广义范畴，但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究对缅甸手写数字数据集myMNIST进行了首次系统基准测试，评估了11种深度学习模型，发现CNN表现最佳，PETNN（GELU）紧随其后，而KAN变体性能稍逊。

摘要翻译

我们首次对myMNIST（原BHDD）进行了系统性基准测试，这是一个公开可用的缅甸语手写数字数据集，对缅甸自然语言处理/人工智能研究具有重要意义。我们评估了十一种架构，涵盖经典深度学习模型（多层感知机、卷积神经网络、长短期记忆网络、门控循环单元、Transformer）、近期替代模型（FastKAN、EfficientKAN）、基于能量的模型（JEM）以及受物理学启发的PETNN变体（Sigmoid、GELU、SiLU）。使用精确率、召回率、F1分数和准确率作为评估指标，我们的结果表明卷积神经网络（CNN）仍然是强大的基线模型，取得了最佳综合得分（F1 = 0.9959，准确率 = 0.9970）。PETNN（GELU）模型紧随其后（F1 = 0.9955，准确率 = 0.9966），其表现优于长短期记忆网络、门控循环单元、Transformer及KAN变体。代表基于能量建模的JEM模型表现出竞争力（F1 = 0.9944，准确率 = 0.9958）。基于KAN的模型（FastKAN、EfficientKAN）虽落后于最优模型，但提供了有意义的替代基线（准确率约0.992）。这些发现：（一）为myMNIST在不同建模范式下建立了可复现的基准；（二）凸显了PETNN相对于经典模型和Transformer模型的强劲性能；（三）量化了受能量启发的PETNN与真实基于能量的模型（JEM）之间的差距。我们发布此基准旨在促进未来缅甸语数字识别研究，并鼓励对区域性文字数据集上的新兴架构开展更广泛的评估。

摘要 (Abstract)

We present the first systematic benchmark on myMNIST (formerly BHDD), a publicly available Burmese handwritten digit dataset important for Myanmar NLP/AI research. We evaluate eleven architectures spanning classical deep learning models (Multi-Layer Perceptron, Convolutional Neural Network, Long Short-Term Memory, Gated Recurrent Unit, Transformer), recent alternatives (FastKAN, EfficientKAN), an energy-based model (JEM), and physics-inspired PETNN variants (Sigmoid, GELU, SiLU). Using Precision, Recall, F1-Score, and Accuracy as evaluation metrics, our results show that the CNN remains a strong baseline, achieving the best overall scores (F1 = 0.9959, Accuracy = 0.9970). The PETNN (GELU) model closely follows (F1 = 0.9955, Accuracy = 0.9966), outperforming LSTM, GRU, Transformer, and KAN variants. JEM, representing energy-based modeling, performs competitively (F1 = 0.9944, Accuracy = 0.9958). KAN-based models (FastKAN, EfficientKAN) trail the top performers but provide a meaningful alternative baseline (Accuracy ~0.992). These findings (i) establish reproducible baselines for myMNIST across diverse modeling paradigms, (ii) highlight PETNN’s strong performance relative to classical and Transformer-based models, and (iii) quantify the gap between energy-inspired PETNNs and a true energy-based model (JEM). We release this benchmark to facilitate future research on Myanmar digit recognition and to encourage broader evaluation of emerging architectures on regional scripts.

关键词: myMNIST, Burmese handwritten digit recognition, benchmark, PETNN, KAN, CNN, deep learning models, Myanmar NLP/AI research

113. ❌ How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

作者: Ke-Han Lu, Szu-Wei Fu, Chao-Han Huck Yang, Zhehuai Chen, Sung-Feng Huang, Chih-Kai Yang, Yi-Cheng Lin, Chi-Yuan Hsiao, Wenze Ren, En-Pei Hu, Yu-Han Huang, An-Yu Cheng, Cheng-Han Chiang, Yu Tsao, Yu-Chiang Frank Wang, Hung-yi Lee 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19195v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLMs作为音频语言模型知识骨干的作用，直接高度相关于’Large Language Models’关键词（10分）。涉及预训练知识评估和微调应用，与’Pre-training’和’Post-training’有一定关联（各5分）。研究属于AI在音频科学领域的应用，与’AI for Science’相关（5分）。其他关键词如MoE、量化、推理加速等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该研究探讨了大型语言模型通过纯文本预训练编码的听觉知识如何影响下游音频语言模型的性能，发现不同模型家族的听觉知识差异显著且文本评估结果与音频性能强相关。

摘要翻译

大型语言模型（LLM）已被广泛用作大型音频语言模型（LALM）的知识基础，然而它们通过纯文本预训练编码了多少听觉知识，以及这如何影响下游性能，目前尚不明确。我们通过比较不同LLM在两种纯文本与一种音频接地设置下的表现来研究这一差距：（1）在AKB-2000上进行直接探测，该基准测试旨在评估听觉知识的广度与深度；（2）级联评估，即LLM基于音频描述器生成的文本描述进行推理；（3）音频接地评估，即将每个LLM与音频编码器结合微调为大型音频语言模型（LALM）。我们的研究结果表明，不同模型系列的听觉知识存在显著差异，且纯文本评估结果与音频性能高度相关。本工作为全面理解LLM在音频研究中的作用提供了实证基础。

摘要 (Abstract)

Large language models (LLMs) have been widely used as knowledge backbones of Large Audio Language Models (LALMs), yet how much auditory knowledge they encode through text-only pre-training and how this affects downstream performance remains unclear. We study this gap by comparing different LLMs under two text-only and one audio-grounded setting: (1) direct probing on AKB-2000, a curated benchmark testing the breadth and depth of auditory knowledge; (2) cascade evaluation, where LLMs reason over text descriptions from an audio captioner; and (3) audio-grounded evaluation, where each LLM is fine-tuned into a Large Audio Language Model (LALM) with an audio encoder. Our findings reveal that auditory knowledge varies substantially across families, and text-only results are strongly correlated with audio performance. Our work provides empirical grounding for a comprehensive understanding of LLMs in audio research.

关键词: Large Language Models, Auditory Knowledge, Audio Language Models, Pre-training, Fine-tuning, Benchmark Evaluation, Audio Encoder, Knowledge Backbone

114. ❌ Online Learning and Equilibrium Computation with Ranking Feedback

作者: Mingyang Liu, Yongshan Chen, Zhiyuan Fan, Gabriele Farina, Asuman Ozdaglar, Kaiqing Zhang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19221v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究在线学习算法，在摘要最后提到将算法应用于在线大语言模型路由任务，因此与’Large Language Models OR LLMs OR Foundation Models’和’LLM Agents OR Autonomous Agents OR Agentic Workflow’有一定相关性（8分）。其他关键词如MoE、SFT、RAG、量化等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了在仅能观察到动作排序反馈的在线学习环境中，如何设计算法实现次线性遗憾，并在大语言模型路由任务中验证了算法的有效性。

摘要翻译

在任意（可能具有对抗性）环境中的在线学习已在序列决策领域得到广泛研究，且与博弈论中的均衡计算密切相关。现有的大多数在线学习算法依赖于环境提供的数值化效用反馈，而在人机交互应用中此类反馈可能无法获取，或受隐私问题限制。本文研究一种在线学习模型，其中学习者仅在每个时间步观察到对一组提议行动的排序。我们考虑两种排序机制：由当前时间步的瞬时效用诱导的排序，以及由截至当前时间步的时间平均效用诱导的排序，并分别在全信息和赌博机反馈设置下进行分析。使用标准的外部遗憾度量，我们证明在一般情况下，基于瞬时效用排序反馈无法实现次线性遗憾。此外，当排序模型相对确定时（例如在温度参数足够小的普拉凯特-卢斯模型下），基于时间平均效用排序反馈同样无法实现次线性遗憾。随后，我们提出了新算法，在效用序列具有次线性总变差的附加假设下实现了次线性遗憾。值得注意的是，对于全信息时间平均效用排序反馈，这一附加假设可以被移除。由此，当正规形式博弈中的所有参与者遵循我们的算法时，重复博弈将产生近似粗相关均衡。我们还在在线大语言模型路由任务中验证了所提算法的有效性。

摘要 (Abstract)

Online learning in arbitrary, and possibly adversarial, environments has been extensively studied in sequential decision-making, and it is closely connected to equilibrium computation in game theory. Most existing online learning algorithms rely on \emph{numeric} utility feedback from the environment, which may be unavailable in human-in-the-loop applications and/or may be restricted by privacy concerns. In this paper, we study an online learning model in which the learner only observes a \emph{ranking} over a set of proposed actions at each timestep. We consider two ranking mechanisms: rankings induced by the \emph{instantaneous} utility at the current timestep, and rankings induced by the \emph{time-average} utility up to the current timestep, under both \emph{full-information} and \emph{bandit} feedback settings. Using the standard external-regret metric, we show that sublinear regret is impossible with instantaneous-utility ranking feedback in general. Moreover, when the ranking model is relatively deterministic, \emph{i.e.}, under the Plackett-Luce model with a temperature that is sufficiently small, sublinear regret is also impossible with time-average utility ranking feedback. We then develop new algorithms that achieve sublinear regret under the additional assumption that the utility sequence has sublinear total variation. Notably, for full-information time-average utility ranking feedback, this additional assumption can be removed. As a consequence, when all players in a normal-form game follow our algorithms, repeated play yields an approximate coarse correlated equilibrium. We also demonstrate the effectiveness of our algorithms in an online large-language-model routing task.

关键词: online learning, ranking feedback, regret minimization, game theory, coarse correlated equilibrium, LLM routing, sequential decision-making, Plackett-Luce model

115. ❌ Evaluating Counterfactual Strategic Reasoning in Large Language Models

作者: Dimitrios Georgousis, Maria Lymperaiou, Angeliki Dimitriou, Giorgos Filandrianos, Giorgos Stamou 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19167v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是评估LLMs在博弈论环境中的战略推理能力，特别是通过反事实变体测试其泛化能力。因此，与’Large Language Models’高度相关（10分），因为这是研究对象；与’Chain of Thought’和’System 2 Thinking’有一定关联（8分），因为研究涉及推理过程评估。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文主题无关，得0分。

!!! tip deepseek-chat TL;DR

该论文通过引入反事实变体的囚徒困境和石头剪刀布游戏，评估了大型语言模型在战略推理中的表现，发现其在激励敏感性、结构泛化和反事实环境中的推理能力存在局限性。

摘要翻译

我们在重复博弈论情境下评估大型语言模型（LLMs），以检验其策略表现反映的是真正的推理能力还是对记忆模式的依赖。我们选取了两个经典博弈——囚徒困境（Prisoner’s Dilemma, PD）和石头剪刀布（Rock-Paper-Scissors, RPS），并在此基础上引入反事实变体：通过改变收益结构和行动标签，打破原有的对称性与占优关系。我们的多指标评估框架对比了默认版本与反事实版本中的模型表现，揭示了LLMs在反事实环境中对激励的敏感性、结构泛化能力以及策略推理方面存在的局限性。

摘要 (Abstract)

We evaluate Large Language Models (LLMs) in repeated game-theoretic settings to assess whether strategic performance reflects genuine reasoning or reliance on memorized patterns. We consider two canonical games, Prisoner’s Dilemma (PD) and Rock-Paper-Scissors (RPS), upon which we introduce counterfactual variants that alter payoff structures and action labels, breaking familiar symmetries and dominance relations. Our multi-metric evaluation framework compares default and counterfactual instantiations, showcasing LLM limitations in incentive sensitivity, structural generalization and strategic reasoning within counterfactual environments.

关键词: Large Language Models, strategic reasoning, counterfactual variants, game theory, Prisoner’s Dilemma, Rock-Paper-Scissors, evaluation framework, generalization

116. ❌ A Dataset and Resources for Identifying Patient Health Literacy Information from Clinical Notes

作者: Madeline Bittner, Dina Demner-Fushman, Yasmeen Shabazz, Davis Bartels, Dukyong Yoon, Brad Quitadamo, Rajiv Menghrajani, Leo Celi, Sarvesh Soni 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19082v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心贡献是创建医疗健康素养数据集HEALIX，并利用LLMs进行零样本和少样本提示测试，因此与’Large Language Models’高度相关（10分），属于’AI for Science’在生物信息学/医疗领域的应用（10分）。论文提到’few-shot prompting’，与’In-context Learning’有一定关联（5分）。其他关键词如MoE、SFT、RAG等均未涉及，评0分。

!!! tip deepseek-chat TL;DR

该研究创建了首个公开的临床笔记健康素养标注数据集HEALIX，并利用开源大语言模型进行了零样本和少样本提示策略的基准测试，以自动化检测患者健康素养信息。

摘要翻译

健康素养是患者预后的关键决定因素，但现有筛查工具在可行性上存在局限，且其项目数量、问题形式及所涵盖的健康素养维度差异显著，导致难以在结构化电子健康记录中实现标准化记录。从非结构化临床记录中自动检测健康素养提供了一种前景广阔的替代方案，因为这些记录通常包含更丰富、更具情境性的健康素养信息，但相关进展因缺乏标注资源而受限。我们推出了HEALIX——首个基于真实临床记录构建的公开标注健康素养数据集，该数据集通过结合社工记录抽样、基于关键词的筛选以及基于大语言模型（LLM）的主动学习流程构建而成。HEALIX包含9种记录类型的589份临床笔记，标注了低、正常、高三种健康素养标签。为验证其实用性，我们在四种开源大语言模型上对零样本和少样本提示策略进行了基准测试。

摘要 (Abstract)

Health literacy is a critical determinant of patient outcomes, yet current screening tools are not always feasible and differ considerably in the number of items, question format, and dimensions of health literacy they capture, making documentation in structured electronic health records difficult to achieve. Automated detection from unstructured clinical notes offers a promising alternative, as these notes often contain richer, more contextual health literacy information, but progress has been limited by the lack of annotated resources. We introduce HEALIX, the first publicly available annotated health literacy dataset derived from real clinical notes, curated through a combination of social worker note sampling, keyword-based filtering, and LLM-based active learning. HEALIX contains 589 notes across 9 note types, annotated with three health literacy labels: low, normal, and high. To demonstrate its utility, we benchmarked zero-shot and few-shot prompting strategies across four open source large language models (LLMs).

关键词: health literacy, clinical notes, dataset, large language models, zero-shot prompting, few-shot prompting, automated detection, electronic health records

117. ❌ Optimal Splitting of Language Models from Mixtures to Specialized Domains

作者: Skyler Seto, Pierre Ablin, Anastasiia Filippova, Jiayuan Ye, Louis Bethune, Angelos Katharopoulos, David Grangier 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19149v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	10.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究语言模型训练中的计算资源分配问题，提出了一种基于缩放定律的方法来优化预训练和持续预训练之间的计算分配。与’Large Language Models’高度相关（10分），因为论文明确研究语言模型训练；与’Scaling Laws AND Data Quality’高度相关（10分），因为论文使用缩放定律来预测模型损失并优化计算分配；与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分），因为论文研究预训练和持续预训练（专业化训练）的两阶段范式。其他关键词如MoE、SFT、RAG、推理加速等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究如何通过缩放定律优化语言模型在预训练和专业化训练之间的计算资源分配，提出了一种能准确预测模型损失并提升常识知识和推理基准性能的方法。

摘要翻译

语言模型因其可利用的预训练数据规模与多样性，在知识、语言及推理任务上展现出卓越性能。标准训练流程遵循两阶段范式：首先在全量数据语料上进行预训练，随后从全量语料中选取高质量专业数据子集进行专项训练。在多领域场景中，此流程通常涉及对多个模型在各专业领域分别进行持续预训练，即分割模型训练。本文提出一种方法，可在通用预训练语料上独立预训练多个模型，并利用缩放定律确定预训练与持续预训练之间的最优计算资源分配。该方法能准确预测模型规模为N、使用D个预训练标记与D’个专项训练标记时的损失，并可外推至更大模型规模与更多标记数量。将本方法应用于语言模型训练后，在不同模型规模与计算预算下，模型在常识知识与推理基准测试中的性能均获得持续提升。

摘要 (Abstract)

Language models achieve impressive performance on a variety of knowledge, language, and reasoning tasks due to the scale and diversity of pretraining data available. The standard training recipe is a two-stage paradigm: pretraining first on the full corpus of data followed by specialization on a subset of high quality, specialized data from the full corpus. In the multi-domain setting, this involves continued pretraining of multiple models on each specialized domain, referred to as split model training. We propose a method for pretraining multiple models independently over a general pretraining corpus, and determining the optimal compute allocation between pretraining and continued pretraining using scaling laws. Our approach accurately predicts the loss of a model of size N with D pretraining and D’ specialization tokens, and extrapolates to larger model sizes and number of tokens. Applied to language model training, our approach improves performance consistently across common sense knowledge and reasoning benchmarks across different model sizes and compute budgets.

关键词: language models, pretraining, specialization, scaling laws, compute allocation, continued pretraining, multi-domain, split model training

118. ❌ RADIUS: Ranking, Distribution, and Significance - A Comprehensive Alignment Suite for Survey Simulation

作者: Weronika Łajewska, Paul Missault, George Davidson, Saab Mansour 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19002v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在调查模拟中的应用，并提出了RADIUS评估套件来评估模拟结果与人类偏好的对齐程度，因此与’Large Language Models’和’Alignment’高度相关（10分）。论文未涉及其他关键词的技术细节或应用，如MoE、SLMs、训练方法、推理优化、代理系统等，因此这些关键词得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM调查模拟中现有评估指标不标准化、难以比较且忽视排名对齐的问题，提出了一个包含排名对齐、分布对齐和统计显著性测试的综合性评估套件RADIUS。

摘要翻译

利用大语言模型进行问卷模拟正成为一种能够大规模生成类人回答的强大应用。现有研究通常借用其他领域的指标来评估问卷模拟，这些指标往往是临时性、碎片化且非标准化的，导致结果难以比较。此外，现有指标主要关注准确性或分布性度量，忽视了排序一致性的关键维度。在实践中，模拟可能实现高准确度，却仍未能捕捉到人类最偏好的选项——这一差异在决策应用中至关重要。我们提出了RADIUS，一个用于问卷模拟的综合性二维对齐评估套件，它涵盖：1）排序一致性，以及2）分布一致性，每个维度均辅以统计显著性检验。RADIUS揭示了现有指标的局限性，使问卷模拟的评估更具意义，并提供了开源实现以确保评估的可复现性与可比性。

摘要 (Abstract)

Simulation of surveys using LLMs is emerging as a powerful application for generating human-like responses at scale. Prior work evaluates survey simulation using metrics borrowed from other domains, which are often ad hoc, fragmented, and non-standardized, leading to results that are difficult to compare. Moreover, existing metrics focus mainly on accuracy or distributional measures, overlooking the critical dimension of ranking alignment. In practice, a simulation can achieve high accuracy while still failing to capture the option most preferred by humans - a distinction that is critical in decision-making applications. We introduce RADIUS, a comprehensive two-dimensional alignment suite for survey simulation that captures: 1) RAnking alignment and 2) DIstribUtion alignment, each complemented by statistical Significance testing. RADIUS highlights the limitations of existing metrics, enables more meaningful evaluation of survey simulation, and provides an open-source implementation for reproducible and comparable assessment.

关键词: survey simulation, LLMs, alignment, ranking alignment, distribution alignment, evaluation metrics, RADIUS, human preferences

119. ❌ A conceptual framework for ideology beyond the left and right

作者: Kenneth Joseph, Kim Williams, David Lazer 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18945v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究意识形态的概念框架，属于计算社会科学（CSS）和自然语言处理（NLP）的交叉领域，主要关注社会认知、话语分析和理论框架构建。论文摘要中未提及任何大模型、深度学习技术原理、模型训练方法、推理优化、AI代理或科学AI应用等具体技术内容。所有评分关键词均涉及大模型技术栈的不同方面（架构、训练、推理、应用等），而本文完全不涉及这些技术主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个超越传统左右政治光谱的多层次社会认知概念网络框架来理解意识形态，并展示了该框架如何澄清现有NLP任务（如立场检测）的重叠关系并揭示新的研究方向。

摘要翻译

现有自然语言处理与计算社会科学研究几乎仅将意识形态操作化为左/右翼党派维度。这种方法掩盖了一个事实：人们在种族、气候、性别等议题上实际持有对诸多不同复杂且更具体意识形态的解读。我们提出一个将意识形态理解为可归因的、多层次的社会认知概念网络的框架，并阐释意识形态如何与框架化等其他相关社会过程协同在话语中显现。我们论证该框架如何能厘清现有自然语言处理任务（如立场检测与自然语言推理）之间的重叠关系，并揭示新的研究方向。本研究在计算方法与意识形态理论之间构建了独特而重要的桥梁，通过促进两领域共同发展的方式，实现了对社会话语更丰富的分析。

摘要 (Abstract)

NLP+CSS work has operationalized ideology almost exclusively on a left/right partisan axis. This approach obscures the fact that people hold interpretations of many different complex and more specific ideologies on issues like race, climate, and gender. We introduce a framework that understands ideology as an attributed, multi-level socio-cognitive concept network, and explains how ideology manifests in discourse in relation to other relevant social processes like framing. We demonstrate how this framework can clarifies overlaps between existing NLP tasks (e.g. stance detection and natural language inference) and also how it reveals new research directions. Our work provides a unique and important bridge between computational methods and ideology theory, enabling richer analysis of social discourse in a way that benefits both fields.

关键词: ideology, computational social science, natural language processing, conceptual framework, stance detection, discourse analysis, socio-cognitive networks

120. ❌ Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

作者: Xinghao Zhao 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18940v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的Chain-of-Thought推理可靠性，通过分析推理步骤中的不确定性动态（熵轨迹形状）来预测答案正确性。因此与’Chain of Thought’高度相关（10分），与’Large Language Models’高度相关（10分）。论文涉及不确定性分析和可靠性预测，与’System 2 Thinking’、‘Self-Correction’、‘Hallucination Mitigation’、‘Mechanistic Interpretability’有一定关联（各5分）。其他关键词如MoE、SFT、RAG等未在论文中涉及（0分）。

!!! tip deepseek-chat TL;DR

该研究通过分析LLM在Chain-of-Thought推理过程中熵轨迹的单调性形状，发现这种不确定性动态比总熵减少更能可靠地预测推理结果的正确性，在GSM8K和Mistral-7B上验证了其有效性。

摘要翻译

思维链推理提升了大型语言模型的准确性，但如何低成本地检测其推理失败仍是一个难题。本研究探讨了推理步骤间不确定性动态的形态特征——通过每步采样少量答案完成情况来捕捉——是否能够预测正确性。
我们引入了熵轨迹单调性概念：若一个推理链在每一步的答案分布熵值均逐步骤降低，则称该链是单调的。在GSM8K数据集（n=300）上使用Qwen2.5-7B-Instruct模型进行实验，单调链的准确率达到68.8%，而非单调链仅为46.8%（提升21.9个百分点；费希尔检验p=0.0005；优势比OR=2.50）。关键发现是，总熵减少量并不具备预测能力（$ρ$=-0.06, p=0.31），这揭示了形态与幅度的分离现象：重要的是熵值是否在每一步都降低，而非降低的幅度。违反单调性的次数0/1/2分别对应68.8%/50.8%/28.6%的准确率。
随着推理步骤加深，基于词元对数概率的置信度校准性能恶化（预期校准误差ECE：0.186→0.312），而单调性检测在73.7%的覆盖率下实现了5.8个百分点的提升，其成本约为每问题1,500个词元——相当于40链自洽方法成本的1/8——且优于标量基线方法。该结果在Mistral-7B模型（n=300）上得到复现：单调链准确率达72.3%，非单调链为37.6%（提升34.7个百分点；OR=4.33）。因此，不确定性轨迹的结构特性比聚合度量更具信息价值。

摘要 (Abstract)

Chain-of-thought (CoT) reasoning improves LLM accuracy, yet detecting failures cheaply remains elusive. We study whether the shape of uncertainty dynamics across reasoning steps–captured by sampling a few answer completions per step–predicts correctness. We introduce entropy-trajectory monotonicity: a chain is monotone if its per-step answer-distribution entropy decreases at every step. On GSM8K (n=300) with Qwen2.5-7B-Instruct, monotone chains achieve 68.8% accuracy vs. 46.8% for non-monotone chains (+21.9 pp; Fisher’s p=0.0005; OR=2.50). Critically, total entropy reduction is not predictive ($ρ$=-0.06, p=0.31), revealing a shape-over-magnitude dissociation: whether entropy decreases at every step matters, not how much. Violation count 0/1/2 gives 68.8%/50.8%/28.6% accuracy. Token log-probability confidence worsens in calibration with step depth (ECE: 0.186->0.312), and monotonicity achieves +5.8 pp at 73.7% coverage, outperforming scalar baselines at approx 1,500 tokens/question–1/8 the cost of 40-chain self-consistency. Results replicate on Mistral-7B (n=300): monotone chains reach 72.3% vs. 37.6% (+34.7 pp; OR=4.33). Structural properties of uncertainty trajectories are thus more informative than aggregate measures.

关键词: Chain-of-Thought, LLM reasoning, uncertainty dynamics, entropy trajectory, reliability prediction, monotonicity, GSM8K, Qwen2.5

121. ❌ Why Better Cross-Lingual Alignment Fails for Better Cross-Lingual Transfer: Case of Encoders

作者: Yana Veitsman, Yihong Liu, Hinrich Schütze 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18863v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究跨语言对齐与下游任务性能之间的关系，主要涉及对齐技术和监督微调，与’Instruction Tuning OR Alignment OR Value Alignment’高度相关（8分），与’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（5分），但论文聚焦于XLM-R编码器模型，未涉及大模型技术原理创新、科学领域应用或其他关键词，因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该研究发现跨语言对齐与下游任务目标基本正交，更好的对齐并不总能带来更好的跨语言迁移性能，并提供了结合对齐与任务微调的实用指南。

摘要翻译

通常认为更好的跨语言对齐能带来更好的跨语言迁移效果。然而，显式对齐技术——尽管提升了嵌入相似度——往往无法改善词级别的下游任务性能。本研究发现，这种不匹配源于对齐目标与下游任务目标基本正交，且对齐带来的下游收益在不同语言和任务类型间存在显著差异。我们分析了四个在不同语言对上对齐、并针对词性标注或句子分类任务进行微调的XLM-R编码器模型。通过表征分析（包括嵌入距离、梯度相似度以及任务损失与对齐损失的梯度幅值），我们发现：（1）仅凭嵌入距离无法可靠预测任务性能的提升（或下降）；（2）对齐梯度与任务梯度常接近正交，表明优化一个目标对另一目标的优化贡献甚微。综合来看，这些发现解释了为何“更好”的对齐常无法转化为“更好”的跨语言迁移。基于这些见解，我们提出了将跨语言对齐与任务特定微调相结合的实际指导原则，强调了谨慎选择损失函数的重要性。

摘要 (Abstract)

Better cross-lingual alignment is often assumed to yield better cross-lingual transfer. However, explicit alignment techniques – despite increasing embedding similarity – frequently fail to improve token-level downstream performance. In this work, we show that this mismatch arises because alignment and downstream task objectives are largely orthogonal, and because the downstream benefits from alignment vary substantially across languages and task types. We analyze four XLM-R encoder models aligned on different language pairs and fine-tuned for either POS Tagging or Sentence Classification. Using representational analyses, including embedding distances, gradient similarities, and gradient magnitudes for both task and alignment losses, we find that: (1) embedding distances alone are unreliable predictors of improvements (or degradations) in task performance and (2) alignment and task gradients are often close to orthogonal, indicating that optimizing one objective may contribute little to optimizing the other. Taken together, our findings explain why better'' alignment often fails to translate into better’’ cross-lingual transfer. Based on these insights, we provide practical guidelines for combining cross-lingual alignment with task-specific fine-tuning, highlighting the importance of careful loss selection.

关键词: cross-lingual alignment, cross-lingual transfer, XLM-R encoder, fine-tuning, embedding distances, gradient similarities, task performance, loss selection

122. ❌ A Human-in/on-the-Loop Framework for Accessible Text Generation

作者: Lourdes Moreno, Paloma Martínez 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18879v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确提到使用LLM进行文本生成，因此与’Large Language Models’高度相关（10分）。论文强调可解释性和伦理责任，与’Explainable AI’有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，且论文聚焦于人类参与框架而非底层技术，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种将人类参与（Human-in/on-the-Loop）集成到基于LLM的可访问文本生成中的混合框架，通过结构化反馈和评估机制提高了文本生成的可追溯性和包容性。

摘要翻译

文本简化中的简明语言与易读格式对认知可及性至关重要。然而当前自动简化与评估流程仍高度依赖自动化、以指标驱动，未能反映用户理解度或规范性标准。本文提出一种混合框架，将人类参与显式整合至基于大语言模型的可及文本生成中。人在回路（Human-in-the-Loop, HiTL）机制引导生成过程中的动态调整，而人在环上（Human-on-the-Loop, HoTL）监督则确保系统化的生成后审查。通过用户研究和标注资源获得的实证证据被具体化为：（一）符合标准的核查清单，（二）用于激活专家监督的事件-条件-动作触发规则，以及（三）可及性关键绩效指标（Key Performance Indicators, KPIs）。该框架展示了如何将人本机制编码用于评估，并复用以提供结构化反馈来改进模型适应。通过将人类角色嵌入生成与监督的双重环节，本框架建立起可追溯、可复现、可审计的可及文本创建与评估流程。在此过程中，它将可解释性与伦理问责作为核心设计原则，为构建更透明、更具包容性的自然语言处理系统作出贡献。

摘要 (Abstract)

Plain Language and Easy-to-Read formats in text simplification are essential for cognitive accessibility. Yet current automatic simplification and evaluation pipelines remain largely automated, metric-driven, and fail to reflect user comprehension or normative standards. This paper introduces a hybrid framework that explicitly integrates human participation into LLM-based accessible text generation. Human-in-the-Loop (HiTL) contributions guide adjustments during generation, while Human-on-the-Loop (HoTL) supervision ensures systematic post-generation review. Empirical evidence from user studies and annotated resources is operationalized into (i) checklists aligned with standards, (ii) Event-Condition-Action trigger rules for activating expert oversight, and (iii) accessibility Key Performance Indicators (KPIs). The framework shows how human-centered mechanisms can be encoded for evaluation and reused to provide structured feedback that improves model adaptation. By embedding the human role in both generation and supervision, it establishes a traceable, reproducible, and auditable process for creating and evaluating accessible texts. In doing so, it integrates explainability and ethical accountability as core design principles, contributing to more transparent and inclusive NLP systems.

关键词: Human-in-the-Loop, Human-on-the-Loop, LLM-based text generation, accessible text, text simplification, explainability, ethical accountability, NLP systems

作者: Maria Milkova, Maksim Rudnev 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18822v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究使用LLM（GPT）进行文本标注和价值检测，属于大模型在社会科学领域的应用，因此与’Large Language Models’高度相关（8分）。论文涉及价值检测和人类判断对齐，与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG等均未在摘要中提及，与论文内容完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种多阶段分类框架，利用LLM标注和Transformer模型检测俄语社交媒体文本中的人类基本价值，最佳模型在测试数据上达到F1宏平均0.83，并揭示了价值表达模式与人类判断的系统性差异。

摘要翻译

本研究提出一个用于检测俄语社交媒体噪声数据中人类价值观的多阶段分类框架，并在750万条随机抽样的公开文本帖子上进行了验证。基于施瓦茨（Schwartz）的基本人类价值观理论，我们设计了一个多阶段流程，包括垃圾信息与非个人内容过滤、针对性选取与价值观相关及政治相关的帖子、基于大语言模型（LLM）的标注以及多标签分类。研究特别注重以人类专家为参照，验证大语言模型标注与模型预测的质量。我们并不将专家标注视为绝对标准，而是将其作为一个具有自身不确定性的解释性基准。为处理标注的主观性，我们将多个大语言模型生成的判断聚合为反映不同一致程度的软标签，并利用这些标签训练基于Transformer的模型，使其能够预测十大基本价值观各自的概率。性能最佳的模型XLM-RoBERTa-large在预留测试数据上实现了0.83的宏观F1值和0.71的加权F1值。通过将价值观检测视为一项多视角解释任务——其中专家标签、GPT标注和模型预测代表对同一文本连贯但不完全相同的解读——我们发现模型总体上与人类判断一致，但系统性地高估了“对变化的开放性”这一价值领域。实证分析揭示了俄语社交网络中价值观表达的独特模式及其共现规律，为数字环境中的文化差异、传播框架和基于价值观的解读等更广泛的研究议题提供了贡献。所有模型均已公开发布。

摘要 (Abstract)

This study presents a multi-stage classification framework for detecting human values in noisy Russian language social media, validated on a random sample of 7.5 million public text posts. Drawing on Schwartz’s theory of basic human values, we design a multi-stage pipeline that includes spam and nonpersonal content filtering, targeted selection of value relevant and politically relevant posts, LLM based annotation, and multi-label classification. Particular attention is given to verifying the quality of LLM annotations and model predictions against human experts. We treat human expert annotations not as ground truth but as an interpretative benchmark with its own uncertainty. To account for annotation subjectivity, we aggregate multiple LLM generated judgments into soft labels that reflect varying levels of agreement. These labels are then used to train transformer based models capable of predicting the probability of each of the ten basic values. The best performing model, XLM RoBERTa large, achieves an F1 macro of 0.83 and an F1 of 0.71 on held out test data. By treating value detection as a multi perspective interpretive task, where expert labels, GPT annotations, and model predictions represent coherent but not identical readings of the same texts, we show that the model generally aligns with human judgments but systematically overestimates the Openness to Change value domain. Empirically, the study reveals distinct patterns of value expression and their co-occurrence in Russian social networks, contributing to a broader research agenda on cultural variation, communicative framing, and value based interpretation in digital environments. All models are released publicly.

关键词: human values detection, Russian social media, multi-stage classification, LLM annotation, transformer models, value alignment, Schwartz’s theory, XLM RoBERTa

124. ❌ Automatic detection of Gen-AI texts: A comparative framework of neural models

作者: Cristian Buttaro, Irene Amerini 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18750v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型生成的文本检测问题，直接涉及’Large Language Models’关键词，因此给予10分高度相关评分。论文未涉及其他关键词的具体技术原理或应用，如MoE、SLMs、训练方法、推理优化、代理系统等，这些关键词与论文的检测器设计、评估框架无直接关联，故均评0分。

!!! tip deepseek-chat TL;DR

该论文研究大语言模型生成文本的自动检测问题，通过设计和比较多种神经网络检测器，发现监督学习方法比商业工具在不同语言和领域上表现更稳定和鲁棒。

摘要翻译

大型语言模型的快速扩散显著增加了区分人类撰写文本与人工智能生成文本的难度，在学术、出版及社会领域引发了关键问题。本文通过设计、实施并比较评估多种基于机器学习的检测器，对人工智能生成文本的检测问题进行了研究。我们开发并分析了四种神经架构：多层感知机、一维卷积神经网络、基于MobileNet的卷积神经网络以及Transformer模型。所提出的模型与广泛使用的在线检测工具进行了基准测试，包括ZeroGPT、GPTZero、QuillBot、Originality.AI、Sapling、IsGen、Rephrase和Writer。实验在COLING多语言数据集上开展，涵盖了英语和意大利语配置，同时也在一个专注于艺术与心理健康的原创主题数据集上进行。结果表明，在不同语言和领域中，有监督检测器比商业工具实现了更稳定和鲁棒的性能，凸显了当前检测策略的主要优势与局限性。

摘要 (Abstract)

The rapid proliferation of Large Language Models has significantly increased the difficulty of distinguishing between human-written and AI generated texts, raising critical issues across academic, editorial, and social domains. This paper investigates the problem of AI generated text detection through the design, implementation, and comparative evaluation of multiple machine learning based detectors. Four neural architectures are developed and analyzed: a Multilayer Perceptron, a one-dimensional Convolutional Neural Network, a MobileNet-based CNN, and a Transformer model. The proposed models are benchmarked against widely used online detectors, including ZeroGPT, GPTZero, QuillBot, Originality.AI, Sapling, IsGen, Rephrase, and Writer. Experiments are conducted on the COLING Multilingual Dataset, considering both English and Italian configurations, as well as on an original thematic dataset focused on Art and Mental Health. Results show that supervised detectors achieve more stable and robust performance than commercial tools across different languages and domains, highlighting key strengths and limitations of current detection strategies.

关键词: Large Language Models, AI generated text detection, neural models, comparative evaluation, supervised detectors, multilingual dataset, transformer model, machine learning

125. ❌ Implicit Grading Bias in Large Language Models: How Writing Style Affects Automated Assessment Across Math, Programming, and Essay Tasks

作者: Rudra Jadhav, Janhavi Danve, Sonalika Shaw 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18765v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs作为自动评分器在教育场景中的公平性和偏见问题，直接涉及’Large Language Models’关键词（10分）。其他关键词如MoE、SLMs、Scaling Laws、各种训练方法、推理技术、压缩方法、AI for Science等均未在论文中涉及或讨论，因此评分为0分。

!!! tip deepseek-chat TL;DR

该研究揭示了大型语言模型在作为自动评分器时，即使明确要求仅评估内容正确性，仍会因写作风格（语法错误、非正式语言、非母语表达）而对Essay/Writing任务产生显著的评分偏见，而在数学和编程任务中偏见较小。

摘要翻译

随着大语言模型（LLMs）在教育领域越来越多地被部署为自动评分工具，其评估的公平性与偏见问题变得至关重要。本研究探讨了在底层内容正确性保持不变的情况下，LLMs是否会基于写作风格表现出隐性的评分偏见。我们构建了一个包含三个学科（数学、编程和作文/写作）共180份学生作答的受控数据集，每份作答均包含三种表层扰动类型：语法错误、非正式语言和非母语表达。研究提示两种先进的开源LLM——LLaMA 3.3 70B（Meta）和Qwen 2.5 72B（阿里巴巴）——按照1-10分制对作答进行评分，并明确指示其仅评估内容正确性，忽略写作风格。我们的结果显示，在作文/写作任务中，两种模型对所有扰动类型均存在统计上显著的评分偏见（p < 0.05），效应量从中等（Cohen’s d = 0.64）到极大（d = 4.25）不等。非正式语言受到的惩罚最重，在10分制下，LLaMA平均扣减1.90分，Qwen平均扣减1.20分——这一惩罚幅度相当于字母等级从B+降至C+的差异。非母语表达分别被扣减1.35分和0.90分。与此形成鲜明对比的是，数学和编程任务表现出极小的偏见，大多数条件未达到统计显著性。这些发现表明，LLM的评分偏见具有学科依赖性、风格敏感性，并且在评分提示中明确给出反偏见指令后依然存在。我们讨论了基于LLM的评分系统公平部署的影响，并建议在机构采用前实施偏见审计规程。

摘要 (Abstract)

As large language models (LLMs) are increasingly deployed as automated graders in educational settings, concerns about fairness and bias in their evaluations have become critical. This study investigates whether LLMs exhibit implicit grading bias based on writing style when the underlying content correctness remains constant. We constructed a controlled dataset of 180 student responses across three subjects (Mathematics, Programming, and Essay/Writing), each with three surface-level perturbation types: grammar errors, informal language, and non-native phrasing. Two state-of-the-art open-source LLMs – LLaMA 3.3 70B (Meta) and Qwen 2.5 72B (Alibaba) – were prompted to grade responses on a 1-10 scale with explicit instructions to evaluate content correctness only and to disregard writing style. Our results reveal statistically significant grading bias in Essay/Writing tasks across both models and all perturbation types (p < 0.05), with effect sizes ranging from medium (Cohen’s d = 0.64) to very large (d = 4.25). Informal language received the heaviest penalty, with LLaMA deducting an average of 1.90 points and Qwen deducting 1.20 points on a 10-point scale – penalties comparable to the difference between a B+ and C+ letter grade. Non-native phrasing was penalized 1.35 and 0.90 points respectively. In sharp contrast, Mathematics and Programming tasks showed minimal bias, with most conditions failing to reach statistical significance. These findings demonstrate that LLM grading bias is subject-dependent, style-sensitive, and persists despite explicit counter-bias instructions in the grading prompt. We discuss implications for equitable deployment of LLM-based grading systems and recommend bias auditing protocols before institutional adoption.

关键词: Large Language Models, automated grading, grading bias, writing style, educational assessment, fairness, LLaMA, Qwen

126. ❌ STEP: Scientific Time-Series Encoder Pretraining via Cross-Domain Distillation

作者: Chen Zhang, Liwei Liu, Jun Tao, Xiaoyu Yang, Xuenan Xu, Kai Chen, Bowen Zhou, Wen Wu, Chao Zhang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18688v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究科学时间序列的表示学习，提出STEP框架通过跨领域蒸馏整合多个基础模型的知识。核心相关关键词：1) ‘Pre-training OR Continual Pre-training OR Domain Adaptation’ (10分)：论文核心是预训练框架，涉及跨领域知识迁移和适应；2) ‘AI for Science OR Bioinformatics OR Cheminformatics’ (10分)：论文明确属于科学AI领域，应用于科学时间序列任务；3) ‘Large Language Models OR LLMs OR Foundation Models’ (5分)：论文涉及基础模型（foundation models）的知识迁移，但非LLM特定。其他关键词与论文内容无直接关联，如MoE、SFT、RAG、推理方法等均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对科学时间序列数据稀疏、异构的挑战，提出了STEP预训练框架，通过跨领域蒸馏整合多个基础模型的知识，实验证明其在七个科学时间序列任务上能学习到通用且可迁移的特征表示。

摘要翻译

科学时间序列是科学人工智能的核心，但通常具有稀疏性、高度异质性和规模有限的特点，这使得统一的表征学习尤为困难。与此同时，在音频、通用时间序列和脑信号等相关时序领域预训练的基础模型蕴含丰富知识，但其对科学信号的适用性仍未得到充分探索。本文研究了来自相关时序领域的基础模型的可迁移性与互补性，并探讨如何有效利用它们来构建一个统一的科学时间序列编码器。我们首先系统评估了相关基础模型，证明了知识向科学任务迁移的有效性及其互补优势。基于此观察，我们提出了STEP——一个通过跨领域蒸馏的科学时间序列编码器预训练框架。STEP引入了自适应分块处理极端长度序列，并采用统计补偿方案以适应不同的数值尺度。它进一步利用跨领域蒸馏，将多个基础模型的知识整合到一个统一的编码器中。通过融合不同领域的互补表征，STEP学习了针对科学信号定制的通用且可迁移的特征。在七个科学时间序列任务上的实验表明，STEP既提供了有效的架构，也提供了有效的预训练范式，为科学时间序列表征学习迈出了重要一步。

摘要 (Abstract)

Scientific time series are central to scientific AI but are typically sparse, highly heterogeneous, and limited in scale, making unified representation learning particularly challenging. Meanwhile, foundation models pretrained on relevant time series domains such as audio, general time series, and brain signals contain rich knowledge, but their applicability to scientific signals remains underexplored. In this paper, we investigate the transferability and complementarity of foundation models from relevant time series domains, and study how to effectively leverage them to build a unified encoder for scientific time series. We first systematically evaluate relevant foundation models, showing the effectiveness of knowledge transfer to scientific tasks and their complementary strengths. Based on this observation, we propose STEP, a Scientific Time Series Encoder Pretraining framework via cross domain distillation. STEP introduces adaptive patching to handle extreme-length sequences and a statistics compensation scheme to accommodate diverse numerical scales. It further leverages cross-domain distillation to integrate knowledge from multiple foundation models into a unified encoder. By combining complementary representations across different domains, STEP learns general-purpose and transferable features tailored for scientific signals. Experiments on seven scientific time series tasks demonstrate that STEP provides both an effective structure and an effective pretraining paradigm, taking a STEP toward scientific time series representation learning.

关键词: scientific time series, foundation models, cross-domain distillation, pretraining, representation learning, knowledge transfer, adaptive patching, unified encoder

127. ❌ Words at Play: Benchmarking Audio Pun Understanding in Large Audio-Language Models

作者: Yuchen Su, Shaoxin Zhong, Yonghua Zhu, Ruofan Wang, Zijian Huang, Qiqi Wang, Na Zhao, Diana Benavides-Prado, Michael Witbrock 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18678v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于评估大型音频语言模型（LALMs）在音频双关语理解上的性能，属于大模型在特定领域（音频语言理解）的应用研究。因此，仅与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（评8分），因为LALMs是LLMs在音频模态的扩展应用，论文核心是评估这类模型。其他关键词主要涉及具体的大模型技术原理（如MoE、量化、推理加速等）、训练方法（如SFT、RLHF）或特定应用领域（如AI for Science），论文未涉及这些具体技术或领域，故均评0分。

!!! tip deepseek-chat TL;DR

该论文提出了首个用于评估大型音频语言模型在音频双关语理解上的基准APUN-Bench，并通过系统评估10个先进模型，揭示了它们在识别、定位和解释音频双关语方面存在显著性能差距和关键挑战。

摘要翻译

双关语是一种典型的语言现象，它利用一词多义和语音歧义来制造幽默效果，对自然语言理解提出了独特挑战。在双关语研究中，音频（audio）作为除文本和图像外人类交流的核心载体，其相关研究资源却严重匮乏：目前针对口语双关语的数据集和系统性资源仍然稀缺，导致这一关键模态在很大程度上未被充分探索。本文提出了APUN-Bench，这是首个专门用于评估大型音频语言模型（Large Audio Language Models, LALMs）在音频双关语理解方面能力的基准测试。我们的基准包含4,434个音频样本，并标注了三个层次的任务：双关识别、双关词定位和双关含义推理。我们通过对10个前沿大型音频语言模型进行系统性评估，对APUN-Bench进行了深入分析，揭示了这些模型在识别、定位和解释音频双关语方面存在的显著性能差距。该分析指出了若干关键挑战，例如音频双关词定位中的位置偏差以及含义推理中的错误案例，为推进具备幽默感知能力的音频智能发展提供了可操作的见解。

摘要 (Abstract)

Puns represent a typical linguistic phenomenon that exploits polysemy and phonetic ambiguity to generate humour, posing unique challenges for natural language understanding. Within pun research, audio plays a central role in human communication except text and images, while datasets and systematic resources for spoken puns remain scarce, leaving this crucial modality largely underexplored. In this paper, we present APUN-Bench, the first benchmark dedicated to evaluating large audio language models (LALMs) on audio pun understanding. Our benchmark contains 4,434 audio samples annotated across three stages: pun recognition, pun word location and pun meaning inference. We conduct a deep analysis of APUN-Bench by systematically evaluating 10 state-of-the-art LALMs, uncovering substantial performance gaps in recognizing, localizing, and interpreting audio puns. This analysis reveals key challenges, such as positional biases in audio pun location and error cases in meaning inference, offering actionable insights for advancing humour-aware audio intelligence.

关键词: audio pun understanding, large audio language models, benchmark evaluation, humour-aware audio intelligence, natural language understanding, multimodal AI, audio-language models

128. ❌ A Comparative Empirical Study of Catastrophic Forgetting Mitigation in Sequential Task Adaptation for Continual Natural Language Processing Systems

作者: Aram Abrahamyan, Sachin Kumar 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18641v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究持续学习中的灾难性遗忘缓解，主要涉及传统神经网络架构（ANN、GRU、Transformer）和持续学习策略（MIR、LwF、HAT），与大多数关键词无关。仅与"Post-training OR Supervised Fine-tuning OR SFT"有一定关联（5分），因为论文提到顺序微调（sequential fine-tuning）作为基线方法，但非核心创新点。其他关键词如大模型、MoE、量化、RAG等均未涉及。

!!! tip deepseek-chat TL;DR

该论文通过比较三种神经网络架构和多种持续学习策略在顺序任务适应中的表现，发现回放方法是缓解灾难性遗忘的关键，且最优配置依赖于架构选择。

摘要翻译

部署于实际应用中的神经语言模型必须持续适应新任务与领域，同时不遗忘先前习得的知识。本研究针对持续意图分类中的灾难性遗忘缓解问题，开展了一项比较性实证研究。我们基于CLINC150数据集构建了一个包含10个任务、标签空间互斥的场景，并在多种持续学习策略下评估了三种骨干架构：前馈人工神经网络、门控循环单元以及Transformer编码器。我们从每个主要的持续学习类别中选取了一种代表性方法：基于回放的“最大干扰检索”、基于正则化的“无遗忘学习”，以及通过“硬注意力任务”实现的参数隔离方法，并对这些方法进行了单独、两两组合及三重组合的测试。性能评估采用平均准确率、宏观F1分数及后向迁移指标，以捕捉任务序列中的稳定性-可塑性权衡。实验结果表明，对所有架构而言，简单的顺序微调均遭受严重的遗忘问题，且没有任何单一持续学习方法能完全避免遗忘。回放策略被证明是关键要素：最大干扰检索是最可靠的独立策略，而包含回放的组合方案（最大干扰检索+硬注意力任务、最大干扰检索+无遗忘学习、最大干扰检索+无遗忘学习+硬注意力任务）均能持续取得较高的最终性能，其后向迁移值接近零或呈温和正值。最优配置因架构而异：最大干扰检索+硬注意力任务在人工神经网络和Transformer上效果最佳；而最大干扰检索+无遗忘学习+硬注意力任务则在门控循环单元上表现最优。在多种情况下，持续学习方法甚至超越了联合训练，显示出正则化效应。这些发现凸显了在设计持续意图分类系统时，联合选择骨干架构与持续学习机制的重要性。

摘要 (Abstract)

Neural language models deployed in real-world applications must continually adapt to new tasks and domains without forgetting previously acquired knowledge. This work presents a comparative empirical study of catastrophic forgetting mitigation in continual intent classification. Using the CLINC150 dataset, we construct a 10-task label-disjoint scenario and evaluate three backbone architectures: a feed-forward Artificial Neural Network (ANN), a Gated Recurrent Unit (GRU), and a Transformer encoder, under a range of continual learning (CL) strategies. We consider one representative method from each major CL family: replay-based Maximally Interfered Retrieval (MIR), regularization-based Learning without Forgetting (LwF), and parameter-isolation via Hard Attention to Task (HAT), both individually and in all pairwise and triple combinations. Performance is assessed with average accuracy, macro F1, and backward transfer, capturing the stability-plasticity trade-off across the task sequence. Our results show that naive sequential fine-tuning suffers from severe forgetting for all architectures and that no single CL method fully prevents it. Replay emerges as a key ingredient: MIR is the most reliable individual strategy, and combinations that include replay (MIR+HAT, MIR+LwF, MIR+LwF+HAT) consistently achieve high final performance with near-zero or mildly positive backward transfer. The optimal configuration is architecture-dependent. MIR+HAT yields the best result for ANN and Transformer, MIR+LwF+HAT, on the other hand, works the best for GRU, and in several cases CL methods even surpass joint training, indicating a regularization effect. These findings highlight the importance of jointly selecting backbone architecture and CL mechanism when designing continual intent-classification systems.

关键词: catastrophic forgetting, continual learning, sequential task adaptation, replay methods, MIR, LwF, HAT, intent classification

129. ❌ MOSAIC: Multi-Objective Slice-Aware Iterative Curation for Alignment

作者: Yipu Dou, Wang Yang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18637v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型对齐中的监督微调（SFT）预算分配问题，提出MOSAIC框架来平衡多目标对齐（安全对齐、低过度拒绝、指令遵循）。因此与’Supervised Fine-tuning’和’Alignment’高度相关（10分），与’Large Language Models’高度相关（10分）。论文提到LoRA作为基线比较，因此与’LoRA’有一定关联（5分）。其他关键词如MoE、Scaling Laws、RLHF等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文研究如何在固定监督微调预算下平衡大语言模型的多目标对齐问题，提出MOSAIC框架，通过结构化失败诊断优化数据构建，在安全对齐、过度拒绝和指令遵循方面取得显著改进。

摘要翻译

本研究探讨如何在固定监督微调预算下，同时平衡三个目标：多轮对话安全性对齐、良性边界查询的低度过度拒绝，以及在可验证约束下的指令遵循能力。我们提出MOSAIC（面向对齐的多目标切片感知迭代数据构建框架），这是一个基于统一L1-L3评估接口构建的闭环数据混合搜索多目标框架。MOSAIC将切片层面的失败模式转化为可执行的数据操作，包括数据集级混合比例、桶级权重和聚焦标准。在固定100万词元预算及从同一基础模型出发进行五轮独立微调的条件下，MOSAIC将内部XGuard评分从2.76提升至4.67，同时保持OrBench评分为4.41、IFEval评分为3.65。最终帕累托解在独立攻击测试、过度拒绝测试和能力测试上也优于随机静态LoRA基线，这表明结构化失败诊断可作为预算受限数据构建的有效控制信号。代码发布于https://github.com/douyipu/mosaic。

摘要 (Abstract)

We study how to allocate a fixed supervised fine-tuning budget when three objectives must be balanced at once: multi-turn safety alignment, low over-refusal on benign boundary queries, and instruction following under verifiable constraints. We propose MOSAIC (Multi-Objective Slice-Aware Iterative Curation for Alignment), a multi-objective framework for closed-loop data mixture search built on a unified L1-L3 evaluation interface. MOSAIC turns slice-level failure profiles into executable data actions, including dataset-level mixture ratios, bucket-level weights, and focus criteria. Under a fixed 1M-token budget and five rounds of independent fine-tuning from the same base model, MOSAIC improves internal XGuard from 2.76 to 4.67 while keeping OrBench at 4.41 and IFEval at 3.65. The final Pareto solution also generalizes better than a random static LoRA baseline on independent attack, over-refusal, and capability tests, suggesting that structured failure diagnosis can serve as a practical control signal for budgeted data construction. Code is available at https://github.com/douyipu/mosaic.

关键词: alignment, supervised fine-tuning, multi-objective optimization, data mixture, safety alignment, over-refusal, instruction following, LoRA

130. ❌ DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units

作者: Maxime Poli, Manel Khentout, Angelo Ortiz Tandazo, Ewan Dunbar, Emmanuel Chemla, Emmanuel Dupoux 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18612v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文DiscoPhon专注于语音处理领域，特别是无监督音素发现，与大多数大模型技术关键词（如LLM、MoE、RLHF、RAG等）无直接关联。仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为论文提到使用预训练的多语言HuBERT和SpidR模型作为基线。与’AI for Science OR Bioinformatics OR Cheminformatics’也有一定关联（5分），因为该研究属于AI在科学（语音学）领域的应用，但非核心生物信息学或化学信息学。其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了DiscoPhon基准，用于评估从离散语音单元中无监督发现音素的能力，并展示了当前预训练模型中的音素信息足以使衍生单元与音素良好相关，但存在跨语言差异。

摘要翻译

我们提出DiscoPhon——一个用于评估从离散语音单元中无监督发现音素的多语言基准测试集。DiscoPhon涵盖6种开发语言和6种测试语言，这些语言的选择旨在覆盖广泛的音位对立特征。在仅提供10小时未见语言语音数据的条件下，系统必须通过多对一或一对一映射的方式，将生成的离散单元对应到预定义的音素库中。最终生成的序列将从单元质量、识别准确度和分割效果三个维度进行评估。我们提供了四个预训练的多语言HuBERT和SpidR基线模型，实验表明当前模型已能充分提取音位信息，其衍生单元与音素呈现显著相关性，但该相关性在不同语言间存在差异。

摘要 (Abstract)

We introduce DiscoPhon, a multilingual benchmark for evaluating unsupervised phoneme discovery from discrete speech units. DiscoPhon covers 6 dev and 6 test languages, chosen to span a wide range of phonemic contrasts. Given only 10 hours of speech in a previously unseen language, systems must produce discrete units that are mapped to a predefined phoneme inventory, through either a many-to-one or a one-to-one assignment. The resulting sequences are evaluated for unit quality, recognition and segmentation. We provide four pretrained multilingual HuBERT and SpidR baselines, and show that phonemic information is available enough in current models for derived units to correlate well with phonemes, though with variations across languages.

关键词: unsupervised phoneme discovery, discrete speech units, multilingual benchmark, HuBERT, SpidR, phonemic contrasts, speech processing, AI for science

作者: Thi Huyen Nguyen, Koustav Rudra, Wolfgang Nejdl 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18611v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是社交媒体上人道主义分类的可解释性方法，提出了一种跨模态理由传递框架。论文的核心是视觉语言Transformer模型和可解释性AI技术，与"Mechanistic Interpretability OR Explainable AI"高度相关（8分），因为论文专注于提取文本和图像理由来增强分类决策的透明度。其他关键词主要涉及大模型技术原理、训练方法、推理优化、代理系统等，而本文虽然使用Transformer模型，但重点是可解释性方法而非大模型技术本身，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种跨模态理由传递的可解释性多模态分类框架，用于社交媒体上的人道主义分类，通过从文本理由映射到图像理由来提升分类性能并减少标注成本，在CrisisMMD数据集上实现了2-35%的Macro-F1提升和80%的零样本准确率。

摘要翻译

社交媒体数据传播的进步使得在危机期间能够提供实时信息。这些信息来自不同类别，例如基础设施损坏、受影响区域的人员失踪或受困等。现有方法试图将文本和图像分类到各类人道主义类别中，但其决策过程在很大程度上仍不透明，这影响了它们在实际应用中的部署。近期研究试图通过从推文中提取文本依据来解释预测类别，从而提高透明度。然而，这类可解释的分类方法主要集中于文本，而非与危机相关的图像。本文提出了一种可解释设计的多模态分类框架。我们的方法首先使用视觉语言Transformer模型学习文本和图像的联合表征，并提取文本依据。随后，通过与文本依据的映射来提取图像依据。我们的方法展示了如何通过跨模态依据迁移从一个模态学习另一个模态的依据，从而节省标注成本。最后，基于提取的依据对推文进行分类。我们在CrisisMMD基准数据集上进行了实验，结果表明，所提出的方法在提取准确的文本标记和图像块作为依据的同时，将分类的宏观F1分数提升了2-35%。人工评估也支持这一结论：我们提出的方法能够检索出更优的图像依据块（提升12%），有助于识别人道主义类别。我们的方法能很好地适应零样本模式下的新未见数据集，达到80%的准确率。

摘要 (Abstract)

Advances in social media data dissemination enable the provision of real-time information during a crisis. The information comes from different classes, such as infrastructure damages, persons missing or stranded in the affected zone, etc. Existing methods attempted to classify text and images into various humanitarian categories, but their decision-making process remains largely opaque, which affects their deployment in real-life applications. Recent work has sought to improve transparency by extracting textual rationales from tweets to explain predicted classes. However, such explainable classification methods have mostly focused on text, rather than crisis-related images. In this paper, we propose an interpretable-by-design multimodal classification framework. Our method first learns the joint representation of text and image using a visual language transformer model and extracts text rationales. Next, it extracts the image rationales via the mapping with text rationales. Our approach demonstrates how to learn rationales in one modality from another through cross-modal rationale transfer, which saves annotation effort. Finally, tweets are classified based on extracted rationales. Experiments are conducted over CrisisMMD benchmark dataset, and results show that our proposed method boosts the classification Macro-F1 by 2-35% while extracting accurate text tokens and image patches as rationales. Human evaluation also supports the claim that our proposed method is able to retrieve better image rationale patches (12%) that help to identify humanitarian classes. Our method adapts well to new, unseen datasets in zero-shot mode, achieving an accuracy of 80%.

关键词: explainable AI, multimodal classification, cross-modal rationale transfer, humanitarian classification, visual language transformer, social media analysis, crisis management, zero-shot adaptation

132. ❌ Language Model Maps for Prompt-Response Distributions via Log-Likelihood Vectors

作者: Yusuke Takase, Momose Oyama, Hidetoshi Shimodaira 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18593v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文提出了一种通过log-likelihood向量表示语言模型并构建模型地图的方法，用于比较不同语言模型的条件分布。论文核心关注语言模型的表示、比较和分析方法，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文明确研究语言模型集合。与’Mechanistic Interpretability OR Explainable AI’有一定关联（8分），因为论文的分析框架支持理解模型行为，属于可解释性范畴。其他关键词如MoE、SFT、RAG、推理方法、压缩技术、科学应用等均未在摘要中提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于log-likelihood向量的语言模型表示方法，构建模型地图以比较和分析不同语言模型的条件分布，并展示了该方法能捕捉模型属性、任务性能及提示修改引起的系统性变化。

摘要翻译

我们提出一种方法，通过语言模型在提示-响应对上的对数似然向量来表示模型，并构建模型图谱以比较其条件分布。在此空间中，模型间的距离近似于对应条件分布之间的KL散度。对大量公开可用的语言模型进行的实验表明，该图谱能够捕捉有意义的全局结构，包括模型属性与任务性能之间的关系。该方法还能捕获由提示修改引发的系统性偏移及其近似可加组合性，为分析和预测复合提示操作的效果提供了途径。我们进一步引入点互信息向量以减少无条件分布的影响；在某些情况下，基于PMI的模型图谱能更好地反映与训练数据相关的差异。总体而言，该框架支持对输入依赖的模型行为进行分析。

摘要 (Abstract)

We propose a method that represents language models by log-likelihood vectors over prompt-response pairs and constructs model maps for comparing their conditional distributions. In this space, distances between models approximate the KL divergence between the corresponding conditional distributions. Experiments on a large collection of publicly available language models show that the maps capture meaningful global structure, including relationships to model attributes and task performance. The method also captures systematic shifts induced by prompt modifications and their approximate additive compositionality, suggesting a way to analyze and predict the effects of composite prompt operations. We further introduce pointwise mutual information (PMI) vectors to reduce the influence of unconditional distributions; in some cases, PMI-based model maps better reflect training-data-related differences. Overall, the framework supports the analysis of input-dependent model behavior.

关键词: language models, log-likelihood vectors, model maps, conditional distributions, KL divergence, prompt-response pairs, pointwise mutual information, input-dependent behavior

133. ❌ ICE: Intervention-Consistent Explanation Evaluation with Statistical Grounding for LLMs

作者: Abhinaba Basu, Pavan Chakraborty 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18579v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM解释的忠实性评估方法，与’Large Language Models’高度相关（10分），因为直接评估了7个LLM；与’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为专注于模型解释的忠实性评估，这是可解释AI的核心问题。其他关键词如MoE、SFT、RAG、量化等均未涉及，评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了ICE框架，通过多干预操作符和随机化测试来评估LLM解释的忠实性，发现忠实性依赖于操作符且与人类合理性无关，并揭示了多语言评估中的模型-语言交互效应。

摘要翻译

评估解释是否忠实反映模型的推理过程仍是一个悬而未决的问题。现有基准采用单一干预且缺乏统计检验，无法区分真实的忠实性与随机水平的表现。我们提出ICE（干预一致性解释）框架，该框架通过在多干预算子下进行随机化检验，将解释与匹配的随机基线进行比较，从而得出带有置信区间的胜率。通过对7种大语言模型在4项英语任务、6种非英语语言及2种归因方法上的评估，我们发现忠实性具有算子依赖性：算子间的差异最高可达44个百分点，其中删除操作通常在短文本上会高估忠实性，但在长文本上该模式发生逆转，这表明忠实性应通过跨干预算子的比较来解读，而非作为一个单一分数。随机化基线显示，三分之一的配置中存在反忠实性现象，且忠实性与人类合理性之间几乎无相关性（|r| < 0.04）。多语言评估揭示了显著的模型-语言交互效应，这无法仅通过分词差异解释。我们公开了ICE框架与ICEBench基准。

摘要 (Abstract)

Evaluating whether explanations faithfully reflect a model’s reasoning remains an open problem. Existing benchmarks use single interventions without statistical testing, making it impossible to distinguish genuine faithfulness from chance-level performance. We introduce ICE (Intervention-Consistent Explanation), a framework that compares explanations against matched random baselines via randomization tests under multiple intervention operators, yielding win rates with confidence intervals. Evaluating 7 LLMs across 4 English tasks, 6 non-English languages, and 2 attribution methods, we find that faithfulness is operator-dependent: operator gaps reach up to 44 percentage points, with deletion typically inflating estimates on short text but the pattern reversing on long text, suggesting that faithfulness should be interpreted comparatively across intervention operators rather than as a single score. Randomized baselines reveal anti-faithfulness in one-third of configurations, and faithfulness shows zero correlation with human plausibility (|r| < 0.04). Multilingual evaluation reveals dramatic model-language interactions not explained by tokenization alone. We release the ICE framework and ICEBench benchmark.

关键词: LLM explanation faithfulness, intervention operators, randomization tests, multilingual evaluation, ICE framework, attribution methods, model-language interactions

134. ❌ SpecForge: A Flexible and Efficient Open-Source Training Framework for Speculative Decoding

作者: Shenggui Li, Chao Wang, Yikai Zhu, Yubo Wang, Fan Yin, Shuai Shi, Yefei Chen, Xiaomin Dong, Qiaoling Chen, Jin Pan, Ji Li, Laixin Xie, Yineng Zhang, Lei Yu, Yonggang Wen, Ivor Tsang, Tianwei Zhang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18567v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	15.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究speculative decoding（推测解码）技术，这是LLM推理加速的关键方法，因此与’Speculative Decoding OR Inference Acceleration’高度相关（15分）。论文涉及LLM推理优化，与’Large Language Models OR LLMs OR Foundation Models’相关（10分）。其他关键词如MoE、SLMs、训练方法、对齐、RAG、推理技术（除speculative decoding外）、代理、量化、科学AI等均未在摘要中提及，因此评0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型推理延迟高的问题，提出了SpecForge框架来高效训练speculative decoding模型，实现了高达9.9倍的训练加速和4.48倍的端到端推理加速。

摘要翻译

大型语言模型因序列自回归解码而产生高推理延迟。推测解码通过使用轻量级草稿模型提出多个令牌进行批量验证来缓解这一瓶颈。然而，由于缺乏高质量的草稿模型和可扩展的训练基础设施，其应用一直受限。我们推出SpecForge，这是一个开源、面向生产的框架，用于训练推测解码模型，并全面支持EAGLE-3。SpecForge融合了目标-草稿解耦、混合并行、优化的训练内核以及与生产级推理引擎的集成，能够为Qwen3-235B-A22B实现高达9.9倍的EAGLE-3训练加速。此外，我们发布了SpecBundle，这是一套使用SpecForge训练的生产级EAGLE-3草稿模型，适用于主流开源大语言模型。通过对推测解码训练方案的系统性研究，SpecBundle解决了社区中高质量草稿模型稀缺的问题，我们的草稿模型在SGLang上实现了高达4.48倍的端到端推理加速，从而确立了SpecForge作为实际场景中推测解码部署的实用基础。

摘要 (Abstract)

Large language models incur high inference latency due to sequential autoregressive decoding. Speculative decoding alleviates this bottleneck by using a lightweight draft model to propose multiple tokens for batched verification. However, its adoption has been limited by the lack of high-quality draft models and scalable training infrastructure. We introduce SpecForge, an open-source, production-oriented framework for training speculative decoding models with full support for EAGLE-3. SpecForge incorporates target-draft decoupling, hybrid parallelism, optimized training kernels, and integration with production-grade inference engines, enabling up to 9.9x faster EAGLE-3 training for Qwen3-235B-A22B. In addition, we release SpecBundle, a suite of production-grade EAGLE-3 draft models trained with SpecForge for mainstream open-source LLMs. Through a systematic study of speculative decoding training recipes, SpecBundle addresses the scarcity of high-quality drafts in the community, and our draft models achieve up to 4.48x end-to-end inference speedup on SGLang, establishing SpecForge as a practical foundation for real-world speculative decoding deployment.

关键词: speculative decoding, inference acceleration, large language models, training framework, EAGLE-3, draft models, open-source, inference latency

135. ❌ Cross-Lingual LLM-Judge Transfer via Evaluation Decomposition

作者: Ivaxi Sheth, Zeno Jonke, Amin Mantrach, Saab Mansour 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18557v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在多语言环境下的自动化评估问题，因此与’Large Language Models’高度相关（10分）。论文提出基于通用标准集（UCS）的分解评估框架，涉及评估维度和可解释性，与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分）。论文在多个忠实性任务上进行实验，与’Hallucination Mitigation OR Factuality OR Truthfulness’有一定关联（5分）。其他关键词如MoE、SLMs、训练方法、推理加速、AI for Science等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究如何解决大语言模型在多语言应用中自动化评估的挑战，通过提出基于通用标准集的分解评估框架，实现了跨语言评估的有效迁移，并在多语言忠实性任务上取得了优于基线方法的效果。

摘要翻译

随着大语言模型在多样化现实应用中的部署日益广泛，将自动化评估扩展至英语之外已成为一项关键挑战。现有的评估方法主要聚焦于英语，而将其适配到其他语言则受限于大多数语言中人工标注数据的稀缺性与高昂成本。我们提出一种基于分解的评估框架，围绕通用标准集构建。该通用标准集由一套共享的、语言无关的评估维度组成，能生成可解释的中间表征，从而在极少监督下支持跨语言迁移。在多种语言及模型架构上的多个忠实度评估任务实验表明，该方法无需目标语言标注，即可在强基线模型基础上实现持续的性能提升。

摘要 (Abstract)

As large language models are increasingly deployed across diverse real-world applications, extending automated evaluation beyond English has become a critical challenge. Existing evaluation approaches are predominantly English-focused, and adapting them to other languages is hindered by the scarcity and cost of human-annotated judgments in most languages. We introduce a decomposition-based evaluation framework built around a Universal Criteria Set (UCS). UCS consists of a shared, language-agnostic set of evaluation dimensions, producing an interpretable intermediate representation that supports cross-lingual transfer with minimal supervision. Experiments on multiple faithfulness tasks across languages and model backbones demonstrate consistent improvements over strong baselines without requiring target-language annotations.

关键词: large language models, cross-lingual evaluation, automated evaluation, faithfulness tasks, evaluation framework, Universal Criteria Set, interpretable representation, minimal supervision

136. ❌ Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning

作者: Yinan Xia, Haotian Zhang, Huiming Wang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18533v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Large Reasoning Models（LRMs）的推理优化问题，与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分），因为LRMs是大模型的一种。核心方法DDPO是基于强化学习的优化算法，与’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’高度相关（10分），属于强化学习优化范畴。论文针对推理过程中的’overthinking’和’overconfidence’问题，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（10分），以及’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’相关（8分）。‘Self-Correction OR Self-Improvement OR Self-Reflection’（5分）和’Hallucination Mitigation OR Factuality OR Truthfulness’（5分）有一定关联，因为论文涉及模型自我优化和减少错误。其他关键词如MoE、SLMs、Scaling Laws、Pre-training等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对Large Reasoning Models（LRMs）在推理过程中存在的过度思考和过度自信问题，提出了一种基于难度区分的策略优化算法DDPO，通过分别优化简单和复杂任务，在减少答案长度的同时提高了准确性，实现了效率与性能的更好平衡。

摘要翻译

大型推理模型（LRMs）展现出卓越的推理能力，但也存在过度思考的问题，常生成过长且冗余的答案。对于超出模型能力范围的问题，LRMs则倾向于表现出过度自信现象，生成过短但错误的答案，这可能导致性能欠佳。为解决这些问题，我们提出了难度差异化策略优化（Difficulty-Differentiated Policy Optimization, DDPO），这是一种基于过度自信现象、对简单与复杂任务分别进行优化的高效强化学习算法。具体而言，该算法在不损害准确性的前提下缩短简单任务的输出长度，而对于复杂任务，则扩展探索空间以提升性能。我们进一步推导了最大化期望准确率的理论条件，要求长度分布尽可能接近最优长度且尽可能集中。基于这些条件，我们提出使用难度级别平均值作为长度优化的合理参考。在领域内和领域外基准测试上的大量实验验证了DDPO的优越性与有效性。与GRPO相比，DDPO在多个基准测试中将平均答案长度降低了12%，同时准确率提升了1.85%，实现了准确率与长度间更优的权衡。代码发布于https://github.com/Yinan-Xia/DDPO。

摘要 (Abstract)

Large Reasoning Models (LRMs) have shown exceptional reasoning capabilities, but they also suffer from the issue of overthinking, often generating excessively long and redundant answers. For problems that exceed the model’s capabilities, LRMs tend to exhibit the overconfidence phenomenon, generating overly short but incorrect answers, which may contribute to suboptimal performance. To address these issues, we propose Difficulty-Differentiated Policy Optimization (DDPO), an efficient reinforcement learning algorithm that optimizes simple and complex tasks separately based on the overconfidence phenomenon. Specifically, it reduces the output length for simple tasks without compromising accuracy, while for complex tasks, it expands the exploration space to improve performance. We further derive the theoretical conditions for maximizing expected accuracy, which require the length distribution to closely approximate the optimal length and be as concentrated as possible. Based on these conditions, we propose using the difficulty-level average as a well-founded reference for length optimization. Extensive experiments on both in-domain and out-of-domain benchmarks validate the superiority and effectiveness of DDPO. Compared to GRPO, DDPO reduces the average answer length by 12% while improving accuracy by 1.85% across multiple benchmarks, achieving a better trade-off between accuracy and length. The code is available at https://github.com/Yinan-Xia/DDPO.

关键词: Large Reasoning Models, Reinforcement Learning, Policy Optimization, Reasoning Efficiency, Overthinking, Overconfidence, Difficulty-Differentiated, Length Redistribution

137. ❌ When Names Change Verdicts: Intervention Consistency Reveals Systematic Bias in LLM Decision-Making

作者: Abhinaba Basu, Pavan Chakraborty 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18530v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在决策中的系统性偏见检测与缓解，与’Large Language Models’高度相关（10分），直接涉及LLMs评估。与’Hallucination Mitigation’和’Mechanistic Interpretability’有一定关联（各5分），因为研究偏见检测（类似事实性/真实性评估）和解释模型行为（通过干预测试）。其他关键词如MoE、SFT、RAG等未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLMs在高风险决策中依赖虚假特征（如权威、框架、人口统计）的系统性偏见问题，通过ICE-Guard框架检测并发现权威偏见最显著，且通过结构化分解和迭代提示修补可显著减少偏见。

摘要翻译

大语言模型（LLM）正日益被用于高风险决策，但其对伪特征的敏感性仍未得到充分表征。我们提出了ICE-Guard框架，该框架应用干预一致性测试来检测三种类型的伪特征依赖：人口统计特征（姓名/种族替换）、权威特征（资历/声望替换）和表述框架特征（正面/负面重述）。在涵盖10个高风险领域的3000个情景案例中，我们评估了来自8个系列的11个LLM，发现：（1）权威偏见（平均5.8%）和框架偏见（5.0%）显著超过人口统计偏见（2.2%），这挑战了该领域对人口统计因素的狭隘关注；（2）偏见集中在特定领域——金融领域显示出22.6%的权威偏见，而刑事司法领域仅为2.8%；（3）结构化分解方法（即LLM提取特征，再由确定性规则进行决策）可将翻转率降低高达100%（在9个模型中位数降低49%）。我们展示了一个ICE引导的“检测-诊断-缓解-验证”循环，通过迭代式提示修补实现了累计78%的偏见削减。针对真实COMPAS再犯数据的验证表明，基于COMPAS的翻转率超过了汇总的合成数据翻转率，这提示我们的基准测试对现实世界偏见提供了保守估计。代码与数据已公开。

摘要 (Abstract)

Large language models (LLMs) are increasingly used for high-stakes decisions, yet their susceptibility to spurious features remains poorly characterized. We introduce ICE-Guard, a framework applying intervention consistency testing to detect three types of spurious feature reliance: demographic (name/race swaps), authority (credential/prestige swaps), and framing (positive/negative restatements). Across 3,000 vignettes spanning 10 high-stakes domains, we evaluate 11 LLMs from 8 families and find that (1) authority bias (mean 5.8%) and framing bias (5.0%) substantially exceed demographic bias (2.2%), challenging the field’s narrow focus on demographics; (2) bias concentrates in specific domains – finance shows 22.6% authority bias while criminal justice shows only 2.8%; (3) structured decomposition, where the LLM extracts features and a deterministic rubric decides, reduces flip rates by up to 100% (median 49% across 9 models). We demonstrate an ICE-guided detect-diagnose-mitigate-verify loop achieving cumulative 78% bias reduction via iterative prompt patching. Validation against real COMPAS recidivism data shows COMPAS-derived flip rates exceed pooled synthetic rates, suggesting our benchmark provides a conservative estimate of real-world bias. Code and data are publicly available.

关键词: Large Language Models, bias detection, intervention consistency, high-stakes decisions, spurious features, authority bias, framing bias, demographic bias

作者: Esteban Garces Arias, Nurzhan Sapargali, Christian Heumann, Matthias Aßenmacher 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18482v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究解码策略（如top-k、nucleus sampling、contrastive search）如何影响文本生成质量，通过分析8个语言模型、5种解码策略和53种超参数配置，发现基于似然的解码策略会系统性地排除人类可能选择的罕见但合适的token，这增强了机器生成文本的可检测性。论文核心关注语言模型的解码策略和文本生成质量，因此仅与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文分析了多个语言模型及其解码策略。其他关键词涉及模型架构、训练方法、推理优化、对齐、应用领域等，论文未涉及，故均为0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，基于似然的解码策略（如top-k、nucleus sampling）会系统性地排除人类可能选择的罕见但合适的token，这增强了机器生成文本的可检测性，且检测性主要受截断参数影响而非模型规模或架构。

摘要翻译

标准文本生成解码策略（如top-k采样、核采样和对比搜索）基于似然性选择词元，将选择范围限制在高概率区域。人类语言生成机制则不同：词元的选择基于交际适切性而非统计频率。这种不匹配产生了截断盲区：语境适切但统计罕见的词元对人类而言是可及的，却无法被基于似然性的解码策略获取。我们假设这是导致机器生成文本可被检测的重要因素。通过分析八个语言模型、五种解码策略和53种超参数配置下的超过180万篇文本，我们发现人类选择的词元中有8-18%落在典型截断边界之外。基于可预测性和词汇多样性训练的简单分类器实现了显著的检测率。关键的是，模型规模或架构与可检测性均无强相关性；截断参数解释了大部分变异。达到低可检测性的配置常产生不连贯文本，表明规避检测与生成自然文本是两个不同的目标。这些发现说明基于似然性的词元选择增强了可检测性，而不仅仅是模型能力的问题。

摘要 (Abstract)

Standard decoding strategies for text generation, including top-k, nucleus sampling, and contrastive search, select tokens based on likelihood, restricting selection to high-probability regions. Human language production operates differently: tokens are chosen for communicative appropriateness rather than statistical frequency. This mismatch creates a truncation blind spot: contextually appropriate but statistically rare tokens remain accessible to humans yet unreachable by likelihood-based decoding. We hypothesize this contributes to the detectability of machine-generated text. Analyzing over 1.8 million texts across eight language models, five decoding strategies, and 53 hyperparameter configurations, we find that 8-18% of human-selected tokens fall outside typical truncation boundaries. Simple classifiers trained on predictability and lexical diversity achieve remarkable detection rates. Crucially, neither model scale nor architecture correlates strongly with detectability; truncation parameters account for most variance. Configurations achieving low detectability often produce incoherent text, indicating that evading detection and producing natural text are distinct objectives. These findings suggest detectability is enhanced by likelihood-based token selection, not merely a matter of model capability.

关键词: decoding strategies, text generation, truncation blind spot, machine-generated text detectability, likelihood-based token selection, human language production, top-k sampling, nucleus sampling

139. ❌ WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior

作者: Haonan Yu, Junhao Liu, Zhenyu Yan, Haoran Lin, Xin Zhang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18474v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM行为解释与控制，通过识别神经元激活条件来保证输出，属于大模型技术原理创新。高度相关关键词：1）‘Large Language Models’（论文明确研究LLM行为控制，核心内容）；2）‘Mechanistic Interpretability’（论文提出WASD框架解释模型行为，属于可解释AI范畴）。其他关键词如MoE、SFT、RAG、量化等均未涉及，评分为0。

!!! tip deepseek-chat TL;DR

论文提出WASD框架，通过识别神经元激活的充分条件来解释和控制LLM行为，实验表明该方法比传统归因图更稳定、准确和简洁。

摘要翻译

对大语言模型（LLM）进行精确的行为控制对于复杂应用至关重要。然而，现有方法通常存在训练成本高昂、缺乏自然语言可控性或损害语义连贯性的问题。为弥补这一差距，我们提出了WASD（unWeaving Actionable Sufficient Directives）框架，这是一种通过识别令牌生成所需的充分神经条件来解释模型行为的新方法。我们的方法将候选条件表示为神经元激活谓词，并通过迭代搜索，在输入扰动下找到能保证当前输出的最小谓词集合。在SST-2和CounterFact数据集上使用Gemma-2-2B模型进行的实验表明，与传统归因图相比，我们的方法产生的解释更稳定、准确且简洁。此外，通过对跨语言输出生成控制的案例研究，我们验证了WASD在控制模型行为方面的实际有效性。

摘要 (Abstract)

Precise behavioral control of large language models (LLMs) is critical for complex applications. However, existing methods often incur high training costs, lack natural language controllability, or compromise semantic coherence. To bridge this gap, we propose WASD (unWeaving Actionable Sufficient Directives), a novel framework that explains model behavior by identifying sufficient neural conditions for token generation. Our method represents candidate conditions as neuron-activation predicates and iteratively searches for a minimal set that guarantees the current output under input perturbations. Experiments on SST-2 and CounterFact with the Gemma-2-2B model demonstrate that our approach produces explanations that are more stable, accurate, and concise than conventional attribution graphs. Moreover, through a case study on controlling cross-lingual output generation, we validated the practical effectiveness of WASD in controlling model behavior.

关键词: LLM behavior control, neuron activation, sufficient conditions, model interpretability, WASD framework, token generation, explainable AI, behavioral control

140. ❌ GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms

作者: Masayuki Kawarada, Kodai Watanabe, Soichiro Murakami 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18469v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是评估大语言模型（LLMs）在商业场景中平衡规范遵循与目标达成的能力，因此与’Large Language Models’高度相关（10分）。研究涉及模型在压力下的决策对齐，与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（8分），但论文主要关注评估而非对齐方法本身。其他关键词（如MoE、SLMs、训练技术、推理优化、科学AI等）均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了GAIN基准，用于评估大语言模型在商业场景中面临规范与目标冲突时的决策能力，实验发现先进LLMs通常模仿人类决策模式，但在个人激励压力下更倾向于严格遵守规范。

摘要翻译

我们提出GAIN（不完美规范下的目标对齐决策）基准，旨在评估大语言模型如何平衡规范遵循与商业目标。现有基准通常关注抽象场景而非现实商业应用，且对影响大语言模型决策的因素提供有限洞察，这限制了其衡量模型适应复杂现实世界中规范与目标冲突的能力。在GAIN中，模型会接收一个目标、具体情境、一项规范以及额外的情境压力。这些压力被明确设计为鼓励潜在的规范偏离，是GAIN区别于其他基准的独特特征，能够系统评估影响决策的因素。我们定义了五种压力类型：目标对齐、风险规避、情感/伦理诉求、社会/权威影响以及个人激励。该基准涵盖招聘、客户支持、广告和金融四个领域共1200个场景。实验表明，先进的大语言模型经常反映人类决策模式，但当存在个人激励压力时，它们表现出显著差异，呈现出强烈的遵循规范而非偏离规范的倾向。

摘要 (Abstract)

We introduce GAIN (Goal-Aligned Decision-Making under Imperfect Norms), a benchmark designed to evaluate how large language models (LLMs) balance adherence to norms against business goals. Existing benchmarks typically focus on abstract scenarios rather than real-world business applications. Furthermore, they provide limited insights into the factors influencing LLM decision-making. This restricts their ability to measure models’ adaptability to complex, real-world norm-goal conflicts. In GAIN, models receive a goal, a specific situation, a norm, and additional contextual pressures. These pressures, explicitly designed to encourage potential norm deviations, are a unique feature that differentiates GAIN from other benchmarks, enabling a systematic evaluation of the factors influencing decision-making. We define five types of pressures: Goal Alignment, Risk Aversion, Emotional/Ethical Appeal, Social/Authoritative Influence, and Personal Incentive. The benchmark comprises 1,200 scenarios across four domains: hiring, customer support, advertising and finance. Our experiments show that advanced LLMs frequently mirror human decision-making patterns. However, when Personal Incentive pressure is present, they diverge significantly, showing a strong tendency to adhere to norms rather than deviate from them.

关键词: Large Language Models, benchmark, decision-making, norms, goal alignment, business applications, pressure scenarios, evaluation

141. ❌ UT-ACA: Uncertainty-Triggered Adaptive Context Allocation for Long-Context Inference

作者: Lang Zhou, Shuxuan Li, Zhuohao Li, Shi Liu, Zhilin Zhao, Wei-Shi Zheng 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18446v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	8.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文UT-ACA专注于长上下文LLM推理优化，核心贡献是动态调整上下文窗口的推理时框架。与’Large Language Models’和’Context Window Extension’高度相关（10分），因为直接解决长上下文LLM的推理挑战。与’KV Cache Compression’和’Speculative Decoding’相关（8分），涉及KV缓存管理和推理加速技术。与’Self-Correction’和’Hallucination Mitigation’有一定关联（5分），通过不确定性检测和回滚机制提高生成质量。其他关键词如MoE、SLMs、训练方法、代理系统等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对长上下文大语言模型推理中的注意力稀释和分布外退化问题，提出了不确定性触发的自适应上下文分配框架，在保持生成质量的同时显著减少了平均上下文使用量。

摘要翻译

长上下文推理对大语言模型而言仍具挑战性，主要归因于注意力稀释与分布外性能退化。上下文选择通过仅关注关键值缓存条目的子集来缓解这一局限，但现有方法大多在解码全程分配固定的上下文预算，忽视了词元层级上下文需求的高度非均匀性。为解决该问题，我们提出不确定性触发的自适应上下文分配框架（UT-ACA），该推理时框架能基于词元级不确定性动态调整上下文窗口。UT-ACA通过学习一个不确定性检测器，将语义嵌入与基于逻辑值的置信度相结合，同时考虑解码步骤间不确定性的累积效应。当检测到证据不足时，UT-ACA会选择性回滚、扩展上下文窗口，并利用补充信息重新生成词元。实验表明，在长上下文场景中，UT-ACA能在保持生成质量的同时显著降低平均上下文使用量。

摘要 (Abstract)

Long-context inference remains challenging for large language models due to attention dilution and out-of-distribution degradation. Context selection mitigates this limitation by attending to a subset of key-value cache entries, yet most methods allocate a fixed context budget throughout decoding despite highly non-uniform token-level contextual demands. To address this issue, we propose Uncertainty-Triggered Adaptive Context Allocation (UT-ACA), an inference-time framework that dynamically adjusts the context window based on token-wise uncertainty. UT-ACA learns an uncertainty detector that combines semantic embeddings with logit-based confidence while accounting for uncertainty accumulation across decoding steps. When insufficient evidence is indicated, UT-ACA selectively rolls back, expands the context window, and regenerates the token with additional support. Experiments show that UT-ACA substantially reduces average context usage while preserving generation quality in long-context settings.

关键词: Long-context inference, Large language models, Context window allocation, Uncertainty detection, KV cache, Inference-time framework, Attention dilution, Adaptive context selection

142. ❌ SODIUM: From Open Web Data to Queryable Databases

作者: Chuxuan Hu, Philip Li, Maxwell Yang, Daniel Kang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18447v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出SODIUM任务和SODIUM-Agent多智能体系统，用于从开放网络数据构建可查询数据库，核心创新在于多智能体系统设计和网络探索算法。与’LLM Agents/Autonomous Agents’和’Multi-agent Systems’高度相关（10分），因为论文明确开发了多智能体系统；与’Tool Use/Function Calling’和’Retrieval-Augmented Generation’有一定关联（5分），涉及网络工具使用和信息检索；与’Large Language Models’和’AI for Science’有弱关联（5分），因可能使用AI技术且应用于学术研究领域；其他关键词如MoE、量化、对齐等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了SODIUM任务，旨在从开放网络数据自动构建可查询数据库，并开发了SODIUM-Agent多智能体系统，在SODIUM-Bench基准上实现了91.1%的准确率，比最强基线提升约2倍。

摘要翻译

在研究过程中，领域专家经常提出需要整合多种网络数据源才能解答的分析性问题。因此，在开始分析之前，他们必须投入大量精力进行数据搜索、提取与整理。我们将这一过程形式化为SODIUM任务，并将网络等开放领域概念化为潜在的数据库，这些数据库必须被系统化实例化以支持后续查询。解决SODIUM任务需要：（1）对开放网络进行深入且专业化的探索，并进一步通过（2）利用结构相关性进行系统性信息抽取，以及（3）将收集的信息整合为连贯、可查询的数据库实例来加强这一过程。

为量化自动化SODIUM任务的挑战，我们构建了SODIUM-Bench基准测试集，该基准包含从6个领域的已发表学术论文中提取的105项任务，要求系统探索开放网络，从多样化的来源中收集数据并整合为结构化表格。现有系统在SODIUM任务上表现不佳：我们在SODIUM-Bench上评估了6个先进的人工智能智能体，其中最强的基线模型仅达到46.5%的准确率。为弥补这一差距，我们开发了SODIUM-Agent——一个由网络探索器和缓存管理器组成的多智能体系统。该系统通过我们提出的ATP-BFS算法驱动，并通过对缓存源和导航路径进行原则性管理来优化性能，从而执行深度、全面的网络探索，并进行结构一致的信息抽取。SODIUM-Agent在SODIUM-Bench上实现了91.1%的准确率，超越最强基线约2倍，较最弱基线的提升高达73倍。

摘要 (Abstract)

During research, domain experts often ask analytical questions whose answers require integrating data from a wide range of web sources. Thus, they must spend substantial effort searching, extracting, and organizing raw data before analysis can begin. We formalize this process as the SODIUM task, where we conceptualize open domains such as the web as latent databases that must be systematically instantiated to support downstream querying. Solving SODIUM requires (1) conducting in-depth and specialized exploration of the open web, which is further strengthened by (2) exploiting structural correlations for systematic information extraction and (3) integrating collected information into coherent, queryable database instances. To quantify the challenges in automating SODIUM, we construct SODIUM-Bench, a benchmark of 105 tasks derived from published academic papers across 6 domains, where systems are tasked with exploring the open web to collect and aggregate data from diverse sources into structured tables. Existing systems struggle with SODIUM tasks: we evaluate 6 advanced AI agents on SODIUM-Bench, with the strongest baseline achieving only 46.5% accuracy. To bridge this gap, we develop SODIUM-Agent, a multi-agent system composed of a web explorer and a cache manager. Powered by our proposed ATP-BFS algorithm and optimized through principled management of cached sources and navigation paths, SODIUM-Agent conducts deep and comprehensive web exploration and performs structurally coherent information extraction. SODIUM-Agent achieves 91.1% accuracy on SODIUM-Bench, outperforming the strongest baseline by approximately 2 times and the weakest by up to 73 times.

关键词: SODIUM, open web data, queryable databases, multi-agent system, web exploration, information extraction, SODIUM-Bench, SODIUM-Agent

143. ❌ Adaptive Decoding via Test-Time Policy Learning for Self-Improving Generation

作者: Asmita Bhardwaj, Yuya Jeremy Ong, Eelaaf Zahid, Basel Shbita 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18428v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究解码策略优化，直接涉及LLM（关键词1）和Self-Improvement（关键词16），其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理加速、科学应用等均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对静态解码策略导致LLM生成质量不一致的问题，提出了一种基于强化学习的测试时策略学习方法，在保持模型权重不变的情况下动态调整采样参数，在多个摘要数据集上显著优于贪婪和静态基线方法。

摘要翻译

解码策略在很大程度上决定了大型语言模型（LLM）输出的质量，然而广泛使用的启发式方法（如贪心解码或固定温度/top-p解码）是静态的，且通常与具体任务无关，这导致在需要风格或结构灵活性的领域中，生成质量往往次优或不一致。我们提出了一种基于强化学习的解码采样器，它将解码过程视为序列决策问题，并学习一个轻量级策略，在测试时动态调整采样参数，同时保持LLM权重冻结。我们在包括BookSum、arXiv和WikiHow在内的摘要数据集上，使用Granite-3.3-2B和Qwen-2.5-0.5B模型进行了评估。我们的策略采样器在各项评估中均稳定优于贪心解码和静态基线方法，实现了最高达+88%（BookSum数据集，Granite模型）和+79%（WikiHow数据集，Qwen模型）的相对性能提升。奖励函数消融实验表明，仅基于重叠度的目标函数相比复合奖励函数表现较差，而结合了结构化约束项（如长度、覆盖率、重复度、完整性）的奖励函数能够实现稳定且持续的改进。这些发现凸显了强化学习作为一种实用的解码阶段自适应机制，能够在无需重新训练大模型的情况下，实现领域感知和用户可控的文本生成。

摘要 (Abstract)

Decoding strategies largely determine the quality of Large Language Model (LLM) outputs, yet widely used heuristics such as greedy or fixed temperature/top-p decoding are static and often task-agnostic, leading to suboptimal or inconsistent generation quality across domains that demand stylistic or structural flexibility. We introduce a reinforcement learning-based decoder sampler that treats decoding as sequential decision-making and learns a lightweight policy to adjust sampling parameters at test-time while keeping LLM weights frozen. We evaluated summarization datasets including BookSum, arXiv, and WikiHow using Granite-3.3-2B and Qwen-2.5-0.5B. Our policy sampler consistently outperforms greedy and static baselines, achieving relative gains of up to +88% (BookSum, Granite) and +79% (WikiHow, Qwen). Reward ablations show that overlap-only objectives underperform compared to composite rewards, while structured shaping terms (length, coverage, repetition, completeness) enable stable and sustained improvements. These findings highlight reinforcement learning as a practical mechanism for test-time adaptation in decoding, enabling domain-aware and user-controllable generation without retraining large models.

关键词: Large Language Models, Decoding Strategies, Reinforcement Learning, Test-time Adaptation, Policy Learning, Self-Improving Generation, Summarization, Domain-aware Generation

144. ❌ Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs

作者: Masayuki Kawarada, Tatsuya Ishigaki, Hiroya Takamura 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18425v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多模态大语言模型（Multimodal LLMs）中的任务干扰现象，与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文明确研究多模态LLMs，并测试了开源和专有模型。论文未涉及其他关键词的具体技术（如MoE、SFT、RAG等），也未涉及科学领域的AI应用，因此其他关键词评分为0分。论文主要贡献是提出一个基准和分析任务干扰的模式，而非技术创新，因此创新性评分较低。

!!! tip deepseek-chat TL;DR

该论文研究了多模态大语言模型中任务切换导致的性能下降问题，发现从纯文本切换到图像任务时干扰最严重，而干扰主要由模态差异驱动。

摘要翻译

任务干扰，即单一对话中因任务切换导致的性能下降，此前仅在纯文本环境中被研究，尽管多模态对话系统正日益普及。我们提出了一个用于评估多模态大语言模型中此现象的基准，涵盖文本与视觉领域的六项任务，并沿三个维度对历史-目标关系进行系统化调整：模态失配、推理失配及回答格式失配。对开源权重模型与专有模型的实验表明，任务干扰具有高度方向性：从纯文本任务切换至基于图像的目标任务会导致性能严重下降，而反向切换则仅引起轻微退化。当多个维度的失配同时发生时，干扰效应会进一步加剧，其中模态差异的影响最为显著，其次是回答格式差异，而推理需求变化导致的性能下降最小。

摘要 (Abstract)

Task interference, the performance degradation caused by task switches within a single conversation, has been studied exclusively in text-only settings despite the growing prevalence of multimodal dialogue systems. We introduce a benchmark for evaluating this phenomenon in multimodal LLMs, covering six tasks across text and vision with systematic variation of history-target along three axes: modality mismatch, reasoning mismatch, and answer format mismatch. Experiments on both open-weights and proprietary models reveal that task interference is highly directional: switching from text-only to image-based targets causes severe performance drops, while the reverse transition yields minimal degradation. Interference is further amplified when mismatches co-occur across multiple dimensions, and is driven most strongly by modality differences, followed by answer format, while reasoning requirement shifts cause minimal degradation.

关键词: Multimodal LLMs, Task Interference, Benchmark, Modality Mismatch, Performance Degradation, History-Target Mismatch, Vision and Text, Directional Interference

145. ❌ From Topic to Transition Structure: Unsupervised Concept Discovery at Corpus Scale via Predictive Associative Memory

作者: Jason Dury 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18420v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是无监督概念发现方法，使用对比学习模型分析文本中的过渡结构模式，与给定的所有大模型和深度学习技术关键词均无直接关联。论文未涉及LLM、MoE、SLM、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理方法、智能体、量化、解码加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI等主题。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于预测性关联记忆的无监督方法，通过分析文本中的时间共现模式来发现语料库规模的过渡结构概念，而非语义主题。

摘要翻译

嵌入模型依据语义内容（即文本的主题）对文本进行分组。本文证明，文本内部的时间共现关系揭示了一种不同的结构：反复出现的过渡结构概念，即文本的功能。我们在来自9,766部古登堡计划文本（共计2,496万个段落）的3.73亿个共现对上训练了一个2940万参数的对比模型，将预训练嵌入映射到一个关联空间中，在此空间中，具有相似过渡结构的段落会聚集在一起。在容量受限的条件下（准确率42.75%），模型必须对重复出现的模式进行压缩，而非记忆单个共现关系。在六个粒度（k=50至k=2,000）上进行聚类，产生了一个多分辨率的概念图谱；其范围从“直接对抗”、“抒情冥想”等宽泛模式，到“水手方言”、“法庭盘问”等精确语域和场景模板。在k=100时，每个聚类平均包含4,508本书（总计9,766本），证实了语料库范围内的普遍模式。与嵌入相似性聚类的直接比较表明，原始嵌入按主题分组，而关联空间聚类则按功能、语域和文学传统分组。未经重新训练，即可将未见过的长篇小说分配到现有聚类中；关联模型将每部小说集中分配到一组连贯的、有选择性的聚类子集中，而原始嵌入分配则几乎饱和于所有聚类。验证控制排除了位置、长度和书籍集中度等混杂因素的影响。该方法将预测性关联记忆（Predictive Associative Memory, PAM, arXiv:2602.11322）从情景回忆扩展到概念形成：PAM回忆特定关联，而在压缩下的多轮次对比训练则提取可迁移到未见文本的结构模式，同一框架在不同机制下产生了性质不同的行为。

摘要 (Abstract)

Embedding models group text by semantic content, what text is about. We show that temporal co-occurrence within texts discovers a different kind of structure: recurrent transition-structure concepts or what text does. We train a 29.4M-parameter contrastive model on 373 million co-occurrence pairs from 9,766 Project Gutenberg texts (24.96 million passages), mapping pre-trained embeddings into an association space where passages with similar transition structure cluster together. Under capacity constraint (42.75% accuracy), the model must compress across recurring patterns rather than memorise individual co-occurrences. Clustering at six granularities (k=50 to k=2,000) produces a multi-resolution concept map; from broad modes like “direct confrontation” and “lyrical meditation” to precise registers and scene templates like “sailor dialect” and “courtroom cross-examination.” At k=100, clusters average 4,508 books each (of 9,766), confirming corpus-wide patterns. Direct comparison with embedding-similarity clustering shows that raw embeddings group by topic while association-space clusters group by function, register, and literary tradition. Unseen novels are assigned to existing clusters without retraining; the association model concentrates each novel into a selective subset of coherent clusters, while raw embedding assignment saturates nearly all clusters. Validation controls address positional, length, and book-concentration confounds. The method extends Predictive Associative Memory (PAM, arXiv:2602.11322) from episodic recall to concept formation: where PAM recalls specific associations, multi-epoch contrastive training under compression extracts structural patterns that transfer to unseen texts, the same framework producing qualitatively different behaviour in a different regime.

关键词: unsupervised concept discovery, transition structure, predictive associative memory, contrastive learning, corpus-scale analysis, text clustering, temporal co-occurrence, literary patterns

146. ❌ TopoChunker: Topology-Aware Agentic Document Chunking Framework

作者: Xiaoyu Liu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18409v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心贡献是提出TopoChunker框架，用于改进RAG中的文档分块方法。与关键词高度相关的是：1) ‘Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’（10分）- 论文直接针对RAG系统进行优化；2) ‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（10分）- 框架采用双智能体架构（Inspector Agent和Refiner Agent）；3) ‘Multi-agent Systems OR Agent Coordination’（5分）- 涉及多智能体协调；4) ‘Large Language Models OR LLMs OR Foundation Models’（5分）- 论文提到LLM-based baseline并在RAG上下文中应用。其他关键词与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对RAG系统中文档线性化分块导致的语义碎片化问题，提出了TopoChunker框架，通过双智能体架构和结构化中间表示来保留文档拓扑层次，在多个数据集上实现了更高的生成准确率和召回率，同时降低了计算开销。

摘要翻译

当前面向检索增强生成（RAG）的文档分块方法通常对文本进行线性化处理。这种强制的线性化剥离了文本内在的拓扑层次结构，导致“语义碎片化”问题，从而降低了下游检索质量。本文提出TopoChunker，一种智能体框架，能够将异构文档映射到结构化中间表示（Structured Intermediate Representation, SIR），以显式保留跨片段间的依赖关系。为了在结构保真度与计算成本之间取得平衡，TopoChunker采用双智能体架构：巡检智能体（Inspector Agent）通过成本优化的提取路径动态路由文档，而精炼智能体（Refiner Agent）则执行容量审计与拓扑上下文消歧，以重建层次化谱系。在非结构化叙事文本（GutenQA）和复杂报告（GovReport）上的评估表明，TopoChunker实现了最先进的性能。其在绝对生成准确率上超越最强的基于大语言模型（LLM）的基线方法8.0%，同时达到83.26%的Recall@3，并将令牌开销降低了23.5%，为结构感知的RAG提供了一种可扩展的解决方案。

摘要 (Abstract)

Current document chunking methods for Retrieval-Augmented Generation (RAG) typically linearize text. This forced linearization strips away intrinsic topological hierarchies, creating ``semantic fragmentation’’ that degrades downstream retrieval quality. In this paper, we propose TopoChunker, an agentic framework that maps heterogeneous documents onto a Structured Intermediate Representation (SIR) to explicitly preserve cross-segment dependencies. To balance structural fidelity with computational cost, TopoChunker employs a dual-agent architecture. An Inspector Agent dynamically routes documents through cost-optimized extraction paths, while a Refiner Agent performs capacity auditing and topological context disambiguation to reconstruct hierarchical lineage. Evaluated on unstructured narratives (GutenQA) and complex reports (GovReport), TopoChunker demonstrates state-of-the-art performance. It outperforms the strongest LLM-based baseline by 8.0% in absolute generation accuracy and achieves an 83.26% Recall@3, while simultaneously reducing token overhead by 23.5%, offering a scalable approach for structure-aware RAG.

关键词: Retrieval-Augmented Generation, RAG, document chunking, agentic framework, topological hierarchy, dual-agent architecture, semantic fragmentation, structured intermediate representation

147. ❌ AutoScreen-FW: An LLM-based Framework for Resume Screening

作者: Zhelin Xu, Shuhei Yamamoto, Atsuyuki Morishima 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18390v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用开源LLM进行简历筛选的框架，主要涉及LLM应用和上下文学习技术。论文明确提到使用LLM（开源模型）和上下文学习（in-context learning）方法，因此这两个关键词高度相关（10分）。其他关键词如MoE、量化、推理加速、对齐等均未在摘要中提及或与论文内容无关，故得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个基于LLM的本地自动化简历筛选框架AutoScreen-FW，通过选择代表性简历样本进行上下文学习，使开源LLM能够作为职业顾问评估简历，实验表明该框架在保持隐私的同时实现了与商业模型相当的性能和更快的处理速度。

摘要翻译

企业招聘人员经常需要在有限时间内筛选大量简历，这增加了他们的工作负担，并可能导致合适的候选人被忽略。为应对这些挑战，先前的研究已探索基于大语言模型（LLM）的自动化简历筛选方法。然而，部分方法依赖商用LLM，可能带来数据隐私风险。此外，由于企业通常不会公开带有评估结果的简历样本，目前尚不清楚在学习过程中应使用哪些简历样本来提升LLM的判断性能。为解决这些问题，我们提出了AutoScreen-FW——一个基于LLM的本地化自动简历筛选框架。AutoScreen-FW采用多种方法选取少量具有代表性的简历样本。这些样本与角色描述及评估标准共同用于上下文学习，使开源LLM能够扮演职业顾问的角色，并对未见过的新简历进行评估。基于多组真实标注数据的实验表明，开源LLM评审员的性能持续优于GPT-5-nano。在某一组真实标注设定下，其表现也超越了GPT-5-mini。尽管在其他真实标注设定中略逊于GPT-5-mini，但该框架处理每份简历的速度显著快于商用GPT模型。这些发现表明，在企业内部本地部署AutoScreen-FW具有潜力，可在支持高效筛选的同时减轻招聘人员的负担。

摘要 (Abstract)

Corporate recruiters often need to screen many resumes within a limited time, which increases their burden and may cause suitable candidates to be overlooked. To address these challenges, prior work has explored LLM-based automated resume screening. However, some methods rely on commercial LLMs, which may pose data privacy risks. Moreover, since companies typically do not make resumes with evaluation results publicly available, it remains unclear which resume samples should be used during learning to improve an LLM’s judgment performance. To address these problems, we propose AutoScreen-FW, an LLM-based locally and automatically resume screening framework. AutoScreen-FW uses several methods to select a small set of representative resume samples. These samples are used for in-context learning together with a persona description and evaluation criteria, enabling open-source LLMs to act as a career advisor and evaluate unseen resumes. Experiments with multiple ground truths show that the open-source LLM judges consistently outperform GPT-5-nano. Under one ground truth setting, it also surpass GPT-5-mini. Although it is slightly weaker than GPT-5-mini under other ground-truth settings, it runs substantially faster per resume than commercial GPT models. These findings indicate the potential for deploying AutoScreen-FW locally in companies to support efficient screening while reducing recruiters’ burden.

关键词: LLM-based framework, resume screening, in-context learning, open-source LLMs, privacy preservation, automated evaluation, career advisor, local deployment

148. ❌ From Noise to Signal: When Outliers Seed New Topics

作者: Evangelia Zve, Gauvain Bourgne, Benjamin Icard, Jean-Gabriel Ganascia 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18358v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究动态主题建模中异常值作为新兴主题早期信号的问题，使用了11种最先进的语言模型生成文档嵌入进行聚类分析。因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分），因为论文明确使用了多种语言模型作为技术工具。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为论文在氢经济新闻语料库上进行了应用，属于AI在特定领域（能源/经济）的应用研究。其他关键词主要涉及大模型的技术原理、训练方法、推理优化、对齐技术等，论文未涉及这些具体技术细节，因此均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了动态主题建模中异常值如何作为新兴主题的早期信号，通过提出一种时间分类法并使用多种语言模型进行聚类分析，在氢经济新闻语料库中识别出了具有高共识的预测性异常值。

摘要翻译

动态主题建模中的离群值通常被视为噪声，但我们证明其中部分可成为新兴主题的早期信号。本文提出一种新闻文档轨迹的时间分类法，用以定义文档随时间推移与主题形成的关系。该分类法区分了先导型离群值（即在其最终归属主题出现前已存在的文档）与那些仅强化现有主题或保持孤立的文档。通过捕捉这些轨迹，该分类法将弱信号检测与时间主题建模相连接，并阐明单篇文章如何在演化中的主题簇内实现预兆、发起或漂移。我们在累积聚类框架中实现了该方法，采用来自十一种前沿语言模型的文档嵌入向量，并以氢能经济领域的法语新闻语料库HydroNewsFr进行回顾性评估。模型间一致性分析揭示了一个规模较小但共识度高的先导型离群值子集，增强了此类标签的可信度。定性案例研究通过具体主题演化过程进一步阐释了这些轨迹。

摘要 (Abstract)

Outliers in dynamic topic modeling are typically treated as noise, yet we show that some can serve as early signals of emerging topics. We introduce a temporal taxonomy of news-document trajectories that defines how documents relate to topic formation over time. It distinguishes anticipatory outliers, which precede the topics they later join, from documents that either reinforce existing topics or remain isolated. By capturing these trajectories, the taxonomy links weak-signal detection with temporal topic modeling and clarifies how individual articles anticipate, initiate, or drift within evolving clusters. We implement it in a cumulative clustering setting using document embeddings from eleven state-of-the-art language models and evaluate it retrospectively on HydroNewsFr, a French news corpus on the hydrogen economy. Inter-model agreement reveals a small, high-consensus subset of anticipatory outliers, increasing confidence in these labels. Qualitative case studies further illustrate these trajectories through concrete topic developments.

关键词: dynamic topic modeling, outliers, emerging topics, document embeddings, language models, temporal taxonomy, anticipatory outliers, HydroNewsFr

149. ❌ Large-Scale Analysis of Political Propaganda on Moltbook

作者: Julia Jose, Meghna Manoj Nair, Rachel Greenstadt 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18349v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文使用LLM构建分类器分析AI代理平台上的政治宣传内容，与’Large Language Models’高度相关（8分），因为LLM是核心分析工具；与’LLM Agents’高度相关（10分），因为研究平台Moltbook专为AI代理设计，分析对象是代理行为；其他关键词如MoE、SLMs、训练方法、推理优化、科学AI应用等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究使用基于LLM的分类器分析了AI代理平台Moltbook上的政治宣传内容，发现政治宣传占所有帖子的1%，集中在少数社区和代理中，且评论对宣传的放大作用有限。

摘要翻译

本研究基于自然语言处理技术，对人工智能体交流平台Moltbook（一种类Reddit平台）上的政治宣传内容展开分析。为实现大规模分析，我们开发了基于大语言模型的分类器来检测政治宣传内容，并依据专家标注数据进行了验证（Cohen’s $κ$系数为0.64-0.74）。通过对673,127条帖子和879,606条评论的数据集进行分析，我们发现政治宣传内容占帖子总量的1%，占所有政治类内容的42%。这些帖子高度集中于少数社区，其中70%的此类帖子仅分布于五个社区内。4%的智能体生成了51%的政治宣传帖子。我们进一步发现，少数智能体在社区内部及跨社区间反复发布高度相似的内容。尽管如此，研究并未发现充分证据表明评论环节会放大政治宣传的影响。

摘要 (Abstract)

We present an NLP-based study of political propaganda on Moltbook, a Reddit-style platform for AI agents. To enable large-scale analysis, we develop LLM-based classifiers to detect political propaganda, validated against expert annotation (Cohen’s $κ$= 0.64-0.74). Using a dataset of 673,127 posts and 879,606 comments, we find that political propaganda accounts for 1% of all posts and 42% of all political content. These posts are concentrated in a small set of communities, with 70% of such posts falling into five of them. 4% of agents produced 51% of these posts. We further find that a minority of these agents repeatedly post highly similar content within and across communities. Despite this, we find limited evidence that comments amplify political propaganda.

关键词: political propaganda, LLM-based classifiers, AI agents, Moltbook, large-scale analysis, NLP study, social media analysis, content detection

150. ❌ Retrieval-Augmented LLM Agents: Learning to Learn from Experience

作者: Thomas Palmeira Ferraz, Romain Deffayet, Vassilina Nikoulina, Hervé Déjean, Stéphane Clinchant 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18272v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究检索增强的LLM智能体，通过结合监督微调（SFT）和检索增强生成（RAG）来提升智能体在未见任务上的泛化能力。因此，与’Large Language Models’、‘Post-training/SFT’、‘PEFT/LoRA’、‘Retrieval-Augmented Generation’、‘LLM Agents’高度相关（10分）。‘In-context Learning’相关，因为论文涉及利用检索到的轨迹进行上下文学习，但非核心焦点（5分）。其他关键词如MoE、量化、对齐等未在论文中涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合监督微调（使用LoRA）和检索增强生成的LLM智能体训练框架，显著提升了智能体在未见任务上的泛化能力。

摘要翻译

尽管大语言模型（LLM）推动了通用智能体的发展，但如何实现对新任务的稳健泛化仍是一个重大挑战。当前方法通常依赖于微调或使用检索经验的免训练记忆增强生成；然而两者均存在局限：微调往往难以外推至新任务，而经验检索的性能通常弱于有监督基线。在本研究中，我们提出将这两种方法相结合，并系统性地研究如何训练检索增强的LLM智能体，以有效利用上下文中的检索轨迹。首先，我们建立了基于LoRA的稳健监督微调（SFT）方案，其性能优于多种最先进的智能体训练流程。其次，我们对经验检索的关键设计选择进行了详细分析，确定了存储、查询和轨迹选择的最佳策略。最后，我们提出了一种将经验检索整合到微调流程中的方法。实验结果表明，这种结合策略显著提升了对未见任务的泛化能力，为构建能够从经验中学习的智能体提供了一个可扩展且有效的框架。

摘要 (Abstract)

While large language models (LLMs) have advanced the development of general-purpose agents, achieving robust generalization to unseen tasks remains a significant challenge. Current approaches typically rely on either fine-tuning or training-free memory-augmented generation using retrieved experience; yet both have limitations: fine-tuning often fails to extrapolate to new tasks, while experience retrieval often underperforms compared to supervised baselines. In this work, we propose to combine these approaches and systematically study how to train retrieval-augmented LLM agents to effectively leverage retrieved trajectories in-context. First, we establish a robust supervised fine-tuning (SFT) recipe using LoRA that outperforms several state-of-the-art agent training pipelines. Second, we provide a detailed analysis of key design choices for experience retrieval, identifying optimal strategies for storage, querying, and trajectory selection. Finally, we propose a pipeline that integrates experience retrieval into the fine-tuning process. Our results demonstrate that this combined approach significantly improves generalization to unseen tasks, providing a scalable and effective framework for building agents that learn to learn from experience.

关键词: LLM agents, retrieval-augmented generation, supervised fine-tuning, LoRA, experience retrieval, generalization, in-context learning, trajectory selection

151. ❌ Impact of automatic speech recognition quality on Alzheimer’s disease detection from spontaneous speech: a reproducible benchmark study with lexical modeling and statistical validation

作者: Himadri Samanta 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18239v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究自动语音识别（ASR）质量对阿尔茨海默病检测的影响，使用Whisper ASR转录和传统机器学习模型（如逻辑回归和线性支持向量机）。论文与大多数关键词无关，因为这些关键词主要涉及大语言模型（LLM）技术、训练方法、推理优化、代理系统等，而本文未使用或讨论LLM、深度学习或相关技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及AI在生物医学（阿尔茨海默病检测）中的应用，属于AI for Science范畴，但并非核心创新点，因此给予5分（有一定关联）。其他关键词均得0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文研究了自动语音识别（ASR）转录质量对基于自发语音的阿尔茨海默病检测性能的影响，发现高质量ASR（如Whisper-small）能显著提升简单可解释的词汇模型的分类准确率。

摘要翻译

通过自发语音实现阿尔茨海默病的早期检测已成为一种前景广阔的非侵入性筛查方法。然而，自动语音识别（ASR）质量对下游临床语言建模的影响尚未得到充分理解。本研究利用ADReSSo 2021诊断数据集，基于Whisper ASR转录文本提取词汇特征进行阿尔茨海默病检测。我们采用可解释的机器学习模型（包括逻辑回归和线性支持向量机），在重复5x5分层交叉验证框架下使用TF-IDF文本表示进行评估。

结果表明，转录质量对分类性能具有统计学上的显著影响。基于Whisper-small转录文本训练的模型始终优于使用Whisper-base转录文本的模型，其中线性支持向量机模型的平衡准确率超过0.7850。配对统计检验证实观察到的性能提升具有显著性。重要的是，与ASR转录质量相比，分类器复杂度对性能变异的影响较小。特征分析显示，认知正常者会产生更多语义精确的对象与场景描述性语言，而阿尔茨海默病患者的语音则表现出模糊性、话语标记增多以及犹豫模式增强的特点。

这些发现表明，高质量的ASR能够使简单可解释的词汇模型在不依赖显式声学建模的情况下，实现具有竞争力的阿尔茨海默病检测性能。本研究提供了一个可复现的基准流程，并强调ASR选择是基于临床语音的人工智能系统中关键建模决策。

摘要 (Abstract)

Early detection of Alzheimer’s disease from spontaneous speech has emerged as a promising non-invasive screening approach. However, the influence of automatic speech recognition (ASR) quality on downstream clinical language modeling remains insufficiently understood. In this study, we investigate Alzheimer’s disease detection using lexical features derived from Whisper ASR transcripts on the ADReSSo 2021 diagnosis dataset. We evaluate interpretable machine-learning models, including Logistic Regression and Linear Support Vector Machines, using TF-IDF text representations under repeated 5x5 stratified cross-validation. Our results demonstrate that transcript quality has a statistically significant impact on classification performance. Models trained on Whisper-small transcripts consistently outperform those using Whisper-base transcripts, achieving balanced accuracy above 0.7850 with Linear SVM. Paired statistical testing confirms that the observed improvements are significant. Importantly, classifier complexity contributes less to performance variation than ASR transcription quality. Feature analysis reveals that cognitively normal speakers produce more semantically precise object- and scene-descriptive language, whereas Alzheimer’s speech is characterized by vagueness, discourse markers, and increased hesitation patterns. These findings suggest that high-quality ASR can enable simple, interpretable lexical models to achieve competitive Alzheimer’s detection performance without explicit acoustic modeling. The study provides a reproducible benchmark pipeline and highlights ASR selection as a critical modeling decision in clinical speech-based artificial intelligence systems.

关键词: Alzheimer’s disease detection, automatic speech recognition, Whisper ASR, lexical features, interpretable machine learning, clinical speech analysis, reproducible benchmark, statistical validation

152. ❌ How Psychological Learning Paradigms Shaped and Constrained Artificial Intelligence

作者: Alex Anvi Eponon, Ildar Batyrshin, Christian E. Maldonado-Sifuentes, Grigori Sidorov 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18203v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文探讨心理学学习范式（行为主义、认知主义、建构主义）如何塑造和限制人工智能范式（强化学习、深度学习、组合方法），并提出ReSynth三模块框架，属于AI方法论和理论基础的哲学/心理学交叉研究，未涉及具体的大模型技术、训练方法、优化技术或科学应用，与所有技术性关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文研究心理学学习范式如何塑造并限制了人工智能的发展，提出ReSynth三模块框架来解决现有AI范式的结构局限性，以推动人工通用智能的适应性。

摘要翻译

人工智能的主导范式由心理学学习理论所塑造：行为主义启发了强化学习，认知主义催生了深度学习与记忆增强架构，而建构主义则影响了课程学习与组合式方法。本文认为，每种人工智能范式不仅继承了其所源心理学理论的优势，也承袭了其结构性局限。强化学习无法解释知识的内部结构，深度学习将表征压缩至不透明的参数空间且难以进行原则性更新，而当前的整合方法缺乏关于如何从现有组件构建新理解的形式化阐释。本文进一步探讨了关于机械式学习解读的东西方差异，指出东方将记忆视为通向理解的结构化、多阶段前导过程，这一观念为心理学理论与人工智能方法论之间提供了一座尚未充分开发的桥梁。借鉴系统性争论及艾泽瓦对经典主义与联结主义的批判，本文提出ReSynth——一个将推理（智识）、目的（身份）与知识（记忆）分离为架构独立组件的三模块框架。本文追溯了从心理学范式到人工智能方法的谱系，诊断了各阶段所继承的局限，并论证了适应性作为人工通用智能核心挑战，需要一种使系统化行为成为必然结果而非偶然属性的表征架构。

摘要 (Abstract)

The dominant paradigms of artificial intelligence were shaped by learning theories from psychology: behaviorism inspired reinforcement learning, cognitivism gave rise to deep learning and memory-augmented architectures, and constructivism influenced curriculum learning and compositional approaches. This paper argues that each AI paradigm inherited not only the strengths but the structural limitations of the psychological theory that inspired it. Reinforcement learning cannot account for the internal structure of knowledge, deep learning compresses representations into opaque parameter spaces resistant to principled update, and current integrative approaches lack a formal account of how new understanding is constructed from existing components. The paper further examines a cross-cultural divergence in the interpretation of rote learning, arguing that the Eastern conception of memorization as a structured, multi-phase precursor to understanding offers an underexploited bridge between psychological theory and AI methodology. Drawing on the systematicity debate and critique of Aizawa of both classicism and connectionism, this paper introduces ReSynth, a trimodular framework that separates reasoning (Intellect), purpose (Identity), and knowledge (Memory) as architecturally independent components. The paper traces the genealogy from psychological paradigm to AI method, diagnoses the inherited limitations at each stage, and argues that adaptability, the central challenge of artificial general intelligence requires a representational architecture in which systematic behavior is a necessary consequence rather than an accidental property.

关键词: psychological learning paradigms, artificial intelligence, reinforcement learning, deep learning, ReSynth framework, systematicity, artificial general intelligence, representational architecture

153. ❌ CWoMP: Morpheme Representation Learning for Interlinear Glossing

作者: Morris Alper, Enora Rice, Bhargav Shandilya, Alexis Palmer, Lori Levin 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18184v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文CWoMP专注于语言学和自然语言处理中的词素表示学习，用于自动生成语际注释（IGT）。它涉及预训练（contrastive pretraining）和检索增强生成（通过可变词典检索词素），因此与’Pre-training OR Continual Pre-training OR Domain Adaptation’（5分）、‘Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’（5分）相关。论文强调可解释性（interpretable predictions），与’Mechanistic Interpretability OR Explainable AI’（5分）相关。作为AI在语言学（科学领域）的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’（5分）相关。其他关键词主要涉及大模型技术、推理、对齐、优化等，与本文的特定NLP任务无直接关联，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出CWoMP方法，通过对比词-词素预训练和检索增强生成，自动生成语际注释，在低资源语言上优于现有方法且更高效。

摘要翻译

行间注释文本（Interlinear Glossed Text，简称IGT）是语言记录的一种标准标注方式，其语言学信息丰富但人工制作费时费力。现有的自动化IGT方法将注释内容视为字符序列处理，忽略了其组合结构。我们提出CWoMP（对比式词素预训练模型），该模型将语素视为具有习得表征的原子化形式-意义单元。通过对比训练的编码器，将上下文中的词语与其构成语素对齐到共享嵌入空间中；随后，自回归解码器通过从可动态更新的嵌入词典中检索条目来生成语素序列。该模型的预测结果具有可解释性——其基础源于词典条目——且用户可在推理阶段通过扩展词典来提升效果，无需重新训练模型。我们在多种低资源语言上进行评估，结果表明CWoMP在显著提升效率的同时优于现有方法，在极低资源场景下表现尤为突出。

摘要 (Abstract)

Interlinear glossed text (IGT) is a standard notation for language documentation which is linguistically rich but laborious to produce manually. Recent automated IGT methods treat glosses as character sequences, neglecting their compositional structure. We propose CWoMP (Contrastive Word-Morpheme Pretraining), which instead treats morphemes as atomic form-meaning units with learned representations. A contrastively trained encoder aligns words-in-context with their constituent morphemes in a shared embedding space; an autoregressive decoder then generates the morpheme sequence by retrieving entries from a mutable lexicon of these embeddings. Predictions are interpretable–grounded in lexicon entries–and users can improve results at inference time by expanding the lexicon without retraining. We evaluate on diverse low-resource languages, showing that CWoMP outperforms existing methods while being significantly more efficient, with particularly strong gains in extremely low-resource settings.

关键词: morpheme representation learning, interlinear glossing, contrastive pretraining, retrieval-augmented generation, low-resource languages, lexicon-based decoding, interpretable predictions, language documentation

154. ❌ GRAFITE: Generative Regression Analysis Framework for Issue Tracking and Evaluation

作者: Ja Young Lee, Mírian Silva, Mohamed Nasr, Shonda Witherspoon, Enzo Bozzani, Veronique Demers, Radha Ratnaparkhi, Hui Wu, Sara Rosenthal 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18173v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文GRAFITE专注于LLM评估平台，核心内容涉及LLM性能评估、基准污染问题和回归检测，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。其他关键词涉及具体技术原理（如MoE、量化、推理加速等）、训练方法（如预训练、微调、对齐等）或特定应用领域（如科学AI），论文未直接涉及这些方面，故均评0分。

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型（LLMs）在训练中因基准数据暴露导致的性能评估失真问题，提出了一个名为GRAFITE的持续评估平台，通过用户反馈构建问题库并利用LLM作为评判者进行质量测试，实现了多模型并行比较和跨版本回归检测。

摘要翻译

大型语言模型（LLM）的研发在很大程度上受其发布时在热门话题和基准测试上的表现所驱动。然而，随着时间的推移，由于训练过程中基准数据的大量暴露，会导致数据污染问题。如果测试执行不严谨，这将带来模型性能虚高的风险。为应对这一挑战，我们提出了GRAFITE，这是一个持续性的LLM评估平台，通过一个用于维护和评估模型问题的综合系统来实现。我们的方法能够基于用户随时间推移的反馈构建模型问题库，并提供一个评估流程，利用LLM即评判员（LLM-as-a-judge）进行质量保证（QA）测试，以针对这些问题评估LLM。该平台支持多个模型的并行比较，便于在不同版本间进行回归检测。该平台可在 https://github.com/IBM/grafite 获取。演示视频请访问 www.youtube.com/watch?v=XFZyoleN56k。

摘要 (Abstract)

Large language models (LLMs) are largely motivated by their performance on popular topics and benchmarks at the time of their release. However, over time, contamination occurs due to significant exposure of benchmark data during training. This poses a risk of model performance inflation if testing is not carefully executed. To address this challenge, we present GRAFITE, a continuous LLM evaluation platform through a comprehensive system for maintaining and evaluating model issues. Our approach enables building a repository of model problems based on user feedback over time and offers a pipeline for assessing LLMs against these issues through quality assurance (QA) tests using LLM-as-a-judge. The platform enables side-by-side comparison of multiple models, facilitating regression detection across different releases. The platform is available at https://github.com/IBM/grafite. The demo video is available at www.youtube.com/watch?v=XFZyoleN56k.

关键词: Large language models, LLM evaluation, benchmark contamination, regression detection, quality assurance, LLM-as-a-judge, continuous evaluation, model issues

155. ❌ How LLMs Distort Our Written Language

作者: Marwa Abdulhai, Isadora White, Yanming Wan, Ibrahim Qureshi, Joel Leibo, Max Kleiman-Weiner, Natasha Jaques 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18161v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs对人类写作的影响，包括语义改变、创造性和声音损失，以及AI生成科学评审中的偏见。因此，与’Large Language Models’高度相关（10分）。论文涉及LLMs改变人类写作意图，与’Alignment’有一定关联（5分）。论文发现LLMs改变语义和事实性，与’Hallucination Mitigation’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training等，论文未涉及技术细节或应用，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，广泛使用的大型语言模型（LLMs）在辅助写作时会显著改变人类文本的语义意图和创造性，导致写作失去个人声音，并在AI生成的科学同行评审中引入评分偏见和内容权重偏差。

摘要翻译

大型语言模型（LLM）在全球有超过十亿用户使用，最常用于辅助写作。本研究证明，LLM不仅会改变人类写作的风格与语气，还会持续改变其原意。首先，我们进行了一项人类用户研究，以了解人们在使用LLM进行写作时的实际互动方式。研究发现，重度使用LLM导致近70%的论文在回答主题问题时保持中立立场。显著更多的高频LLM用户反映，其写作的创造性降低且失去了个人风格。其次，我们利用一个在2021年LLM广泛发布前收集的人工撰写论文数据集，研究了当要求LLM根据数据集中的人工反馈修改论文时，其如何引发内容与含义的重大改变。我们发现，即使向LLM提供专家反馈并仅要求进行语法修改，它仍会以显著改变文本语义含义的方式改动内容。随后，我们考察了实际场景中的LLM生成文本，特别聚焦于近期一次顶级人工智能会议中占比21%的AI生成科学同行评审意见。分析表明，LLM生成的评审意见对研究清晰度与重要性的关注度显著降低，且给出的评分平均高出整整一分。这些发现揭示了AI使用的感知效益与其对人类写作语义产生的隐性、持续性影响之间的错位，这促使未来研究需关注广泛使用AI写作将如何影响我们的文化与科学体系。

摘要 (Abstract)

Large language models (LLMs) are used by over a billion people globally, most often to assist with writing. In this work, we demonstrate that LLMs not only alter the voice and tone of human writing, but also consistently alter the intended meaning. First, we conduct a human user study to understand how people actually interact with LLMs when using them for writing. Our findings reveal that extensive LLM use led to a nearly 70% increase in essays that remained neutral in answering the topic question. Significantly more heavy LLM users reported that the writing was less creative and not in their voice. Next, using a dataset of human-written essays that was collected in 2021 before the widespread release of LLMs, we study how asking an LLM to revise the essay based on the human-written feedback in the dataset induces large changes in the resulting content and meaning. We find that even when LLMs are prompted with expert feedback and asked to only make grammar edits, they still change the text in a way that significantly alters its semantic meaning. We then examine LLM-generated text in the wild, specifically focusing on the 21% of AI-generated scientific peer reviews at a recent top AI conference. We find that LLM-generated reviews place significantly less weight on clarity and significance of the research, and assign scores that, on average, are a full point higher.These findings highlight a misalignment between the perceived benefit of AI use and an implicit, consistent effect on the semantics of human writing, motivating future work on how widespread AI writing will affect our cultural and scientific institutions.

关键词: Large Language Models, LLMs, writing assistance, semantic meaning alteration, human-AI interaction, AI-generated scientific reviews, creativity loss, peer review bias

156. ❌ Evaluating FrameNet-Based Semantic Modeling for Gender-Based Violence Detection in Clinical Records

作者: Lívia Dutra, Arthur Lorenzi, Frederico Belcavello, Ely Matos, Marcelo Viridiano, Lorena Larré, Olívia Guaranha, Erik Santos, Sofia Reinach, Pedro de Paula, Tiago Torrent 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18124v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究基于FrameNet的语义建模在临床记录中检测性别暴力的应用，属于AI在生物医学/公共卫生领域的应用研究。论文未涉及任何大模型、深度学习技术原理或创新方法（如MoE、Scaling Laws、微调、推理优化、智能体等），也未使用LLM或相关技术。仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为其属于AI在科学（公共卫生/临床）领域的应用，但未涉及生物信息学或化学信息学的具体技术。其他关键词均完全无关（0分）。

!!! tip deepseek-chat TL;DR

本研究探讨了基于FrameNet的语义标注能否提升电子病历中性别暴力案例的检测性能，结果表明结合语义标注的模型比仅使用结构化数据的模型在F1分数上提高了0.3以上，证实了语义分析在早期识别中的有效性。

摘要翻译

基于性别的暴力（Gender-based violence, GBV）是一个重大的公共卫生问题，世界卫生组织估计，全球有三分之一的女性在其一生中会遭受亲密伴侣的身体或性暴力。在巴西，尽管法律要求医疗保健专业人员报告此类案件，但由于难以识别虐待行为以及公共信息系统之间的整合有限，漏报现象仍然十分严重。本研究探讨了基于框架网络（FrameNet）的电子病历开放文本字段语义标注是否能够支持识别GBV的模式。我们比较了支持向量机（SVM）分类器在三种不同训练数据上的性能：（1）框架标注文本，（2）标注文本与参数化数据结合，以及（3）仅使用参数化数据。定量和定性分析表明，融合了语义标注的模型优于仅使用分类数据的模型，其F1分数提高了0.3以上，并证明特定领域的语义表征能够提供超越结构化人口统计数据的有意义信号。这些发现支持了以下假设：对临床叙述文本进行语义分析可以加强早期识别策略，并为更明智的公共卫生干预措施提供支持。

摘要 (Abstract)

Gender-based violence (GBV) is a major public health issue, with the World Health Organization estimating that one in three women experiences physical or sexual violence by an intimate partner during her lifetime. In Brazil, although healthcare professionals are legally required to report such cases, underreporting remains significant due to difficulties in identifying abuse and limited integration between public information systems. This study investigates whether FrameNet-based semantic annotation of open-text fields in electronic medical records can support the identification of patterns of GBV. We compare the performance of an SVM classifier for GBV cases trained on (1) frame-annotated text, (2) annotated text combined with parameterized data, and (3) parameterized data alone. Quantitative and qualitative analyses show that models incorporating semantic annotation outperform categorical models, achieving over 0.3 improvement in F1 score and demonstrating that domain-specific semantic representations provide meaningful signals beyond structured demographic data. The findings support the hypothesis that semantic analysis of clinical narratives can enhance early identification strategies and support more informed public health interventions.

关键词: Gender-based violence detection, FrameNet semantic modeling, Electronic medical records, SVM classifier, Clinical narratives, Semantic annotation, Public health interventions, Brazil healthcare

157. ❌ Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor

作者: Ahmed Sharshar, Hosam Elgendy, Saad El Dine Ahmed, Yasser Rohaim, Yuxia Wang 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17759v2

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究有害幽默检测，涉及多模态、多语言基准测试。与关键词的相关性分析：1）论文评估了SOTA模型（包括闭源和开源模型），这些模型通常是LLMs，因此与’Large Language Models’有一定关联（5分）。2）论文强调需要区分显性和隐性有害幽默，这涉及深度推理和上下文理解，与’Chain of Thought’和’System 2 Thinking’高度相关（8分）。3）论文提到安全对齐（safety alignment），与’Instruction Tuning OR Alignment OR Value Alignment’相关（8分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training等主要涉及大模型技术原理或特定应用领域，论文未直接涉及，因此评分为0。论文未提及任何指定的专家作者。

!!! tip deepseek-chat TL;DR

该论文引入了一个多模态、多语言基准来检测有害幽默，发现闭源模型优于开源模型，并强调了基于文化的深度推理对于安全对齐的重要性。

摘要翻译

黑色幽默通常依赖微妙的文化意蕴和隐性线索，其理解需要结合语境进行推理，这给现有静态基准测试带来了未能涵盖的安全挑战。为此，我们引入了一个新颖的多模态、多语言基准，用于检测和理解有害及冒犯性幽默。我们人工构建的数据集包含英语和阿拉伯语的3000条文本与6000张图像，以及1200个涵盖英语、阿拉伯语及语言无关（通用）情境的视频。与标准毒性数据集不同，我们执行严格的标注准则：区分安全笑话与有害笑话，并将后者进一步分类为显性（公开）和隐性（隐蔽）两类，以探究深层推理能力。我们系统评估了所有模态下最先进的开源与闭源模型。研究结果表明，闭源模型显著优于开源模型，且两种语言模型在英语和阿拉伯语上的表现均存在明显差距，这凸显了基于文化背景、具备推理意识的安全对齐机制的迫切需求。警告：本文包含可能具有冒犯性、有害性或偏见性的示例数据。

摘要 (Abstract)

Dark humor often relies on subtle cultural nuances and implicit cues that require contextual reasoning to interpret, posing safety challenges that current static benchmarks fail to capture. To address this, we introduce a novel multimodal, multilingual benchmark for detecting and understanding harmful and offensive humor. Our manually curated dataset comprises 3,000 texts and 6,000 images in English and Arabic, alongside 1,200 videos that span English, Arabic, and language-independent (universal) contexts. Unlike standard toxicity datasets, we enforce a strict annotation guideline: distinguishing Safe jokes from Harmful ones, with the latter further classified into Explicit (overt) and Implicit (Covert) categories to probe deep reasoning. We systematically evaluate state-of-the-art (SOTA) open and closed-source models across all modalities. Our findings reveal that closed-source models significantly outperform open-source ones, with a notable difference in performance between the English and Arabic languages in both, underscoring the critical need for culturally grounded, reasoning-aware safety alignment. Warning: this paper contains example data that may be offensive, harmful, or biased.

关键词: harmful humor detection, multimodal benchmark, multilingual dataset, safety alignment, contextual reasoning, cultural nuances, explicit and implicit harm, model evaluation

158. ❌ Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

作者: Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan, Xiang Bai 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19235v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出VEGA-3D框架，利用预训练视频生成模型作为隐式世界模拟器，增强多模态大语言模型的空间理解能力。核心相关关键词：1) ‘Large Language Models’ (10分)：论文直接针对MLLMs的改进；2) ‘World Models’ (10分)：将视频扩散模型重新定位为’Latent World Simulator’，学习3D结构和物理规律；3) ‘Pre-training’ (8分)：利用预训练视频扩散模型提取特征。其他关键词如MoE、SLMs、SFT、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了多模态大语言模型在空间推理和几何理解方面的局限性，通过提取预训练视频生成模型的隐式3D先验知识，提出了VEGA-3D框架，显著提升了3D场景理解和空间推理任务的性能。

摘要翻译

尽管多模态大语言模型展现出令人印象深刻的语义理解能力，但其常存在空间盲区，难以进行细粒度的几何推理与物理动态理解。现有解决方案通常依赖于显式的三维模态或复杂的几何支架，这些方法受限于数据稀缺性与泛化挑战。在本研究中，我们提出一种范式转变，通过利用大规模视频生成模型中的隐式空间先验知识。我们认为，为合成时序连贯的视频，这些模型已内在地学习了鲁棒的三维结构先验与物理规律。我们提出了VEGA-3D（视频提取生成感知），一种即插即用框架，可将预训练的视频扩散模型重新用作潜在世界模拟器。通过从中间噪声层级提取时空特征，并借助令牌级自适应门控融合机制将其与语义表征相结合，我们为多模态大语言模型注入了密集的几何线索，而无需显式的三维监督。在三维场景理解、空间推理与具身操作基准测试上的大量实验表明，我们的方法超越了现有先进基线，验证了生成式先验知识为物理世界理解提供了可扩展的基础。代码公开于https://github.com/H-EmbodVis/VEGA-3D。

摘要 (Abstract)

While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.

关键词: Multimodal Large Language Models, 3D scene understanding, video generation models, spatial reasoning, generative priors, latent world simulator, physical dynamics, geometric reasoning

159. ❌ Matryoshka Gaussian Splatting

作者: Zhilin Guo, Boqiao Zhang, Hakan Aktas, Kyle Fogarty, Jeffrey Hu, Nursena Koprucu Aslan, Wenzhao Li, Canberk Baykal, Albert Miao, Josef Bengtson, Chenliang Zhou, Weihao Xia, Cristina Nader Vasconcelos. Cengiz Oztireli 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19234v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于3D高斯泼溅（3D Gaussian Splatting）的连续细节层次（LoD）渲染技术，提出Matryoshka Gaussian Splatting训练框架。所有评分关键词均与大语言模型、深度学习技术原理或科学AI应用直接相关，而本文研究的是计算机图形学中的3D渲染优化方法，未涉及任何大模型、深度学习技术或科学领域AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了3D高斯泼溅中连续细节层次渲染的质量与效率权衡问题，提出了Matryoshka Gaussian Splatting训练框架，在保持全容量渲染质量的同时实现了连续的速度-质量调节。

摘要翻译

实现单一模型可调保真度的场景渲染能力——即细节层次（Level of Detail, LoD）——对于三维高斯泼溅（3D Gaussian Splatting, 3DGS）的实际部署至关重要。现有的离散LoD方法仅提供有限的预设操作点，而同期提出的连续LoD方法虽能实现更平滑的缩放，却常在满容量渲染时出现明显的质量下降，使得LoD成为一种代价高昂的设计选择。我们提出了套娃式高斯泼溅（Matryoshka Gaussian Splatting, MGS），一种训练框架，可在不牺牲满容量渲染质量的前提下，为标准3DGS管线实现连续LoD。MGS学习一个有序的高斯集合，使得渲染其任意前缀（即前k个泼溅单元）都能产生连贯的重建结果，其保真度随预算增加而平滑提升。我们的核心思想是随机预算训练：每次迭代采样一个随机的泼溅预算，并同时优化对应的前缀和完整高斯集合。此策略仅需两次前向传播，且无需修改模型架构。在四个基准数据集和六个基线方法上的实验表明，MGS在保持其主干模型满容量性能的同时，能够通过单一模型实现连续的速度-质量权衡。对排序策略、训练目标和模型容量的广泛消融实验进一步验证了设计有效性。

摘要 (Abstract)

The ability to render scenes at adjustable fidelity from a single model, known as level of detail (LoD), is crucial for practical deployment of 3D Gaussian Splatting (3DGS). Existing discrete LoD methods expose only a limited set of operating points, while concurrent continuous LoD approaches enable smoother scaling but often suffer noticeable quality degradation at full capacity, making LoD a costly design decision. We introduce Matryoshka Gaussian Splatting (MGS), a training framework that enables continuous LoD for standard 3DGS pipelines without sacrificing full-capacity rendering quality. MGS learns a single ordered set of Gaussians such that rendering any prefix, the first k splats, produces a coherent reconstruction whose fidelity improves smoothly with increasing budget. Our key idea is stochastic budget training: each iteration samples a random splat budget and optimises both the corresponding prefix and the full set. This strategy requires only two forward passes and introduces no architectural modifications. Experiments across four benchmarks and six baselines show that MGS matches the full-capacity performance of its backbone while enabling a continuous speed-quality trade-off from a single model. Extensive ablations on ordering strategies, training objectives, and model capacity further validate the designs.

关键词: 3D Gaussian Splatting, Level of Detail, Continuous LoD, Matryoshka Gaussian Splatting, Stochastic Budget Training, Rendering Quality, Speed-Quality Trade-off, Splat Budget

160. ❌ MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

作者: Haitian Li, Haozhe Xie, Junxiang Xu, Beichen Wen, Fangzhou Hong, Ziwei Liu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19231v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MonoArt专注于单目图像下的关节式3D重建，属于计算机视觉和3D重建领域。虽然使用了深度学习技术，但研究内容与所有评分关键词（主要围绕大语言模型技术、训练方法、推理优化、AI代理等）完全无关。论文未涉及任何语言模型、MoE、缩放定律、训练调优方法、推理技术、AI代理或科学AI应用。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了MonoArt框架，通过渐进式结构推理从单张图像中稳定重建关节式3D物体，在PartNet-Mobility数据集上实现了最先进的精度和推理速度。

摘要翻译

从单幅图像重建铰接式三维物体需要从有限的视觉证据中联合推断物体几何、部件结构及运动参数。核心难点在于运动线索与物体结构之间的耦合关系，这导致直接进行铰接参数回归具有不稳定性。现有方法通过多视角监督、基于检索的装配或辅助视频生成来应对这一挑战，但往往以牺牲可扩展性或效率为代价。我们提出MonoArt——一个基于渐进式结构推理的统一框架。该方法并非直接从图像特征预测铰接参数，而是在单一架构内逐步将视觉观测转化为规范几何、结构化部件表征和运动感知嵌入。这种结构化推理过程实现了稳定且可解释的铰接推断，无需依赖外部运动模板或多阶段流程。在PartNet-Mobility数据集上的大量实验表明，该方法在重建精度与推理速度上均达到最先进水平。该框架可进一步泛化至机器人操作与铰接式场景重建任务。

摘要 (Abstract)

Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.

关键词: monocular 3D reconstruction, articulated objects, progressive structural reasoning, canonical geometry, part structure, motion parameters, PartNet-Mobility, robotic manipulation

161. ❌ Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

作者: Yuqing Wang, Chuofan Ma, Zhijie Lin, Yao Teng, Lijun Yu, Shuai Wang, Jiaming Han, Jiashi Feng, Yi Jiang, Xihui Liu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19232v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	3.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	2.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	1.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Cubic Discrete Diffusion (CubiD)，一种用于高维表示（768-1024维）的离散生成模型，属于视觉生成领域。与关键词的相关性分析：1）与’Large Language Models’有弱相关（3分），因为论文提到离散视觉生成可与语言模型共享统一的token预测范式，但未深入探讨LLM技术；2）与’Scaling Laws AND Data Quality’有微弱相关（2分），因为论文提到模型从900M到3.7B参数表现出强扩展行为，但未明确讨论数据质量；3）与’Pre-training’有微弱相关（1分），因为涉及高维预训练表示，但非核心；4）其他关键词（如MoE、SFT、RAG等）与论文的视觉生成和扩散模型主题完全无关，得0分。论文主要创新在离散扩散模型和视觉生成，而非大模型技术原理或科学应用。

!!! tip deepseek-chat TL;DR

该论文解决了高维离散表示（768-1024维）在视觉生成中的挑战，提出了Cubic Discrete Diffusion (CubiD)模型，在ImageNet-256上实现了最先进的离散生成性能，并验证了离散token同时适用于理解和生成任务。

摘要翻译

基于离散标记的视觉生成技术因其能与语言模型共享统一的标记预测范式，并有望实现无缝的多模态架构而受到广泛关注。然而，当前的离散生成方法仍局限于低维潜在标记（通常为8-32维），牺牲了理解任务所必需的语义丰富性。虽然高维预训练表示（768-1024维）可能弥合这一差距，但其离散生成过程存在根本性挑战。本文提出立方离散扩散模型（Cubic Discrete Diffusion, CubiD），这是首个面向高维表示的离散生成模型。CubiD在高维离散表示中进行细粒度掩码——任何位置上的任何维度均可被掩码，并基于部分观测进行预测。这使得模型能够学习空间位置内部及位置之间的丰富关联，且生成步骤数固定为$T$，与特征维度无关（其中$T \ll hwd$）。在ImageNet-256数据集上，CubiD实现了最先进的离散生成性能，其参数量从9亿扩展到37亿时展现出强劲的缩放特性。关键的是，我们验证了这些离散化标记能够保持原始表示能力，证明同一离散标记可同时有效服务于理解与生成任务。我们希望这项工作能启发未来面向统一多模态架构的研究。代码发布于：https://github.com/YuqingWang1029/CubiD。

摘要 (Abstract)

Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges. In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. CubiD performs fine-grained masking throughout the high-dimensional discrete representation – any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions, with the number of generation steps fixed at $T$ regardless of feature dimensionality, where $T \ll hwd$. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures. Code is available at: https://github.com/YuqingWang1029/CubiD.

关键词: discrete diffusion, high-dimensional representations, visual generation, token prediction, multimodal architectures, ImageNet-256, scaling behavior, unified understanding and generation

162. ❌ Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

作者: Chenyang Gu, Mingyuan Zhang, Haozhe Xie, Zhongang Cai, Lei Yang, Ziwei Liu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19227v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于扩散模型的运动生成方法，专注于计算机视觉和图形学领域的人体运动生成任务。虽然论文涉及生成模型和条件控制，但所有关键词都专门针对大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理等），而本文完全不涉及语言模型或文本处理，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合语义和运动学条件控制的扩散式离散运动标记化框架，显著提高了人体运动生成的精确度和可控性。

摘要翻译

先前的运动生成主要遵循两种范式：擅长运动学控制的连续扩散模型，以及适用于语义条件约束的基于离散令牌的生成器。为结合二者优势，我们提出一个三阶段框架，包含条件特征提取（感知）、离散令牌生成（规划）和基于扩散的运动合成（控制）。该框架的核心是MoTok——一种基于扩散的离散运动令牌化器，通过将运动重建任务委托给扩散解码器，实现了语义抽象与细粒度重建的解耦，从而在保持运动保真度的同时实现了紧凑的单层令牌表示。对于运动学条件，粗粒度约束在规划阶段引导令牌生成，而细粒度约束则在控制阶段通过基于扩散的优化过程强制执行。这一设计防止了运动学细节干扰语义令牌规划。在HumanML3D数据集上，我们的方法仅使用六分之一的令牌数量，就在可控性和保真度上显著超越MaskControl，将轨迹误差从0.72厘米降低至0.08厘米，FID从0.083降至0.029。与先前方法在强运动学约束下性能下降不同，我们的方法反而提升了保真度，将FID从0.033进一步降低至0.014。

摘要 (Abstract)

Prior motion generation largely follows two paradigms: continuous diffusion models that excel at kinematic control, and discrete token-based generators that are effective for semantic conditioning. To combine their strengths, we propose a three-stage framework comprising condition feature extraction (Perception), discrete token generation (Planning), and diffusion-based motion synthesis (Control). Central to this framework is MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from fine-grained reconstruction by delegating motion recovery to a diffusion decoder, enabling compact single-layer tokens while preserving motion fidelity. For kinematic conditions, coarse constraints guide token generation during planning, while fine-grained constraints are enforced during control through diffusion-based optimization. This design prevents kinematic details from disrupting semantic token planning. On HumanML3D, our method significantly improves controllability and fidelity over MaskControl while using only one-sixth of the tokens, reducing trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029. Unlike prior methods that degrade under stronger kinematic constraints, ours improves fidelity, reducing FID from 0.033 to 0.014.

关键词: motion generation, diffusion models, discrete tokenizer, semantic conditioning, kinematic control, HumanML3D, controllability, fidelity

163. ❌ SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

作者: Xinyao Zhang, Wenkai Dong, Yuxin Song, Bo Fang, Qi Zhang, Jing Wang, Fan Chen, Hui Zhang, Haocheng Feng, Yu Lu, Hang Zhou, Chun Yuan, Jingdong Wang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19228v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SAMA专注于指令引导的视频编辑，核心创新在于将视频编辑分解为语义锚定和运动对齐两个因子化模块，并采用两阶段训练（因子化预训练+监督微调）。与关键词的相关性分析：1）高度相关（10分）：‘Pre-training’和’Supervised Fine-tuning’是论文的核心训练方法；2）中等相关（5分）：‘Instruction Tuning’与论文的指令引导编辑相关，但论文更侧重于视觉任务而非语言模型对齐；3）无关（0分）：其他关键词主要涉及大语言模型技术、推理方法、代理系统、科学AI应用等，与论文的计算机视觉视频编辑主题无直接关联。

!!! tip deepseek-chat TL;DR

论文提出SAMA框架，通过因子化语义锚定和运动对齐解决指令引导视频编辑中语义修改与运动保持的平衡问题，实现了无需配对编辑数据的预训练和强大的零样本编辑能力。

摘要翻译

当前基于指令引导的视频编辑模型难以在精确语义修改与忠实运动保持之间实现平衡。现有方法依赖注入显式外部先验（如VLM特征或结构条件）来缓解这一问题，但这种依赖严重制约了模型的鲁棒性与泛化能力。为突破此局限，我们提出SAMA（解耦的语义锚定与运动对齐框架），该框架将视频编辑解耦为语义锚定与运动建模两个独立模块。首先，我们引入语义锚定机制，通过在稀疏锚定帧上联合预测语义标记与视频潜在表示，建立可靠的视觉锚点，实现纯指令感知的结构规划。其次，运动对齐模块通过以运动为核心的视频修复预训练任务（立方体修复、速度扰动与时序块重排）对同一骨干网络进行预训练，使模型能够直接从原始视频中内化时序动态特征。SAMA采用两阶段优化流程：第一阶段通过解耦预训练学习固有的语义-运动表征（无需配对的视频-指令编辑数据），第二阶段在配对编辑数据上进行监督微调。值得注意的是，仅通过解耦预训练阶段，模型已展现出强大的零样本视频编辑能力，验证了所提出解耦框架的有效性。SAMA在开源模型中实现了最先进的性能，并与主流商业系统（如Kling-Omni）具有竞争力。代码、模型及数据集将公开发布。

摘要 (Abstract)

Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.

关键词: instruction-guided video editing, semantic anchoring, motion alignment, factorized pre-training, supervised fine-tuning, zero-shot video editing, video restoration, temporal dynamics

164. ❌ Under One Sun: Multi-Object Generative Perception of Materials and Illumination

作者: Nobuo Yoshii, Xinran Nicole Han, Ryo Kawahara, Todd Zickler, Ko Nishino 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19226v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究计算机视觉中的生成式逆渲染方法（MultiGP），用于从单张图像中采样反射率、纹理和光照等辐射成分。该工作属于计算机图形学/计算机视觉领域，核心贡献包括级联架构、协调引导、轴向注意力和纹理提取ControlNet等技术。所有关键词均与大语言模型（LLM）、深度学习技术原理或AI在科学领域的应用直接相关，但本文不涉及LLM、MoE、推理加速、对齐、RAG等具体技术。唯一的相关点是“AI for Science”，因为逆渲染可视为计算机视觉中的科学应用，但并非生物信息学或化学信息学等典型科学AI领域，因此给予5分（有一定关联）。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为MultiGP的多对象生成感知方法，通过从单张图像中采样反射率、纹理和光照来解决辐射解缠的模糊性问题，并利用场景中物体共享同一光照的共识实现了有效的成分恢复。

摘要翻译

我们提出多对象生成式感知（MultiGP），这是一种生成式逆渲染方法，能够从单张图像中对构成物体外观的所有辐射度成分——反射率、纹理与光照——进行随机采样。为解决这一本质上具有歧义的辐射度解耦问题，我们的核心思路是利用同一场景中物体虽纹理和反射率可能不同，但均受相同光照照射这一事实。MultiGP基于四项关键技术贡献，利用这一共识从已知形状的单张图像中生成反射率、纹理和光照的采样样本：结合图像空间与角度空间解耦的级联端到端架构；通过协调引导实现扩散过程收敛于单一一致光照估计；应用轴向注意力机制以促进不同反射率物体间的“交叉对话”；以及纹理提取控制网络在保持高频纹理细节的同时确保其与估计光照的解耦。实验结果表明，MultiGP能有效利用多物体外观在空间与频率特性上的互补性，成功恢复个体纹理、反射率及共同的光照条件。

摘要 (Abstract)

We introduce Multi-Object Generative Perception (MultiGP), a generative inverse rendering method for stochastic sampling of all radiometric constituents – reflectance, texture, and illumination – underlying object appearance from a single image. Our key idea to solve this inherently ambiguous radiometric disentanglement is to leverage the fact that while their texture and reflectance may differ, objects in the same scene are all lit by the same illumination. MultiGP exploits this consensus to produce samples of reflectance, texture, and illumination from a single image of known shapes based on four key technical contributions: a cascaded end-to-end architecture that combines image-space and angular-space disentanglement; Coordinated Guidance for diffusion convergence to a single consistent illumination estimate; Axial Attention applied to facilitate ``cross-talk’’ between objects of different reflectance; and a Texture Extraction ControlNet to preserve high-frequency texture details while ensuring decoupling from estimated lighting. Experimental results demonstrate that MultiGP effectively leverages the complementary spatial and frequency characteristics of multiple object appearances to recover individual texture and reflectance as well as the common illumination.

关键词: generative inverse rendering, radiometric disentanglement, reflectance, texture, illumination, single image, diffusion models, ControlNet

165. ❌ EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing

作者: Yang Fu, Yike Zheng, Ziyun Dai, Henghui Ding 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19224v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视频对象移除和插入的计算机视觉任务，提出了一种基于扩散模型的方法EffectErase和一个大规模数据集VOR。虽然论文涉及深度学习技术（扩散模型），但所有关键词均与大语言模型（LLM）及其相关技术（如MoE、SFT、RLHF、RAG、推理、对齐、代理等）或特定科学领域AI应用（如生物信息学）相关。论文内容完全不涉及大语言模型、其训练方法、优化技术、推理机制、对齐问题、代理系统或科学AI应用，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文针对视频对象移除中难以消除对象视觉效果（如阴影、反射）的问题，提出了EffectErase方法，通过联合对象移除和插入的互惠学习方案，并构建了大规模数据集VOR，实现了高质量的视频效果擦除。

摘要翻译

视频对象移除旨在消除动态目标对象及其视觉效应（如形变、阴影和反射），同时恢复无缝背景。当前基于扩散模型的视频修复与对象移除方法虽能去除对象本身，却常难以消除这些效应并合成连贯的背景。除方法局限外，该领域进展还因缺乏系统性涵盖不同环境中常见对象效应的综合性数据集而受阻，此类数据集对训练与评估至关重要。为此，我们提出VOR（Video Object Removal）数据集——一个大规模数据集，提供多样化的配对视频：每组包含一段存在目标对象及其效应的视频，以及一段对象与效应均被移除的对应视频，并附有相应的对象掩码。VOR包含来自实拍与合成来源的6万对高质量视频，涵盖五种效应类型，涉及广泛的对象类别以及复杂动态的多对象场景。基于VOR，我们提出EffectErase方法，这是一种效应感知的视频对象移除方法，其通过互逆学习框架将视频对象插入作为逆向辅助任务。该模型包含任务感知区域引导机制，使学习聚焦于受效应影响的区域，并能灵活切换任务；同时采用插入-移除一致性目标，促进效应区域与结构线索的互补行为与协同定位。通过在VOR上训练，EffectErase在大量实验中展现出卓越性能，能够在多样场景中实现高质量的视频对象效应消除。

摘要 (Abstract)

Video object removal aims to eliminate dynamic target objects and their visual effects, such as deformation, shadows, and reflections, while restoring seamless backgrounds. Recent diffusion-based video inpainting and object removal methods can remove the objects but often struggle to erase these effects and to synthesize coherent backgrounds. Beyond method limitations, progress is further hampered by the lack of a comprehensive dataset that systematically captures common object effects across varied environments for training and evaluation. To address this, we introduce VOR (Video Object Removal), a large-scale dataset that provides diverse paired videos, each consisting of one video where the target object is present with its effects and a counterpart where the object and effects are absent, with corresponding object masks. VOR contains 60K high-quality video pairs from captured and synthetic sources, covers five effects types, and spans a wide range of object categories as well as complex, dynamic multi-object scenes. Building on VOR, we propose EffectErase, an effect-aware video object removal method that treats video object insertion as the inverse auxiliary task within a reciprocal learning scheme. The model includes task-aware region guidance that focuses learning on affected areas and enables flexible task switching. Then, an insertion-removal consistency objective that encourages complementary behaviors and shared localization of effect regions and structural cues. Trained on VOR, EffectErase achieves superior performance in extensive experiments, delivering high-quality video object effect erasing across diverse scenarios.

关键词: video object removal, effect erasing, diffusion-based inpainting, reciprocal learning, VOR dataset, object insertion, effect-aware, video synthesis

166. ❌ Spectrally-Guided Diffusion Noise Schedules

作者: Carlos Esteves, Ameesh Makadia 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19222v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于图像生成中扩散模型的噪声调度优化，属于计算机视觉和生成模型的特定技术领域。所有评分关键词均围绕大语言模型（LLM）及其相关技术（如训练、推理、对齐、应用等），而本文研究的是像素级扩散模型，与LLM无直接关联。论文未涉及任何LLM技术、科学AI应用或大模型创新，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于图像频谱特性设计像素扩散模型噪声调度的方法，通过理论推导和条件采样，在低步数情况下提高了单阶段像素扩散模型的生成质量。

摘要翻译

去噪扩散模型被广泛用于生成高质量图像与视频。其性能取决于噪声调度方案，该方案定义了训练过程中所施加噪声水平的分布以及采样过程中遍历的噪声水平序列。噪声调度方案通常为人工设计，且需针对不同分辨率进行手动调优。本研究提出一种基于图像频谱特性的原则性方法，为像素扩散模型设计针对单个实例的噪声调度方案。通过推导最小与最大噪声水平有效性的理论边界，我们设计了能够消除冗余步骤的“紧凑型”噪声调度方案。在推理阶段，我们提出对此类噪声调度方案进行条件采样。实验表明，我们的噪声调度方案提升了单阶段像素扩散模型的生成质量，尤其在低步数采样场景中表现显著。

摘要 (Abstract)

Denoising diffusion models are widely used for high-quality image and video generation. Their performance depends on noise schedules, which define the distribution of noise levels applied during training and the sequence of noise levels traversed during sampling. Noise schedules are typically handcrafted and require manual tuning across different resolutions. In this work, we propose a principled way to design per-instance noise schedules for pixel diffusion, based on the image’s spectral properties. By deriving theoretical bounds on the efficacy of minimum and maximum noise levels, we design ``tight’’ noise schedules that eliminate redundant steps. During inference, we propose to conditionally sample such noise schedules. Experiments show that our noise schedules improve generative quality of single-stage pixel diffusion models, particularly in the low-step regime.

关键词: denoising diffusion models, noise schedules, spectral properties, pixel diffusion, generative quality, low-step regime, image generation

167. ❌ DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding

作者: Dong Zhuo, Wenzhao Zheng, Sicheng Zuo, Siming Yan, Lu Hou, Jie Zhou, Jiwen Lu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19219v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文DriveTok专注于计算机视觉和自动驾驶领域，提出了一种用于多视图3D驾驶场景的视觉标记化方法。它主要涉及视觉基础模型、3D变形交叉注意力、多视图重建和3D语义占用预测。与评分关键词列表中的大多数关键词（主要关于大语言模型技术、训练方法、推理优化、对齐、代理等）完全无关。唯一相关的关键词是’World Models AND General World Models’，因为摘要中提到’vision-language-action models and world models in autonomous driving systems’，且论文的标记化方法旨在作为世界模型的视觉接口，因此给予10分（高度相关，核心内容）。其他关键词均未涉及，给予0分。

!!! tip deepseek-chat TL;DR

该论文针对自动驾驶中多视图高分辨率驾驶场景的标记化效率低和视图间不一致问题，提出了DriveTok——一种高效的3D驾驶场景标记器，通过3D变形交叉注意力生成统一场景标记，实现了多视图图像重建、深度预测、语义分割和3D语义占用预测，在nuScenes数据集上表现优异。

摘要翻译

随着视觉-语言-行动模型与世界模型在自动驾驶系统中的日益广泛应用，可扩展的图像标记化技术作为视觉模态的接口变得至关重要。然而，现有的大多数标记器是为单目和二维场景设计的，当应用于高分辨率多视角驾驶场景时，会导致效率低下和视角间不一致的问题。为解决这一问题，我们提出了DriveTok，一种高效的三维驾驶场景标记器，用于统一的多视角重建与理解。DriveTok首先从视觉基础模型中获取语义丰富的视觉特征，随后通过三维可变形交叉注意力机制将其转化为场景标记。在解码阶段，我们采用多视角变换器从场景标记中重建多视角特征，并使用多头架构获取RGB、深度和语义重建结果。我们还在场景标记上直接添加了一个三维头部，用于三维语义占据预测，以增强空间感知能力。通过多重训练目标，DriveTok学习了统一的场景标记，这些标记整合了语义、几何和纹理信息，以实现高效的多视角标记化。在广泛使用的nuScenes数据集上进行的大量实验表明，DriveTok生成的场景标记在图像重建、语义分割、深度预测和三维占据预测任务上均表现出色。

摘要 (Abstract)

With the growing adoption of vision-language-action models and world models in autonomous driving systems, scalable image tokenization becomes crucial as the interface for the visual modality. However, most existing tokenizers are designed for monocular and 2D scenes, leading to inefficiency and inter-view inconsistency when applied to high-resolution multi-view driving scenes. To address this, we propose DriveTok, an efficient 3D driving scene tokenizer for unified multi-view reconstruction and understanding. DriveTok first obtains semantically rich visual features from vision foundation models and then transforms them into the scene tokens with 3D deformable cross-attention. For decoding, we employ a multi-view transformer to reconstruct multi-view features from the scene tokens and use multiple heads to obtain RGB, depth, and semantic reconstructions. We also add a 3D head directly on the scene tokens for 3D semantic occupancy prediction for better spatial awareness. With the multiple training objectives, DriveTok learns unified scene tokens that integrate semantic, geometric, and textural information for efficient multi-view tokenization. Extensive experiments on the widely used nuScenes dataset demonstrate that the scene tokens from DriveTok perform well on image reconstruction, semantic segmentation, depth prediction, and 3D occupancy prediction tasks.

关键词: 3D driving scene tokenization, multi-view reconstruction, vision foundation models, 3D deformable cross-attention, semantic occupancy prediction, autonomous driving, scene tokens, nuScenes dataset

168. ❌ Rethinking Vector Field Learning for Generative Segmentation

作者: Chaoyang Wang, Yaobo Liang, Boci Peng, Fan Duan, Jingdong Wang, Yunhai Tong 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19218v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于扩散模型在生成式分割任务中的应用，提出了一种向量场重塑策略和类别编码方案。所有评分关键词均与大语言模型（LLM）相关，而本文研究的是计算机视觉领域的扩散模型分割，未涉及LLM、MoE、SLMs、对齐、推理、代理、压缩、幻觉缓解、科学AI等任何LLM相关技术。因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文研究了扩散模型在生成式分割任务中存在的梯度消失和轨迹穿越问题，提出了一种向量场重塑策略和类别编码方案，显著提升了分割性能并缩小了与判别式方法的差距。

摘要翻译

扩散模型在生成式分割领域的应用正日益受到关注。现有方法主要集中于架构调整或训练启发式策略，而对连续流匹配目标与离散感知任务之间的内在不匹配仍缺乏深入理解。本研究从向量场学习的视角重新审视扩散分割问题。我们指出了常用流匹配目标的两个关键局限：梯度消失与轨迹穿越，这些问题导致收敛速度缓慢和类别分离效果不佳。为应对这些挑战，我们提出一种基于原理的向量场重塑策略，通过引入解耦的距离感知校正项来增强学习到的速度场。该校正项同时包含吸引与排斥相互作用，在增强质心附近梯度强度的同时，保持了原始扩散训练框架的完整性。此外，受克罗内克序列启发，我们设计了一种计算高效的准随机类别编码方案，该方案可与端到端的像素神经场框架无缝集成，实现像素级语义对齐。大量实验一致表明，相较于原始流匹配方法，本方法取得了显著改进，大幅缩小了生成式分割与强判别式专家模型之间的性能差距。

摘要 (Abstract)

Taming diffusion models for generative segmentation has attracted increasing attention. While existing approaches primarily focus on architectural tweaks or training heuristics, there remains a limited understanding of the intrinsic mismatch between continuous flow matching objectives and discrete perception tasks. In this work, we revisit diffusion segmentation from the perspective of vector field learning. We identify two key limitations of the commonly used flow matching objective: gradient vanishing and trajectory traversing, which result in slow convergence and poor class separation. To tackle these issues, we propose a principled vector field reshaping strategy that augments the learned velocity field with a detached distance-aware correction term. This correction introduces both attractive and repulsive interactions, enhancing gradient magnitudes near centroids while preserving the original diffusion training framework. Furthermore, we design a computationally efficient, quasi-random category encoding scheme inspired by Kronecker sequences, which integrates seamlessly with an end-to-end pixel neural field framework for pixel-level semantic alignment. Extensive experiments consistently demonstrate significant improvements over vanilla flow matching approaches, substantially narrowing the performance gap between generative segmentation and strong discriminative specialists.

关键词: diffusion models, generative segmentation, vector field learning, flow matching, pixel neural field, Kronecker sequences, semantic alignment, class separation

169. ❌ LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

作者: Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao, Hesong Wang, Xueyi Chen, Xin Jin, Junhan Zhu, Bohan Yu, Weiqiang Wang, Jian Liu, Can Qin, Yulun Zhang, Ming-Hsuan Yang, Huan Wang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19217v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Omnimodal LLMs（OmniLLMs）在长音频-视频理解方面的评估，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文直接研究大语言模型的多模态扩展。与’Context Window Extension OR Long Context LLMs’高度相关（10分），因为论文专门解决长视频（10-90分钟）理解问题，本质上是长上下文处理挑战。其他关键词如MoE、SLMs、训练技术、推理方法、代理系统、科学AI应用等，论文未涉及技术细节或应用领域，因此评0分。

!!! tip deepseek-chat TL;DR

该论文针对现有Omnimodal LLMs主要评估短音频视频片段（10秒至5分钟）而无法满足现实长视频（通常数十分钟）理解需求的问题，提出了LVOmniBench长音频-视频理解评估基准，并通过实验发现当前模型在处理长音频-视频输入时面临显著挑战，开源模型准确率低于35%，Gemini 3 Pro最高约65%。

摘要翻译

近期，全模态大语言模型（OmniLLMs）在音频与视频输入理解方面取得了显著进展。然而，当前的评估主要集中于10秒至5分钟的短音频和视频片段，未能反映实际应用的需求——现实场景中的视频通常长达数十分钟。为填补这一关键空白，我们提出了LVOmniBench，这是一个专门为长时音频与视频的跨模态理解而设计的新基准。该数据集包含来自开放平台的高质量视频，具有丰富的视听动态特征。通过严格的人工筛选与标注，LVOmniBench包含275段时长介于10至90分钟的视频，以及1,014组问答对。该基准旨在严格评估OmniLLMs在多个领域的能力，包括长期记忆、时序定位、细粒度理解和多模态感知。我们的大规模评估表明，当前OmniLLMs在处理长时视听输入时面临显著挑战：开源模型的准确率普遍低于35%，而Gemini 3 Pro的最高准确率约为65%。我们期望该数据集及其实验结果能够推动进一步研究，促进能够解决长时视听场景中复杂跨模态理解问题的先进模型的发展。

摘要 (Abstract)

Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.

关键词: Omnimodal LLMs, long audio-video understanding, benchmark evaluation, cross-modal comprehension, long-form video, temporal localization, multimodal perception, LVOmniBench

170. ❌ Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

作者: Shang-Jui Ray Kuo, Paola Cascante-Bonilla 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19209v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大视觉语言模型（VLMs）的视觉编码器架构，核心是比较Transformer与状态空间模型（SSM）作为视觉骨干的性能。与’Large Language Models’相关度8分，因为VLMs属于大模型范畴，但论文聚焦视觉编码器而非纯语言模型。‘Pre-training’和’Post-training’各5分，因为论文涉及ImageNet初始化（类似预训练）和检测/分割任务的密集任务调优（类似后训练）。其他关键词如MoE、SLMs、Scaling Laws、Alignment、RAG、CoT、Agents、Compression等均未涉及，故得0分。论文未涉及科学AI应用，因此’AI for Science’也得0分。

!!! tip deepseek-chat TL;DR

该论文研究了在大型视觉语言模型中，状态空间模型（SSM）作为视觉编码器是否比标准的视觉Transformer（ViT）更具优势，并通过系统评估发现SSM骨干在多项任务中表现更优且模型规模更小。

摘要翻译

大型视觉-语言模型通常采用冻结的视觉主干网络，其图像特征通过轻量级连接器映射至大型语言模型。尽管基于Transformer的编码器是标准的视觉主干，我们探讨状态空间模型视觉主干是否可成为强有力的替代方案。我们在受控环境下系统评估了视觉-语言模型中SSM视觉主干的性能。在匹配的ImageNet-1K初始化条件下，SSM主干在视觉问答与定位/接地任务中均展现出最优的综合性能。我们进一步通过检测或分割训练对SSM和ViT系列主干进行适配，发现密集任务调优普遍能提升各系列模型的性能；经过此适配后，SSM主干在显著更小的模型规模下仍保持竞争力。我们还观察到：（一）更高的ImageNet准确率或更大的主干网络并不能稳定转化为更好的视觉-语言模型性能；（二）部分视觉主干在定位任务中存在不稳定性。基于这些发现，我们提出了提升两种主干家族鲁棒性的稳定化策略，并强调SSM主干可作为视觉-语言模型中基于Transformer的视觉编码器的有力替代方案。

摘要 (Abstract)

Large vision–language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer-based vision encoders in VLMs.

关键词: Vision-Language Models, State Space Models, Vision Transformers, Vision Encoders, VQA, Grounding, Localization, Model Scale

171. ❌ RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing

作者: Yue Gong, Hongyu Li, Shanyuan Liu, Bo Cheng, Yuhang Ma, Liebucha Wu, Xiaoyu Wu, Manyuan Zhang, Dawei Leng, Yuhui Yin, Lijun Zhang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19206v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的图像生成和编辑，提出了一种基于表示学习的自动编码器（RPiAE）作为视觉分词器，用于改进扩散模型。核心贡献在于表示-枢轴正则化、变分桥接和分阶段训练策略，以提升重建保真度和生成质量。论文与绝大多数关键词（涉及大语言模型、推理、对齐、代理、科学AI等）完全无关，仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为方法利用了预训练的视觉表示模型作为初始化，并涉及微调以适应重建任务，但这并非论文核心创新点。

!!! tip deepseek-chat TL;DR

该论文提出了一种表示-枢轴自动编码器（RPiAE），通过表示-枢轴正则化和变分桥接技术，解决了现有基于表示的视觉分词器在图像生成和编辑中重建保真度低、潜在空间维度高的问题，从而在文本到图像生成和图像编辑任务上取得了优于其他视觉分词器的性能。

摘要翻译

扩散模型已成为图像生成与编辑的主导范式，其中潜在扩散模型将去噪过程转移至紧凑的潜在空间，以实现高效性和可扩展性。近期研究尝试利用预训练的视觉表征模型作为分词器先验，其方法要么将扩散特征与表征特征对齐，要么直接复用表征编码器作为冻结的分词器。尽管此类方法能够提升生成指标，但由于编码器被冻结，它们往往受到重建保真度的限制，进而影响编辑质量；同时，其潜在空间维度通常过高，导致扩散建模困难。为应对这些局限，我们提出表征轴心自动编码器，这是一种基于表征的分词器，可同时提升生成与编辑性能。我们引入了表征轴心正则化，该训练策略使基于表征初始化的编码器能够在保持预训练表征空间语义结构的同时，针对重建任务进行微调；随后通过一个变分桥接模块将潜在空间压缩至更紧凑的形态，以优化扩散建模。我们采用目标解耦的分阶段训练策略，依次优化生成可处理性与重建保真度目标。这些组件共同构成的分词器能够保持强语义性、实现精准重建，并生成可降低扩散建模复杂度的潜在表示。实验表明，RPiAE在文本到图像生成和图像编辑任务上优于其他视觉分词器，同时在基于表征的分词器中实现了最佳的重建保真度。

摘要 (Abstract)

Diffusion models have become the dominant paradigm for image generation and editing, with latent diffusion models shifting denoising to a compact latent space for efficiency and scalability. Recent attempts to leverage pretrained visual representation models as tokenizer priors either align diffusion features to representation features or directly reuse representation encoders as frozen tokenizers. Although such approaches can improve generation metrics, they often suffer from limited reconstruction fidelity due to frozen encoders, which in turn degrades editing quality, as well as overly high-dimensional latents that make diffusion modeling difficult. To address these limitations, We propose Representation-Pivoted AutoEncoder, a representation-based tokenizer that improves both generation and editing. We introduce Representation-Pivot Regularization, a training strategy that enables a representation-initialized encoder to be fine-tuned for reconstruction while preserving the semantic structure of the pretrained representation space, followed by a variational bridge which compress latent space into a compact one for better diffusion modeling. We adopt an objective-decoupled stage-wise training strategy that sequentially optimizes generative tractability and reconstruction-fidelity objectives. Together, these components yield a tokenizer that preserves strong semantics, reconstructs faithfully, and produces latents with reduced diffusion modeling complexity. Experiments demonstrate that RPiAE outperforms other visual tokenizers on text-to-image generation and image editing, while delivering the best reconstruction fidelity among representation-based tokenizers.

关键词: Representation-Pivoted AutoEncoder, visual tokenizer, image generation, image editing, diffusion models, reconstruction fidelity, latent space compression, representation-based tokenizer

172. ❌ FASTER: Rethinking Real-Time Flow VLAs

作者: Yuxiang Lu, Zhe Liu, Xianzhe Fan, Zhenya Yang, Jinghua Hou, Junyi Li, Kaixin Ding, Hengshuang Zhao 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19199v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于Vision-Language-Action (VLA)模型的实时执行优化，特别是针对机器人控制中的反应延迟问题。论文的核心贡献是提出FASTER方法，通过Horizon-Aware Schedule优化动作采样策略，减少反应延迟。虽然VLA模型结合了视觉和语言模态，但论文的重点是动作生成和实时执行，而非大语言模型(LLM)本身的技术原理或应用。所有评分关键词均与大语言模型、其训练方法、推理优化、对齐技术、代理系统等直接相关，而本文未涉及这些主题。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对Vision-Language-Action (VLA)模型在机器人部署中的实时反应延迟问题，提出了FASTER方法，通过优化动作采样策略显著降低了反应时间，并在动态任务中实现了前所未有的实时响应能力。

摘要翻译

实时执行对于在物理世界中部署视觉-语言-动作（Vision-Language-Action, VLA）模型至关重要。现有的异步推理方法主要优化轨迹平滑度，却忽视了响应环境变化的关键延迟问题。本文通过重新思考动作分块策略中的反应概念，对影响反应时间的因素进行了系统性分析。我们发现反应时间遵循由首次动作时间（Time to First Action, TTFA）与执行视野共同决定的均匀分布。此外，我们揭示了在基于流的VLA模型中应用恒定调度方案的标准做法可能效率低下，并迫使系统在开始任何动作之前必须完成所有采样步骤，从而形成了反应延迟的瓶颈。为克服这一问题，我们提出了即时反应快速动作采样（Fast Action Sampling for ImmediaTE Reaction, FASTER）。通过引入视野感知调度，FASTER在流采样过程中自适应地优先处理近期动作，将即时反应的去噪过程压缩十倍（例如在$π_{0.5}$和X-VLA中）至单步完成，同时保持长视野轨迹的质量。结合流式客户端-服务器流水线架构，FASTER显著降低了真实机器人系统的有效反应延迟，尤其在消费级GPU部署场景中表现突出。包括高动态乒乓球任务在内的真实世界实验证明，FASTER为通用策略提供了前所未有的实时响应能力，能够快速生成精确且平滑的运动轨迹。

摘要 (Abstract)

Real-time execution is crucial for deploying Vision-Language-Action (VLA) models in the physical world. Existing asynchronous inference methods primarily optimize trajectory smoothness, but neglect the critical latency in reacting to environmental changes. By rethinking the notion of reaction in action chunking policies, this paper presents a systematic analysis of the factors governing reaction time. We show that reaction time follows a uniform distribution determined jointly by the Time to First Action (TTFA) and the execution horizon. Moreover, we reveal that the standard practice of applying a constant schedule in flow-based VLAs can be inefficient and forces the system to complete all sampling steps before any movement can start, forming the bottleneck in reaction latency. To overcome this issue, we propose Fast Action Sampling for ImmediaTE Reaction (FASTER). By introducing a Horizon-Aware Schedule, FASTER adaptively prioritizes near-term actions during flow sampling, compressing the denoising of the immediate reaction by tenfold (e.g., in $π_{0.5}$ and X-VLA) into a single step, while preserving the quality of long-horizon trajectory. Coupled with a streaming client-server pipeline, FASTER substantially reduces the effective reaction latency on real robots, especially when deployed on consumer-grade GPUs. Real-world experiments, including a highly dynamic table tennis task, prove that FASTER unlocks unprecedented real-time responsiveness for generalist policies, enabling rapid generation of accurate and smooth trajectories.

关键词: Vision-Language-Action (VLA), real-time execution, reaction latency, action sampling, Horizon-Aware Schedule, trajectory generation, robotics, FASTER

173. ❌ Reconstruction Matters: Learning Geometry-Aligned BEV Representation through 3D Gaussian Splatting

作者: Yiren Lu, Xin Ye, Burhaneddin Yaman, Jingru Luo, Zhexiao Xiong, Liu Ren, Yu Yin 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19193v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于自动驾驶中的BEV感知，提出了一种结合3D高斯溅射的框架Splat2BEV，旨在通过显式3D重建提升几何对齐的BEV表示。论文与大多数大模型技术关键词无关，因为其核心是计算机视觉和3D重建技术，而非语言模型。仅与两个关键词有弱关联：1) ‘Pre-training OR Continual Pre-training OR Domain Adaptation’（5分）：论文提到预训练高斯生成器，但这是针对3D场景重建的预训练，而非大语言模型的预训练；2) ‘Mechanistic Interpretability OR Explainable AI’（5分）：论文批评现有BEV框架缺乏可解释性，并旨在通过显式3D表示提升几何理解，这与可解释AI有一定关联，但非核心焦点。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对自动驾驶中BEV感知缺乏显式3D几何理解的问题，提出了Splat2BEV框架，通过3D高斯溅射预训练实现几何对齐的BEV表示，在nuScenes和argoverse数据集上取得了最先进的性能。

摘要翻译

鸟瞰图（Bird’s-Eye-View，简称BEV）感知是自动驾驶的基石，它通过融合环视图像提供统一的空间表征，从而支持语义分割、三维目标检测和运动预测等多种下游任务的推理。然而，现有的大多数BEV感知框架采用端到端的训练范式，将图像特征直接转换到BEV空间，并仅通过下游任务的监督进行优化。这种构建方式将整个感知过程视为黑箱，通常缺乏明确的三维几何理解与可解释性，导致性能未能达到最优。本文主张，显式的三维表征对于实现精确的BEV感知至关重要，并提出了Splat2BEV——一种用于BEV任务的高斯泼溅（Gaussian Splatting）辅助框架。Splat2BEV旨在学习兼具语义丰富性与几何精确性的BEV特征表征。我们首先预训练一个高斯生成器，该生成器能够从多视角输入中显式地重建三维场景，从而生成几何对齐的特征表征。随后，这些表征被投影到BEV空间中，作为下游任务的输入。在nuScenes和argoverse数据集上的大量实验表明，Splat2BEV实现了最先进的性能，并验证了将显式三维重建融入BEV感知的有效性。

摘要 (Abstract)

Bird’s-Eye-View (BEV) perception serves as a cornerstone for autonomous driving, offering a unified spatial representation that fuses surrounding-view images to enable reasoning for various downstream tasks, such as semantic segmentation, 3D object detection, and motion prediction. However, most existing BEV perception frameworks adopt an end-to-end training paradigm, where image features are directly transformed into the BEV space and optimized solely through downstream task supervision. This formulation treats the entire perception process as a black box, often lacking explicit 3D geometric understanding and interpretability, leading to suboptimal performance. In this paper, we claim that an explicit 3D representation matters for accurate BEV perception, and we propose Splat2BEV, a Gaussian Splatting-assisted framework for BEV tasks. Splat2BEV aims to learn BEV feature representations that are both semantically rich and geometrically precise. We first pre-train a Gaussian generator that explicitly reconstructs 3D scenes from multi-view inputs, enabling the generation of geometry-aligned feature representations. These representations are then projected into the BEV space to serve as inputs for downstream tasks. Extensive experiments on nuScenes and argoverse dataset demonstrate that Splat2BEV achieves state-of-the-art performance and validate the effectiveness of incorporating explicit 3D reconstruction into BEV perception.

关键词: BEV perception, 3D Gaussian Splatting, autonomous driving, 3D reconstruction, geometry-aligned representation, multi-view inputs, Splat2BEV, nuScenes dataset

174. ❌ Few-shot Acoustic Synthesis with Multimodal Flow Matching

作者: Amandine Brunetto 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19176v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是音频合成和声学建模，使用扩散变换器和流匹配方法生成房间脉冲响应。所有关键词都专注于大语言模型（LLM）及其相关技术（如微调、对齐、推理、代理等），而本文完全不涉及语言模型或文本处理，属于音频信号处理和生成模型领域，与LLM技术无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于流匹配和扩散变换器的概率方法FLAC，用于少样本声学合成，能够根据最小场景上下文生成合理的房间脉冲响应，并在多个数据集上超越了现有方法。

摘要翻译

生成与场景声学特性一致的音频对于构建沉浸式虚拟环境至关重要。现有的神经声场方法能够实现空间连续的声音渲染，但通常局限于特定场景，需要密集的音频测量数据并为每个环境进行耗时的训练。少样本学习方法提升了跨房间的可扩展性，但仍依赖多次录音，且因其确定性建模方式，无法在稀疏场景上下文条件下捕捉声学固有的不确定性。本文提出流匹配声学生成方法，这是一种用于少样本声学合成的概率性方法，能够在给定极少场景上下文的情况下，对可能的房间脉冲响应分布进行建模。该方法利用基于流匹配目标训练的扩散变换器，依据空间、几何和声学线索，在新场景的任意位置生成房间脉冲响应。在AcousticRooms和Hearing Anything Anywhere数据集上，该方法仅需单次样本即超越了当前最优的八次样本基线性能。为补充传统感知评估指标，我们进一步提出了AGREE联合声学-几何嵌入方法，通过检索和分布度量实现对生成房间脉冲响应的几何一致性评估。本研究首次将生成式流匹配应用于显式房间脉冲响应合成，为鲁棒且数据高效的声学合成开辟了新方向。

摘要 (Abstract)

Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still rely on multiple recordings and, being deterministic, fail to capture the inherent uncertainty of scene acoustics under sparse context. We introduce flow-matching acoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. FLAC leverages a diffusion transformer trained with a flow-matching objective to generate RIRs at arbitrary positions in novel scenes, conditioned on spatial, geometric, and acoustic cues. FLAC outperforms state-of-the-art eight-shot baselines with one-shot on both the AcousticRooms and Hearing Anything Anywhere datasets. To complement standard perceptual metrics, we further introduce AGREE, a joint acoustic-geometry embedding, enabling geometry-consistent evaluation of generated RIRs through retrieval and distributional metrics. This work is the first to apply generative flow matching to explicit RIR synthesis, establishing a new direction for robust and data-efficient acoustic synthesis.

关键词: acoustic synthesis, flow matching, diffusion transformer, few-shot learning, room impulse responses, probabilistic modeling, audio generation, multimodal conditioning

175. ❌ Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation

作者: Kwanyoung Lee, SeungJu Cha, Yebin Ahn, Hyunwoo Oh, Sungho Koh, Dong-Jin Kim 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19158v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于扩散模型（Diffusion Models）的文本到图像生成，特别是针对低密度区域的概念生成和图像编辑问题，提出了自适应辅助提示混合（AAPB）框架。论文的核心技术是扩散模型、提示工程和Tweedie’s identity的数学推导。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文研究的是计算机视觉领域的扩散模型，与评分关键词列表中的主题（如LLMs、MoE、Scaling Laws、Alignment、RAG、CoT、Agents、Quantization等）没有直接关联。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对扩散模型在生成低密度区域概念时出现的语义错位和结构不一致问题，提出了自适应辅助提示混合（AAPB）框架，通过基于Tweedie's identity的自适应系数优化提示混合，在RareBench和FlowEdit数据集上实现了更准确的语义和结构保真度。

摘要翻译

基于扩散的文本到图像（T2I）模型在生成具有照片级真实感和丰富语义的图像方面取得了显著进展。然而，当目标概念位于训练分布的低密度区域时，这些模型往往会产生语义错位或结构不一致的结果。这一局限性源于文本-图像数据集的长尾特性，其中罕见概念或编辑指令的表示不足。为解决此问题，我们提出了自适应辅助提示融合（AAPB）——一个在低密度区域稳定扩散过程的统一框架。AAPB利用辅助锚点提示，在罕见概念生成中提供语义支持，在图像编辑中提供结构支持，确保对目标提示的忠实引导。与先前启发式的提示交替方法不同，AAPB推导出一个封闭形式的自适应系数，该系数在扩散过程的每一步都能最优地平衡辅助锚点提示与目标提示之间的影响力。基于Tweedie恒等式，我们的公式为自适应提示融合提供了一个原则性的、无需训练的框架，确保了稳定且忠实于目标的生成。我们通过对照实验证明了自适应插值相较于固定插值的有效性，并在RareBench和FlowEdit数据集上实证展示了一致的性能提升，与先前无需训练的基线方法相比，实现了更优的语义准确性和结构保真度。

摘要 (Abstract)

Diffusion-based text-to-image (T2I) models have made remarkable progress in generating photorealistic and semantically rich images. However, when the target concepts lie in low-density regions of the training distribution, these models often produce semantically misaligned or structurally inconsistent results. This limitation arises from the long-tailed nature of text-image datasets, where rare concepts or editing instructions are underrepresented. To address this, we introduce Adaptive Auxiliary Prompt Blending (AAPB) - a unified framework that stabilizes the diffusion process in low-density regions. AAPB leverages auxiliary anchor prompts to provide semantic support in rare concept generation and structural support in image editing, ensuring faithful guidance toward the target prompt. Unlike prior heuristic prompt alternation methods, AAPB derives a closed-form adaptive coefficient that optimally balances the influence between the auxiliary anchor and the target prompt at each diffusion step. Grounded in Tweedie’s identity, our formulation provides a principled and training-free framework for adaptive prompt blending, ensuring stable and target-faithful generation. We demonstrate the effectiveness of adaptive interpolation over fixed interpolation through controlled experiments and empirically show consistent improvements on the RareBench and FlowEdit datasets, achieving superior semantic accuracy and structural fidelity compared to prior training-free baselines.

关键词: Diffusion models, Text-to-image generation, Low-density regions, Adaptive prompt blending, Tweedie’s identity, Semantic alignment, Structural fidelity, Training-free framework

176. ❌ ADAPT: Attention Driven Adaptive Prompt Scheduling and InTerpolating Orthogonal Complements for Rare Concepts Generation

作者: Kwanyoung Lee, Hyunwoo Oh, SeungJu Cha, Sungho Koh, Dong-Jin Kim 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19157v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究文本到图像生成中罕见概念合成的挑战，提出ADAPT框架利用注意力机制和正交分量改进提示调度。仅与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为论文提到相关工作R2F使用LLM进行提示调度，但ADAPT本身不直接研究LLM技术。其他关键词均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文解决了文本到图像生成中罕见概念合成的挑战，提出了ADAPT框架，通过确定性提示调度和语义对齐，在RareBench基准上显著提升了罕见概念的组合生成性能，且无需额外训练或微调。

摘要翻译

在文本到图像合成中生成罕见组合概念对扩散模型而言仍具挑战性，尤其对于训练数据中不常见的属性。尽管近期方法（如R2F）通过利用大语言模型（LLM）进行提示调度以应对此挑战，但由于语言模型的随机性及迭代式文本嵌入切换的次优引导，这些方法存在固有方差问题。为解决上述问题，我们提出ADAPT框架——一种无需训练即可确定性规划并语义对齐提示调度的框架，通过提供一致性引导来增强罕见概念的组合生成。通过利用注意力分数与正交分量，ADAPT在RareBench基准测试中显著提升了罕见概念的组合生成能力，且无需额外训练或微调。综合实验表明，ADAPT在RareBench中实现了优越性能，准确反映了罕见属性的语义信息，在保持视觉完整性的同时，为罕见组合的生成提供了确定性且精准的控制。

摘要 (Abstract)

Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training data. While recent approaches, such as R2F, address this challenge by utilizing LLM for prompt scheduling, they suffer from inherent variance due to the randomness of language models and suboptimal guidance from iterative text embedding switching. To address these problems, we propose the ADAPT framework, a training-free framework that deterministically plans and semantically aligns prompt schedules, providing consistent guidance to enhance the composition of rare concepts. By leveraging attention scores and orthogonal components, ADAPT significantly enhances compositional generation of rare concepts in the RareBench benchmark without additional training or fine-tuning. Through comprehensive experiments, we demonstrate that ADAPT achieves superior performance in RareBench and accurately reflects the semantic information of rare attributes, providing deterministic and precise control over the generation of rare compositions without compromising visual integrity.

关键词: text-to-image synthesis, rare concepts generation, prompt scheduling, attention scores, orthogonal components, diffusion models, RareBench benchmark, training-free framework

177. ❌ Revisiting Autoregressive Models for Generative Image Classification

作者: Ilia Sudakov, Artem Babenko, Dmitry Baranchuk 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19122v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是视觉生成式分类器，特别是自回归模型在图像分类中的应用，属于计算机视觉领域。所有评分关键词均针对大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理等），而本文专注于视觉生成模型（AR模型、扩散模型），未涉及任何语言模型或相关技术，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文重新审视了基于自回归模型的生成式图像分类方法，通过利用任意顺序的自回归模型来估计顺序边缘化预测，克服了固定标记顺序的限制，从而在多个图像分类基准上超越了基于扩散模型的分类器，同时实现了高达25倍的效率提升。

摘要翻译

类别条件生成模型已成为准确且稳健的分类器，其中扩散模型相较于其他视觉生成范式（包括自回归模型）展现出明显优势。本研究重新审视了基于视觉自回归的生成式分类器，并指出了先前方法的一个重要局限：其对固定标记顺序的依赖，这为图像理解施加了限制性归纳偏置。我们观察到，单一顺序预测更依赖于部分判别性线索，而对多种标记顺序取平均则能提供更全面的信号。基于这一洞见，我们利用最新的任意顺序自回归模型来估计顺序边缘化预测，从而释放了自回归模型的高分类潜力。我们的方法在多种图像分类基准测试中持续超越基于扩散的分类器，同时效率提升高达25倍。与最先进的自监督判别模型相比，本方法实现了具有竞争力的分类性能——这对生成式分类器而言是一项显著成就。

摘要 (Abstract)

Class-conditional generative models have emerged as accurate and robust classifiers, with diffusion models demonstrating clear advantages over other visual generative paradigms, including autoregressive (AR) models. In this work, we revisit visual AR-based generative classifiers and identify an important limitation of prior approaches: their reliance on a fixed token order, which imposes a restrictive inductive bias for image understanding. We observe that single-order predictions rely more on partial discriminative cues, while averaging over multiple token orders provides a more comprehensive signal. Based on this insight, we leverage recent any-order AR models to estimate order-marginalized predictions, unlocking the high classification potential of AR models. Our approach consistently outperforms diffusion-based classifiers across diverse image classification benchmarks, while being up to 25x more efficient. Compared to state-of-the-art self-supervised discriminative models, our method delivers competitive classification performance - a notable achievement for generative classifiers.

关键词: autoregressive models, generative image classification, any-order AR models, order-marginalized predictions, diffusion models, visual generative classifiers, classification benchmarks, efficiency improvement

178. ❌ GSMem: 3D Gaussian Splatting as Persistent Spatial Memory for Zero-Shot Embodied Exploration and Reasoning

作者: Yiren Lu, Yi Du, Disheng Liu, Yunlai Zhou, Chen Wang, Yu Yin 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19137v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文《GSMem: 3D Gaussian Splatting as Persistent Spatial Memory for Zero-Shot Embodied Exploration and Reasoning》主要研究具身智能中的空间探索与推理问题，提出了一种基于3D高斯泼溅（3DGS）的零样本框架。论文的核心贡献在于利用3DGS作为持久空间记忆，结合视觉语言模型（VLM）进行推理，并设计了混合探索策略。然而，论文内容与所有给定的评分关键词均无直接关联：论文未涉及大语言模型（LLM）技术、模型架构（如MoE、SLMs）、训练方法（如预训练、微调、对齐、RLHF、PEFT）、推理优化（如RAG、上下文扩展、注意力优化、量化、解码加速）、推理技术（如思维链、系统2思维、MCTS、自我纠正）、智能体技术（如LLM智能体、工具使用、多智能体系统）、模型可解释性、世界模型、模型合并、上下文学习，也未涉及生物信息学或化学信息学等科学AI应用。论文虽然使用了VLM，但未深入探讨VLM的技术原理或创新，主要聚焦于3D场景表示与具身探索的框架设计。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对具身智能中现有场景表示缺乏事后可重观察性导致记忆遗漏的问题，提出了GSMem框架，利用3D高斯泼溅作为持久空间记忆，结合视觉语言模型进行零样本探索和推理，实验表明该框架在具身问答和终身导航任务中具有鲁棒性和有效性。

摘要翻译

有效的具身探索要求智能体能够随时间积累并保持空间知识。然而，现有的场景表示方法（如离散场景图或基于静态视角的快照）缺乏事后可重观测性。若初始观测遗漏了目标，由此产生的记忆缺失往往无法弥补。为弥合这一差距，我们提出了GSMem——一个基于3D高斯溅射（3D Gaussian Splatting, 3DGS）构建的零样本具身探索与推理框架。通过显式参数化连续几何与稠密外观，3DGS可作为持久性空间记忆，赋予智能体空间回溯能力：即能够从先前未占据的最优视点渲染逼真新视图。为实现这一能力，GSMem采用了一种检索机制，同时利用并行的物体级场景图与语义级语言场。这种互补设计能稳健定位目标区域，使智能体能够“幻构”出适用于高保真视觉语言模型（Vision-Language Model, VLM）推理的最优视图。此外，我们提出了一种混合探索策略，将VLM驱动的语义评分与基于3DGS的覆盖目标相结合，从而平衡任务感知探索与几何覆盖。在具身问答与终身导航任务上的大量实验验证了本框架的鲁棒性与有效性。

摘要 (Abstract)

Effective embodied exploration requires agents to accumulate and retain spatial knowledge over time. However, existing scene representations, such as discrete scene graphs or static view-based snapshots, lack \textit{post-hoc re-observability}. If an initial observation misses a target, the resulting memory omission is often irrecoverable. To bridge this gap, we propose \textbf{GSMem}, a zero-shot embodied exploration and reasoning framework built upon 3D Gaussian Splatting (3DGS). By explicitly parameterizing continuous geometry and dense appearance, 3DGS serves as a persistent spatial memory that endows the agent with \textit{Spatial Recollection}: the ability to render photorealistic novel views from optimal, previously unoccupied viewpoints. To operationalize this, GSMem employs a retrieval mechanism that simultaneously leverages parallel object-level scene graphs and semantic-level language fields. This complementary design robustly localizes target regions, enabling the agent to ``hallucinate’’ optimal views for high-fidelity Vision-Language Model (VLM) reasoning. Furthermore, we introduce a hybrid exploration strategy that combines VLM-driven semantic scoring with a 3DGS-based coverage objective, balancing task-aware exploration with geometric coverage. Extensive experiments on embodied question answering and lifelong navigation demonstrate the robustness and effectiveness of our framework

关键词: 3D Gaussian Splatting, embodied exploration, spatial memory, zero-shot reasoning, Vision-Language Model, scene representation, novel view rendering, hybrid exploration strategy

179. ❌ TAU-R1: Visual Language Model for Traffic Anomaly Understanding

作者: Yuqiang Lin, Kehua Chen, Sam Lockyer, Arjun Yadav, Mingxuan Sui, Shucheng Zhang, Yan Shi, Bingzhang Wang, Yuang Zhang, Markus Zarbock, Florain Stanek, Adrian Evans, Wenbin Li, Yinhai Wang, Nic Zhang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19098v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文提出TAU-R1，一个用于交通异常理解的两层视觉语言模型框架。核心创新在于其两阶段训练策略：第一阶段使用分解QA增强的监督微调（SFT），第二阶段使用基于GRPO的后训练方法（TAU-GRPO）。因此，与"Post-training OR Supervised Fine-tuning OR SFT"高度相关（10分）。论文属于大模型在交通领域的应用，与"Large Language Models OR LLMs OR Foundation Models"有一定关联（5分），且属于"AI for Science"在交通领域的应用（8分）。其他关键词如MoE、量化、推理加速、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对交通异常理解任务，提出了一个名为TAU-R1的两层视觉语言模型框架，并引入了一个新的数据集Roundabout-TAU和一种两阶段训练策略，实验表明该模型在异常分类和推理任务上均取得了良好性能。

摘要翻译

交通异常理解（Traffic Anomaly Understanding, TAU）对于智能交通系统中的交通安全至关重要。近期的视觉-语言模型（Vision-Language Models, VLMs）在视频理解方面展现出强大能力。然而，由于缺乏专门的基准数据集和针对性的任务方法，TAU的研究进展仍然有限。为应对这一局限，我们引入了Roundabout-TAU数据集，该数据集基于与美国印第安纳州卡梅尔市合作采集的真实世界环岛监控视频构建而成。数据集包含342个视频片段，并标注了超过2000个问答对，涵盖了交通异常理解的多个方面。基于此基准，我们提出了TAU-R1——一个用于TAU的双层视觉-语言框架。第一层是轻量级异常分类器，负责进行粗粒度的异常分类；第二层是规模更大的异常推理器，用于生成详细的事件摘要。为提升任务特定推理能力，我们引入了一种两阶段训练策略：首先进行基于分解式问答增强的监督微调，随后采用TAU-GRPO——一种基于GRPO的后训练方法，并配备了针对TAU设计的奖励函数。实验结果表明，TAU-R1在异常分类和推理任务上均表现出色，同时保持了部署效率。数据集与代码已公开于：https://github.com/siri-rouser/TAU-R1

摘要 (Abstract)

Traffic Anomaly Understanding (TAU) is important for traffic safety in Intelligent Transportation Systems. Recent vision-language models (VLMs) have shown strong capabilities in video understanding. However, progress on TAU remains limited due to the lack of benchmarks and task-specific methodologies. To address this limitation, we introduce Roundabout-TAU, a dataset constructed from real-world roundabout videos collected in collaboration with the City of Carmel, Indiana. The dataset contains 342 clips and is annotated with more than 2,000 question-answer pairs covering multiple aspects of traffic anomaly understanding. Building on this benchmark, we propose TAU-R1, a two-layer vision-language framework for TAU. The first layer is a lightweight anomaly classifier that performs coarse anomaly categorisation, while the second layer is a larger anomaly reasoner that generates detailed event summaries. To improve task-specific reasoning, we introduce a two-stage training strategy consisting of decomposed-QA-enhanced supervised fine-tuning followed by TAU-GRPO, a GRPO-based post-training method with TAU-specific reward functions. Experimental results show that TAU-R1 achieves strong performance on both anomaly classification and reasoning tasks while maintaining deployment efficiency. The dataset and code are available at: https://github.com/siri-rouser/TAU-R1

关键词: Traffic Anomaly Understanding, Vision-Language Model, Roundabout-TAU dataset, Supervised Fine-tuning, GRPO-based post-training, Anomaly classification, Anomaly reasoning, Intelligent Transportation Systems

180. ❌ DROID-SLAM in the Wild

作者: Moyang Li, Zihan Zhu, Marc Pollefeys, Daniel Barath 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19076v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《DROID-SLAM in the Wild》专注于计算机视觉领域的SLAM（同时定位与地图构建）技术，特别是针对动态环境的鲁棒性实时RGB SLAM系统。其核心贡献在于利用可微分的、不确定性感知的束调整来处理动态物体和杂乱场景，实现实时跟踪和重建。所有评分关键词均与大语言模型、深度学习技术原理、AI for Science等主题相关，而本论文的研究内容（SLAM、动态环境、不确定性估计、实时系统）与这些关键词无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种鲁棒的实时RGB SLAM系统，通过利用可微分的不确定性感知束调整来处理动态环境，解决了传统SLAM在动态场景中跟踪失败的问题，并在杂乱动态场景中实现了最先进的相机位姿和场景几何重建，同时保持约10 FPS的实时性能。

摘要翻译

本文提出一种鲁棒的实时RGB SLAM系统，该系统通过利用可微分的不确定性感知光束法平差（Differentiable Uncertainty-aware Bundle Adjustment）处理动态环境。传统SLAM方法通常假设场景静态，在存在运动时会导致跟踪失败。近期的动态SLAM方法尝试使用预定义的动态先验或不确定性感知建图来应对这一挑战，但在面对未知动态物体或几何建图不可靠的高度杂乱场景时仍存在局限。相比之下，我们的方法通过利用多视角视觉特征不一致性来估计逐像素不确定性，从而即使在真实世界环境中也能实现鲁棒的跟踪与重建。所提出的系统在杂乱动态场景中实现了最先进的相机位姿与场景几何重建效果，同时以约10帧/秒的速度实时运行。代码与数据集已发布于https://github.com/MoyangLi00/DROID-W.git。

摘要 (Abstract)

We present a robust, real-time RGB SLAM system that handles dynamic environments by leveraging differentiable Uncertainty-aware Bundle Adjustment. Traditional SLAM methods typically assume static scenes, leading to tracking failures in the presence of motion. Recent dynamic SLAM approaches attempt to address this challenge using predefined dynamic priors or uncertainty-aware mapping, but they remain limited when confronted with unknown dynamic objects or highly cluttered scenes where geometric mapping becomes unreliable. In contrast, our method estimates per-pixel uncertainty by exploiting multi-view visual feature inconsistency, enabling robust tracking and reconstruction even in real-world environments. The proposed system achieves state-of-the-art camera poses and scene geometry in cluttered dynamic scenarios while running in real time at around 10 FPS. Code and datasets are available at https://github.com/MoyangLi00/DROID-W.git.

关键词: SLAM, dynamic environments, real-time, uncertainty-aware bundle adjustment, robust tracking, scene reconstruction, RGB SLAM, multi-view visual feature inconsistency

作者: Ye Wang, Wei Lu, Zhihui You, Keyan Chen, Tongfei Liu, Kaiyu Li, Hongruixuan Chen, Qingling Shu, Sibao Chen 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19077v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	2.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的多模态遥感图像变化检测，具体研究RGB-NIR图像中的建筑物小变化检测。论文的核心贡献是提出了一个新的数据集LSMD和一个多模态特征融合网络MSCNet。所有关键词都直接与大语言模型、深度学习技术原理或AI在科学领域的应用相关，但该论文的研究内容与这些关键词基本无关。唯一略有相关的是"AI for Science OR Bioinformatics OR Cheminformatics"，因为遥感图像分析可视为AI在科学（地球科学/环境监测）中的一个应用，但论文并未明确强调AI for Science框架，只是具体的技术应用，因此给予2分（微弱关联）。其他关键词均与大模型技术、训练方法、推理优化、代理系统等完全无关，论文未涉及任何语言模型或相关深度学习技术原理创新。

!!! tip deepseek-chat TL;DR

该论文针对光学遥感图像中建筑物小变化检测的挑战，提出了一个大规模多模态数据集LSMD和一个多模态光谱互补网络MSCNet，通过有效的跨模态特征融合显著提升了检测精度。

摘要翻译

光学遥感影像中的变化检测易受光照波动、季节变化及地表覆盖物材质差异的影响。仅依赖RGB影像常产生伪变化并导致特征语义模糊。引入近红外信息可提供与可见光互补的异质物理线索，从而增强建筑材料与微小结构的可区分性，同时提升检测精度。然而，现有多模态数据集普遍缺乏高分辨率且精确配准的双时相影像，且当前方法往往未能充分利用模态间的固有异质性。针对这些问题，我们提出了大规模小变化多模态数据集（LSMD），这是一个面向现实场景中小尺度变化检测的双时相RGB-NIR建筑变化检测基准数据集，为评估复杂环境下的多模态变化检测方法提供了严谨的测试平台。基于LSMD，我们进一步提出了多模态光谱互补网络（MSCNet）以实现有效的跨模态特征融合。MSCNet包含三个核心组件：用于增强局部空间细节的邻域上下文增强模块（NCEM）、实现RGB与NIR特征深度交互的跨模态对齐与交互模块（CAIM），以及逐步优化融合特征的显著性感知多源细化模块（SMRM）。大量实验表明，MSCNet能有效利用多模态信息，在多种输入配置下均稳定优于现有方法，验证了其在细粒度建筑变化检测中的有效性。源代码将公开于：https://github.com/AeroVILab-AHU/LSMD

摘要 (Abstract)

Change detection in optical remote sensing imagery is susceptible to illumination fluctuations, seasonal changes, and variations in surface land-cover materials. Relying solely on RGB imagery often produces pseudo-changes and leads to semantic ambiguity in features. Incorporating near-infrared (NIR) information provides heterogeneous physical cues that are complementary to visible light, thereby enhancing the discriminability of building materials and tiny structures while improving detection accuracy. However, existing multi-modal datasets generally lack high-resolution and accurately registered bi-temporal imagery, and current methods often fail to fully exploit the inherent heterogeneity between these modalities. To address these issues, we introduce the Large-scale Small-change Multi-modal Dataset (LSMD), a bi-temporal RGB-NIR building change detection benchmark dataset targeting small changes in realistic scenarios, providing a rigorous testing platform for evaluating multi-modal change detection methods in complex environments. Based on LSMD, we further propose the Multi-modal Spectral Complementarity Network (MSCNet) to achieve effective cross-modal feature fusion. MSCNet comprises three key components: the Neighborhood Context Enhancement Module (NCEM) to strengthen local spatial details, the Cross-modal Alignment and Interaction Module (CAIM) to enable deep interaction between RGB and NIR features, and the Saliency-aware Multisource Refinement Module (SMRM) to progressively refine fused features. Extensive experiments demonstrate that MSCNet effectively leverages multi-modal information and consistently outperforms existing methods under multiple input configurations, validating its efficacy for fine-grained building change detection. The source code will be made publicly available at: https://github.com/AeroVILab-AHU/LSMD

关键词: change detection, multi-modal, remote sensing, RGB-NIR, building change, small changes, feature fusion, MSCNet

182. ❌ Measuring 3D Spatial Geometric Consistency in Dynamic Generated Videos

作者: Weijia Dou, Wenzhao Zheng, Weiliang Chen, Yu Zheng, Jie Zhou, Jiwen Lu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19048v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频生成模型的3D空间几何一致性评估，提出了一种新的度量方法SGC。所有评分关键词均涉及大语言模型（LLM）及相关技术（如训练方法、推理优化、应用等），而本论文研究的是视频生成模型的评估问题，属于计算机视觉和生成模型领域，与LLM技术无直接关联。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对动态生成视频中存在的3D空间几何不一致问题，提出了一种新的评估指标SGC，通过分离静态/动态区域、预测深度、估计局部相机姿态并计算姿态差异来量化几何一致性，实验表明该指标能有效识别现有方法遗漏的关键失败案例。

摘要翻译

近期生成模型虽能产出高保真度视频，却常表现出三维空间几何不一致性。现有评估方法难以准确刻画此类不一致：以保真度为核心的指标（如FVD）对几何失真不敏感，而侧重一致性的基准测试又常对有效的前景动态变化施加不当惩罚。为弥补这一空白，我们提出SGC——一种用于评估动态生成视频中三维空间几何一致性的度量标准。我们通过测量从不同局部区域估计出的多个相机位姿之间的差异来量化几何一致性。该方法首先分离静态与动态区域，随后将静态背景划分为空间连贯的子区域。我们预测每个像素的深度，为每个子区域估计局部相机位姿，并通过计算这些位姿间的离散度来量化几何一致性。在真实视频与生成视频上的实验表明，SGC能稳健地量化几何不一致性，有效识别出现有指标遗漏的关键缺陷。

摘要 (Abstract)

Recent generative models can produce high-fidelity videos, yet they often exhibit 3D spatial geometric inconsistencies. Existing evaluation methods fail to accurately characterize these inconsistencies: fidelity-centric metrics like FVD are insensitive to geometric distortions, while consistency-focused benchmarks often penalize valid foreground dynamics. To address this gap, we introduce SGC, a metric for evaluating 3D \textbf{S}patial \textbf{G}eometric \textbf{C}onsistency in dynamically generated videos. We quantify geometric consistency by measuring the divergence among multiple camera poses estimated from distinct local regions. Our approach first separates static from dynamic regions, then partitions the static background into spatially coherent sub-regions. We predict depth for each pixel, estimate a local camera pose for each subregion, and compute the divergence among these poses to quantify geometric consistency. Experiments on real and generative videos demonstrate that SGC robustly quantifies geometric inconsistencies, effectively identifying critical failures missed by existing metrics.

关键词: 3D spatial geometric consistency, video generation evaluation, camera pose estimation, depth prediction, generative models, metric design, static-dynamic separation, geometric distortion

183. ❌ SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation

作者: Phuc Pham, Uy Dieu Tran, Binh-Son Hua, Phong Nguyen 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19053v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SwiftTailor主要研究3D服装生成，属于计算机视觉和数字时尚领域。虽然论文提到了使用vision-language model（PatternMaker），但并非核心大语言模型技术，而是视觉-语言模型在特定领域的应用。论文的核心创新在于几何图像表示和高效推理框架，与大多数大模型技术关键词（如MoE、RLHF、RAG等）无关。唯一相关的是’Speculative Decoding OR Inference Acceleration’，因为论文重点解决了现有方法推理速度慢的问题（从30秒到1分钟减少到更快），但这是针对特定3D生成任务的加速，而非通用大模型推理加速技术。其他关键词如AI for Science等与论文的计算机视觉应用领域不匹配。

!!! tip deepseek-chat TL;DR

该论文解决了现有3D服装生成方法推理速度慢的问题，提出了SwiftTailor框架，通过几何图像表示和轻量级模块实现了高效、高质量的3D服装生成。

摘要翻译

逼真且高效的三维服装生成始终是计算机视觉与数字时尚领域长期存在的挑战。现有方法通常依赖大型视觉语言模型生成二维缝纫图案的序列化表示，再通过如GarmentCode等服装建模框架将其转换为可用于仿真的三维网格。尽管这些方法能产生高质量结果，但其推理速度往往较慢，通常需要30秒至一分钟。本研究提出了SwiftTailor——一种通过紧凑几何图像表示将缝纫图案推理与基于几何的网格合成相统一的新型两阶段框架。SwiftTailor包含两个轻量级模块：PatternMaker（一种能从多模态输入高效预测缝纫图案的视觉语言模型）和GarmentSewer（一种高效密集预测变换器，可将缝纫图案转换为新型的Garment Geometry Image，在统一UV空间中编码所有服装裁片的三维表面）。最终三维网格通过高效逆映射过程重建，该过程结合了重网格划分与动态缝合算法直接组装服装，从而分摊了物理仿真的计算成本。在Multimodal GarmentCodeData数据集上的大量实验表明，SwiftTailor在显著降低推理时间的同时，实现了最先进的精度与视觉保真度。本研究为新一代三维服装生成提供了可扩展、可解释且高性能的解决方案。

摘要 (Abstract)

Realistic and efficient 3D garment generation remains a longstanding challenge in computer vision and digital fashion. Existing methods typically rely on large vision- language models to produce serialized representations of 2D sewing patterns, which are then transformed into simulation-ready 3D meshes using garment modeling framework such as GarmentCode. Although these approaches yield high-quality results, they often suffer from slow inference times, ranging from 30 seconds to a minute. In this work, we introduce SwiftTailor, a novel two-stage framework that unifies sewing-pattern reasoning and geometry-based mesh synthesis through a compact geometry image representation. SwiftTailor comprises two lightweight modules: PatternMaker, an efficient vision-language model that predicts sewing patterns from diverse input modalities, and GarmentSewer, an efficient dense prediction transformer that converts these patterns into a novel Garment Geometry Image, encoding the 3D surface of all garment panels in a unified UV space. The final 3D mesh is reconstructed through an efficient inverse mapping process that incorporates remeshing and dynamic stitching algorithms to directly assemble the garment, thereby amortizing the cost of physical simulation. Extensive experiments on the Multimodal GarmentCodeData demonstrate that SwiftTailor achieves state-of-the-art accuracy and visual fidelity while significantly reducing inference time. This work offers a scalable, interpretable, and high-performance solution for next-generation 3D garment generation.

关键词: 3D garment generation, geometry image representation, sewing pattern reasoning, vision-language model, inference acceleration, digital fashion, Garment Geometry Image, efficient mesh synthesis

184. ❌ FUMO: Prior-Modulated Diffusion for Single Image Reflection Removal

作者: Telang Xu, Chaoyang Zhang, Guangtao Zhai, Xiaohong Liu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19036v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的单图像反射去除任务，提出了一种基于扩散模型和先验调制框架的方法。虽然论文涉及深度学习技术（扩散模型），但所有评分关键词都专门针对大语言模型（LLMs）及其相关技术、应用和优化方法。论文内容与LLMs、MoE、SLMs、对齐、推理、代理、科学AI等关键词完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于先验调制的扩散模型框架（FUMO），用于解决单图像反射去除问题，通过提取强度先验和高频先验来改善空间可控性和结构保真度，在标准基准测试和野外挑战性图像上取得了竞争性的定量结果和一致的感知质量提升。

摘要翻译

单图像反射去除（SIRR）在真实场景中具有挑战性，其中反射强度在空间上变化，且反射模式与透射结构紧密交织。本文提出一种带有先验调制框架的扩散模型（FUMO），该框架引入显式引导信号以提升空间可控性和结构保真度。两种先验直接从混合图像中提取：一种强度先验用于估计空间反射强度，另一种高频先验通过多尺度残差聚合捕捉细节敏感响应。我们提出一种由粗到精的训练范式。在第一阶段，这些线索被结合以门控条件残差注入，使条件聚焦于反射主导且结构敏感的区域。在第二阶段，一个细粒度优化网络在图像空间中校正局部错位并锐化精细细节。在标准基准测试和真实场景中的挑战性图像上进行的实验表明，该方法取得了具有竞争力的定量结果，并持续提升了感知质量。代码发布于 https://github.com/Lucious-Desmon/FUMO。

摘要 (Abstract)

Single image reflection removal (SIRR) is challenging in real scenes, where reflection strength varies spatially and reflection patterns are tightly entangled with transmission structures. This paper presents a diffusion model with prior modulation framework (FUMO) that introduces explicit guidance signals to improve spatial controllability and structural faithfulness. Two priors are extracted directly from the mixed image, an intensity prior that estimates spatial reflection severity and a high-frequency prior that captures detail-sensitive responses via multi-scale residual aggregation. We propose a coarse-to-fine training paradigm. In the first stage, these cues are combined to gate the conditional residual injections, focusing the conditioning on regions that are both reflection-dominant and structure-sensitive. In the second stage, a fine-grained refinement network corrects local misalignment and sharpens fine details in the image space. Experiments conducted on both standard benchmarks and challenging images in the wild demonstrate competitive quantitative results and consistently improved perceptual quality. The code is released at https://github.com/Lucious-Desmon/FUMO.

关键词: Single image reflection removal, Diffusion model, Prior modulation, Intensity prior, High-frequency prior, Coarse-to-fine training, Spatial controllability, Structural faithfulness

185. ❌ Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token

作者: Anqi Zhang, Xiaokang Ji, Guangyu Gao, Jianbo Jiao, Chi Harold Liu, Yunchao Wei 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19026v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于多模态大语言模型（MLLMs）在图像分割任务中的应用，核心创新在于通过单一分割嵌入实现无解码器的分割，并改进特征分辨率与注意力机制。因此，仅与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（评分为10），因为MLLMs是LLMs的多模态扩展，属于大模型技术范畴。其他关键词如MoE、SLMs、Scaling Laws、训练方法（Pre-training、SFT、RLHF等）、推理优化（RAG、Context Window、KV Cache）、推理能力（CoT、System 2）、代理（Agents、Tool Use）、模型效率（Quantization、Speculative Decoding）、可靠性（Hallucination Mitigation）、可解释性、World Models、模型合并、上下文学习或科学AI应用，在论文中均未涉及或提及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文研究如何仅使用单一分割嵌入在多模态大语言模型中实现无解码器的图像分割，通过改进特征分辨率和注意力机制，在多个分割任务上取得了与基于专业掩码解码器方法竞争的性能。

摘要翻译

近期利用多模态大语言模型（MLLMs）的分割方法已展现出可靠的对象级分割能力与增强的空间感知性能。然而，几乎所有现有方法主要依赖专用掩码解码器来解析由生成的分割相关嵌入与视觉特征衍生的掩码，或引入多个额外辅助标记。本文旨在探究是否以及如何仅通过1个分割嵌入（SELF1E）从MLLM自身实现分割，同时取得具有竞争力的结果，从而消除对外部解码器的依赖。为此，我们的方法针对MLLM中像素重排图像特征分辨率降低这一根本性局限展开研究。首先，我们保留图像特征在原始未压缩分辨率下的状态，并通过从MLLM处理后的压缩特征中提取残差特征对其进行补充，从而提升特征精度。随后，我们分别对经过LLM处理与未处理的图像特征进行像素反重排操作，以释放压缩特征的细节并在未压缩分辨率下增强残差特征，从而进一步提升补充后特征的分辨率。此外，我们重新设计了具有双感知路径（即图像到图像与图像到分割）的注意力掩码，实现了像素与分割标记之间丰富的特征交互。在多个分割任务上的综合实验验证了SELF1E能够取得与基于专用掩码解码器方法相竞争的性能，证明了在MLLM中实现无解码器分割的可行性。项目页面：https://github.com/ANDYZAQ/SELF1E。

摘要 (Abstract)

Recent segmentation methods leveraging Multi-modal Large Language Models (MLLMs) have shown reliable object-level segmentation and enhanced spatial perception. However, almost all previous methods predominantly rely on specialist mask decoders to interpret masks from generated segmentation-related embeddings and visual features, or incorporate multiple additional tokens to assist. This paper aims to investigate whether and how we can unlock segmentation from MLLM itSELF with 1 segmentation Embedding (SELF1E) while achieving competitive results, which eliminates the need for external decoders. To this end, our approach targets the fundamental limitation of resolution reduction in pixel-shuffled image features from MLLMs. First, we retain image features at their original uncompressed resolution, and refill them with residual features extracted from MLLM-processed compressed features, thereby improving feature precision. Subsequently, we integrate pixel-unshuffle operations on image features with and without LLM processing, respectively, to unleash the details of compressed features and amplify the residual features under uncompressed resolution, which further enhances the resolution of refilled features. Moreover, we redesign the attention mask with dual perception pathways, i.e., image-to-image and image-to-segmentation, enabling rich feature interaction between pixels and the segmentation token. Comprehensive experiments across multiple segmentation tasks validate that SELF1E achieves performance competitive with specialist mask decoder-based methods, demonstrating the feasibility of decoder-free segmentation in MLLMs. Project page: https://github.com/ANDYZAQ/SELF1E.

关键词: Multi-modal Large Language Models, MLLMs, segmentation, decoder-free, single segmentation token, feature resolution, attention mask, pixel-unshuffle

186. ❌ Generalized Hand-Object Pose Estimation with Occlusion Awareness

作者: Hui Yang, Wei Sun, Jian Liu, Jian Xiao Tao Xie, Hossein Rahmani, Ajmal Saeed mian, Nicu Sebe, Gim Hee Lee 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19013v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于计算机视觉中的3D手-物体姿态估计问题，提出GenHOI框架解决遮挡条件下的泛化问题。论文使用了多模态方法（RGB图像、点云、文本描述）和层次语义提示，但所有技术都是针对特定视觉任务的深度学习模型，不涉及大语言模型（LLMs）、模型架构创新（如MoE、量化）、训练方法（如RLHF、PEFT）、推理优化或AI代理等关键词。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学领域的应用（计算机视觉），但并非核心生物信息学或化学信息学，因此给5分（有一定关联）。其他关键词均与论文内容完全无关，给0分。

!!! tip deepseek-chat TL;DR

该论文提出GenHOI框架，通过集成层次语义知识和手部先验，解决了单RGB图像中遮挡条件下的广义3D手-物体姿态估计问题，在DexYCB和HO3Dv2基准测试中实现了最先进的性能。

摘要翻译

从单张RGB图像进行广义三维手-物姿态估计仍面临巨大挑战，这主要源于物体外观与交互模式的显著差异，尤其在严重遮挡条件下。我们提出GenHOI——一个具备遮挡感知能力的广义手-物姿态估计框架。该框架通过整合层次化语义知识与手部先验信息，增强了模型在挑战性遮挡条件下的泛化能力。具体而言，我们设计了一种层次化语义提示机制，通过文本描述编码物体状态、手部构型及交互模式。这使得模型能够学习手物交互的抽象高层表征，从而泛化至未见物体及新型交互场景，同时补偿缺失或模糊的视觉线索。为实现鲁棒的遮挡推理，我们在RGB图像、预测点云及文本描述上采用多模态掩码建模策略。此外，我们利用手部先验作为稳定的空间参照，以提取隐式交互约束。这使得即使在物体形状与交互模式存在显著差异时，仍能实现可靠的姿态推断。在DexYCB和HO3Dv2等挑战性基准测试上的大量实验表明，我们的方法在手-物姿态估计任务中达到了最先进的性能水平。

摘要 (Abstract)

Generalized 3D hand-object pose estimation from a single RGB image remains challenging due to the large variations in object appearances and interaction patterns, especially under heavy occlusion. We propose GenHOI, a framework for generalized hand-object pose estimation with occlusion awareness. GenHOI integrates hierarchical semantic knowledge with hand priors to enhance model generalization under challenging occlusion conditions. Specifically, we introduce a hierarchical semantic prompt that encodes object states, hand configurations, and interaction patterns via textual descriptions. This enables the model to learn abstract high-level representations of hand-object interactions for generalization to unseen objects and novel interactions while compensating for missing or ambiguous visual cues. To enable robust occlusion reasoning, we adopt a multi-modal masked modeling strategy over RGB images, predicted point clouds, and textual descriptions. Moreover, we leverage hand priors as stable spatial references to extract implicit interaction constraints. This allows reliable pose inference even under significant variations in object shapes and interaction patterns. Extensive experiments on the challenging DexYCB and HO3Dv2 benchmarks demonstrate that our method achieves state-of-the-art performance in hand-object pose estimation.

关键词: 3D hand-object pose estimation, occlusion awareness, hierarchical semantic prompt, multi-modal masked modeling, generalization, RGB image, point clouds, DexYCB benchmark

187. ❌ Unleashing the Power of Simplicity: A Minimalist Strategy for State-of-the-Art Fingerprint Enhancement

作者: Raffaele Cappelli 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19004v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于指纹图像增强的计算机视觉技术，提出了一种简约的上下文过滤方法和基于学习的方法。虽然属于AI应用领域，但所有关键词均与大模型、深度学习技术原理、AI for Science等特定主题相关，而本文完全不涉及这些内容。论文没有提到任何语言模型、模型训练技术、推理优化、对齐方法、代理系统或科学AI应用，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

本文提出了一种简约的指纹图像增强方法，包括上下文过滤和基于学习的技术，在挑战性数据集上超越了现有复杂方法，实现了更清晰、准确且噪声更少的图像质量。

摘要翻译

指纹识别系统依赖于人类指纹的独有特征，在现代安全与验证应用中至关重要。精确的细节特征提取作为该系统的关键步骤，其准确性取决于指纹图像的质量。尽管近期指纹增强技术有所改进，但现有先进方法在处理低质量指纹时仍面临困难，且计算需求较高。本文提出了一种极简主义的指纹增强方法，优先考虑简洁性与有效性。研究引入了两种创新方法：上下文滤波方法与基于学习的方法。这些技术持续优于复杂的现有先进方法，能生成更清晰、更准确且噪声更低的图像。这些方法的有效性通过一个具有挑战性的潜指纹数据库得到了验证。这些技术的开源实现不仅促进了可重复性，也推动了该领域的进一步发展。研究结果强调了简洁性对于实现高质量指纹增强的重要性，并表明未来研究应在复杂性与实际效益之间取得平衡。

摘要 (Abstract)

Fingerprint recognition systems, which rely on the unique characteristics of human fingerprints, are essential in modern security and verification applications. Accurate minutiae extraction, a critical step in these systems, depends on the quality of fingerprint images. Despite recent improvements in fingerprint enhancement techniques, state-of-the-art methods often struggle with low-quality fingerprints and can be computationally demanding. This paper presents a minimalist approach to fingerprint enhancement, prioritizing simplicity and effectiveness. Two novel methods are introduced: a contextual filtering method and a learning-based method. These techniques consistently outperform complex state-of-the-art methods, producing clearer, more accurate, and less noisy images. The effectiveness of these methods is validated using a challenging latent fingerprint database. The open-source implementation of these techniques not only fosters reproducibility but also encourages further advancements in the field. The findings underscore the importance of simplicity in achieving high-quality fingerprint enhancement and suggest that future research should balance complexity and practical benefits.

关键词: fingerprint enhancement, minimalist approach, contextual filtering, learning-based method, minutiae extraction, latent fingerprint, image quality, open-source implementation

188. ❌ CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think

作者: Zening Sun, Zhengpeng Xie, Lichen Bai, Shitong Shao, Shuo Yang, Zeke Xie 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18991v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	8.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于扩散模型的微调对齐，与语言模型无关，因此大多数关键词得分为0。核心相关关键词：1) “Post-training OR Supervised Fine-tuning OR SFT"得10分，因为论文提出CRAFT作为SFT的增强变体；2) “RLHF OR RLAIF OR Direct Preference Optimization OR DPO"得8分，因为论文对比并改进DPO风格方法；3) “Instruction Tuning OR Alignment OR Value Alignment"得5分，因为论文涉及模型与人类偏好的对齐，但非语言模型指令调优。

!!! tip deepseek-chat TL;DR

论文提出CRAFT方法，通过复合奖励过滤构建高质量数据集并增强SFT，以解决扩散模型对齐中数据依赖和计算效率问题，仅用100样本即超越现有方法并实现11-220倍加速收敛。

摘要翻译

扩散模型对齐技术在生成符合人类偏好的高质量图像方面取得了显著突破。现有方法如监督微调（SFT）和基于DPO的偏好优化，已成为微调扩散模型的原则性工具。然而，SFT依赖于获取成本高昂的高质量图像，而DPO类方法则需要大规模偏好数据集，且数据质量常存在不一致性。除数据依赖外，这些方法还受限于计算效率低下的问题。为应对这两大挑战，我们提出复合奖励辅助微调（Composite Reward Assisted Fine-Tuning, CRAFT），这是一种轻量级且强大的微调范式，能在保持计算效率的同时大幅减少训练数据需求。该方法首先利用复合奖励过滤（Composite Reward Filtering, CRF）技术构建高质量且一致性强的训练数据集，随后执行增强型SFT变体。我们还从理论上证明了CRAFT实际上优化了基于群体的强化学习下界，从而在基于筛选数据的SFT与强化学习之间建立了原理性关联。大量实验结果表明，仅使用100个样本的CRAFT即可轻松超越近期需要数千个偏好配对样本的先进偏好优化方法。此外，CRAFT甚至能比基线偏好优化方法实现11-220倍的收敛速度提升，彰显了其极高的效率优势。

摘要 (Abstract)

Aligning Diffusion models has achieved remarkable breakthroughs in generating high-quality, human preference-aligned images. Existing techniques, such as supervised fine-tuning (SFT) and DPO-style preference optimization, have become principled tools for fine-tuning diffusion models. However, SFT relies on high-quality images that are costly to obtain, while DPO-style methods depend on large-scale preference datasets, which are often inconsistent in quality. Beyond data dependency, these methods are further constrained by computational inefficiency. To address these two challenges, we propose Composite Reward Assisted Fine-Tuning (CRAFT), a lightweight yet powerful fine-tuning paradigm that requires significantly reduced training data while maintaining computational efficiency. It first leverages a Composite Reward Filtering (CRF) technique to construct a high-quality and consistent training dataset and then perform an enhanced variant of SFT. We also theoretically prove that CRAFT actually optimizes the lower bound of group-based reinforcement learning, establishing a principled connection between SFT with selected data and reinforcement learning. Our extensive empirical results demonstrate that CRAFT with only 100 samples can easily outperform recent SOTA preference optimization methods with thousands of preference-paired samples. Moreover, CRAFT can even achieve 11-220$\times$ faster convergences than the baseline preference optimization methods, highlighting its extremely high efficiency.

关键词: Diffusion Models, Alignment, Supervised Fine-tuning, DPO, Composite Reward Filtering, Preference Optimization, Data Efficiency, Convergence Acceleration

189. ❌ Unsupervised Contrastive Learning for Efficient and Robust Spectral Shape Matching

作者: Feifan Luo, Hongyang Chen 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18924v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是3D形状匹配的计算机视觉问题，使用无监督对比学习和简化的功能映射架构，专注于几何处理和图形学领域。所有评分关键词均与大语言模型、深度学习技术原理或AI科学应用相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文提出了一种基于无监督对比学习的高效鲁棒3D形状匹配方法，通过简化功能映射架构实现了在多种挑战性场景下的最先进性能。

摘要翻译

估计非刚性可变形三维形状之间的对应关系，仍然是计算机视觉与图形学领域的一项重大挑战。尽管深度函数映射方法已成为解决该问题的主流方案，但它们主要侧重于单独或联合优化逐点映射与函数映射，而非直接在嵌入空间中增强特征表示，这往往导致特征质量不足和匹配性能欠佳。此外，这些方法严重依赖传统的函数映射技术，例如耗时的函数映射求解器，带来了巨大的计算开销。在本研究中，我们首次提出了一种基于无监督对比学习的新方法，用于高效且鲁棒的三维形状匹配。我们首先提出一个无监督对比学习框架，通过最大化正相似对之间的一致性并最小化负相似对之间的一致性来促进特征学习，从而提升所学特征的一致性和可区分性。随后，我们设计了一个极大简化的函数映射学习架构，该架构无需计算昂贵的函数映射求解器和多个辅助函数映射损失函数，显著提升了计算效率。通过将这两个组件整合到一个统一的双分支流程中，我们的方法在准确性和效率上均达到了最先进的性能。大量实验表明，我们的方法不仅计算高效，而且在包括近等距、非等距和拓扑不一致在内的多种具有挑战性的基准测试中，均超越了当前最先进的方法，甚至优于有监督技术。

摘要 (Abstract)

Estimating correspondences between pairs of non-rigid deformable 3D shapes remains a significant challenge in computer vision and graphics. While deep functional map methods have become the go-to solution for addressing this problem, they primarily focus on optimizing pointwise and functional maps either individually or jointly, rather than directly enhancing feature representations in the embedding space, which often results in inadequate feature quality and suboptimal matching performance. Furthermore, these approaches heavily rely on traditional functional map techniques, such as time-consuming functional map solvers, which incur substantial computational costs. In this work, we introduce, for the first time, a novel unsupervised contrastive learning-based approach for efficient and robust 3D shape matching. We begin by presenting an unsupervised contrastive learning framework that promotes feature learning by maximizing consistency within positive similarity pairs and minimizing it within negative similarity pairs, thereby improving both the consistency and discriminability of the learned features.We then design a significantly simplified functional map learning architecture that eliminates the need for computationally expensive functional map solvers and multiple auxiliary functional map losses, greatly enhancing computational efficiency. By integrating these two components into a unified two-branch pipeline, our method achieves state-of-the-art performance in both accuracy and efficiency. Extensive experiments demonstrate that our approach is not only computationally efficient but also outperforms current state-of-the-art methods across various challenging benchmarks, including near-isometric, non-isometric, and topologically inconsistent scenarios, even surpassing supervised techniques.

关键词: 3D shape matching, unsupervised contrastive learning, functional maps, non-rigid deformation, computational efficiency, feature representation, spectral matching, computer vision

190. ❌ VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

作者: Jiayi Yuan, Haobo Jiang, De Wen Soh, Na Zhao 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18943v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文的核心是提出VGGT-360框架，用于零样本全景深度估计。它明确利用了VGGT-like foundation models（基础模型），因此与’Large Language Models OR LLMs OR Foundation Models’高度相关，评分为10分。然而，论文专注于计算机视觉中的深度估计任务，具体涉及3D重建、全景投影和几何一致性，并未涉及其他关键词所描述的大模型技术原理（如MoE、量化、对齐、推理加速等）或特定科学领域应用（如生物信息学）。因此，其他所有关键词均评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了VGGT-360，一种无需训练、几何一致的全景深度估计框架，通过利用VGGT类基础模型的3D一致性，将碎片化的单视图推理统一为连贯的全景理解，在多个数据集上超越了现有方法。

摘要翻译

本文提出了VGGT-360，一种新颖的免训练、零样本且几何一致的全景深度估计框架。与先前独立于视角的免训练方法不同，VGGT-360通过利用类VGGT基础模型固有的三维一致性，将任务重新定义为基于多视图重建三维模型的全景重投影，从而将碎片化的单视图推理统一为连贯的全景理解。为实现稳健而精确的估计，VGGT-360集成了三个即插即用模块，共同构成一个统一的全景到三维到深度框架：（i）不确定性引导的自适应投影将全景图切片为透视图，以弥合全景输入与VGGT的透视先验之间的领域差距。该模块通过基于梯度的不确定性估计，为几何信息贫乏区域分配更密集的视图，从而为VGGT生成几何信息丰富的输入。（ii）结构显著性增强注意力通过将结构感知置信度注入VGGT的注意力层，增强其在三维重建过程中的鲁棒性，引导模型聚焦于几何可靠的区域并提升跨视图一致性。（iii）相关性加权的三维模型校正利用注意力推断的相关性分数对重叠点进行重新加权，从而优化重建的三维模型，为精确的全景重投影提供一致的几何基础。大量实验表明，VGGT-360在多种分辨率及多样化的室内外数据集上，均优于经过训练的和免训练的现有先进方法。

摘要 (Abstract)

This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT’s perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT’s robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.

关键词: panoramic depth estimation, zero-shot, training-free, geometry-consistent, VGGT-like foundation models, 3D reconstruction, multi-view, uncertainty-guided projection

191. ❌ GHOST: Fast Category-agnostic Hand-Object Interaction Reconstruction from RGB Videos using Gaussian Splatting

作者: Ahmed Tawfik Aboukhadra, Marcel Rogge, Nadia Robertini, Abdalla Arafa, Jameel Malik, Ahmed Elhayek, Didier Stricker 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18912v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文GHOST专注于计算机视觉和图形学领域，研究从RGB视频重建手-物体交互的3D模型，使用高斯泼溅技术。所有评分关键词均涉及大语言模型、深度学习技术原理或AI在科学领域的应用，而本文的核心是视觉重建和几何建模，未涉及任何语言模型、模型训练、推理优化、对齐、代理系统或科学AI应用。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

GHOST提出了一种基于高斯泼溅的快速、类别无关框架，用于从单目RGB视频重建物理一致且可动画化的动态手-物体交互，在速度和准确性上优于现有方法。

摘要翻译

从单目RGB视频中理解真实的手-物交互对于增强现实/虚拟现实、机器人技术和具身人工智能至关重要。现有方法依赖于特定类别的模板或大量计算，但在三维空间中仍会产生物理不一致的手-物对齐结果。我们提出GHOST（高斯手-物溅射），一种快速、类别无关的框架，利用二维高斯溅射技术重建动态手-物交互。GHOST将手和物体均表示为密集且视角一致的高斯圆盘，并引入三项关键创新：(1) 几何先验检索与一致性损失，用于补全被遮挡的物体区域；(2) 抓取感知对齐机制，优化手部平移与物体尺度以确保真实的接触；(3) 手部感知背景损失，避免对因手部遮挡而不可见的物体区域施加惩罚。GHOST仅需单段RGB视频即可实现完整、物理一致且可动画化的重建，其运行速度比现有类别无关方法快一个数量级。在ARCTIC、HO3D及野外数据集上的大量实验表明，该方法在三维重建与二维渲染质量上均达到最先进的精度，确立了GHOST作为真实手-物交互建模的高效鲁棒解决方案。代码发布于https://github.com/ATAboukhadra/GHOST。

摘要 (Abstract)

Understanding realistic hand-object interactions from monocular RGB videos is essential for AR/VR, robotics, and embodied AI. Existing methods rely on category-specific templates or heavy computation, yet still produce physically inconsistent hand-object alignment in 3D. We introduce GHOST (Gaussian Hand-Object Splatting), a fast, category-agnostic framework for reconstructing dynamic hand-object interactions using 2D Gaussian Splatting. GHOST represents both hands and objects as dense, view-consistent Gaussian discs and introduces three key innovations: (1) a geometric-prior retrieval and consistency loss that completes occluded object regions, (2) a grasp-aware alignment that refines hand translations and object scale to ensure realistic contact, and (3) a hand-aware background loss that prevents penalizing hand-occluded object regions. GHOST achieves complete, physically consistent, and animatable reconstructions from a single RGB video while running an order of magnitude faster than prior category-agnostic methods. Extensive experiments on ARCTIC, HO3D, and in-the-wild datasets demonstrate state-of-the-art accuracy in 3D reconstruction and 2D rendering quality, establishing GHOST as an efficient and robust solution for realistic hand-object interaction modeling. Code is available at https://github.com/ATAboukhadra/GHOST.

关键词: hand-object interaction, 3D reconstruction, Gaussian Splatting, RGB video, category-agnostic, physical consistency, AR/VR, robotics

192. ❌ PromptHub: Enhancing Multi-Prompt Visual In-Context Learning with Locality-Aware Fusion, Concentration and Alignment

作者: Tianci Luo, Jinpeng Wang, Shiyu Qin, Niu Lian, Yan Feng, Bin Chen, Chun Yuan, Shu-Tao Xia 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18891v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视觉上下文学习（Visual In-Context Learning, VICL），这是计算机视觉领域的一个特定方向，与大多数关键词（主要涉及大语言模型、训练技术、推理方法、模型优化等）完全无关。唯一相关的关键词是’In-context Learning OR Many-shot Learning’，因为论文的核心是视觉上下文学习，属于上下文学习在视觉领域的应用，因此给予10分（高度相关，核心内容）。其他关键词均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了PromptHub框架，通过局部感知融合、集中和对齐来增强多提示视觉上下文学习，解决了现有方法因补丁级融合和模型无关监督而限制性能的问题，并在三个基础视觉任务上验证了其优越性、通用性和鲁棒性。

摘要翻译

视觉上下文学习旨在通过模仿像素级演示完成视觉任务。近期研究开创了提示融合方法，结合了多种演示的优势，为扩展视觉上下文学习提供了可行路径。然而，现有的分块式融合框架与模型无关的监督机制阻碍了信息线索的充分挖掘，从而限制了性能提升。为克服这一缺陷，我们提出PromptHub框架，该框架通过局部感知融合、集中与对齐机制，系统性强化多提示学习。PromptHub利用空间先验捕捉更丰富的上下文信息，采用互补的集中、对齐与预测目标相互指导训练，并结合数据增强进一步强化监督。在三个基础视觉任务上的大量实验证明了PromptHub的优越性。此外，我们在分布外场景及多种检索场景中验证了其普适性、可迁移性与鲁棒性。本研究为提示融合建立了可靠的局部感知范式，超越了先前的分块式方法。代码发布于https://github.com/luotc-why/ICLR26-PromptHub。

摘要 (Abstract)

Visual In-Context Learning (VICL) aims to complete vision tasks by imitating pixel demonstrations. Recent work pioneered prompt fusion that combines the advantages of various demonstrations, which shows a promising way to extend VICL. Unfortunately, the patch-wise fusion framework and model-agnostic supervision hinder the exploitation of informative cues, thereby limiting performance gains. To overcome this deficiency, we introduce PromptHub, a framework that holistically strengthens multi-prompting through locality-aware fusion, concentration and alignment. PromptHub exploits spatial priors to capture richer contextual information, employs complementary concentration, alignment, and prediction objectives to mutually guide training, and incorporates data augmentation to further reinforce supervision. Extensive experiments on three fundamental vision tasks demonstrate the superiority of PromptHub. Moreover, we validate its universality, transferability, and robustness across out-of-distribution settings, and various retrieval scenarios. This work establishes a reliable locality-aware paradigm for prompt fusion, moving beyond prior patch-wise approaches. Code is available at https://github.com/luotc-why/ICLR26-PromptHub.

关键词: Visual In-Context Learning, Prompt Fusion, Locality-Aware Fusion, Multi-Prompting, Vision Tasks, Data Augmentation, Out-of-Distribution, Retrieval Scenarios

193. ❌ HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models

作者: Xiangyu Bai, Bishoy Galoaa, Sarah Ostadabbas 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18850v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究视频问答中的帧选择问题，使用强化学习（GRPO）训练轻量级策略来选择视觉语言模型（VLM）的输入帧。虽然涉及视觉语言模型（VLMs），但所有关键词均针对大语言模型（LLMs）及其相关技术（如MoE、Scaling Laws、RLHF、PEFT等），或特定科学领域应用（如Bioinformatics）。论文未涉及LLMs、MoE、Scaling Laws、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理技术、代理系统、模型压缩、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了HORNet，一种使用GRPO训练的轻量级帧选择策略，用于优化视频问答中视觉语言模型的输入帧，在减少99%帧和93%处理时间的同时提高了答案质量。

摘要翻译

基于视觉语言模型（VLM）的视频问答（VQA）性能在很大程度上取决于从输入视频中选择哪些帧，然而大多数系统依赖于均匀或启发式采样方法，这些方法无法针对下游回答质量进行优化。本文提出 HORNet，一种轻量级的帧选择策略，通过组相对策略优化（GRPO）进行训练，以学习冻结的VLM需要哪些帧才能正确回答问题。HORNet的可训练参数少于100万个，能将输入帧减少高达99%，VLM处理时间降低高达93%，同时在短格式基准测试上提升回答质量（在MSVD-QA上F1分数提高1.7%），并在时序推理任务上取得强劲性能（在NExT-QA上较均匀采样提升7.3分）。我们将此形式化为“选择任意帧”（Select Any Frames, SAF）任务，该任务将视觉输入筛选与VLM推理解耦，并证明GRPO训练的选择策略在分布外泛化能力上优于监督学习和PPO替代方法。HORNet的策略无需重新训练即可在不同VLM回答器间迁移，当与更强模型结合时能带来额外8.5%的相对增益。在涵盖341,877个问答对和114.2小时视频的六个基准测试上进行评估，结果表明：优化VLM“看到的内容”是一种实用且互补的替代方案，可在提升效率的同时，避免仅优化其生成内容。代码发布于https://github.com/ostadabbas/HORNet。

摘要 (Abstract)

Video question answering (VQA) with vision-language models (VLMs) depends critically on which frames are selected from the input video, yet most systems rely on uniform or heuristic sampling that cannot be optimized for downstream answering quality. We introduce \textbf{HORNet}, a lightweight frame selection policy trained with Group Relative Policy Optimization (GRPO) to learn which frames a frozen VLM needs to answer questions correctly. With fewer than 1M trainable parameters, HORNet reduces input frames by up to 99% and VLM processing time by up to 93%, while improving answer quality on short-form benchmarks (+1.7% F1 on MSVD-QA) and achieving strong performance on temporal reasoning tasks (+7.3 points over uniform sampling on NExT-QA). We formalize this as Select Any Frames (SAF), a task that decouples visual input curation from VLM reasoning, and show that GRPO-trained selection generalizes better out-of-distribution than supervised and PPO alternatives. HORNet’s policy further transfers across VLM answerers without retraining, yielding an additional 8.5% relative gain when paired with a stronger model. Evaluated across six benchmarks spanning 341,877 QA pairs and 114.2 hours of video, our results demonstrate that optimizing \emph{what} a VLM sees is a practical and complementary alternative to optimizing what it generates while improving efficiency. Code is available at https://github.com/ostadabbas/HORNet.

关键词: Video Question Answering, Vision-Language Models, Frame Selection, Group Relative Policy Optimization, Efficiency Improvement, Temporal Reasoning, Select Any Frames, Generalization

194. ❌ Towards Interpretable Foundation Models for Retinal Fundus Images

作者: Samuel Ofosu Mensah, Maria Camila Roa Carvajal, Kerol Djoumessi, Philipp Berens 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18846v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出了一种用于视网膜眼底图像的可解释基础模型（Dual-IFM），核心贡献在于将大规模自监督预训练与可解释性设计相结合。因此，与’Foundation Models’（论文直接研究基础模型）、‘Pre-training’（使用自监督学习预训练）、‘Explainable AI’（模型设计强调局部和全局可解释性）以及’AI for Science’（应用于医学成像领域）高度相关（10分）。其他关键词主要涉及大模型的技术细节（如MoE、量化、推理加速等）、训练方法（如RLHF、指令调优）或应用场景（如智能体、工具调用），论文未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该研究针对医学影像中基础模型可解释性不足的问题，提出了一种可解释性设计的基础模型Dual-IFM，通过在超过80万张视网膜眼底图像上进行自监督预训练，实现了与更大规模模型相当的性能，同时提供了对分布外数据的可解释预测。

摘要翻译

基础模型通常通过自监督学习（SSL）从大量无标签数据中提取可迁移的表征。然而，许多此类模型依赖的架构可解释性有限，这在医学影像等高风险领域是一个关键问题。我们提出Dual-IFM，一种在设计上具备双重可解释性的基础模型：首先，它通过忠实于决策过程的类别证据图为单张图像提供局部可解释性；其次，它通过一个允许直接可视化模型表征空间的二维投影层，为整个数据集提供全局可解释性。我们利用来自不同来源的超过80万张彩色眼底摄影图像训练模型，以学习适用于不同下游任务的、可泛化的可解释表征。结果表明，我们的模型达到了与最先进基础模型相当的性能范围（后者参数量高达本模型的16倍），同时能在分布外数据上提供可解释的预测。我们的研究证明，大规模自监督预训练与内在可解释性相结合，能够为视网膜影像生成鲁棒的表征。

摘要 (Abstract)

Foundation models are used to extract transferable representations from large amounts of unlabeled data, typically via self-supervised learning (SSL). However, many of these models rely on architectures that offer limited interpretability, which is a critical issue in high-stakes domains such as medical imaging. We propose Dual-IFM, a foundation model that is interpretable-by-design in two ways: First, it provides local interpretability for individual images through class evidence maps that are faithful to the decision-making process. Second, it provides global interpretability for entire datasets through a 2D projection layer that allows for direct visualization of the model’s representation space. We trained our model on over 800,000 color fundus photography from various sources to learn generalizable, interpretable representations for different downstream tasks. Our results show that our model reaches a performance range similar to that of state-of-the-art foundation models with up to $16\times$ the number of parameters, while providing interpretable predictions on out-of-distribution data. Our results suggest that large-scale SSL pretraining paired with inherent interpretability can lead to robust representations for retinal imaging.

关键词: Foundation Models, Interpretability, Medical Imaging, Self-supervised Learning, Retinal Fundus Images, Dual-IFM, Representation Learning, Out-of-distribution

195. ❌ Statistical Characteristic-Guided Denoising for Rapid High-Resolution Transmission Electron Microscopy Imaging

作者: Hesong Li, Ziqi Wu, Ruiwen Shao, Ying Fu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18834v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于高分辨率透射电子显微镜（HRTEM）图像去噪，提出了一种基于统计特征引导的去噪网络，属于计算机视觉和图像处理领域。论文内容与绝大多数关键词（如大语言模型、微调技术、推理方法、智能体等）完全无关，因为这些关键词主要涉及自然语言处理和通用人工智能技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究将深度学习应用于科学成像（材料科学领域），属于AI for Science的范畴，但并非核心聚焦于大模型或深度学习技术原理的创新，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对高分辨率透射电子显微镜（HRTEM）快速成像中的严重噪声问题，提出了一种统计特征引导的去噪网络，在空间和频域利用统计特性指导去噪，并在合成和真实数据上超越了现有方法。

摘要翻译

高分辨率透射电子显微镜（HRTEM）能够实现原子尺度的成核动力学观测，从而推动先进固体材料的研究。然而，由于成核过程具有毫秒级的快速变化特性，需要采用短曝光快速成像，这导致图像产生严重噪声，从而掩盖了原子位置。本研究提出了一种统计特征引导的去噪网络，该网络利用统计特征在空间域和频域同时引导去噪过程。在空间域中，我们提出了空间偏差引导加权方法，根据偏差特征为每个空间位置选择合适的卷积操作。在频域中，我们提出了频段引导加权方法，基于频带特性增强信号并抑制噪声。我们还开发了一种针对HRTEM的噪声校准方法，生成了包含无序结构和真实HRTEM图像噪声的数据集。这能确保模型在用于成核观测的真实图像上具有稳定的去噪性能。在合成数据和真实数据上的实验表明，我们的方法在HRTEM图像去噪任务上优于现有先进方法，并在原子定位下游任务中表现出有效性。代码将在https://github.com/HeasonLee/SCGN公开。

摘要 (Abstract)

High-Resolution Transmission Electron Microscopy (HRTEM) enables atomic-scale observation of nucleation dynamics, which boosts the studies of advanced solid materials. Nonetheless, due to the millisecond-scale rapid change of nucleation, it requires short-exposure rapid imaging, leading to severe noise that obscures atomic positions. In this work, we propose a statistical characteristic-guided denoising network, which utilizes statistical characteristics to guide the denoising process in both spatial and frequency domains. In the spatial domain, we present spatial deviation-guided weighting to select appropriate convolution operations for each spatial position based on deviation characteristic. In the frequency domain, we present frequency band-guided weighting to enhance signals and suppress noise based on band characteristics. We also develop an HRTEM-specific noise calibration method and generate a dataset with disordered structures and realistic HRTEM image noises. It can ensure the denoising performance of models on real images for nucleation observation. Experiments on synthetic and real data show our method outperforms the state-of-the-art methods in HRTEM image denoising, with effectiveness in the localization downstream task. Code will be available at https://github.com/HeasonLee/SCGN.

关键词: High-Resolution Transmission Electron Microscopy, HRTEM image denoising, statistical characteristic-guided denoising, spatial deviation-guided weighting, frequency band-guided weighting, nucleation observation, noise calibration, deep learning for scientific imaging

196. ❌ VesselTok: Tokenizing Vessel-like 3D Biomedical Graph Representations for Reconstruction and Generation

作者: Chinmay Prabhakar, Bastian Wittmann, Tamaz Amiranashvili, Paul Büschl, Ezequiel de la Rosa, Julian McGinnis, Benedikt Wiestler, Bjoern Menze, Suprosanna Shit 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18797v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于生物医学图像分析领域，提出VesselTok框架用于血管状3D生物医学图表示的标记化和重建生成。所有关键词中，只有’AI for Science OR Bioinformatics OR Cheminformatics’与论文高度相关（10分），因为论文明确涉及生物信息学/生物医学研究中的AI应用。其他关键词均与大语言模型、训练技术、推理优化、代理系统等无关，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了VesselTok框架，通过参数化形状视角学习血管状3D生物医学图的潜在表示（标记），解决了高空间分辨率网络的计算挑战，并在肺气道、肺血管和脑血管等多种解剖结构中证明了其编码复杂拓扑、生成解剖图和下游逆问题迁移的有效性。

摘要翻译

空间图为血管、肺气道及神经网络等曲线解剖结构提供了一种轻量化且优雅的表征方式。精确建模此类图结构在临床与（生物）医学研究中至关重要。然而，大型网络的高空间分辨率极大增加了其复杂性，导致显著的计算挑战。本研究旨在通过提出VesselTok框架应对这些挑战，该框架从参数化形状的视角处理空间密集图，以学习潜在表征（token）。VesselTok利用带有伪半径的中心线点来高效编码管状几何结构。具体而言，我们学习一种以中心线点为条件的新型潜在表征，用以编码类血管管状结构的神经隐式表征。我们在包括肺气道、肺血管和脑血管在内的多种解剖结构中验证了VesselTok的性能，突显其稳健编码复杂拓扑结构的能力。为证明VesselTok所学潜在表征的有效性，我们展示了其能够：（i）泛化至未见解剖结构，（ii）支持合理解剖图的生成式建模，以及（iii）有效迁移至下游逆问题（如链接预测）。

摘要 (Abstract)

Spatial graphs provide a lightweight and elegant representation of curvilinear anatomical structures such as blood vessels, lung airways, and neuronal networks. Accurately modeling these graphs is crucial in clinical and (bio-)medical research. However, the high spatial resolution of large networks drastically increases their complexity, resulting in significant computational challenges. In this work, we aim to tackle these challenges by proposing VesselTok, a framework that approaches spatially dense graphs from a parametric shape perspective to learn latent representations (tokens). VesselTok leverages centerline points with a pseudo radius to effectively encode tubular geometry. Specifically, we learn a novel latent representation conditioned on centerline points to encode neural implicit representations of vessel-like, tubular structures. We demonstrate VesselTok’s performance across diverse anatomies, including lung airways, lung vessels, and brain vessels, highlighting its ability to robustly encode complex topologies. To prove the effectiveness of VesselTok’s learnt latent representations, we show that they (i) generalize to unseen anatomies, (ii) support generative modeling of plausible anatomical graphs, and (iii) transfer effectively to downstream inverse problems, such as link prediction.

关键词: VesselTok, 3D biomedical graphs, vessel-like structures, latent representations, neural implicit representations, generative modeling, link prediction, computational challenges

197. ❌ Rethinking Uncertainty Quantification and Entanglement in Image Segmentation

作者: Jakob Lønborg Christensen, Vedrana Andersen Dahl, Morten Rieger Hannemose, Anders Bjorholm Dahl, Christian F. Baumgartner 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18792v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于医学图像分割中的不确定性量化（UQ）问题，研究aleatoric uncertainty（AU）和epistemic uncertainty（EU）的分解、相互作用及纠缠问题。所有关键词均与大模型（LLM）技术、训练方法、推理优化、代理系统等直接相关，而本文研究的是传统的深度学习模型（如UNet、扩散模型、集成方法）在特定领域（医学图像）的应用，未涉及大模型技术原理或创新。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为医学图像分析可视为AI在科学（生物医学）领域的应用，但论文未强调大模型或深度学习技术原理的创新，仅涉及现有方法在特定任务上的评估，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文研究了医学图像分割中不确定性量化（UQ）的aleatoric和epistemic不确定性分解的相互作用与纠缠问题，通过综合实证评估发现集成方法在分布外检测中表现最佳且纠缠较低，而最佳模型因任务和数据集而异，并分析了纠缠来源及缓解方向。

摘要翻译

不确定性量化在医学图像分割等安全关键应用中至关重要。总不确定性通常被分解为与数据相关的偶然不确定性（AU）和与模型相关的认知不确定性（EU）。目前存在多种建模AU（如概率UNet、扩散模型）和EU（如集成方法、蒙特卡洛Dropout）的方法，但当这些方法结合使用时，其相互作用尚不明确。此外，近期研究揭示了AU与EU之间存在显著的纠缠现象，这削弱了不确定性分解的可解释性与实际应用价值。我们开展了一项全面的实证研究，涵盖了广泛的AU-EU模型组合，提出了一种量化不确定性纠缠的指标，并在下游不确定性量化任务中对各类方法进行了评估。对于分布外检测，集成方法始终表现出较低的纠缠度和更优的性能。在模糊性建模和校准方面，最佳模型因数据集而异，其中基于softmax/SSN的方法表现良好，而概率UNet的纠缠度较低。值得注意的是，softmax集成方法在所有任务中均表现优异。最后，我们分析了不确定性纠缠的潜在来源，并提出了缓解这一效应的研究方向。

摘要 (Abstract)

Uncertainty quantification (UQ) is crucial in safety-critical applications such as medical image segmentation. Total uncertainty is typically decomposed into data-related aleatoric uncertainty (AU) and model-related epistemic uncertainty (EU). Many methods exist for modeling AU (such as Probabilistic UNet, Diffusion) and EU (such as ensembles, MC Dropout), but it is unclear how they interact when combined. Additionally, recent work has revealed substantial entanglement between AU and EU, undermining the interpretability and practical usefulness of the decomposition. We present a comprehensive empirical study covering a broad range of AU-EU model combinations, propose a metric to quantify uncertainty entanglement, and evaluate both across downstream UQ tasks. For out-of-distribution detection, ensembles exhibit consistently lower entanglement and superior performance. For ambiguity modeling and calibration the best models are dataset-dependent, with softmax/SSN-based methods performing well and Probabilistic UNets being less entangled. A softmax ensemble fares remarkably well on all tasks. Finally, we analyze potential sources of uncertainty entanglement and outline directions for mitigating this effect.

关键词: Uncertainty Quantification, Medical Image Segmentation, Aleatoric Uncertainty, Epistemic Uncertainty, Uncertainty Entanglement, Ensemble Methods, Out-of-distribution Detection, Probabilistic UNet

198. ❌ ProCal: Probability Calibration for Neighborhood-Guided Source-Free Domain Adaptation

作者: Ying Zheng, Yiyi Zhang, Yi Wang, Lap-Pui Chau 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18764v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于源自由域适应（SFDA）技术，属于迁移学习领域，与关键词’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分），因为论文核心是解决预训练模型在无源数据情况下的域适应问题。其他关键词主要涉及大模型技术、推理方法、对齐、压缩等，与论文的通用深度学习域适应研究无直接关联，均得0分。

!!! tip deepseek-chat TL;DR

论文提出ProCal概率校准方法，通过双模型协作预测机制解决源自由域适应中过度依赖邻居预测相似性导致的知识遗忘和局部噪声过拟合问题，在31个跨域任务上验证了有效性。

摘要翻译

无源域自适应（Source-Free Domain Adaptation, SFDA）旨在无需访问源数据的情况下，将预训练模型适配到无标注的目标域。尽管当前最先进的方法利用局部邻域结构在SFDA中展现出潜力，但它们往往过度依赖邻居间的预测相似性。这种过度依赖会加速源知识的遗忘，并增加对局部噪声过拟合的敏感性。为解决这些问题，我们提出了ProCal，一种概率校准方法，通过双模型协同预测机制动态校准基于邻域的预测。ProCal将源模型的初始预测与当前模型的在线输出相结合，以有效校准邻居概率。该策略不仅减轻了局部噪声的干扰，还保留了源模型的判别性信息，从而在知识保留与域适应之间实现了平衡。此外，我们设计了一个联合优化目标，将软监督损失与多样性损失相结合，以指导目标模型的学习。理论分析表明，ProCal能够收敛到一个源知识与目标信息有效融合的均衡状态，减少了知识遗忘和过拟合现象。我们在四个公共数据集上的31个跨域任务中进行了广泛实验，验证了所提方法的有效性。代码已发布于：https://github.com/zhengyinghit/ProCal。

摘要 (Abstract)

Source-Free Domain Adaptation (SFDA) adapts pre-trained models to unlabeled target domains without requiring access to source data. Although state-of-the-art methods leveraging local neighborhood structures show promise for SFDA, they tend to over-rely on prediction similarity among neighbors. This over-reliance accelerates the forgetting of source knowledge and increases susceptibility to local noise overfitting. To address these issues, we introduce ProCal, a probability calibration method that dynamically calibrates neighborhood-based predictions through a dual-model collaborative prediction mechanism. ProCal integrates the source model’s initial predictions with the current model’s online outputs to effectively calibrate neighbor probabilities. This strategy not only mitigates the interference of local noise but also preserves the discriminative information from the source model, thereby achieving a balance between knowledge retention and domain adaptation. Furthermore, we design a joint optimization objective that combines a soft supervision loss with a diversity loss to guide the target model. Our theoretical analysis shows that ProCal converges to an equilibrium where source knowledge and target information are effectively fused, reducing both knowledge forgetting and overfitting. We validate the effectiveness of our approach through extensive experiments on 31 cross-domain tasks across four public datasets. Our code is available at: https://github.com/zhengyinghit/ProCal.

关键词: Source-Free Domain Adaptation, Probability Calibration, Neighborhood-Guided, Dual-model Collaboration, Knowledge Retention, Domain Adaptation, Cross-domain Tasks, SFDA

199. ❌ SEAR: Simple and Efficient Adaptation of Visual Geometric Transformers for RGB+Thermal 3D Reconstruction

作者: Vsevolod Skorokhodov, Chenghao Xu, Shuo Sun, Olga Fink, Malcolm Mielle 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18774v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究视觉几何Transformer在RGB-thermal多模态3D重建中的适应问题，核心贡献是提出SEAR微调策略。与关键词的相关性分析：1）‘Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分），因为论文核心是微调策略；2）‘Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（8分），涉及预训练模型的领域适应；3）‘AI for Science OR Bioinformatics OR Cheminformatics’有弱关联（5分），属于AI在科学领域的应用；其他关键词（如LLM、MoE、RLHF等）与论文的视觉几何Transformer和多模态3D重建主题完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文针对预训练的视觉几何Transformer在RGB-thermal多模态3D重建中模态对齐困难的问题，提出了SEAR微调策略，显著提升了3D重建和相机姿态估计的性能，并在低光照和浓烟等挑战性条件下实现了可靠的跨模态重建。

摘要翻译

基础前馈视觉几何模型通过从海量RGB数据集中学习强场景先验，能够实现准确高效的相机姿态估计与场景重建。然而，当应用于混合传感模态（如RGB-热成像图像）时，其性能会下降。我们观察到，尽管基于RGB数据预训练的视觉几何变换器在纯热成像重建任务中泛化良好，但在同时处理RGB与热成像模态时难以实现两者的有效对齐。为此，我们提出SEAR——一种简单高效的微调策略，使预训练的几何变换器能够适配多模态RGB-T输入。尽管仅在相对小规模的RGB-T数据集上进行训练，我们的方法在三维重建与相机姿态估计任务上显著优于现有先进方法，在所有指标上均取得显著提升（例如AUC@30指标提升超过29%），并在模态间实现了更高细节度与一致性，且推理时间相较于原始预训练模型的开销可忽略不计。值得注意的是，即使在低光照与浓烟等挑战性条件下，SEAR仍能实现可靠的多模态姿态估计与重建。我们通过大量消融实验验证了该架构的有效性，阐明了模型如何实现双模态对齐。此外，我们提出了一个包含不同时间、视角和光照条件下采集的RGB与热成像序列的新数据集，为未来多模态三维场景重建研究提供了稳健的基准。代码与模型已公开于https://www.github.com/Schindler-EPFL-Lab/SEAR。

摘要 (Abstract)

Foundational feed-forward visual geometry models enable accurate and efficient camera pose estimation and scene reconstruction by learning strong scene priors from massive RGB datasets. However, their effectiveness drops when applied to mixed sensing modalities, such as RGB-thermal (RGB-T) images. We observe that while a visual geometry grounded transformer pretrained on RGB data generalizes well to thermal-only reconstruction, it struggles to align RGB and thermal modalities when processed jointly. To address this, we propose SEAR, a simple yet efficient fine-tuning strategy that adapts a pretrained geometry transformer to multimodal RGB-T inputs. Despite being trained on a relatively small RGB-T dataset, our approach significantly outperforms state-of-the-art methods for 3D reconstruction and camera pose estimation, achieving significant improvements over all metrics (e.g., over 29% in AUC@30) and delivering higher detail and consistency between modalities with negligible overhead in inference time compared to the original pretrained model. Notably, SEAR enables reliable multimodal pose estimation and reconstruction even under challenging conditions, such as low lighting and dense smoke. We validate our architecture through extensive ablation studies, demonstrating how the model aligns both modalities. Additionally, we introduce a new dataset featuring RGB and thermal sequences captured at different times, viewpoints, and illumination conditions, providing a robust benchmark for future work in multimodal 3D scene reconstruction. Code and models are publicly available at https://www.github.com/Schindler-EPFL-Lab/SEAR.

关键词: visual geometry transformer, multimodal RGB-thermal, 3D reconstruction, camera pose estimation, fine-tuning strategy, domain adaptation, scene reconstruction, modality alignment

200. ❌ Dual-Model Prediction of Affective Engagement and Vocal Attractiveness from Speaker Expressiveness in Video Learning

作者: Hung-Yue Suen, Kuo-En Hung, Fan-Hsun Tseng 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18758v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于机器学习的情感AI方法，用于预测视频学习中观众的情感参与度和声音吸引力，主要使用回归模型分析说话者的多模态特征（面部动态、眼动特征、韵律、认知语义和声学特征）。所有评分关键词都直接与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的特定应用（如生物信息学）相关，而本文专注于传统机器学习在情感计算和教育技术中的应用，未涉及大模型、深度学习创新或AI for Science的具体子领域，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于说话者多模态特征的情感AI方法，通过两个回归模型成功预测了异步视频学习中观众的情感参与度和声音吸引力，实证研究表明仅使用说话者侧特征即可有效代表聚合的观众反馈。

摘要翻译

本文提出一种基于机器学习的以说话者为中心的情感人工智能方法，该方法仅依靠说话者的情感表达，即可在异步视频学习中预测受众的情感投入度和声音吸引力。受对可扩展、保护隐私的情感计算应用需求的启发，这种以说话者为中心的情感人工智能方法整合了两个独立的回归模型，利用大规模开放在线课程（MOOCs）中构建的海量语料库，以实现具有情感吸引力的学习体验。预测情感投入度的回归模型通过融合源自面部动态、眼动特征、韵律和认知语义的情感表达来构建，同时引入第二个回归模型，该模型仅基于说话者的声学特征来预测声音吸引力。值得注意的是，在独立于说话者的测试集上，两个回归模型均取得了出色的预测性能（情感投入度R2 = 0.85，声音吸引力R2 = 0.88），证实了说话者端的情感表达可以功能性代表聚合的受众反馈。本文提出的以说话者为中心的情感人工智能方法得到了一项实证研究的支持，该研究发现，说话者端的多模态特征（包括声学特征）能够前瞻性地预测受众反馈，而无需必然使用受众端的输入信息。

摘要 (Abstract)

This paper outlines a machine learning-enabled speaker-centric Emotion AI approach capable of predicting audience-affective engagement and vocal attractiveness in asynchronous video-based learning, relying solely on speaker-side affective expressions. Inspired by the demand for scalable, privacy-preserving affective computing applications, this speaker-centric Emotion AI approach incorporates two distinct regression models that leverage a massive corpus developed within Massive Open Online Courses (MOOCs) to enable affectively engaging experiences. The regression model predicting affective engagement is developed by assimilating emotional expressions emanating from facial dynamics, oculomotor features, prosody, and cognitive semantics, while incorporating a second regression model to predict vocal attractiveness based exclusively on speaker-side acoustic features. Notably, on speaker-independent test sets, both regression models yielded impressive predictive performance (R2 = 0.85 for affective engagement and R2 = 0.88 for vocal attractiveness), confirming that speaker-side affect can functionally represent aggregated audience feedback. This paper provides a speaker-centric Emotion AI approach substantiated by an empirical study discovering that speaker-side multimodal features, including acoustics, can prospectively forecast audience feedback without necessarily employing audience-side input information.

关键词: Emotion AI, affective engagement, vocal attractiveness, speaker-centric, multimodal features, regression models, video-based learning, MOOCs

201. ❌ DA-Mamba: Learning Domain-Aware State Space Model for Global-Local Alignment in Domain Adaptive Object Detection

作者: Haochen Li, Rui Zhang, Hantao Yao, Xin Zhang, Yifan Hao, Shaohui Peng, Yongwei Zhao, Ling Li 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18757v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉中的域自适应目标检测（DAOD），提出了一种结合CNN和状态空间模型（SSMs）的混合架构DA-Mamba，以高效捕获全局和局部域不变特征。该研究与大多数关键词无关，因为这些关键词主要涉及大语言模型（LLMs）及其相关技术（如微调、对齐、推理、代理等）。唯一相关的关键词是“Pre-training OR Continual Pre-training OR Domain Adaptation”，因为论文涉及域自适应（Domain Adaptation），这是迁移学习的一个子领域，与预训练和持续预训练有一定关联，但论文本身不直接研究预训练技术，因此给予8分（有一定关联）。其他关键词均不涉及LLMs、AI for Science或其他指定技术，故评分为0。

!!! tip deepseek-chat TL;DR

该论文提出DA-Mamba，一种结合CNN和状态空间模型的混合架构，用于域自适应目标检测，以高效实现全局-局部对齐，提升跨域检测性能。

摘要翻译

域自适应目标检测（Domain Adaptive Object Detection, DAOD）旨在将检测器从已标注的源域迁移至未标注的目标域。现有DAOD方法采用多粒度特征对齐来学习域不变表征。然而，其基于CNN的主干网络与检测头的局部连接性将特征对齐限制于局部区域，难以提取全局域不变特征。尽管基于Transformer的DAOD方法通过注意力机制捕获全局依赖关系，但其二次计算复杂度阻碍了实际部署。为解决此问题，我们提出DA-Mamba，一种混合CNN-状态空间模型（State Space Models, SSMs）架构，结合了CNN的高效性与状态空间模型的线性时间长程建模能力，以同时捕获全局与局部域不变特征。具体而言，我们引入了两个新颖模块：图像感知状态空间模型（Image-Aware SSM, IA-SSM）与目标感知状态空间模型（Object-Aware SSM, OA-SSM）。IA-SSM集成于主干网络中，以增强全局域感知能力，实现图像级的全局与局部对齐；OA-SSM嵌入检测头中，用于建模目标间的空间与语义依赖关系，从而增强实例级对齐。综合实验表明，所提方法能有效提升目标检测器的跨域性能。

摘要 (Abstract)

Domain Adaptive Object Detection (DAOD) aims to transfer detectors from a labeled source domain to an unlabeled target domain. Existing DAOD methods employ multi-granularity feature alignment to learn domain-invariant representations. However, the local connectivity of their CNN-based backbone and detection head restricts alignment to local regions, failing to extract global domain-invariant features. Although transformer-based DAOD methods capture global dependencies via attention mechanisms, their quadratic computational cost hinders practical deployment. To solve this, we propose DA-Mamba, a hybrid CNN-State Space Models (SSMs) architecture that combines the efficiency of CNNs with the linear-time long-range modeling capability of State Space Models (SSMs) to capture both global and local domain-invariant features. Specifically, we introduce two novel modules: Image-Aware SSM (IA-SSM) and Object-Aware SSM (OA-SSM). IA-SSM is integrated into the backbone to enhance global domain awareness, enabling image-level global and local alignment. OA-SSM is inserted into the detection head to model spatial and semantic dependencies among objects, enhancing instance-level alignment. Comprehensive experiments demonstrate that the proposed method can efficiently improve the cross-domain performance of the object detector.

关键词: Domain Adaptive Object Detection, State Space Models, CNN-SSM hybrid architecture, global-local alignment, cross-domain performance, Image-Aware SSM, Object-Aware SSM, efficient long-range modeling

202. ❌ 6Bit-Diffusion: Inference-Time Mixed-Precision Quantization for Video Diffusion Models

作者: Rundong Su, Jintao Zhang, Zhihang Yuan, Haojie Duanmu, Jianfei Chen, Jun Zhu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18742v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频扩散模型的高效推理优化，核心贡献是提出了一种混合精度量化框架和计算跳过技术。与关键词的相关性分析：1）与’Quantization OR Model Compression OR Low-bit Weights’高度相关（10分），论文核心是NVFP4/INT8混合精度量化；2）与’Speculative Decoding OR Inference Acceleration’高度相关（10分），实现了1.92倍端到端加速；3）与’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（5分），涉及后训练量化；4）其他关键词主要涉及大语言模型、对齐、推理、代理等，与视频扩散模型技术不直接相关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对视频扩散变换模型推理时内存占用高、计算成本大的问题，提出了一种基于输入输出差异预测的混合精度量化框架和时态冗余计算跳过技术，实现了1.92倍加速和3.32倍内存压缩。

摘要翻译

扩散变换器在视频生成领域展现出卓越能力，但其实际部署受限于高内存占用与计算成本。训练后量化为降低内存使用、提升计算速度提供了实用途径。现有量化方法通常采用静态位宽分配，忽视了不同扩散时间步中激活值的量化难度，导致效率与质量间的权衡未能达到最优。本文提出一种推理时NVFP4/INT8混合精度量化框架。我们发现模块的输入输出差异与其内部线性层的量化敏感性存在强线性相关性。基于此洞见，我们设计了一个轻量级预测器，动态地将NVFP4分配给时间维度上稳定的层以最大化内存压缩，同时为波动剧烈的层选择性保留INT8精度以确保鲁棒性。这种自适应精度策略可在不影响生成质量的前提下实现激进量化。此外，我们观察到Transformer模块的输入输出残差在不同时间步间具有高度的时间一致性。利用这种时间冗余性，我们引入时序差分缓存（Temporal Delta Cache, TDC）以跳过这些不变模块的计算，进一步降低计算成本。大量实验表明，本方法实现了1.92倍的端到端加速与3.32倍的内存压缩，为视频扩散变换器的高效推理设立了新的基准。

摘要 (Abstract)

Diffusion transformers have demonstrated remarkable capabilities in generating videos. However, their practical deployment is severely constrained by high memory usage and computational cost. Post-Training Quantization provides a practical way to reduce memory usage and boost computation speed. Existing quantization methods typically apply a static bit-width allocation, overlooking the quantization difficulty of activations across diffusion timesteps, leading to a suboptimal trade-off between efficiency and quality. In this paper, we propose a inference time NVFP4/INT8 Mixed-Precision Quantization framework. We find a strong linear correlation between a block’s input-output difference and the quantization sensitivity of its internal linear layers. Based on this insight, we design a lightweight predictor that dynamically allocates NVFP4 to temporally stable layers to maximize memory compression, while selectively preserving INT8 for volatile layers to ensure robustness. This adaptive precision strategy enables aggressive quantization without compromising generation quality. Beside this, we observe that the residual between the input and output of a Transformer block exhibits high temporal consistency across timesteps. Leveraging this temporal redundancy, we introduce Temporal Delta Cache (TDC) to skip computations for these invariant blocks, further reducing the computational cost. Extensive experiments demonstrate that our method achieves 1.92$\times$ end-to-end acceleration and 3.32$\times$ memory reduction, setting a new baseline for efficient inference in Video DiTs.

关键词: Video Diffusion Models, Mixed-Precision Quantization, Inference Acceleration, Memory Reduction, Transformer Blocks, Temporal Delta Cache, Post-Training Quantization, Efficient Inference

203. ❌ EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation

作者: Longfei Liu, Yongjie Hou, Yang Li, Qirui Wang, Youyang Sha, Yongjun Yu, Yinzhi Wang, Peizhe Ru, Xuanlong Yu, Xi Shen 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18739v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域，研究用于边缘设备密集预测任务的紧凑型视觉变换器（ViT），涉及模型蒸馏、轻量级架构设计和边缘部署优化。所有评分关键词均与大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理等）或特定科学AI应用（如生物信息学）相关。论文内容完全不涉及LLM、NLP或相关技术，也未涉及指定的科学AI子领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对资源受限的边缘设备，提出了一个名为EdgeCrafter的统一紧凑型视觉变换器框架，通过任务专用蒸馏和边缘感知设计，在目标检测、实例分割和姿态估计等密集预测任务上实现了与CNN基准模型（如YOLO）相竞争的性能，且参数量更少。

摘要翻译

在计算和内存严格受限的资源受限边缘设备上部署高性能密集预测模型仍具挑战性。实践中，用于目标检测、实例分割和姿态估计的轻量级系统仍主要基于YOLO等CNN架构，而紧凑型视觉变换器（Vision Transformers, ViTs）即使经过大规模预训练，也往往难以实现类似的强精度-效率权衡。我们认为，这一差距主要源于小规模ViTs中任务特定表征学习的不足，而非ViTs与边缘密集预测之间存在固有错配。为解决此问题，我们提出了EdgeCrafter——一个统一的紧凑型ViT框架，专为边缘密集预测设计，其核心是ECDet检测模型。ECDet由蒸馏得到的紧凑骨干网络和边缘友好的编码器-解码器架构构建而成。在COCO数据集上，仅使用COCO标注的ECDet-S模型以不足1000万参数实现了51.7 AP。在实例分割任务中，ECInsSeg以显著更少的参数取得了与RF-DETR相当的性能。在姿态估计任务中，ECPose-X达到74.8 AP，显著优于依赖大量Objects365预训练的YOLO26Pose-X（71.6 AP）。这些结果表明，紧凑型ViTs结合任务专用蒸馏和边缘感知设计后，可成为边缘密集预测领域实用且具有竞争力的选择。代码发布于：https://intellindust-ai-lab.github.io/projects/EdgeCrafter/

摘要 (Abstract)

Deploying high-performance dense prediction models on resource-constrained edge devices remains challenging due to strict limits on computation and memory. In practice, lightweight systems for object detection, instance segmentation, and pose estimation are still dominated by CNN-based architectures such as YOLO, while compact Vision Transformers (ViTs) often struggle to achieve similarly strong accuracy efficiency tradeoff, even with large scale pretraining. We argue that this gap is largely due to insufficient task specific representation learning in small scale ViTs, rather than an inherent mismatch between ViTs and edge dense prediction. To address this issue, we introduce EdgeCrafter, a unified compact ViT framework for edge dense prediction centered on ECDet, a detection model built from a distilled compact backbone and an edge-friendly encoder decoder design. On the COCO dataset, ECDet-S achieves 51.7 AP with fewer than 10M parameters using only COCO annotations. For instance segmentation, ECInsSeg achieves performance comparable to RF-DETR while using substantially fewer parameters. For pose estimation, ECPose-X reaches 74.8 AP, significantly outperforming YOLO26Pose-X (71.6 AP) despite the latter’s reliance on extensive Objects365 pretraining. These results show that compact ViTs, when paired with task-specialized distillation and edge-aware design, can be a practical and competitive option for edge dense prediction. Code is available at: https://intellindust-ai-lab.github.io/projects/EdgeCrafter/

关键词: Vision Transformers (ViTs), edge computing, dense prediction, model distillation, compact models, object detection, instance segmentation, pose estimation

204. ❌ From ex(p) to poly: Gaussian Splatting with Polynomial Kernels

作者: Joerg H. Mueller, Martin Winter, Markus Steinberger 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18707v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于3D高斯泼溅（3DGS）的核函数优化，提出用多项式近似替换原始指数核以提高计算效率并保持与现有数据集的兼容性。所有评分关键词均与大语言模型（LLM）、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是计算机图形学中的3D重建技术，属于完全不同的领域。论文未涉及任何大模型、深度学习创新或AI在生物/化学信息学等科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对3D高斯泼溅（3DGS）中核函数修改与现有数据集不兼容的问题，提出了一种多项式近似核函数，在保持图像质量的同时实现了4-15%的性能提升。

摘要翻译

高斯泼溅（3DGS）技术的最新进展对原始核函数进行了多种改进，显著提升了其性能表现。然而，许多核函数的修改与针对原始高斯核优化的现有数据集不兼容，这对其广泛采用构成了挑战。在本研究中，我们通过提出一种替代核函数来解决这一挑战，该核函数在保持与现有数据集兼容性的同时，提高了计算效率。具体而言，我们将原始的指数核替换为结合了ReLU函数的多项式近似。这一修改允许对高斯函数进行更激进的剔除，从而在不同3DGS实现中提升了性能。我们的结果显示，在图像质量影响可忽略不计的情况下，性能取得了4%至15%的显著提升。我们还对新核函数进行了详细的数学分析，并探讨了其在NPU硬件上实现3DGS的潜在优势。

摘要 (Abstract)

Recent advancements in Gaussian Splatting (3DGS) have introduced various modifications to the original kernel, resulting in significant performance improvements. However, many of these kernel changes are incompatible with existing datasets optimized for the original Gaussian kernel, presenting a challenge for widespread adoption. In this work, we address this challenge by proposing an alternative kernel that maintains compatibility with existing datasets while improving computational efficiency. Specifically, we replace the original exponential kernel with a polynomial approximation combined with a ReLU function. This modification allows for more aggressive culling of Gaussians, leading to enhanced performance across different 3DGS implementations. Our results show a notable performance improvement of 4 to 15% with negligible impact on image quality. We also provide a detailed mathematical analysis of the new kernel and discuss its potential benefits for 3DGS implementations on NPU hardware.

关键词: Gaussian Splatting, 3DGS, polynomial kernel, exponential kernel, computational efficiency, kernel approximation, 3D reconstruction, NPU hardware

205. ❌ Towards High-Quality Image Segmentation: Improving Topology Accuracy by Penalizing Neighbor Pixels

作者: Juan Miguel Valverde, Dim P. Papadopoulos, Rasmus Larsen, Anders Bjorholm Dahl 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18671v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的图像分割任务，提出了一种通过惩罚相邻像素来改善拓扑准确性的方法（SCNP）。论文内容完全围绕深度学习在图像分割中的应用，涉及损失函数设计、训练方法改进和拓扑准确性评估。所有评分关键词均与大语言模型、模型训练技术、推理优化、对齐方法、代理系统等大模型相关主题相关，而本论文研究的是传统的计算机视觉深度学习模型（如CNN），未涉及任何大语言模型技术、大模型训练方法或大模型应用场景，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SCNP的高效方法，通过惩罚相邻像素来改善图像分割中的拓扑准确性，并在13个不同数据集和三种分割框架中验证了其有效性。

摘要翻译

用于图像分割的标准深度学习模型无法保证拓扑准确性，难以维持正确的连通分量数量或结构形态。这一问题会降低分割质量，并影响后续量化分析的可靠性。先前的研究尝试通过专用框架、网络结构和损失函数来提升拓扑准确性，但这些方法往往难以集成到现有训练流程中，计算成本高昂，或仅限于管状形态结构。本文提出SCNP（Same-Class Neighbor Penalization，同类别邻域惩罚）这一高效方法，该方法通过惩罚每个像素与其分类效果最差的相邻像素之间的逻辑值，强制模型在优化像素自身预测前先提升其邻域像素的预测质量，从而改善拓扑准确性。我们在涵盖不同结构形态和图像模态的13个数据集上验证了SCNP的有效性，并将其集成到三种用于语义分割与实例分割的框架中。此外，我们证明SCNP可融入多种损失函数，使其具备提升拓扑准确性的能力。代码发布于https://jmlipman.github.io/SCNP-SameClassNeighborPenalization。

摘要 (Abstract)

Standard deep learning models for image segmentation cannot guarantee topology accuracy, failing to preserve the correct number of connected components or structures. This, in turn, affects the quality of the segmentations and compromises the reliability of the subsequent quantification analyses. Previous works have proposed to enhance topology accuracy with specialized frameworks, architectures, and loss functions. However, these methods are often cumbersome to integrate into existing training pipelines, they are computationally very expensive, or they are restricted to structures with tubular morphology. We present SCNP, an efficient method that improves topology accuracy by penalizing the logits with their poorest-classified neighbor, forcing the model to improve the prediction at the pixels’ neighbors before allowing it to improve the pixels themselves. We show the effectiveness of SCNP across 13 datasets, covering different structure morphologies and image modalities, and integrate it into three frameworks for semantic and instance segmentation. Additionally, we show that SCNP can be integrated into several loss functions, making them improve topology accuracy. Our code can be found at https://jmlipman.github.io/SCNP-SameClassNeighborPenalization.

关键词: image segmentation, topology accuracy, deep learning, loss function, neighbor penalization, semantic segmentation, instance segmentation, SCNP

206. ❌ Multimodal Model for Computational Pathology:Representation Learning and Image Compression

作者: Peihang Wu, Zehong Chen, Lijian Xu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18660v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究计算病理学中的多模态模型，重点涉及基础模型应用、参数高效微调、多智能体协作推理（模拟病理学家的思维链）、可解释AI以及AI在科学（生物信息学）领域的应用。与这些关键词高度相关（8-10分）。其他关键词如MoE、量化、幻觉缓解等未在摘要中提及，评为0分。预训练/微调、少样本学习等有间接关联，评为5分。

!!! tip deepseek-chat TL;DR

该综述系统分析了计算病理学中多模态模型的最新进展，重点探讨了自监督表示学习、多模态数据生成、参数高效适应和多智能体协作推理等方向，旨在解决全切片图像处理、有限标注数据和临床可解释性等挑战，以支持可解释且安全的AI辅助诊断。

摘要翻译

全切片成像技术通过实现对千兆像素级组织病理学图像的计算分析，彻底改变了数字病理学领域。近期基础模型的进展加速了计算病理学的发展，促进了病理图像、临床报告与结构化数据的联合推理。尽管取得这些进展，挑战依然存在：全切片图像的极高分辨率给视觉学习带来计算障碍；有限的专家标注制约了监督学习方法的应用；在保持生物学可解释性的同时整合多模态信息仍存在困难；以及对超长视觉序列建模的不透明性阻碍了临床透明度。本综述全面审视了多模态计算病理学的最新进展。我们系统分析了四个研究方向：（1）面向全切片图像的自监督表征学习与结构感知的令牌压缩；（2）多模态数据生成与增强；（3）参数高效适配与推理增强的小样本学习；（4）面向可信诊断的多智能体协同推理。我们重点探讨了令牌压缩如何实现跨尺度建模，以及多智能体机制如何模拟病理学家跨放大倍率的“思维链”以实现不确定性感知的证据融合。最后，我们讨论了当前面临的开放挑战，并指出未来的进展取决于能够整合高分辨率视觉数据与临床及生物医学知识的统一多模态框架，以支持可解释且安全的人工智能辅助诊断。

摘要 (Abstract)

Whole slide imaging (WSI) has transformed digital pathology by enabling computational analysis of gigapixel histopathology images. Recent foundation model advances have accelerated progress in computational pathology, facilitating joint reasoning across pathology images, clinical reports, and structured data. Despite this progress, challenges remain: the extreme resolution of WSIs creates computational hurdles for visual learning; limited expert annotations constrain supervised approaches; integrating multimodal information while preserving biological interpretability remains difficult; and the opacity of modeling ultra-long visual sequences hinders clinical transparency. This review comprehensively surveys recent advances in multimodal computational pathology. We systematically analyze four research directions: (1) self-supervised representation learning and structure-aware token compression for WSIs; (2) multimodal data generation and augmentation; (3) parameter-efficient adaptation and reasoning-enhanced few-shot learning; and (4) multi-agent collaborative reasoning for trustworthy diagnosis. We specifically examine how token compression enables cross-scale modeling and how multi-agent mechanisms simulate a pathologist’s “Chain of Thought” across magnifications to achieve uncertainty-aware evidence fusion. Finally, we discuss open challenges and argue that future progress depends on unified multimodal frameworks integrating high-resolution visual data with clinical and biomedical knowledge to support interpretable and safe AI-assisted diagnosis.

关键词: computational pathology, multimodal model, representation learning, parameter-efficient adaptation, multi-agent reasoning, Chain of Thought, whole slide imaging, AI-assisted diagnosis

207. ❌ Click-to-Ask: An AI Live Streaming Assistant with Offline Copywriting and Online Interactive QA

作者: Ruizhi Yu, Keyang Zhong, Peng Liu, Qi Wu, Haoran Zhang, Yanhao Zhang, Chen Chen, Haonan Lu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18649v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Click-to-Ask系统，这是一个用于直播电商的AI助手，包含离线文案生成和在线交互问答功能。系统处理多模态产品信息生成结构化数据和促销文案，并在直播中通过点击问题实时响应观众查询。该系统属于大模型在特定领域（直播电商）的应用，因此与’Large Language Models’有一定关联（5分），因为它很可能基于LLM技术实现文案生成和问答功能。系统在直播中实时响应观众查询，这体现了智能代理的工作流程，与’LLM Agents’有一定关联（5分）。系统利用结构化产品信息和历史记忆来生成响应，这类似于检索增强生成（RAG）的概念，因此与’Retrieval-Augmented Generation’有一定关联（5分）。其他关键词主要涉及大模型的技术原理、训练方法、优化技术或特定科学领域应用，与本文的电商应用场景无直接关系，因此评分为0分。

!!! tip deepseek-chat TL;DR

该研究提出了Click-to-Ask系统，一个用于直播电商的AI助手，通过离线处理产品信息生成促销文案和在线实时响应观众问题，显著减少了促销准备时间并提升了互动效果，在TikTok直播数据集上实现了0.913的问题识别准确率和0.876的响应质量得分。

摘要翻译

直播电商已成为当代一种重要的传播形式。为帮助主播更高效便捷地进行产品推广，我们提出了“即点即问”——一个融合线下与线上组件的AI驱动直播电商助手。离线模块处理多样化的多模态产品信息，将复杂输入转化为结构化产品数据，并生成合规的促销文案。在直播过程中，在线模块通过允许主播点击观众问题，并综合利用离线模块生成的结构化产品信息及流式架构中维护的事件级历史记忆，实现对观众咨询的实时响应。该系统显著缩短了促销准备时间，增强了内容吸引力，并能及时回应用户询问，最终提升了直播电商的运营效果。在我们收集的TikTok直播帧数据集上，所提方法实现了0.913的问题识别准确率和0.876的响应质量评分，展现出可观的实际应用潜力。视频演示可在此处查看：https://www.youtube.com/shorts/mWIXK-SWhiE。

摘要 (Abstract)

Live streaming commerce has become a prominent form of broadcasting in the modern era. To facilitate more efficient and convenient product promotions for streamers, we present Click-to-Ask, an AI-driven assistant for live streaming commerce with complementary offline and online components. The offline module processes diverse multimodal product information, transforming complex inputs into structured product data and generating compliant promotional copywriting. During live broadcasts, the online module enables real-time responses to viewer inquiries by allowing streamers to click on questions and leveraging both the structured product information generated by the offline module and an event-level historical memory maintained in a streaming architecture. This system significantly reduces the time needed for promotional preparation, enhances content engagement, and enables prompt interaction with audience inquiries, ultimately improving the effectiveness of live streaming commerce. On our collected dataset of TikTok live stream frames, the proposed method achieves a Question Recognition Accuracy of 0.913 and a Response Quality score of 0.876, demonstrating considerable potential for practical application. The video demonstration can be viewed here: https://www.youtube.com/shorts/mWIXK-SWhiE.

关键词: AI live streaming assistant, offline copywriting, online interactive QA, multimodal product information, structured product data, real-time response, streaming architecture, live streaming commerce

208. ❌ MeInTime: Bridging Age Gap in Identity-Preserving Face Restoration

作者: Teer Song, Yue Zhang, Yu Tian, Ziyang Wang, Xianlin Zhang, Guixuan Zhang, Xuan Liu, Xueming Li, Yasen Zhang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18645v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MeInTime专注于计算机视觉领域的人脸修复任务，使用扩散模型解决跨年龄人脸恢复问题。虽然属于AI应用，但所有评分关键词均针对大语言模型（LLM）及相关技术（如MoE、RLHF、RAG等），而本文研究的是扩散模型在图像生成/修复中的应用，与LLM技术原理、训练方法、推理优化、对齐技术等完全无关。关键词’AI for Science’虽涉及科学应用，但本文属于计算机视觉/图像处理领域，而非生物信息学或化学信息学等科学计算领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出了一种基于扩散模型的跨年龄人脸修复方法MeInTime，通过解耦身份和年龄条件建模，解决了现有方法在参考图像与退化输入年龄不一致时的身份保真度问题，在身份保持和年龄一致性方面优于现有方法。

摘要翻译

为更好地保留个体身份特征，人脸修复技术已从无参考方法演进至基于参考图像的方法，后者利用同一身份的高质量参考图像来增强修复结果的身份保真度。然而，现有方法大多隐式假设参考图像与退化输入在年龄上对齐，这限制了其在现实场景（如历史照片修复）中仅能获取跨年龄参考图像时的有效性。本文提出MeInTime，一种基于扩散模型的人脸修复方法，将基于参考的修复从同年龄场景扩展至跨年龄场景。在给定一张或多张参考图像及对应退化输入的年龄提示（age prompt）条件下，MeInTime能够实现兼具身份保真度与年龄一致性的精准修复。具体而言，我们解耦了身份条件与年龄条件的建模过程：在训练阶段，我们专注于通过新引入的注意力机制有效注入身份特征，并采用门控残差融合模块促进退化特征与身份表征的融合；在推理阶段，我们提出无需训练的采样策略——年龄感知梯度引导，利用年龄驱动的方向迭代地将身份感知的去噪隐变量推向目标年龄语义流形。大量实验表明，MeInTime在身份保持与年龄一致性方面均优于现有人脸修复方法。代码发布于：https://github.com/teer4/MeInTime

摘要 (Abstract)

To better preserve an individual’s identity, face restoration has evolved from reference-free to reference-based approaches, which leverage high-quality reference images of the same identity to enhance identity fidelity in the restored outputs. However, most existing methods implicitly assume that the reference and degraded input are age-aligned, limiting their effectiveness in real-world scenarios where only cross-age references are available, such as historical photo restoration. This paper proposes MeInTime, a diffusion-based face restoration method that extends reference-based restoration from same-age to cross-age settings. Given one or few reference images along with an age prompt corresponding to the degraded input, MeInTime achieves faithful restoration with both identity fidelity and age consistency. Specifically, we decouple the modeling of identity and age conditions. During training, we focus solely on effectively injecting identity features through a newly introduced attention mechanism and introduce Gated Residual Fusion modules to facilitate the integration between degraded features and identity representations. At inference, we propose Age-Aware Gradient Guidance, a training-free sampling strategy, using an age-driven direction to iteratively nudge the identity-aware denoising latent toward the desired age semantic manifold. Extensive experiments demonstrate that MeInTime outperforms existing face restoration methods in both identity preservation and age consistency. Our code is available at: https://github.com/teer4/MeInTime

关键词: face restoration, diffusion models, identity preservation, cross-age restoration, reference-based methods, age consistency, gated residual fusion, age-aware gradient guidance

209. ❌ PhysVideo: Physically Plausible Video Generation with Cross-View Geometry Guidance

作者: Cong Wang, Hanxin Zhu, Xiao Tang, Jiayi Luo, Xin Jin, Long Chen, Fei-Yue Wang, Zhibo Chen 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18639v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文《PhysVideo: Physically Plausible Video Generation with Cross-View Geometry Guidance》专注于视频生成领域，特别是通过物理感知和几何引导来提升视频的物理一致性和时空连贯性。它涉及计算机视觉、视频合成和物理模拟，但未提及或应用任何大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、RLHF等）、大模型在不同领域的研究应用，或AI for Science的具体子领域（如生物信息学）。所有评分关键词均与大模型、深度学习技术或科学AI应用相关，而该论文的核心内容（物理视频生成、跨视图几何、注意力机制）与这些关键词无直接关联，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文提出PhysVideo框架，通过物理感知正交前景视频生成和几何引导的跨视图注意力机制，解决了视频生成中物理运动一致性的挑战，显著提升了生成视频的物理真实感和时空连贯性。

摘要翻译

视频生成领域的最新进展已显著提升了视觉保真度，然而确保物理一致的运动仍是一项根本性挑战。直观而言，这一局限可归因于现实世界中的物体运动是在三维空间中展开的，而视频观测仅能提供此类动态的部分、视角依赖的投影。为解决这些问题，我们提出了PhysVideo，一个两阶段框架：首先生成具备物理感知的正交前景视频，随后合成包含背景的完整视频。在第一阶段，Phys4View利用物理感知注意力机制捕捉物理属性对运动动态的影响，并通过结合几何增强的跨视角注意力与时序注意力来提升时空一致性。在第二阶段，VideoSyn以生成的前景视频为指导，学习前景动态与背景上下文之间的交互，以实现可控的视频合成。为支持训练，我们构建了PhysMV数据集，其中包含4万个场景，每个场景由四个正交视角组成，共计16万条视频序列。大量实验表明，与现有视频生成方法相比，PhysVideo在物理真实感与时空连贯性方面均有显著提升。项目主页：https://anonymous.4open.science/w/Phys4D/。

摘要 (Abstract)

Recent progress in video generation has led to substantial improvements in visual fidelity, yet ensuring physically consistent motion remains a fundamental challenge. Intuitively, this limitation can be attributed to the fact that real-world object motion unfolds in three-dimensional space, while video observations provide only partial, view-dependent projections of such dynamics. To address these issues, we propose PhysVideo, a two-stage framework that first generates physics-aware orthogonal foreground videos and then synthesizes full videos with background. In the first stage, Phys4View leverages physics-aware attention to capture the influence of physical attributes on motion dynamics, and enhances spatio-temporal consistency by incorporating geometry-enhanced cross-view attention and temporal attention. In the second stage, VideoSyn uses the generated foreground videos as guidance and learns the interactions between foreground dynamics and background context for controllable video synthesis. To support training, we construct PhysMV, a dataset containing 40K scenes, each consisting of four orthogonal viewpoints, resulting in a total of 160K video sequences. Extensive experiments demonstrate that PhysVideo significantly improves physical realism and spatial-temporal coherence over existing video generation methods. Home page: https://anonymous.4open.science/w/Phys4D/.

关键词: video generation, physically plausible motion, cross-view geometry, physics-aware attention, spatio-temporal consistency, orthogonal viewpoints, foreground-background interaction, physical realism

210. ❌ Training-Free Sparse Attention for Fast Video Generation via Offline Layer-Wise Sparsity Profiling and Online Bidirectional Co-Clustering

作者: Jiayi Luo, Jiayu Chen, Jiankun Wang, Cong Wang, Hanxin Zhu, Qingyun Sun, Chen Gao, Zhibo Chen, Jianxin Li 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18636v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频生成中的稀疏注意力技术，与’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分），因为核心是稀疏注意力方法；与’Speculative Decoding OR Inference Acceleration’高度相关（10分），因为目标是加速推理；其他关键词主要涉及大语言模型、对齐、代理等，与论文的扩散变换器视频生成主题无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SVOO的训练无关稀疏注意力框架，通过离线层间稀疏性分析和在线双向共聚类算法，解决了视频生成中现有稀疏注意力方法忽略层异质性和查询-键耦合的问题，在七个视频生成模型上实现了高达1.93倍的加速，同时保持高达29 dB的PSNR质量。

摘要翻译

扩散变换器（Diffusion Transformers，DiTs）在视频生成质量上表现优异，但由于密集的三维注意力机制导致推理成本高昂，这推动了稀疏注意力技术的发展以提升效率。然而，现有视频生成中无需训练的稀疏注意力方法仍面临两个未解决的局限：忽视了注意力剪枝中的层间异质性，以及忽略了块划分中的查询-键耦合问题，这阻碍了更优的质量-加速权衡。在本研究中，我们发现一个关键见解：每一层的注意力稀疏性是其固有属性，在不同输入间影响甚微。受此启发，我们提出了SVOO，一种无需训练的稀疏注意力框架，通过离线分层稀疏度分析与在线双向协同聚类实现快速视频生成。具体而言，SVOO采用两阶段范式：（i）离线分层敏感性分析以推导每层固有的剪枝水平；（ii）通过新颖的双向协同聚类算法实现在线块级稀疏注意力。在七个广泛使用的视频生成模型上进行的大量实验表明，SVOO在质量-加速权衡上优于现有先进方法，在Wan2.1上实现了高达$1.93\times$的加速，同时保持峰值信噪比（PSNR）高达29 dB。

摘要 (Abstract)

Diffusion Transformers (DiTs) achieve strong video generation quality but suffer from high inference cost due to dense 3D attention, leading to the development of sparse attention technologies to improve efficiency. However, existing training-free sparse attention methods in video generation still face two unresolved limitations: ignoring layer heterogeneity in attention pruning and ignoring query-key coupling in block partitioning, which hinder a better quality-speedup trade-off. In this work, we uncover a critical insight that the attention sparsity of each layer is its intrinsic property, with minor effects across different inputs. Motivated by this, we propose SVOO, a training-free Sparse attention framework for fast Video generation via Offline layer-wise sparsity profiling and Online bidirectional co-clustering. Specifically, SVOO adopts a two-stage paradigm: (i) offline layer-wise sensitivity profiling to derive intrinsic per-layer pruning levels, and (ii) online block-wise sparse attention via a novel bidirectional co-clustering algorithm. Extensive experiments on seven widely used video generation models demonstrate that SVOO achieves a superior quality-speedup trade-off over state-of-the-art methods, delivering up to $1.93\times$ speedup while maintaining a PSNR of up to 29 dB on Wan2.1.

关键词: Sparse Attention, Video Generation, Diffusion Transformers, Inference Acceleration, Training-Free, Layer-Wise Sparsity, Bidirectional Co-Clustering, Quality-Speedup Trade-off

211. ❌ SwiftGS: Episodic Priors for Immediate Satellite Surface Recovery

作者: Rong Fu, Jiekai Wu, Haiyun Wei, Xiaowen Ma, Shiyin Lin, Kangan Qian, Chuang Liu, Jianyuan Ni, Simon James Fong 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18634v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文SwiftGS专注于卫星图像3D重建的计算机视觉任务，采用元学习、高斯原语、SDF和物理感知渲染等技术，属于AI在遥感/地球科学领域的应用。所有关键词均与大语言模型（LLM）及其相关技术（如训练、对齐、推理优化、智能体等）直接相关，而本文完全不涉及LLM或自然语言处理。仅最后一个关键词’AI for Science’与本文的AI在科学（遥感）应用有微弱关联，但本文并非生物信息学或化学信息学，因此给5分（有一定关联）。其他26个关键词与LLM技术强相关，与本文完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文解决了从多时相卫星图像进行快速大规模3D重建的难题，提出了一种名为SwiftGS的元学习系统，通过预测几何-辐射解耦的高斯原语和轻量级SDF，在单次前向传播中实现零样本推理，显著降低了计算成本并实现了准确的数字表面模型重建和视图一致渲染。

摘要翻译

利用多时相卫星影像进行快速、大规模三维重建对于环境监测、城市规划与灾害响应至关重要，但由于光照变化、传感器异质性以及逐场景优化成本高昂，该任务仍面临挑战。我们提出了SwiftGS，一种元学习系统，通过联合预测几何-辐射解耦的高斯基元与轻量级符号距离场，在单次前向传播中完成三维表面重建，从而以能够捕获可迁移先验知识的片段式训练替代了昂贵的逐场景拟合。该模型将用于投影、光照与传感器响应的可微分物理图，与融合稀疏高斯细节和全局SDF结构的空间门控机制相耦合，并整合了语义-几何融合、条件轻量化任务头，以及在不确定性感知多任务损失下从冻结几何教师模型获取的多视角监督。在推理阶段，SwiftGS以零样本方式运行，支持可选的紧凑型校准，能够以显著降低的计算成本实现精确的数字表面模型重建与视角一致的渲染，消融实验验证了混合表征、物理感知渲染与片段式元训练策略的有效性。

摘要 (Abstract)

Rapid, large-scale 3D reconstruction from multi-date satellite imagery is vital for environmental monitoring, urban planning, and disaster response, yet remains difficult due to illumination changes, sensor heterogeneity, and the cost of per-scene optimization. We introduce SwiftGS, a meta-learned system that reconstructs 3D surfaces in a single forward pass by predicting geometry-radiation-decoupled Gaussian primitives together with a lightweight SDF, replacing expensive per-scene fitting with episodic training that captures transferable priors. The model couples a differentiable physics graph for projection, illumination, and sensor response with spatial gating that blends sparse Gaussian detail and global SDF structure, and incorporates semantic-geometric fusion, conditional lightweight task heads, and multi-view supervision from a frozen geometric teacher under an uncertainty-aware multi-task loss. At inference, SwiftGS operates zero-shot with optional compact calibration and achieves accurate DSM reconstruction and view-consistent rendering at significantly reduced computational cost, with ablations highlighting the benefits of the hybrid representation, physics-aware rendering, and episodic meta-training.

关键词: 3D reconstruction, satellite imagery, meta-learning, Gaussian primitives, signed distance field (SDF), physics-aware rendering, zero-shot inference, episodic training

212. ❌ GEAR: Geography-knowledge Enhanced Analog Recognition Framework in Extreme Environments

作者: Zelin Liu, Bocheng Li, Yuling Zhou, Xuanting Li, Yixuan Yang, Jing Wang, Weishu Zhao, Xiaofeng Gao 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18626v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文GEAR专注于地理学知识增强的地形相似性识别框架，用于在青藏高原上识别马里亚纳海沟的陆地类似物。研究内容涉及地理信息处理、地形分析、图神经网络（MSG-Net）和生物地理学应用。所有关键词均与大模型、深度学习技术原理或AI在科学领域的应用相关，但论文未涉及任何大模型（LLM）、深度学习技术（如MoE、SFT、RLHF、PEFT、RAG等）、推理方法（如CoT、MCTS）、代理系统、模型优化（如量化、推理加速）或可解释性技术。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将AI（特别是图神经网络和地形分析）应用于科学领域（生物地理学和地质学），但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个地理知识增强的类比识别框架（GEAR），用于在青藏高原上高效检索与马里亚纳海沟结构相似的陆地地形，通过三阶段流程和MSG-Net图神经网络实现了高精度识别，并发现地形特征与生物数据显著相关。

摘要翻译

马里亚纳海沟与青藏高原在地质起源和微生物代谢功能方面表现出显著相似性。鉴于深海生物采样成本极高，识别青藏高原上与马里亚纳海沟结构同源的陆地相似区具有重要意义。然而，现有模型均无法充分处理跨域地形相似性检索问题，或忽视地理知识，或牺牲计算效率。为应对这些挑战，我们提出地理知识增强的相似区识别框架（GEAR），这是一个三阶段流程，旨在从青藏高原250万平方公里范围内高效检索相似区：（1）骨架引导的筛选与裁剪：基于尺寸和线性形态标准识别候选谷地并进行初步筛选。（2）物理感知的过滤：地形波形比较器（TWC）与形态纹理模块（MTM）评估波形和纹理特征，过滤不一致的候选谷地。（3）基于图结构的精细识别：我们设计了一种融合地貌指标的孪生图网络（MSG-Net）。相应地，我们发布了针对构造碰撞带的专家标注地形相似性数据集。实验验证了各阶段的有效性。此外，MSG-Net的F1分数比当前最佳基线模型高出1.38个百分点。利用MSG-Net提取的特征，我们发现了其与生物数据的显著相关性，为未来生物分析提供了依据。

摘要 (Abstract)

The Mariana Trench and the Qinghai-Tibet Plateau exhibit significant similarities in geological origins and microbial metabolic functions. Given that deep-sea biological sampling faces prohibitive costs, recognizing structurally homologous terrestrial analogs of the Mariana Trench on the Qinghai-Tibet Plateau is of great significance. Yet, no existing model adequately addresses cross-domain topographic similarity retrieval, either neglecting geographical knowledge or sacrificing computational efficiency. To address these challenges, we present \underline{\textbf{G}}eography-knowledge \underline{\textbf{E}}nhanced \underline{\textbf{A}}nalog \underline{\textbf{R}}ecognition (\textbf{GEAR}) Framework, a three-stage pipeline designed to efficiently retrieve analogs from 2.5 million square kilometers of the Qinghai-Tibet Plateau: (1) Skeleton guided Screening and Clipping: Recognition of candidate valleys and initial screening based on size and linear morphological criteria. (2) Physics aware Filtering: The Topographic Waveform Comparator (TWC) and Morphological Texture Module (MTM) evaluate the waveform and texture and filter out inconsistent candidate valleys. (3) Graph based Fine Recognition: We design a \underline{\textbf{M}}orphology-integrated \underline{\textbf{S}}iamese \underline{\textbf{G}}raph \underline{\textbf{N}}etwork (\textbf{MSG-Net}) based on geomorphological metrics. Correspondingly, we release an expert-annotated topographic similarity dataset targeting tectonic collision zones. Experiments demonstrate the effectiveness of every stage. Besides, MSG-Net achieved an F1-Score 1.38 percentage points higher than the SOTA baseline. Using features extracted by MSG-Net, we discovered a significant correlation with biological data, providing evidence for future biological analysis.

关键词: GEAR, topographic similarity, analog recognition, graph neural network, MSG-Net, Mariana Trench, Qinghai-Tibet Plateau, biological correlation

213. ❌ GenVideoLens: Where LVLMs Fall Short in AI-Generated Video Detection?

作者: Yueying Zou, Pei Pei Li, Zekun Li, Xinyu Guo, Xing Cui, Huaibo Huang, Ran He 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18625v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究LVLM在AI生成视频检测中的能力评估，与’Large Language Models’相关（LVLM是视觉语言大模型变体），与’Chain of Thought’和’System 2 Thinking’有一定关联（涉及推理能力分析），与’Hallucination Mitigation’和’Mechanistic Interpretability’高度相关（检测AI生成内容本质上是识别虚假/幻觉内容，并提供模型行为解释）。其他关键词如MoE、SLMs、训练技术、推理优化、智能体等均未涉及。

!!! tip deepseek-chat TL;DR

该论文通过构建细粒度基准GenVideoLens评估LVLM在AI生成视频检测中的能力，发现LVLM在感知线索上表现较好，但在光学一致性、物理交互和时序因果推理方面存在显著不足，且对时序信息利用有限。

摘要翻译

近年来，人工智能生成的视频日益逼真且复杂。与此同时，大型视觉语言模型在检测此类内容方面展现出强大潜力。然而，现有评估方案大多将该任务视为二元分类问题，并依赖整体准确率等粗粒度指标，难以深入揭示大型视觉语言模型在哪些方面成功或失败。为突破这一局限，我们提出了GenVideoLens——一个细粒度基准测试框架，支持从多维度评估大型视觉语言模型在AI生成视频检测中的能力。该基准包含400个高度迷惑性的人工智能生成视频与100个真实视频，所有视频均由专家根据15个真实性维度进行标注，涵盖感知特征、光学一致性、物理交互与时间因果线索。我们在该基准上评估了11个具有代表性的大型视觉语言模型。分析结果表明，模型存在显著的维度能力失衡现象：虽然在感知线索上表现相对较好，但在光学一致性、物理交互和时间因果推理方面存在明显困难。不同维度间的模型性能差异显著，部分开源小模型在特定真实性线索上的表现甚至优于更强的闭源模型。时间扰动实验进一步表明，当前大型视觉语言模型对时序信息的利用能力有限。总体而言，GenVideoLens为大型视觉语言模型的行为提供了诊断性洞察，揭示了关键能力缺口，并为改进未来AI生成视频检测系统提供了指导方向。

摘要 (Abstract)

In recent years, AI-generated videos have become increasingly realistic and sophisticated. Meanwhile, Large Vision-Language Models (LVLMs) have shown strong potential for detecting such content. However, existing evaluation protocols largely treat the task as a binary classification problem and rely on coarse-grained metrics such as overall accuracy, providing limited insight into where LVLMs succeed or fail. To address this limitation, we introduce GenVideoLens, a fine-grained benchmark that enables dimension-wise evaluation of LVLM capabilities in AI-generated video detection. The benchmark contains 400 highly deceptive AI-generated videos and 100 real videos, annotated by experts across 15 authenticity dimensions covering perceptual, optical, physical, and temporal cues. We evaluate eleven representative LVLMs on this benchmark. Our analysis reveals a pronounced dimensional imbalance. While LVLMs perform relatively well on perceptual cues, they struggle with optical consistency, physical interactions, and temporal-causal reasoning. Model performance also varies substantially across dimensions, with smaller open-source models sometimes outperforming stronger proprietary models on specific authenticity cues. Temporal perturbation experiments further show that current LVLMs make limited use of temporal information. Overall, GenVideoLens provides diagnostic insights into LVLM behavior, revealing key capability gaps and offering guidance for improving future AI-generated video detection systems.

关键词: AI-generated video detection, Large Vision-Language Models, fine-grained benchmark, dimensional imbalance, temporal-causal reasoning, model evaluation, authenticity dimensions, diagnostic insights

214. ❌ Benchmarking CNN-based Models against Transformer-based Models for Abdominal Multi-Organ Segmentation on the RATIC Dataset

作者: Lukas Bayer, Sheethal Bhat, Andreas Maier 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18616v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学图像分割领域，比较CNN和Transformer架构在腹部多器官分割任务上的性能。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐技术等）完全无关，因为这些关键词主要针对自然语言处理领域的大语言模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在生物医学领域的应用，但并非核心创新技术研究，只是应用现有模型进行基准测试，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究系统比较了CNN和Transformer架构在腹部CT多器官分割任务上的性能，发现CNN模型SegResNet在异构数据集上优于所有混合Transformer模型。

摘要翻译

腹部CT扫描中的精准多器官分割对于计算机辅助诊断与治疗至关重要。尽管卷积神经网络（CNN）长期以来一直是医学图像分割的标准方法，但基于Transformer的架构因其建模长程依赖关系的能力而近期受到关注。本研究在异构的RATIC数据集上，系统性地将三种基于混合Transformer的模型——UNETR、SwinUNETR和UNETR++——与强大的CNN基线模型SegResNet进行了体积多器官分割的性能对比。该数据集包含来自全球23个机构的206例标注CT扫描，涵盖五个腹部器官。所有模型均在相同的预处理和训练条件下，以戴斯相似系数（Dice Similarity Coefficient, DSC）作为主要评估指标进行训练和评估。结果表明，基于CNN的SegResNet取得了最高的整体性能，在所有器官分割任务上均优于所有基于混合Transformer的模型。在基于Transformer的方法中，UNETR++取得了最具竞争力的结果，而UNETR则展现出显著更快的收敛速度，所需训练迭代次数更少。这些发现表明，对于中小规模的异构数据集，经过充分优化的CNN架构仍具有高度竞争力，其性能可能优于基于混合Transformer的设计。

摘要 (Abstract)

Accurate multi-organ segmentation in abdominal CT scans is essential for computer-aided diagnosis and treatment. While convolutional neural networks (CNNs) have long been the standard approach in medical image segmentation, transformer-based architectures have recently gained attention due to their ability to model long-range dependencies. In this study, we systematically benchmark the three hybrid transformer-based models UNETR, SwinUNETR, and UNETR++ against a strong CNN baseline, SegResNet, for volumetric multi-organ segmentation on the heterogeneous RATIC dataset. The dataset comprises 206 annotated CT scans from 23 institutions worldwide, covering five abdominal organs. All models were trained and evaluated under identical preprocessing and training conditions using the Dice Similarity Coefficient (DSC) as the primary metric. The results show that the CNN-based SegResNet achieves the highest overall performance, outperforming all hybrid transformer-based models across all organs. Among the transformer-based approaches, UNETR++ delivers the most competitive results, while UNETR demonstrates notably faster convergence with fewer training iterations. These findings suggest that, for small- to medium-sized heterogeneous datasets, well-optimized CNN architectures remain highly competitive and may outperform hybrid transformer-based designs.

关键词: multi-organ segmentation, abdominal CT, CNN, Transformer, benchmarking, medical image analysis, RATIC dataset, Dice Similarity Coefficient

作者: Bingqi Ma, Linlong Lang, Ming Zhang, Dailan He, Xingtong Ge, Yi Zhang, Guanglu Song, Yu Liu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18600v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于音频-视频联合生成的跨模态学习技术，使用扩散模型和注意力机制，未涉及大语言模型（LLMs）、深度学习技术原理创新或科学领域应用。所有关键词均与大语言模型、深度学习技术原理或AI for Science相关，而本文研究的是多模态生成（音频-视频），与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

本文针对双流Transformer音频-视频联合生成方法中存在的跨模态交互偏差、训练-推理不一致等问题，提出了跨模态上下文学习（CCL）框架，通过TARP、LCT、DCR和UCG等模块显著提升了生成质量和效率，在资源需求更低的情况下达到了最先进性能。

摘要翻译

基于双流Transformer架构的视听联合生成方法已成为当前研究的主流范式。该方法通过引入预训练的视频扩散模型与音频扩散模型，并结合跨模态交互注意力模块，能够以极少的训练数据生成高质量、时序同步的视听内容。本文首先回顾了双流Transformer范式，并深入分析了其存在的局限性，包括控制跨模态交互的门控机制引发的模型流形变化、跨模态注意力带来的多模态背景区域偏差、训练与推理阶段多模态无分类器引导（CFG）的不一致性，以及多条件间的冲突问题。为缓解上述问题，我们提出了跨模态上下文学习（CCL）方法，并配备了多个精心设计的模块。时序对齐RoPE与分区（TARP）有效增强了音频隐表示与视频隐表示之间的时序对齐能力。跨模态上下文注意力（CCA）模块中的可学习上下文令牌（LCT）与动态上下文路由（DCR）为跨模态信息提供了稳定的无条件锚点，同时根据不同训练任务进行动态路由，进一步提升了模型的收敛速度与生成质量。在推理阶段，无条件上下文引导（UCG）利用LCT提供的无条件支持，促进不同形式的CFG应用，改善了训练-推理一致性，并进一步缓解了条件冲突。通过综合评估，CCL在所需资源显著减少的情况下，相比近期学术方法实现了最先进的性能。

摘要 (Abstract)

The dual-stream transformer architecture-based joint audio-video generation method has become the dominant paradigm in current research. By incorporating pre-trained video diffusion models and audio diffusion models, along with a cross-modal interaction attention module, high-quality, temporally synchronized audio-video content can be generated with minimal training data. In this paper, we first revisit the dual-stream transformer paradigm and further analyze its limitations, including model manifold variations caused by the gating mechanism controlling cross-modal interactions, biases in multi-modal background regions introduced by cross-modal attention, and the inconsistencies in multi-modal classifier-free guidance (CFG) during training and inference, as well as conflicts between multiple conditions. To alleviate these issues, we propose Cross-Modal Context Learning (CCL), equipped with several carefully designed modules. Temporally Aligned RoPE and Partitioning (TARP) effectively enhances the temporal alignment between audio latent and video latent representations. The Learnable Context Tokens (LCT) and Dynamic Context Routing (DCR) in the Cross-Modal Context Attention (CCA) module provide stable unconditional anchors for cross-modal information, while dynamically routing based on different training tasks, further enhancing the model’s convergence speed and generation quality. During inference, Unconditional Context Guidance (UCG) leverages the unconditional support provided by LCT to facilitate different forms of CFG, improving train-inference consistency and further alleviating conflicts. Through comprehensive evaluations, CCL achieves state-of-the-art performance compared with recent academic methods while requiring substantially fewer resources.

关键词: audio-video generation, cross-modal learning, diffusion models, transformer architecture, temporal alignment, classifier-free guidance, multimodal generation, attention mechanism

216. ❌ SJD-PAC: Accelerating Speculative Jacobi Decoding via Proactive Drafting and Adaptive Continuation

作者: Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, Xinghao Chen 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18599v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于文本到图像生成的推理加速技术，仅与’Speculative Decoding OR Inference Acceleration’高度相关（10分），因为SJD-PAC是Speculative Jacobi Decoding的改进框架，旨在提升推理速度。其他关键词涉及大模型训练、对齐、应用等领域，与论文的文本到图像生成加速技术无直接关联，均得0分。

!!! tip deepseek-chat TL;DR

该论文针对文本到图像生成中高熵区域导致草稿令牌接受率低的问题，提出了SJD-PAC框架，通过主动草稿策略和自适应延续机制，在保持图像质量的同时实现了3.8倍的推理加速。

摘要翻译

推测性雅可比解码（Speculative Jacobi Decoding，SJD）提供了一种无需草稿模型的加速自回归文本到图像合成方法。然而，视觉生成的高熵特性导致在复杂区域草稿标记接受率较低，形成了严重限制整体吞吐量的瓶颈。为克服这一问题，我们提出了SJD-PAC，一种增强型SJD框架。首先，SJD-PAC采用主动草稿生成策略，以提升这些高熵挑战区域的局部接受率。其次，我们引入了自适应延续机制，在初始拒绝后维持序列验证过程，避免了完全重新采样的需求。这两项优化协同作用，显著提高了每步平均接受长度，在严格保持目标分布的同时提升了推理速度。在标准文本到图像基准测试上的实验表明，SJD-PAC实现了$3.8\times$的加速，且图像质量无损。

摘要 (Abstract)

Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis. However, the high-entropy nature of visual generation yields low draft-token acceptance rates in complex regions, creating a bottleneck that severely limits overall throughput. To overcome this, we introduce SJD-PAC, an enhanced SJD framework. First, SJD-PAC employs a proactive drafting strategy to improve local acceptance rates in these challenging high-entropy regions. Second, we introduce an adaptive continuation mechanism that sustains sequence validation after an initial rejection, bypassing the need for full resampling. Working in tandem, these optimizations significantly increase the average acceptance length per step, boosting inference speed while strictly preserving the target distribution. Experiments on standard text-to-image benchmarks demonstrate that SJD-PAC achieves a $3.8\times$ speedup with lossless image quality.

关键词: Speculative Jacobi Decoding, text-to-image synthesis, inference acceleration, draft-token acceptance, proactive drafting, adaptive continuation, autoregressive generation, lossless image quality

217. ❌ Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness

作者: Lu Yu, Haiyang Zhang, Changsheng Xu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18598v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是视觉-语言模型（CLIP）的对抗鲁棒性问题，属于计算机视觉与自然语言处理的交叉领域。论文核心是改进CLIP模型的注意力机制以增强其对抗鲁棒性，主要涉及预训练模型的应用和注意力机制优化。与大多数关键词（如LLM、MoE、RLHF、RAG等）无关，因为这些关键词主要针对纯语言模型或特定大模型技术。唯一相关的是’Pre-training OR Continual Pre-training OR Domain Adaptation’，因为论文基于预训练的CLIP模型进行改进，但并非核心创新点，因此给5分（有一定关联）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对预训练视觉-语言模型CLIP在对抗样本下的脆弱性问题，提出了互补文本引导注意力方法（Comp-TGA），在16个数据集上实现了零样本鲁棒准确率11.95%的提升。

摘要翻译

得益于卓越的零样本能力，预训练视觉-语言模型（如CLIP）已在众多领域获得广泛关注与应用。然而，研究发现CLIP易受对抗样本的影响。通过实验分析，我们观察到一种现象：对抗扰动会导致文本引导注意力的偏移。基于此发现，我们提出了一种简单而有效的策略：面向零样本鲁棒性的文本引导注意力（TGA-ZSR）。该框架包含两个组件：局部注意力优化模块与全局注意力约束模块。我们的目标是保持CLIP模型的泛化能力并增强其对抗鲁棒性。此外，全局注意力约束模块利用干净样本从目标模型和原始模型中获取文本引导注意力，其目的是在保持模型在干净样本上性能的同时提升整体鲁棒性。然而，我们发现该方法有时会关注无关或虚假特征，这可能导致次优性能并在某些场景下削弱其鲁棒性。为克服这一局限，我们进一步提出了一种新方法：互补文本引导注意力（Comp-TGA）。该方法整合了两种类型的前景注意力：由类别提示引导的注意力以及由非类别提示驱动的反向注意力。这些互补的注意力机制使模型能够捕获更全面、更准确的前景表征。实验验证表明，在16个数据集上，TGA-ZSR与Comp-TGA相较于当前最先进技术，在零样本鲁棒准确率上分别实现了9.58%和11.95%的提升。

摘要 (Abstract)

Due to the impressive zero-shot capabilities, pre-trained vision-language models (e.g., CLIP), have attracted widespread attention and adoption across various domains. Nonetheless, CLIP has been observed to be susceptible to adversarial examples. Through experimental analysis, we have observed a phenomenon wherein adversarial perturbations induce shifts in text-guided attention. Building upon this observation, we propose a simple yet effective strategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). This framework incorporates two components: Local Attention Refinement Module and Global Attention Constraint Module. Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness. Additionally, the Global Attention Constraint Module acquires text-guided attention from both the target and original models using clean examples. Its objective is to maintain model performance on clean samples while enhancing overall robustness. However, we observe that the method occasionally focuses on irrelevant or spurious features, which can lead to suboptimal performance and undermine its robustness in certain scenarios. To overcome this limitation, we further propose a novel approach called Complementary Text-Guided Attention (Comp-TGA). This method integrates two types of foreground attention: attention guided by the class prompt and reversed attention driven by the non-class prompt. These complementary attention mechanisms allow the model to capture a more comprehensive and accurate representation of the foreground. The experiments validate that TGA-ZSR and Comp-TGA yield 9.58% and 11.95% improvements respectively, in zero-shot robust accuracy over the current state-of-the-art techniques across 16 datasets.

关键词: vision-language models, CLIP, adversarial robustness, zero-shot learning, text-guided attention, attention mechanism, foreground attention, complementary attention

218. ❌ Elastic Weight Consolidation Done Right for Continual Learning

作者: Xuan Liu, Xiaobin Chang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18596v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于持续学习（Continual Learning）中的权重正则化方法，特别是对Elastic Weight Consolidation（EWC）的改进。论文的核心是分析EWC中基于梯度的权重重要性估计问题，并提出Logits Reversal操作来修正。所有给定的关键词都直接与大模型（LLMs）或深度学习在特定领域（如科学）的应用相关，而本文的研究主题是通用的持续学习算法改进，并未涉及大模型技术、大模型应用或任何关键词中提到的具体技术（如MoE、SFT、RAG、量化等）。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文分析了持续学习中Elastic Weight Consolidation（EWC）方法因依赖Fisher信息矩阵而导致的梯度消失和冗余保护问题，并提出Logits Reversal操作来修正重要性估计，从而显著提升了EWC在多种任务上的性能。

摘要翻译

持续学习中的权重正则化方法通过评估并惩罚重要模型权重的变化来缓解灾难性遗忘。弹性权重巩固是该框架中一种基础且广泛使用的方法，其基于梯度估计权重重要性。然而，该方法始终表现出次优性能。本文从基于梯度的视角对EWC中的重要性估计进行了系统性分析。我们首次发现，EWC对费舍尔信息矩阵的依赖在某些场景下会导致梯度消失和重要性估计不准确。分析还表明，EWC的变体方法记忆感知突触对与先前任务无关的参数施加了不必要的约束，即冗余保护问题。因此，EWC及其变体在估计权重重要性时存在根本性的错位，导致性能不佳。为解决这些问题，我们提出对数逆转操作——一种简单而有效的改进方法，能够修正EWC的重要性估计。具体而言，在计算FIM过程中逆转逻辑值可有效防止梯度消失和冗余保护。在多种持续学习任务和数据集上的大量实验表明，所提方法显著优于现有EWC及其变体。因此，我们将其命名为正确实现的弹性权重巩固。

摘要 (Abstract)

Weight regularization methods in continual learning (CL) alleviate catastrophic forgetting by assessing and penalizing changes to important model weights. Elastic Weight Consolidation (EWC) is a foundational and widely used approach within this framework that estimates weight importance based on gradients. However, it has consistently shown suboptimal performance. In this paper, we conduct a systematic analysis of importance estimation in EWC from a gradient-based perspective. For the first time, we find that EWC’s reliance on the Fisher Information Matrix (FIM) results in gradient vanishing and inaccurate importance estimation in certain scenarios. Our analysis also reveals that Memory Aware Synapses (MAS), a variant of EWC, imposes unnecessary constraints on parameters irrelevant to prior tasks, termed the redundant protection. Consequently, both EWC and its variants exhibit fundamental misalignments in estimating weight importance, leading to inferior performance. To tackle these issues, we propose the Logits Reversal (LR) operation, a simple yet effective modification that rectifies EWC’s importance estimation. Specifically, reversing the logit values during the calculation of FIM can effectively prevent both gradient vanishing and redundant protection. Extensive experiments across various CL tasks and datasets show that the proposed method significantly outperforms existing EWC and its variants. Therefore, we refer to it as EWC Done Right (EWC-DR).

关键词: Continual Learning, Elastic Weight Consolidation, Weight Regularization, Fisher Information Matrix, Gradient Vanishing, Logits Reversal, Catastrophic Forgetting, Importance Estimation

219. ❌ AU Codes, Language, and Synthesis: Translating Anatomy to Text for Facial Behavior Synthesis

作者: Jiahe Wang, Cong Liang, Xuandong Huang, Yuxin Wang, Xin Yun, Yi Wu, Yanan Chang, Shangfei Wang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18588v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于面部行为合成，提出了一种使用自然语言描述动作单元（AUs）的新方法，并引入了BP4D-AUText数据集和VQ-AUFace生成模型。虽然论文涉及文本到图像的生成，但其核心是计算机视觉和面部动画，并未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、缩放定律、微调技术等）、推理方法（如思维链）、代理系统或AI for Science（如生物信息学）等关键词所涵盖的主题。所有关键词均与论文内容无关，因此相关度评分均为0。

!!! tip deepseek-chat TL;DR

该论文解决了面部行为合成中现有方法对冲突性动作单元（AUs）处理不佳导致解剖学上不合理伪影的问题，通过提出一种基于自然语言描述AUs的新方法、创建BP4D-AUText数据集和开发VQ-AUFace模型，显著生成了更解剖学合理、行为丰富且感知可信的面部表情。

摘要翻译

面部行为合成仍是一个关键但尚未被充分探索的挑战。尽管文本到面部模型已取得进展，但它们通常依赖于粗略的情绪类别，这些类别缺乏捕捉人类非语言交流全部细微差别所需的精度。动作单元（Action Units, AUs）提供了一种更精确且基于解剖学基础的替代方案。然而，当前基于AU的方法通常将AUs编码为独热向量，将复合表情建模为单个AUs的简单线性组合。这种线性在处理冲突AUs——即那些激活同一面部肌肉但产生相反动作的AUs——时会变得问题重重。此类情况会导致解剖学上不合理的伪影和不自然的运动叠加。为解决这一问题，我们提出了一种新方法，通过AUs的自然语言描述来表征面部行为。该方法保留了AU框架的表现力，同时能够对复杂和冲突的AUs进行显式建模，并释放了现代文本到图像模型在高保真面部合成方面的潜力。为支持这一方向，我们引入了BP4D-AUText，这是首个用于复杂面部行为的大规模文本-图像配对数据集。该数据集通过将基于规则的动态AU文本处理器应用于BP4D和BP4D+数据集而合成。我们进一步提出了VQ-AUFace，这是一种生成模型，利用面部结构先验从文本合成真实且多样化的面部行为。大量的定量实验和用户研究表明，我们的方法显著优于现有方法。它在生成解剖学上合理、行为丰富且在感知上令人信服的面部表情方面表现出色，尤其是在涉及冲突AUs的挑战性条件下。

摘要 (Abstract)

Facial behavior synthesis remains a critical yet underexplored challenge. While text-to-face models have made progress, they often rely on coarse emotion categories, which lack the nuance needed to capture the full spectrum of human nonverbal communication. Action Units (AUs) provide a more precise and anatomically grounded alternative. However, current AU-based approaches typically encode AUs as one-hot vectors, modeling compound expressions as simple linear combinations of individual AUs. This linearity becomes problematic when handling conflicting AUs–defined as those which activate the same facial muscle with opposing actions. Such cases lead to anatomically implausible artifacts and unnatural motion superpositions. To address this, we propose a novel method that represents facial behavior through natural language descriptions of AUs. This approach preserves the expressiveness of the AU framework while enabling explicit modeling of complex and conflicting AUs. It also unlocks the potential of modern text-to-image models for high-fidelity facial synthesis. Supporting this direction, we introduce BP4D-AUText, the first large-scale text-image paired dataset for complex facial behavior. It is synthesized by applying a rule-based Dynamic AU Text Processor to the BP4D and BP4D+ datasets. We further propose VQ-AUFace, a generative model that leverages facial structural priors to synthesize realistic and diverse facial behaviors from text. Extensive quantitative experiments and user studies demonstrate that our approach significantly outperforms existing methods. It excels in generating facial expressions that are anatomically plausible, behaviorally rich, and perceptually convincing, particularly under challenging conditions involving conflicting AUs.

关键词: Facial Behavior Synthesis, Action Units (AUs), Natural Language Descriptions, BP4D-AUText Dataset, VQ-AUFace Model, Conflicting AUs, Text-to-Image Models, Anatomically Plausible Expressions

220. ❌ Color image restoration based on nonlocal saturation-value similarity

作者: Wei Wang, Yakun Li 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18586v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于饱和度-明度相似性的非局部变分方法用于彩色图像恢复，属于传统的图像处理/计算机视觉领域。论文内容完全不涉及大模型、深度学习、AI for Science或任何评分关键词中的技术（如LLM、MoE、RLHF、RAG、量化等）。所有关键词均与论文主题无关，因此相关度评分均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于饱和度-明度相似性的非局部变分方法用于彩色图像恢复，并通过实验证明该方法在视觉质量和定量指标上优于其他测试方法。

摘要翻译

本文提出并发展了一种基于饱和度-明度相似性的新型非局部变分技术，用于彩色图像复原。在传统非局部方法中，图像块通常直接从彩色图像的红、绿、蓝通道中提取，由于块相似性主要基于独立通道的灰度值，色彩信息难以得到精细描述。本文的主要目标是提出并发展一种新颖的非局部正则化方法，该方法通过考虑彩色图像在饱和度-明度通道中图像块的相似性来实现。具体而言，我们首先通过将彩色图像块的饱和度-明度相似性融入所提出的非局部梯度中，建立了基于饱和度-明度相似性的非局部全变分模型，该模型能够描述两个相邻彩色图像块在饱和度与明度上的相似性。随后，基于此饱和度-明度相似性非局部全变分，构建了相应的非局部变分模型。此外，我们设计了一种高效且有效的算法，采用布雷格曼算子分裂法对所提出的优化问题进行数值求解，并对算法的收敛性进行了研究。数值实验表明，在视觉质量及多项定量指标——包括峰值信噪比、结构相似性指数、四元数结构相似性指数以及S-CIELAB色彩误差——方面，所提出模型的性能均优于其他对比测试方法。

摘要 (Abstract)

In this paper, we propose and develop a novel nonlocal variational technique based on saturation-value similarity for color image restoration. In traditional nonlocal methods, image patches are extracted from red, green and blue channels of a color image directly, and the color information can not be described finely because the patch similarity is mainly based on the grayscale value of independent channel. The main aim of this paper is to propose and develop a novel nonlocal regularization method by considering the similarity of image patches in saturation-value channel of a color image. In particular, we first establish saturation-value similarity based nonlocal total variation by incorporating saturation-value similarity of color image patches into the proposed nonlocal gradients, which can describe the saturation and value similarity of two adjacent color image patches. The proposed nonlocal variational models are then formulated based on saturation-value similarity based nonlocal total variation. Moreover, we design an effective and efficient algorithm to solve the proposed optimization problem numerically by employing bregmanized operator splitting method, and we also study the convergence of the proposed algorithms. Numerical examples are presented to demonstrate that the performance of the proposed models is better than that of other testing methods in terms of visual quality and some quantitative metrics including peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), quaternion structural similarity index (QSSIM) and S-CIELAB color error.

关键词: color image restoration, nonlocal variational method, saturation-value similarity, nonlocal total variation, bregmanized operator splitting, PSNR, SSIM, QSSIM

221. ❌ HAViT: Historical Attention Vision Transformer

作者: Swarnendu Banik, Manish Das, Shiv Ram Dubey, Satish Kumar Singh 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18585v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文HAViT专注于计算机视觉中的Vision Transformer架构改进，提出了一种跨层注意力传播方法，与所有评分关键词（均围绕大语言模型、训练技术、推理优化、代理系统等）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对Vision Transformer中注意力机制独立跨层操作导致信息流受限的问题，提出了一种保留和整合历史注意力矩阵的跨层注意力传播方法，在CIFAR-100和TinyImageNet数据集上实现了显著的准确率提升。

摘要翻译

视觉Transformer在计算机视觉领域表现卓越，但其注意力机制在各层间独立运行，限制了信息流动与特征学习。我们提出一种有效的跨层注意力传播方法，该方法在编码器层间保存并整合历史注意力矩阵，从而为视觉Transformer中的层间信息流提供了原理性的优化。这一方法能够在Transformer层级结构中实现注意力模式的渐进式优化，增强特征获取与优化动态性。该方法仅需极少的架构改动，仅增加注意力矩阵存储与融合操作。在CIFAR-100和TinyImageNet数据集上的综合实验表明，该方法能持续提升模型精度：ViT在CIFAR-100上的性能从75.74%提升至77.07%（+1.33%），在TinyImageNet上从57.82%提升至59.07%（+1.25%）。跨架构验证显示，该方法在不同Transformer变体上均能带来类似增益，其中CaiT模型性能提升1.01%。系统分析表明，历史注意力融合超参数（alpha = 0.45）在所有配置中均为最优值，在当前注意力信息与历史注意力信息间提供了理想平衡。随机初始化策略持续优于零值初始化，表明多样化的初始注意力模式能加速收敛并提升最终性能。我们的代码已公开于https://github.com/banik-s/HAViT。

摘要 (Abstract)

Vision Transformers have excelled in computer vision but their attention mechanisms operate independently across layers, limiting information flow and feature learning. We propose an effective cross-layer attention propagation method that preserves and integrates historical attention matrices across encoder layers, offering a principled refinement of inter-layer information flow in Vision Transformers. This approach enables progressive refinement of attention patterns throughout the transformer hierarchy, enhancing feature acquisition and optimization dynamics. The method requires minimal architectural changes, adding only attention matrix storage and blending operations. Comprehensive experiments on CIFAR-100 and TinyImageNet demonstrate consistent accuracy improvements, with ViT performance increasing from 75.74% to 77.07% on CIFAR-100 (+1.33%) and from 57.82% to 59.07% on TinyImageNet (+1.25%). Cross-architecture validation shows similar gains across transformer variants, with CaiT showing 1.01% enhancement. Systematic analysis identifies the blending hyperparameter of historical attention (alpha = 0.45) as optimal across all configurations, providing the ideal balance between current and historical attention information. Random initialization consistently outperforms zero initialization, indicating that diverse initial attention patterns accelerate convergence and improve final performance. Our code is publicly available at https://github.com/banik-s/HAViT.

关键词: Vision Transformer, cross-layer attention, historical attention, attention propagation, encoder layers, feature learning, CIFAR-100, TinyImageNet

222. ❌ UEPS: Robust and Efficient MRI Reconstruction

作者: Xiang Zhou, Hong Shang, Zijian Zhan, Tianyu He, Jintao Meng, Dong Liang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18572v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于医学影像（MRI）重建的深度学习模型（UEPS），属于AI在科学/生物医学领域的应用，因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。但论文未涉及大语言模型（LLMs）、MoE、小模型、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理技术、智能体、量化、加速解码、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等关键词，这些均与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对加速MRI重建中深度展开模型在域转移下的鲁棒性问题，提出了一种不依赖线圈灵敏度图估计的新架构UEPS，在多种临床域外测试集上实现了最先进的鲁棒性和实时推理性能。

摘要翻译

深度展开模型已成为加速磁共振成像重建的先进技术，但其在域偏移下的鲁棒性仍是临床应用的重大障碍。本研究指出线圈灵敏度图估计是限制模型泛化能力的主要瓶颈。为此，我们提出UEPS——一种新型深度展开模型架构，其包含三项关键创新：（1）采用非依赖线圈灵敏度图的展开扩展设计，通过独立重建各线圈数据消除对CSM的依赖；（2）渐进式分辨率重建机制，利用k空间到图像的映射实现高效的由粗到细优化；（3）针对磁共振一维欠采样特性定制的稀疏注意力模块。这些基于物理原理的设计在提升鲁棒性的同时实现了计算效率的优化。我们构建了大规模零样本迁移基准测试集，涵盖解剖结构、成像视角、对比度、设备厂商、场强及线圈配置等10种跨临床场景的分布外测试集。大量实验表明，UEPS在所有分布外测试中持续显著优于现有深度展开模型、端到端方法、扩散模型及无训练方法，以适用于实时部署的低延迟推理能力实现了当前最优的鲁棒性表现。

摘要 (Abstract)

Deep unrolled models (DUMs) have become the state of the art for accelerated MRI reconstruction, yet their robustness under domain shift remains a critical barrier to clinical adoption. In this work, we identify coil sensitivity map (CSM) estimation as the primary bottleneck limiting generalization. To address this, we propose UEPS, a novel DUM architecture featuring three key innovations: (i) an Unrolled Expanded (UE) design that eliminates CSM dependency by reconstructing each coil independently; (ii) progressive resolution, which leverages k-space-to-image mapping for efficient coarse-to-fine refinement; and (iii) sparse attention tailored to MRI’s 1D undersampling nature. These physics-grounded designs enable simultaneous gains in robustness and computational efficiency. We construct a large-scale zero-shot transfer benchmark comprising 10 out-of-distribution test sets spanning diverse clinical shifts – anatomy, view, contrast, vendor, field strength, and coil configurations. Extensive experiments demonstrate that UEPS consistently and substantially outperforms existing DUM, end-to-end, diffusion, and untrained methods across all OOD tests, achieving state-of-the-art robustness with low-latency inference suitable for real-time deployment.

关键词: MRI reconstruction, deep unrolled models, robustness, domain shift, coil sensitivity map, zero-shot transfer, sparse attention, real-time deployment

223. ❌ CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention

作者: Jiacheng Tang, Zhiyuan Zhou, Zhuolin He, Jia Zhang, Kai Zhang, Jian Pu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18561v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文CausalVAD专注于端到端自动驾驶中的因果混淆问题，提出了一种基于因果干预的去混淆训练框架。虽然属于AI应用领域，但论文内容与所有评分关键词（主要围绕大模型技术、训练方法、推理优化、代理系统等）均无直接关联。论文未涉及语言模型、模型训练技术、推理方法、代理系统或科学AI应用，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文针对端到端自动驾驶模型中的因果混淆问题，提出了CausalVAD框架，通过因果干预消除虚假关联，在nuScenes基准测试中实现了最先进的规划精度和安全性。

摘要翻译

面向规划的端到端驾驶模型展现出巨大潜力，但其本质上学习的是统计相关性而非真实的因果关系。这种缺陷会导致因果混淆问题，即模型利用数据集偏差作为捷径，严重损害其在复杂场景下的可靠性与安全性。为解决这一问题，我们提出了CausalVAD——一种基于因果干预的去混淆训练框架。其核心是设计了稀疏因果干预方案（Sparse Causal Intervention Scheme, SCIS），这是一个轻量级即插即用模块，用于在神经网络中实例化后门调整理论。SCIS构建了一个表征潜在驾驶场景的原型字典，并利用该字典对模型的稀疏向量化查询进行干预。这一步骤主动消除了由混淆变量引发的虚假关联，从而从下游任务的表征中剔除干扰因素。在nuScenes等基准测试上的大量实验表明，CausalVAD实现了最先进的规划精度与安全性。此外，我们的方法在针对数据偏差和专门设计用于诱发因果混淆的噪声场景中，均表现出卓越的鲁棒性。

摘要 (Abstract)

Planning-oriented end-to-end driving models show great promise, yet they fundamentally learn statistical correlations instead of true causal relationships. This vulnerability leads to causal confusion, where models exploit dataset biases as shortcuts, critically harming their reliability and safety in complex scenarios. To address this, we introduce CausalVAD, a de-confounding training framework that leverages causal intervention. At its core, we design the sparse causal intervention scheme (SCIS), a lightweight, plug-and-play module to instantiate the backdoor adjustment theory in neural networks. SCIS constructs a dictionary of prototypes representing latent driving contexts. It then uses this dictionary to intervene on the model’s sparse vectorized queries. This step actively eliminates spurious associations induced by confounders, thereby eliminating spurious factors from the representations for downstream tasks. Extensive experiments on benchmarks like nuScenes show CausalVAD achieves state-of-the-art planning accuracy and safety. Furthermore, our method demonstrates superior robustness against both data bias and noisy scenarios configured to induce causal confusion.

关键词: autonomous driving, causal intervention, de-confounding, end-to-end driving, planning accuracy, causal confusion, sparse causal intervention, robustness

224. ❌ HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

作者: Dan Ben-Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18558v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出HiMu框架，使用纯文本LLM将查询分解为层次逻辑树，并路由到轻量级专家（视觉和音频），以解决长视频问答中的帧选择问题。核心相关关键词：1）‘Context Window Extension OR Long Context LLMs’（10分）：直接解决LVLMs有限上下文窗口下的长视频推理问题；2）‘Large Language Models OR LLMs OR Foundation Models’（8分）：使用LLM进行查询分解；3）‘Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’（5分）：涉及层次逻辑树和推理；4）‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（5分）：与基于代理的方法对比；5）‘Tool Use OR Function Calling OR API Tool Use’（5分）：路由到专家系统；6）‘Mixture of Experts OR MoE OR Sparse Models’（5分）：使用轻量级专家。其他关键词与论文内容无关或未涉及。

!!! tip deepseek-chat TL;DR

该论文针对长视频问答中大型视觉语言模型受限于有限上下文窗口的问题，提出了HiMu框架，通过单一文本LLM调用将查询分解为层次逻辑树并路由到多模态专家，在多个基准测试中实现了效率-准确性的帕累托前沿改进。

摘要翻译

长视频问答需要对长时序上下文进行推理，这使得帧选择对于受限于有限上下文窗口的大型视觉语言模型至关重要。现有方法面临尖锐的权衡：基于相似性的选择器速度快，但将组合式查询压缩为单个密集向量，丢失了子事件顺序和跨模态绑定；基于智能体的方法通过迭代式LVLM推理恢复此结构，但代价高昂。我们提出HiMu，一个无需训练即可弥合此差距的框架。仅需一次纯文本LLM调用，即可将查询分解为层次化逻辑树，其叶节点为原子谓词，每个谓词被路由至轻量级专家模块（涵盖视觉领域如CLIP、开放词汇检测、OCR以及音频领域如ASR、CLAP）。生成的信号经过归一化和时序平滑处理以对齐不同模态，并通过强制执行时序顺序与邻接关系的模糊逻辑运算符自底向上组合，最终生成连续满足度曲线。在Video-MME、LongVideoBench和HERBench-Lite上的评估表明，HiMu推进了效率-准确率的帕累托前沿：在使用Qwen3-VL 8B模型处理16帧时，其性能超越所有竞争性选择器；在使用GPT-4o模型时，其表现优于在32-512帧上运行的智能体系统，同时所需计算量（FLOPs）减少约10倍。

摘要 (Abstract)

Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.

关键词: Long-form video question answering, Frame selection, Large vision-language models, Hierarchical logic tree, Multimodal experts, Temporal context, Efficiency-accuracy Pareto front

225. ❌ End-to-End QGAN-Based Image Synthesis via Neural Noise Encoding and Intensity Calibration

作者: Xue Yang, Rigui Zhou, Shizheng Jia, Dax Enshan Koh, Siong Thye Goh, Yaochong Li, Hongyu Chen, Fuhui Xiong 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18554v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于量子生成对抗网络（QGAN）在图像合成中的应用，属于量子机器学习领域。所有关键词均围绕大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理等），与论文的量子计算和图像生成主题完全无关。仅最后一个关键词“AI for Science OR Bioinformatics OR Cheminformatics”有一定关联，因为量子计算可视为AI在科学领域的应用，但论文未明确提及生物信息学或化学信息学，故给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文解决了量子生成对抗网络（QGAN）在图像合成中无法直接生成完整图像的问题，提出了ReQGAN框架，通过神经噪声编码和强度校准模块，实现了使用单量子电路进行端到端图像合成，并在MNIST和Fashion-MNIST数据集上验证了其有效性。

摘要翻译

量子生成对抗网络（QGANs）为在近期量子设备上学习数据分布提供了一条前景广阔的路径。然而，现有用于图像合成的QGANs避开了直接的全图像生成，依赖于经典后处理或基于图像块的方法。这些方法削弱了量子生成器的作用，且难以捕捉全局图像语义。为解决此问题，我们提出了ReQGAN，这是一个端到端的框架，它使用单个D量子比特的量子电路来合成整个N=2^D像素的图像。ReQGAN克服了阻碍直接像素生成的两个基本瓶颈：(1) 僵化的经典-量子噪声接口，以及(2) 归一化量子统计输出与期望的像素强度空间之间的不匹配。我们引入了一个可学习的神经噪声编码器（Neural Noise Encoder）用于自适应态制备，以及一个可微分的强度校准模块（Intensity Calibration module），将测量结果映射到一个稳定且视觉上有意义的像素域。在MNIST和Fashion-MNIST数据集上的实验表明，ReQGAN在严格的量子比特预算下实现了稳定的训练和有效的图像合成，消融研究验证了每个组件的贡献。

摘要 (Abstract)

Quantum Generative Adversarial Networks (QGANs) offer a promising path for learning data distributions on near-term quantum devices. However, existing QGANs for image synthesis avoid direct full-image generation, relying on classical post-processing or patch-based methods. These approaches dilute the quantum generator’s role and struggle to capture global image semantics. To address this, we propose ReQGAN, an end-to-end framework that synthesizes an entire N=2^D-pixel image using a single D-qubit quantum circuit. ReQGAN overcomes two fundamental bottlenecks hindering direct pixel generation: (1) the rigid classical-to-quantum noise interface and (2) the output mismatch between normalized quantum statistics and the desired pixel-intensity space. We introduce a learnable Neural Noise Encoder for adaptive state preparation and a differentiable Intensity Calibration module to map measurements to a stable, visually meaningful pixel domain. Experiments on MNIST and Fashion-MNIST demonstrate that ReQGAN achieves stable training and effective image synthesis under stringent qubit budgets, with ablation studies verifying the contribution of each component.

关键词: Quantum Generative Adversarial Networks, QGAN, Image Synthesis, End-to-End Framework, Neural Noise Encoder, Intensity Calibration, Quantum Circuit, MNIST

226. ❌ CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models

作者: Xiang Chen, Fangfang Yang, Chunlei Meng, Chengyin Hu, Ang Li, Yiwei Wei, Jiahuan Long, Jiujiang Guo 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18545v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文主要研究医学视觉-语言模型（MVLMs）在临床工作流程中的鲁棒性问题，提出了CoDA框架来模拟临床图像处理流程中的分布偏移攻击，并评估了多模态大语言模型（MLLMs）作为图像真实性审计器的可靠性。论文与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为其专注于医学影像领域的AI应用。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为论文涉及多模态大语言模型（MLLMs）的评估。与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（5分），因为论文提到了使用教师引导的token-space adaptation进行对齐以提升鲁棒性。其他关键词与论文内容无直接关系，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了CoDA框架来模拟临床图像处理流程中的分布偏移攻击，揭示了医学视觉-语言模型在真实临床工作流程中的鲁棒性漏洞，并展示了轻量级对齐方法能提升部署鲁棒性。

摘要翻译

医学视觉-语言模型（MVLMs）正日益被用作放射学流程中的感知骨干和多模态助手的视觉前端，但其在真实临床工作流程下的可靠性仍未得到充分探索。先前的鲁棒性评估通常假设输入数据洁净且经过人工筛选，或仅研究孤立的图像损坏形式，忽视了常规的采集、重建、显示及传输操作——这些操作在保持临床可读性的同时会改变图像统计特性。为填补这一空白，我们提出了CoDA，一种链式分布偏移框架，通过组合模拟采集过程的阴影效应、重建与显示的重映射、以及传输与导出过程中的图像退化，构建临床可信的流程偏移。在掩码结构相似性约束下，CoDA联合优化各阶段组合与参数，以在保持视觉合理性的同时诱发模型失效。在脑部MRI、胸部X射线和腹部CT数据上，CoDA显著降低了CLIP风格MVLMs的零样本性能，且链式组合的破坏性始终强于任何单一阶段。我们还评估了多模态大语言模型（MLLMs）作为图像真实性与质量（而非病理）的技术真实性审核者的能力。实验发现，商用多模态模型在CoDA偏移样本上表现出审核可靠性下降及持续的高置信度错误，而我们所测试的医学专用MLLMs在医学图像质量审核方面存在明显缺陷。最后，我们提出一种基于教师引导的令牌空间适应与局部块对齐的事后修复策略，该策略提升了模型对已归档CoDA输出图像的识别准确率。总体而言，我们的研究揭示了MVLM部署中基于临床实践的威胁面，并表明轻量级的对齐方法能够提升部署时的鲁棒性。

摘要 (Abstract)

Medical vision–language models (MVLMs) are increasingly used as perceptual backbones in radiology pipelines and as the visual front end of multimodal assistants, yet their reliability under real clinical workflows remains underexplored. Prior robustness evaluations often assume clean, curated inputs or study isolated corruptions, overlooking routine acquisition, reconstruction, display, and delivery operations that preserve clinical readability while shifting image statistics. To address this gap, we propose CoDA, a chain-of-distribution framework that constructs clinically plausible pipeline shifts by composing acquisition-like shading, reconstruction and display remapping, and delivery and export degradations. Under masked structural-similarity constraints, CoDA jointly optimizes stage compositions and parameters to induce failures while preserving visual plausibility. Across brain MRI, chest X-ray, and abdominal CT, CoDA substantially degrades the zero-shot performance of CLIP-style MVLMs, with chained compositions consistently more damaging than any single stage. We also evaluate multimodal large language models (MLLMs) as technical-authenticity auditors of imaging realism and quality rather than pathology. Proprietary multimodal models show degraded auditing reliability and persistent high-confidence errors on CoDA-shifted samples, while the medical-specific MLLMs we test exhibit clear deficiencies in medical image quality auditing. Finally, we introduce a post-hoc repair strategy based on teacher-guided token-space adaptation with patch-level alignment, which improves accuracy on archived CoDA outputs. Overall, our findings characterize a clinically grounded threat surface for MVLM deployment and show that lightweight alignment improves robustness in deployment.

关键词: Medical vision-language models, Chain-of-distribution attacks, Clinical robustness, Multimodal large language models, Image quality auditing, Token-space adaptation, Post-hoc repair, Medical imaging

227. ❌ Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection

作者: Yongwei Jiang, Yixiong Zou, Yuhua Li, Ruixuan Li 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18541v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究跨域少样本目标检测（CD-FSOD），属于计算机视觉领域，而非大语言模型或深度学习技术原理的直接研究。论文涉及预训练检测器的微调（与’Pre-training’和’Post-training’相关），但未涉及大模型、MoE、量化、推理加速、对齐、RAG等关键词。其他关键词如AI for Science、解释性AI等与论文内容无关。

!!! tip deepseek-chat TL;DR

论文提出了一种受生物启发的中心-外围注意力优化框架，解决了跨域少样本目标检测中的目标域散光问题，显著提升了检测精度并在多个基准测试中取得了新的最优结果。

摘要翻译

跨域少样本目标检测（CD-FSOD）旨在将预训练的检测器从源域适应到标注有限的目标域，其面临严重的域偏移和数据稀缺问题。在本研究中，我们发现了一个先前被忽视的现象：模型在目标域中表现出分散且不集中的注意力，导致定位不精确和冗余预测，正如人类无法聚焦于视觉物体一样。因此，我们将其称为目标域散光问题。通过对Transformer各层注意力距离的分析，我们发现常规微调本质上呈现出缓解此问题的趋势，但效果仍远未令人满意，这正是本文旨在增强的方向。受人类中央凹式视觉系统的生物学启发，我们通过一个中心-外围注意力细化框架来增强微调的内在趋势，该框架包含：（1）正模式细化模块，利用类别特定原型重塑对语义物体的注意力，模拟视觉中心区域；（2）负上下文调制模块，通过建模背景上下文增强边界判别能力，模拟视觉外围区域；以及（3）文本语义对齐模块，通过跨模态线索强化中心-外围区分。我们这种受生物启发的方法将散光式注意力转化为聚焦模式，显著提升了对目标域的适应能力。在六个具有挑战性的CD-FSOD基准测试上的实验一致表明，检测精度得到改善，并取得了新的最先进结果。

摘要 (Abstract)

Cross-domain few-shot object detection (CD-FSOD) aims to adapt pretrained detectors from a source domain to target domains with limited annotations, suffering from severe domain shifts and data scarcity problems. In this work, we find a previously overlooked phenomenon: models exhibit dispersed and unfocused attention in target domains, leading to imprecise localization and redundant predictions, just like a human cannot focus on visual objects. Therefore, we call it the target-domain Astigmatism problem. Analysis on attention distances across transformer layers reveals that regular fine-tuning inherently shows a trend to remedy this problem, but results are still far from satisfactory, which we aim to enhance in this paper. Biologically inspired by the human fovea-style visual system, we enhance the fine-tuning’s inherent trend through a center-periphery attention refinement framework, which contains (1) a Positive Pattern Refinement module to reshape attention toward semantic objects using class-specific prototypes, simulating the visual center region; (2) a Negative Context Modulation module to enhance boundary discrimination by modeling background context, simulating the visual periphery region; and (3) a Textual Semantic Alignment module to strengthen center-periphery distinction through cross-modal cues. Our bio-inspired approach transforms astigmatic attention into focused patterns, substantially improving adaptation to target domains. Experiments on six challenging CD-FSOD benchmarks consistently demonstrate improved detection accuracy and establish new state-of-the-art results.

关键词: Cross-domain few-shot object detection, Target-domain Astigmatism, Attention refinement, Center-periphery attention, Transformer layers, Fine-tuning, Bio-inspired approach, State-of-the-art

作者: Haonan Ping, Jian Jiang, Cheng Yuan, Qizhen Sun, Lv Wu, Yutong Ban 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18544v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文SCISSR专注于医学图像分割领域，提出了一种基于涂鸦提示的交互式手术场景分割框架。该论文与大多数大语言模型技术关键词无关，因为其核心是计算机视觉中的图像分割任务，而非自然语言处理。唯一相关的关键词是：1) “PEFT OR LoRA OR Parameter-efficient Fine-tuning”（评分10分）- 论文明确使用LoRA适配器进行参数高效微调；2) “AI for Science OR Bioinformatics OR Cheminformatics”（评分10分）- 论文属于AI在生物医学（手术场景）领域的应用。其他关键词如LLMs、MoE、Scaling Laws等均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于涂鸦提示的交互式手术场景分割框架SCISSR，通过引入轻量级涂鸦编码器和LoRA适配器，在保持预训练模型能力的同时实现了高效微调，在EndoVis 2018和CholecSeg8k数据集上取得了优于迭代点提示的分割性能。

摘要翻译

由于不规则形状、细薄结构、镜面反射及频繁遮挡等问题，手术场景中组织与器械的精确分割通常需要大量标注工作。尽管SAM模型支持点、框和掩码提示，但点提示往往过于稀疏，框提示则过于粗略，难以准确定位此类具有挑战性的目标。本文提出SCISSR——一种可基于涂鸦提示的交互式手术场景分割框架。该框架引入了一个轻量级涂鸦编码器，能够将自由绘制的涂鸦转化为与掩码解码器兼容的密集提示嵌入，从而允许用户通过在错误区域绘制修正笔触来迭代优化目标对象的分割结果。由于所有新增模块（包括涂鸦编码器、空间门控融合模块及LoRA适配器）仅通过标准嵌入接口与主干网络交互，本框架不依赖于单一模型：本研究基于SAM 2构建，但相同组件无需结构修改即可迁移至其他提示驱动分割架构（如SAM 3）。为保留预训练模型的原有能力，我们仅训练这些轻量级新增模块，同时保持主干网络参数冻结。在EndoVis 2018数据集上的实验显示出优异的域内性能，而在分布外数据集CholecSeg8k上的评估进一步证实了其跨手术领域的鲁棒性。SCISSR在EndoVis 2018上经过五轮交互达到95.41%的Dice系数，在CholecSeg8k上经过三轮交互达到96.30%的Dice系数，在两个基准测试中均优于迭代点提示方法。

摘要 (Abstract)

Accurate segmentation of tissues and instruments in surgical scenes is annotation-intensive due to irregular shapes, thin structures, specularities, and frequent occlusions. While SAM models support point, box, and mask prompts, points are often too sparse and boxes too coarse to localize such challenging targets. We present SCISSR, a scribble-promptable framework for interactive surgical scene segmentation. It introduces a lightweight Scribble Encoder that converts freehand scribbles into dense prompt embeddings compatible with the mask decoder, enabling iterative refinement for a target object by drawing corrective strokes on error regions. Because all added modules (the Scribble Encoder, Spatial Gated Fusion, and LoRA adapters) interact with the backbone only through its standard embedding interfaces, the framework is not tied to a single model: we build on SAM 2 in this work, yet the same components transfer to other prompt-driven segmentation architectures such as SAM 3 without structural modification. To preserve pre-trained capabilities, we train only these lightweight additions while keeping the remaining backbone frozen. Experiments on EndoVis 2018 demonstrate strong in-domain performance, while evaluation on the out-of-distribution CholecSeg8k further confirms robustness across surgical domains. SCISSR achieves 95.41% Dice on EndoVis 2018 with five interaction rounds and 96.30% Dice on CholecSeg8k with three interaction rounds, outperforming iterative point prompting on both benchmarks.

关键词: surgical segmentation, scribble prompting, interactive refinement, LoRA adapters, parameter-efficient fine-tuning, SAM models, medical image analysis, AI for healthcare

229. ❌ 3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

作者: Hyun-kyu Ko, Jihyeon Park, Younghyun Kim, Dongheok Park, Eunbyung Park 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18524v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于3D感知视频生成和定制化，属于计算机视觉和生成模型领域，而非大语言模型（LLM）或深度学习技术原理的核心创新。与关键词的相关性如下：1）‘Pre-training’和’Post-training’得5分，因为论文提到单视图预训练和微调（如3Dapter的多视图联合优化），但并非LLM的预训练/微调；2）其他关键词得0分，因为论文未涉及LLM、MoE、Scaling Laws、对齐、推理、代理、压缩、科学AI等主题。

!!! tip deepseek-chat TL;DR

该论文解决了现有2D方法在3D对象定制化中缺乏空间先验的问题，提出了3DreamBooth框架，通过解耦空间几何与时间运动，实现了高保真、视图一致的3D感知视频生成。

摘要翻译

为定制化主体创建动态且视角一致的视频，在沉浸式VR/AR、虚拟制作和下一代电子商务等众多新兴应用领域具有迫切需求。然而，尽管在主体驱动视频生成方面进展迅速，现有方法主要将主体视为二维实体，侧重于通过单视角视觉特征或文本提示来传递身份。由于真实世界的主体本质上是三维的，将这些以二维为中心的方法应用于三维物体定制时，暴露出一个根本性局限：它们缺乏重建三维几何所需的全面空间先验知识。因此，在合成新视角时，它们不得不为不可见区域生成看似合理但任意的细节，而非保持真实的三维身份。由于多视角视频数据集的稀缺，实现真正的三维感知定制仍然具有挑战性。虽然可以尝试在有限的视频序列上对模型进行微调，但这通常会导致时间维度上的过拟合。为解决这些问题，我们引入了一个新颖的三维感知视频定制框架，包含3DreamBooth和3Dapter两个组件。3DreamBooth通过单帧优化范式，将空间几何与时间运动解耦。通过将更新限制在空间表征上，它有效地将鲁棒的三维先验知识融入模型，而无需进行耗时的基于视频的训练。为了增强细粒度纹理并加速收敛，我们整合了3Dapter——一个视觉条件模块。在单视角预训练之后，3Dapter通过非对称条件策略与主生成分支进行多视角联合优化。这种设计使得该模块能够充当动态选择性路由器，从一个极小的参考集中查询特定视角的几何提示。项目页面：https://ko-lani.github.io/3DreamBooth/

摘要 (Abstract)

Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. However, despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities, focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, applying these 2D-centric approaches to 3D object customization reveals a fundamental limitation: they lack the comprehensive spatial priors necessary to reconstruct the 3D geometry. Consequently, when synthesizing novel views, they must rely on generating plausible but arbitrary details for unseen regions, rather than preserving the true 3D identity. Achieving genuine 3D-aware customization remains challenging due to the scarcity of multi-view video datasets. While one might attempt to fine-tune models on limited video sequences, this often leads to temporal overfitting. To resolve these issues, we introduce a novel framework for 3D-aware video customization, comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm. By restricting updates to spatial representations, it effectively bakes a robust 3D prior into the model without the need for exhaustive video-based training. To enhance fine-grained textures and accelerate convergence, we incorporate 3Dapter, a visual conditioning module. Following single-view pre-training, 3Dapter undergoes multi-view joint optimization with the main generation branch via an asymmetrical conditioning strategy. This design allows the module to act as a dynamic selective router, querying view-specific geometric hints from a minimal reference set. Project page: https://ko-lani.github.io/3DreamBooth/

关键词: 3D-aware video customization, subject-driven video generation, 3D geometry reconstruction, multi-view optimization, temporal overfitting, visual conditioning module, view-consistent synthesis, spatial-temporal decoupling

230. ❌ Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models

作者: Liwei Che, Zhiyu Xue, Yihao Quan, Benlin Liu, Zeru Shi, Michelle Hurst, Jacob Feldman, Ruixiang Tang, Ranjay Krishna, Vladimir Pavlovic 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18523v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大型视觉语言模型（LVLMs）的计数机制和可解释性，核心贡献是提出两种新的可解释性方法（Visual Activation Patching和HeadLens）来揭示计数电路，并通过轻量级干预策略提升模型性能。与’Mechanistic Interpretability’高度相关（10分），因为这是论文的核心方法论；与’Large Language Models’相关（8分），因为LVLMs是大语言模型的视觉扩展；与’Chain of Thought’和’System 2 Thinking’相关（各8分），因为计数涉及多步推理和深度思考；与’Post-training’和’PEFT’相关（各5分），因为使用了微调干预；与’Pre-training’弱相关（5分），因为涉及预训练模型。其他关键词如MoE、SLMs、RAG、量化等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文通过提出新的可解释性方法揭示了大型视觉语言模型中的计数电路机制，并利用轻量级干预策略显著提升了模型的计数准确性和通用视觉推理能力。

摘要翻译

计数可作为对大型视觉语言模型推理能力的一种简单而有效的测试；它迫使模型识别每个独立对象并将其累加。本研究通过结合受控合成与真实世界基准测试及机制分析，探究了大型视觉语言模型如何实现计数功能。结果表明，这些模型展现出类人的计数行为：在较小数量上表现精确，而对较大数量则呈现噪声估计。我们引入了两种新颖的可解释性方法——视觉激活修补和HeadLens，并运用它们揭示了一个结构化的“计数回路”，该回路在多种视觉推理任务中广泛共享。基于这些发现，我们提出一种轻量级干预策略：利用简单且大量可得的合成图像，专门针对计数任务对任意预训练大型视觉语言模型进行微调。尽管微调范围有限，该干预措施不仅提升了在分布内合成数据上的计数准确率，还使Qwen2.5-VL模型在分布外计数基准测试中平均提升+8.36%，在复杂通用视觉推理任务上平均获得+1.54%的性能增益。这些发现凸显了计数在视觉推理中的核心影响力，并为通过针对性增强计数机制来提升整体视觉推理能力提供了潜在路径。

摘要 (Abstract)

Counting serves as a simple but powerful test of a Large Vision-Language Model’s (LVLM’s) reasoning; it forces the model to identify each individual object and then add them all up. In this study, we investigate how LVLMs implement counting using controlled synthetic and real-world benchmarks, combined with mechanistic analyses. Our results show that LVLMs display a human-like counting behavior, with precise performance on small numerosities and noisy estimation for larger quantities. We introduce two novel interpretability methods, Visual Activation Patching and HeadLens, and use them to uncover a structured “counting circuit” that is largely shared across a variety of visual reasoning tasks. Building on these insights, we propose a lightweight intervention strategy that exploits simple and abundantly available synthetic images to fine-tune arbitrary pretrained LVLMs exclusively on counting. Despite the narrow scope of this fine-tuning, the intervention not only enhances counting accuracy on in-distribution synthetic data, but also yields an average improvement of +8.36% on out-of-distribution counting benchmarks and an average gain of +1.54% on complex, general visual reasoning tasks for Qwen2.5-VL. These findings highlight the central, influential role of counting in visual reasoning and suggest a potential pathway for improving overall visual reasoning capabilities through targeted enhancement of counting mechanisms.

关键词: Large Vision-Language Models, Mechanistic Interpretability, Counting Circuit, Visual Reasoning, Fine-tuning, Visual Activation Patching, HeadLens, Intervention Strategy

231. ❌ CAFlow: Adaptive-Depth Single-Step Flow Matching for Efficient Histopathology Super-Resolution

作者: Elad Yoshai, Ariel D. Yoshai, Natan T. Shaked 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18513v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文CAFlow专注于数字病理学中的超分辨率任务，使用自适应深度单步流匹配框架，属于计算机视觉和医学图像处理领域。所有关键词均与大语言模型（LLM）相关，而本文未涉及任何LLM技术、训练方法、推理优化或代理系统。唯一的相关点是’AI for Science OR Bioinformatics OR Cheminformatics’，因为数字病理学是生物信息学的一个子领域，但论文未直接讨论生物信息学或化学信息学方法，仅应用AI于科学（医学图像），因此给予5分（有一定关联）。其他关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

CAFlow提出了一种自适应深度单步流匹配框架，用于高效的数字病理学超分辨率，在保持重建质量的同时显著降低计算成本，并在多器官组织病理学数据上验证了其性能。

摘要翻译

在数字病理学中，全切片图像通常具有超过十亿像素的分辨率，这使得计算密集型的生成式超分辨率方法难以在实际部署中常规应用。我们提出了CAFlow，一种自适应深度的单步流匹配框架，该框架将每个图像块路由至能保持重建质量的最浅网络出口。CAFlow在像素重排重组后的空间中进行流匹配，将空间计算量减少16倍，同时支持直接推理。我们发现，将一半的训练数据专门用于精确的t=0样本对于单步生成质量至关重要（若不如此，性能将下降1.5 dB）。其主干网络FlowResNet（190万参数）在四个早期出口中混合了卷积和窗口自注意力模块，计算量范围从3.1到13.3 GFLOPs。一个轻量级的出口分类器（约6K参数）能以仅0.12 dB的性能代价实现33%的计算节省。在多器官组织病理学图像的4倍超分辨率任务中，自适应路由实现了31.72 dB的峰值信噪比，而全深度推理为31.84 dB；同时，最浅出口的性能比双三次插值高出+1.9 dB，且计算量比SwinIR-light模型少2.8倍。该方法能泛化到未参与训练的结肠组织图像，质量损失极小（-0.02 dB）；在8倍放大任务中，其性能优于所有计算量相当的基础模型，并与参数量大得多的SwinIR-Medium模型保持竞争力。下游的细胞核分割任务证实了临床相关结构得以保留。该模型在单块GPU上训练时间不足5小时，且自适应路由能将全切片图像的推理时间从数分钟缩短至数秒。

摘要 (Abstract)

In digital pathology, whole-slide images routinely exceed gigapixel resolution, making computationally intensive generative super-resolution (SR) impractical for routine deployment. We introduce CAFlow, an adaptive-depth single-step flow-matching framework that routes each image tile to the shallowest network exit that preserves reconstruction quality. CAFlow performs flow matching in pixel-unshuffled rearranged space, reducing spatial computation by 16x while enabling direct inference. We show that dedicating half of training to exact t=0 samples is essential for single-step quality (-1.5 dB without it). The backbone, FlowResNet (1.90M parameters), mixes convolution and window self-attention blocks across four early exits spanning 3.1 to 13.3 GFLOPs. A lightweight exit classifier (~6K parameters) achieves 33% compute savings at only 0.12 dB cost. On multi-organ histopathology x4 SR, adaptive routing achieves 31.72 dB PSNR versus 31.84 dB at full depth, while the shallowest exit exceeds bicubic by +1.9 dB at 2.8x less compute than SwinIR-light. The method generalizes to held-out colon tissue with minimal quality loss (-0.02 dB), and at x8 upscaling it outperforms all comparable-compute baselines while remaining competitive with the much larger SwinIR-Medium model. Downstream nuclei segmentation confirms preservation of clinically relevant structure. The model trains in under 5 hours on a single GPU, and adaptive routing can reduce whole-slide inference from minutes to seconds.

关键词: flow matching, super-resolution, digital pathology, adaptive-depth, computational efficiency, histopathology, early exits, inference acceleration

232. ❌ OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting

作者: Hongjia Zhai, Qi Zhang, Xiaokun Pan, Xiyu Zhang, Yitong Dong, Huaqi Zhang, Dan Xu, Guofeng Zhang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18510v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文OnlinePG专注于计算机视觉和机器人领域，研究在线开放词汇全景建图系统，使用3D高斯泼溅技术进行几何重建和开放词汇感知。所有评分关键词均与大语言模型、深度学习技术原理、AI for Science等主题相关，但论文内容完全不涉及这些领域。论文的核心是3D场景理解、在线建图、实例分割和机器人感知，与评分关键词列表中的任何主题都没有直接或间接关联。

!!! tip deepseek-chat TL;DR

该论文提出了OnlinePG系统，解决了现有方法多为离线或缺乏实例级理解的问题，通过集成3D高斯泼溅技术实现了在线开放词汇全景建图，在广泛使用的数据集上取得了优于现有在线方法的性能并保持了实时效率。

摘要翻译

开放词汇场景理解与在线全景建图对于具身应用感知和交互环境至关重要。然而，现有方法多为离线系统或缺乏实例级理解，限制了其在现实世界机器人任务中的适用性。本文提出OnlinePG，一种新颖有效的系统，它在线性环境中利用3D高斯泼溅（3D Gaussian Splatting）技术，将几何重建与开放词汇感知相融合。技术上，为实现在线全景建图，我们采用了一种高效的局部到全局范式，并结合滑动窗口机制。为构建局部一致性地图，我们设计了一个联合利用几何与语义线索的3D片段聚类图，将滑动窗口内不一致的片段融合为完整实例。随后，为更新全局地图，我们为局部3D高斯地图构建了具有空间属性的显式网格，并通过鲁棒的双向二分3D高斯实例匹配将其融合至全局地图中。最后，我们利用3D空间属性网格内融合的视觉语言模型（VLM）特征来实现开放词汇场景理解。在广泛使用的数据集上进行的大量实验表明，我们的方法在在线方法中取得了更优的性能，同时保持了实时效率。

摘要 (Abstract)

Open-vocabulary scene understanding with online panoptic mapping is essential for embodied applications to perceive and interact with environments. However, existing methods are predominantly offline or lack instance-level understanding, limiting their applicability to real-world robotic tasks. In this paper, we propose OnlinePG, a novel and effective system that integrates geometric reconstruction and open-vocabulary perception using 3D Gaussian Splatting in an online setting. Technically, to achieve online panoptic mapping, we employ an efficient local-to-global paradigm with a sliding window. To build local consistency map, we construct a 3D segment clustering graph that jointly leverages geometric and semantic cues, fusing inconsistent segments within sliding window into complete instances. Subsequently, to update the global map, we construct explicit grids with spatial attributes for the local 3D Gaussian map and fuse them into the global map via robust bidirectional bipartite 3D Gaussian instance matching. Finally, we utilize the fused VLM features inside the 3D spatial attribute grids to achieve open-vocabulary scene understanding. Extensive experiments on widely used datasets demonstrate that our method achieves better performance among online approaches, while maintaining real-time efficiency.

关键词: Online Panoptic Mapping, 3D Gaussian Splatting, Open-vocabulary Scene Understanding, Embodied Applications, Instance-level Understanding, Real-time Efficiency, Robotic Tasks, Geometric Reconstruction

233. ❌ Foundations and Architectures of Artificial Intelligence for Motor Insurance

作者: Teerapong Panboonyuen 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18508v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要关注AI在汽车保险领域的应用，特别是基于transformer架构的领域适应技术。与’Domain Adaptation’高度相关（8分），因为论文明确提到’domain-adapted transformer architectures’。与’Large Language Models’有一定关联（5分），因为transformer是LLM的基础架构，但论文未明确提及LLM。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG等均未在摘要中体现，因此评分为0。论文属于AI应用领域，但未涉及生物信息学或化学信息学，因此’AI for Science’评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一个垂直集成的AI范式，通过领域适应的transformer架构实现汽车保险中的车辆损伤分析、理赔评估和承保工作流的端到端自动化。

摘要翻译

本手册基于大规模实际部署经验，系统性地阐述了人工智能在机动车辆保险领域的理论基础与架构体系。手册形式化提出了一种垂直整合的人工智能范式，将感知、多模态推理与生产基础设施统一为用于车辆风险评估与理赔处理的协同智能技术栈。其核心在于开发了面向领域适配的Transformer架构，用于结构化视觉理解、关系型车辆表征学习以及多模态文档智能处理，从而实现了车辆损伤分析、理赔评估与核保工作流的端到端自动化。这些组件被整合为可扩展的处理流程，其设计充分考虑了泰国全国性车险系统中观察到的实际约束条件。除模型设计外，本手册着重强调学习算法与MLOps实践的协同演进，建立了一套将现代人工智能转化为高风险工业环境中可靠、生产级系统的原则性框架。

摘要 (Abstract)

This handbook presents a systematic treatment of the foundations and architectures of artificial intelligence for motor insurance, grounded in large-scale real-world deployment. It formalizes a vertically integrated AI paradigm that unifies perception, multimodal reasoning, and production infrastructure into a cohesive intelligence stack for automotive risk assessment and claims processing. At its core, the handbook develops domain-adapted transformer architectures for structured visual understanding, relational vehicle representation learning, and multimodal document intelligence, enabling end-to-end automation of vehicle damage analysis, claims evaluation, and underwriting workflows. These components are composed into a scalable pipeline operating under practical constraints observed in nationwide motor insurance systems in Thailand. Beyond model design, the handbook emphasizes the co-evolution of learning algorithms and MLOps practices, establishing a principled framework for translating modern artificial intelligence into reliable, production-grade systems in high-stakes industrial environments.

关键词: artificial intelligence, motor insurance, transformer architectures, domain adaptation, multimodal reasoning, vehicle damage analysis, claims processing, MLOps

234. ❌ Robustness, Cost, and Attack-Surface Concentration in Phishing Detection

作者: Julian Allagan, Mohamed Elbakary, Zohreh Safari, Weizheng Gao, Gabrielle Morgan, Essence Morgan, Vladimir Deriglazov 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19204v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究钓鱼网站检测中的对抗性鲁棒性问题，使用传统机器学习模型（逻辑回归、随机森林、梯度提升树、XGBoost）和特征工程方法，未涉及大语言模型、深度学习或任何评分关键词中的技术。论文聚焦于特征经济性、成本感知规避框架和模型鲁棒性分析，与评分关键词中的大模型技术、训练方法、推理优化、对齐技术、AI科学应用等主题完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了钓鱼网站检测器在对抗性特征操纵下的鲁棒性问题，发现不同机器学习模型的鲁棒性在成本感知规避框架下趋于一致，且鲁棒性主要由特征经济性而非模型复杂性决定。

摘要翻译

基于工程化网站特征构建的钓鱼网站检测器在独立同分布评估中能达到近乎完美的准确率，但实际部署的安全性取决于其对部署后特征篡改的鲁棒性。我们通过一个成本感知的规避框架来研究这一差距，该框架模拟了明确攻击者预算下的离散、单调特征编辑。我们引入了三项诊断指标：最小规避成本（MEC）、预算B下的规避存活率$S(B)$以及鲁棒性集中指数（RCI）。

在UCI钓鱼网站基准数据集（11,055个实例，30个三元特征）上，逻辑回归、随机森林、梯度提升树和XGBoost在静态评估中均达到$\mathrm{AUC}\ge 0.979$。在预算约束的净化式规避攻击下，不同模型架构的鲁棒性呈现收敛趋势：使用全部特征时，中位数MEC为2，且超过80%成功的最小成本规避集中在三个低成本表面特征上。特征限制仅在移除所有主导的低成本特征转换时才能提升鲁棒性。在严格成本设定下，偏向基础设施的特征集合导致集成模型存在17-19%的不可规避样本，而可规避实例的中位数MEC保持不变。我们将此收敛现象形式化：若被正确检测的钓鱼实例中存在正比例样本可通过成本为$c_{\min}$的单一特征转换实现规避，则任何分类器都无法在未修改特征表示或成本模型的前提下，将对应的MEC分位数提升至$c_{\min}$以上。钓鱼检测的对抗鲁棒性由特征经济性主导，而非模型复杂度。

摘要 (Abstract)

Phishing detectors built on engineered website features attain near-perfect accuracy under i.i.d.\ evaluation, yet deployment security depends on robustness to post-deployment feature manipulation. We study this gap through a cost-aware evasion framework that models discrete, monotone feature edits under explicit attacker budgets. Three diagnostics are introduced: minimal evasion cost (MEC), the evasion survival rate $S(B)$, and the robustness concentration index (RCI). On the UCI Phishing Websites benchmark (11,055 instances, 30 ternary features), Logistic Regression, Random Forests, Gradient Boosted Trees, and XGBoost all achieve $\mathrm{AUC}\ge 0.979$ under static evaluation. Under budgeted sanitization-style evasion, robustness converges across architectures: the median MEC equals 2 with full features, and over 80% of successful minimal-cost evasions concentrate on three low-cost surface features. Feature restriction improves robustness only when it removes all dominant low-cost transitions. Under strict cost schedules, infrastructure-leaning feature sets exhibit 17-19% infeasible mass for ensemble models, while the median MEC among evadable instances remains unchanged. We formalize this convergence: if a positive fraction of correctly detected phishing instances admit evasion through a single feature transition of minimal cost $c_{\min}$, no classifier can raise the corresponding MEC quantile above $c_{\min}$ without modifying the feature representation or cost model. Adversarial robustness in phishing detection is governed by feature economics rather than model complexity.

关键词: phishing detection, adversarial robustness, cost-aware evasion, feature manipulation, minimal evasion cost, robustness concentration, machine learning models, feature economics

235. ❌ The Exponentially Weighted Signature

作者: Alexandre Bloch, Samuel N. Cohen, Terry Lyons, Joël Mouterde, Benjamin Walker 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19198v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是数学中的路径签名理论，提出了一种带指数权重的签名方法（EWS），用于改进多维路径的表示，使其能更好地处理历史信息的上下文相关性。论文内容完全属于数学和信号处理领域，涉及微分方程、张量代数、傅里叶变换等数学工具，并应用于SDE回归任务。所有评分关键词均与大模型、深度学习、AI应用或相关技术原理（如训练、对齐、推理、代理等）直接相关，而本文未涉及任何这些主题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

论文针对传统路径签名无法区分历史信息重要性的问题，提出了指数加权签名（EWS）方法，通过引入有界线性算子实现跨通道耦合和丰富记忆动态，并在SDE回归任务中实证了其优于传统签名和指数衰减记忆签名的表达能力。

摘要翻译

签名是多维路径在区间上的规范表示。然而，它均等地处理所有历史信息，缺乏内在机制来情境化过去信息的相关性。为解决这一问题，我们引入了指数加权签名（Exponentially Weighted Signature，简称EWS），将指数衰减记忆（Exponentially Fading Memory，简称EFM）签名从对角线性算子推广至一般有界线性算子。这些算子能够在时间加权层面实现跨通道耦合，并支持更丰富的记忆动态，包括振荡、增长和状态依赖行为，同时保留了经典签名的代数优势。我们证明，EWS是张量代数上线性控制微分方程的唯一解，并且它同时推广了状态空间模型以及路径的拉普拉斯变换和傅里叶变换。EWS的类群结构支持高效计算，并使该框架适用于基于梯度的学习——其完整的半群作用由其生成元参数化并通过学习获得。我们利用该框架，在两个基于随机微分方程（SDE）的回归任务上，实证展示了EWS与经典签名及EFM之间的表达能力差距。

摘要 (Abstract)

The signature is a canonical representation of a multidimensional path over an interval. However, it treats all historical information uniformly, offering no intrinsic mechanism for contextualising the relevance of the past. To address this, we introduce the Exponentially Weighted Signature (EWS), generalising the Exponentially Fading Memory (EFM) signature from diagonal to general bounded linear operators. These operators enable cross-channel coupling at the level of temporal weighting together with richer memory dynamics including oscillatory, growth, and regime-dependent behaviour, while preserving the algebraic strengths of the classical signature. We show that the EWS is the unique solution to a linear controlled differential equation on the tensor algebra, and that it generalises both state-space models and the Laplace and Fourier transforms of the path. The group-like structure of the EWS enables efficient computation and makes the framework amenable to gradient-based learning, with the full semigroup action parametrised by and learned through its generator. We use this framework to empirically demonstrate the expressivity gap between the EWS and both the signature and EFM on two SDE-based regression tasks.

关键词: Exponentially Weighted Signature, path signature, controlled differential equation, tensor algebra, memory dynamics, SDE regression, gradient-based learning, group-like structure

236. ❌ Improving RCT-Based Treatment Effect Estimation Under Covariate Mismatch via Calibrated Alignment

作者: Amir Asiaee, Samhita Pal 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19186v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究的是利用观测数据补充随机对照试验（RCT）来估计条件平均处理效应（CATE）的统计方法，核心是解决协变量不匹配问题，提出了一种名为CALM的嵌入对齐和校准框架。论文内容属于因果推断和统计机器学习领域，与绝大多数关键词（涉及大模型技术原理、训练、推理、对齐、应用范式等）完全无关。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究可视为AI在科学（具体是医学/临床试验数据分析）中的一个应用，但论文本身并未强调AI或大模型，而是传统的统计机器学习方法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对随机对照试验（RCT）与观测研究（OS）间协变量不匹配的问题，提出了一种名为CALM的校准对齐嵌入方法，以更有效地利用观测数据补充RCT来估计条件平均处理效应（CATE），并通过理论分析和大量模拟实验验证了其优于传统插补方法的性能。

摘要翻译

随机对照试验（RCT）是估计异质性治疗效应的金标准，但其在检测效应异质性方面往往统计功效不足。大规模观察性研究（OS）可补充RCT用于条件平均处理效应（CATE）估计，但关键障碍在于协变量失配：两种数据源测量的协变量存在差异，仅部分重叠。我们提出CALM（协变量失配下的校准对齐）方法，该方法通过学习将各数据源特征映射至共同表示空间的嵌入表示，从而绕过插补步骤。观察性研究的结局模型被迁移至随机对照试验的嵌入空间，并利用试验数据进行校准，同时保持随机化带来的因果识别性。有限样本风险界分解为对齐误差、结局模型复杂度与校准复杂度项，由此界定了嵌入对齐方法优于插补法的适用条件。在校准基线性变体中，该框架能有效防止负迁移；而神经变体在严重分布偏移下可能表现脆弱。在稀疏线性模型下，嵌入方法严格推广了插补法。对51种设定进行的模拟实验证实：（1）基于校准的方法在线性CATE场景中具有等效性；（2）神经嵌入变体在所有22个非线性机制设定中均以显著优势胜出。

摘要 (Abstract)

Randomized controlled trials (RCTs) are the gold standard for estimating heterogeneous treatment effects, yet they are often underpowered for detecting effect heterogeneity. Large observational studies (OS) can supplement RCTs for conditional average treatment effect (CATE) estimation, but a key barrier is covariate mismatch: the two sources measure different, only partially overlapping, covariates. We propose CALM (Calibrated ALignment under covariate Mismatch), which bypasses imputation by learning embeddings that map each source’s features into a common representation space. OS outcome models are transferred to the RCT embedding space and calibrated using trial data, preserving causal identification from randomization. Finite-sample risk bounds decompose into alignment error, outcome-model complexity, and calibration complexity terms, identifying when embedding alignment outperforms imputation. Under the calibration-based linear variant, the framework provides protection against negative transfer; the neural variant can be vulnerable under severe distributional shift. Under sparse linear models, the embedding approach strictly generalizes imputation. Simulations across 51 settings confirm that (i) calibration-based methods are equivalent for linear CATEs, and (ii) the neural embedding variant wins all 22 nonlinear-regime settings with large margins.

关键词: Randomized Controlled Trials (RCT), Observational Studies (OS), Conditional Average Treatment Effect (CATE), Covariate Mismatch, Embedding Alignment, Calibration, Causal Inference, Transfer Learning

237. ❌ MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

作者: Masoumeh Shafieinejad, Xi He, Mahshid Alinoori, John Jewell, Sana Ayromlou, Wei Pang, Veronica Chatrath, Garui Sharma, Deval Pandya 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19185v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究扩散模型生成的合成表格数据的隐私性评估，特别是针对成员推理攻击的抵抗能力。论文核心是扩散模型、合成数据、隐私攻击和表格数据，与所有评分关键词（均聚焦于大语言模型及其相关技术）完全无关，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文通过开发针对扩散模型的新型黑盒和白盒成员推理攻击，评估了扩散模型生成的合成表格数据的隐私增益，发现其隐私抵抗能力仍需深入探索。

摘要翻译

合成数据常被视为数据匿名化和隐私保护数据发布的终极解决方案。基于扩散模型等生成模型产生的合成数据，预期能在保持原始数据集统计特性的同时，有效抵御隐私攻击。扩散模型的最新进展已在多种数据类型上展现出良好效果，但其隐私韧性——尤其是针对表格数据——在很大程度上仍未得到充分探索。MIDST挑战旨在对扩散模型生成的表格数据进行隐私增益的量化评估，重点关注其对抗成员推理攻击的能力。鉴于表格数据的异质性和复杂性，研究针对多种目标模型开展了成员推理攻击测试，包括面向混合数据类型单表的扩散模型以及具有互连约束的多关系表扩散模型。作为核心成果，MIDST推动了针对这些目标扩散模型的新型黑盒与白盒成员推理攻击方法的开发，从而实现了对其隐私效能的全面评估。MIDST GitHub存储库地址为 https://github.com/VectorInstitute/MIDST

摘要 (Abstract)

Synthetic data is often perceived as a silver-bullet solution to data anonymization and privacy-preserving data publishing. Drawn from generative models like diffusion models, synthetic data is expected to preserve the statistical properties of the original dataset while remaining resilient to privacy attacks. Recent developments of diffusion models have been effective on a wide range of data types, but their privacy resilience, particularly for tabular formats, remains largely unexplored. MIDST challenge sought a quantitative evaluation of the privacy gain of synthetic tabular data generated by diffusion models, with a specific focus on its resistance to membership inference attacks (MIAs). Given the heterogeneity and complexity of tabular data, multiple target models were explored for MIAs, including diffusion models for single tables of mixed data types and multi-relational tables with interconnected constraints. MIDST inspired the development of novel black-box and white-box MIAs tailored to these target diffusion models as a key outcome, enabling a comprehensive evaluation of their privacy efficacy. The MIDST GitHub repository is available at https://github.com/VectorInstitute/MIDST

关键词: synthetic data, diffusion models, privacy, membership inference attacks, tabular data, privacy resilience, black-box attacks, white-box attacks

238. ❌ DyMoE: Dynamic Expert Orchestration with Mixed-Precision Quantization for Efficient MoE Inference on Edge

作者: Yuegui Huang, Zhiyuan Fang, Weiqi Luo, Ruoyu Wu, Wuhui Chen, Zibin Zheng 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19172v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文DyMoE专注于MoE模型在边缘设备上的高效推理优化，核心贡献是动态混合精度量化框架。与关键词高度相关的是：1) ‘Mixture of Experts OR MoE OR Sparse Models’（10分）- 论文直接研究MoE架构的推理优化；2) ‘Quantization OR Model Compression OR Low-bit Weights’（10分）- 核心方法是混合精度量化；3) ‘Speculative Decoding OR Inference Acceleration’（8分）- 通过动态调度和预取技术加速推理；4) ‘Large Language Models OR LLMs OR Foundation Models’（8分）- 论文针对大模型（MoE作为大模型架构）的推理优化；5) ‘Small Language Models OR SLMs OR On-device AI’（8分）- 专注于边缘设备上的部署，属于on-device AI范畴。其他关键词如Scaling Laws、Pre-training、Alignment等与论文的推理优化主题无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对MoE模型在资源受限的边缘设备上推理时面临的内存占用高和I/O开销大的问题，提出了DyMoE动态混合精度量化框架，通过重要性感知量化、深度自适应调度和前瞻预取技术，在商业边缘硬件上实现了TTFT减少3.44x-22.7x和TPOT加速高达14.58x，同时保持精度。

摘要翻译

尽管混合专家（MoE）模型具备计算效率优势，但其多专家架构固有的过高内存占用和I/O开销，对资源受限边缘平台上的实时推理构成了严峻挑战。现有静态方法受限于僵化的延迟-精度权衡，而我们观察到专家重要性存在高度偏态分布且与网络深度相关。基于这些发现，我们提出了DyMoE——一个专为高性能边缘推理设计的动态混合精度量化框架。该框架通过利用专家重要性偏态分布和深度敏感性的洞察，引入了：（1）重要性感知优先级机制，在运行时动态量化专家；（2）深度自适应调度策略，以保护关键层的语义完整性；（3）前瞻预取技术，以重叠I/O延迟。在商用边缘硬件上的实验结果表明，相较于最先进的卸载基线方法，DyMoE将首词元生成时间（TTFT）缩短了3.44至22.7倍，并在每输出词元时间（TPOT）上实现了最高14.58倍的加速，从而在资源受限的边缘设备上实现了保持精度的实时MoE推理。

摘要 (Abstract)

Despite the computational efficiency of MoE models, the excessive memory footprint and I/O overhead inherent in multi-expert architectures pose formidable challenges for real-time inference on resource-constrained edge platforms. While existing static methods struggle with a rigid latency-accuracy trade-off, we observe that expert importance is highly skewed and depth-dependent. Motivated by these insights, we propose DyMoE, a dynamic mixed-precision quantization framework designed for high-performance edge inference. Leveraging insights into expert importance skewness and depth-dependent sensitivity, DyMoE introduces: (1) importance-aware prioritization to dynamically quantize experts at runtime; (2) depth-adaptive scheduling to preserve semantic integrity in critical layers; and (3) look-ahead prefetching to overlap I/O stalls. Experimental results on commercial edge hardware show that DyMoE reduces Time-to-First-Token (TTFT) by 3.44x-22.7x and up to a 14.58x speedup in Time-Per-Output-Token (TPOT) compared to state-of-the-art offloading baselines, enabling real-time, accuracy-preserving MoE inference on resource-constrained edge devices.

关键词: Mixture of Experts, MoE, Edge Inference, Mixed-Precision Quantization, Dynamic Scheduling, Inference Acceleration, Resource-Constrained Devices, Time-to-First-Token

239. ❌ Rigorous Error Certification for Neural PDE Solvers: From Empirical Residuals to Solution Guarantees

作者: Amartya Mukherjee, Maxwell Fitzsimmons, David C. Del Rey Fernández, Jun Liu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19165v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究物理信息神经网络（PINNs）求解偏微分方程（PDEs）的误差认证和泛化界，属于AI for Science（科学AI）领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。论文未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、量化、推理加速等）、模型训练对齐方法（如RLHF、SFT）、智能体系统或其他指定的大模型相关技术，因此其他所有关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对物理信息神经网络求解偏微分方程时缺乏解空间泛化误差保证的问题，建立了将残差控制与解空间误差联系起来的泛化界，证明了在紧解子集中残差误差消失可保证收敛到真解，并提供了确定性及概率性收敛结果与误差认证界限。

摘要翻译

偏微分方程的不确定性量化传统上基于离散化理论，其通过网格细化控制求解误差。物理信息神经网络从根本上偏离了这一范式：它们通过在配置点最小化残差损失来逼近解，从而引入了由优化、采样、表示能力及过拟合产生的新误差源。因此，解空间中的泛化误差仍是一个未解决的问题。

我们的主要理论贡献在于建立了将残差控制与解空间误差相联系的泛化界。我们证明，当神经近似位于解空间的一个紧子集时，残差误差的消失保证了向真实解的收敛。我们推导了确定性及概率性收敛结果，并提供了经过验证的泛化界，将残差误差、边界误差和初始误差转化为明确的解误差保证。

摘要 (Abstract)

Uncertainty quantification for partial differential equations is traditionally grounded in discretization theory, where solution error is controlled via mesh/grid refinement. Physics-informed neural networks fundamentally depart from this paradigm: they approximate solutions by minimizing residual losses at collocation points, introducing new sources of error arising from optimization, sampling, representation, and overfitting. As a result, the generalization error in the solution space remains an open problem. Our main theoretical contribution establishes generalization bounds that connect residual control to solution-space error. We prove that when neural approximations lie in a compact subset of the solution space, vanishing residual error guarantees convergence to the true solution. We derive deterministic and probabilistic convergence results and provide certified generalization bounds translating residual, boundary, and initial errors into explicit solution error guarantees.

关键词: Physics-informed neural networks, Partial differential equations, Error certification, Generalization bounds, Residual control, Solution-space error, Neural PDE solvers, Uncertainty quantification

240. ❌ Fast and Effective Computation of Generalized Symmetric Matrix Factorization

作者: Lei Yang, Han Wan, Min Zhang, Ling Liang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19147v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究广义对称矩阵分解的计算方法，属于数值优化和矩阵计算领域，与所有评分关键词（均围绕大模型、深度学习技术及其应用）完全无关。论文未涉及任何大模型技术、训练方法、推理优化、对齐、应用或相关概念。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于广义对称矩阵分解的非单调交替更新方法（A-NAUM），并证明了其收敛性和效率。

摘要翻译

本文研究了一种非凸、非光滑且非Lipschitz的广义对称矩阵分解模型，该模型统一了机器学习、图像科学、工程学及相关领域中出现的广泛矩阵分解形式。我们首先建立了两个精确性性质。在建模层面，我们证明了一个精确惩罚性质：在适当条件下，当惩罚参数充分大但有限时，对称诱导二次惩罚项能够强制实现对称性，从而精确恢复出相应的对称形式。在算法层面，我们引入了一个辅助变量分裂形式，并建立了精确松弛关系，该关系严格地将原始目标函数的驻点与松弛势函数的驻点联系起来。基于这些精确性性质，我们提出了一种基于松弛势函数的平均型非单调交替更新方法（A-NAUM）。在每次迭代中，A-NAUM通过（近似）最小化势函数交替更新两个因子块，而辅助块则以闭式形式更新。为确保收敛性并提升实际性能，我们进一步引入了平均型非单调线搜索，并证明其在温和条件下是良定义的。此外，基于Kurdyka-Łojasiewicz性质及其相关指数，我们建立了整个序列全局收敛到驻点的结果，并推导了收敛速率。最后，在真实数据集上的数值实验验证了A-NAUM的有效性。

摘要 (Abstract)

In this paper, we study a nonconvex, nonsmooth, and non-Lipschitz generalized symmetric matrix factorization model that unifies a broad class of matrix factorization formulations arising in machine learning, image science, engineering, and related areas. We first establish two exactness properties. On the modeling side, we prove an exact penalty property showing that, under suitable conditions, the symmetry-inducing quadratic penalty enforces symmetry whenever the penalty parameter is sufficiently large but finite, thereby exactly recovering the associated symmetric formulation. On the algorithmic side, we introduce an auxiliary-variable splitting formulation and establish an exact relaxation relationship that rigorously links stationary points of the original objective function to those of a relaxed potential function. Building on these exactness properties, we propose an average-type nonmonotone alternating updating method (A-NAUM) based on the relaxed potential function. At each iteration, A-NAUM alternately updates the two factor blocks by (approximately) minimizing the potential function, while the auxiliary block is updated in closed form. To ensure the convergence and enhance practical performance, we further incorporate an average-type nonmonotone line search and show that it is well-defined under mild conditions. Moreover, based on the Kurdyka-Łojasiewicz property and its associated exponent, we establish global convergence of the entire sequence to a stationary point and derive convergence rate results. Finally, numerical experiments on real datasets demonstrate the efficiency of A-NAUM.

关键词: generalized symmetric matrix factorization, nonconvex optimization, alternating updating method, exact penalty property, Kurdyka-Łojasiewicz property, convergence analysis, numerical experiments, machine learning applications

241. ❌ Enhancing Pretrained Model-based Continual Representation Learning via Guided Random Projection

作者: Ruilin Li, Heming Zou, Xiufeng Yan, Zheming Liang, Jie Yang, Chenliang Li, Xue Yang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19145v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于基于预训练模型的持续表示学习，提出了一种改进的随机投影层方法（SCL-MGSM）来解决领域差距问题。仅与关键词’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（8分），因为论文涉及预训练模型（PTM）的持续学习和领域适应。其他关键词均未涉及，因为论文不讨论大语言模型、微调技术、推理方法、代理系统、压缩技术或科学AI应用，而是聚焦于通用的计算机视觉/表示学习中的持续学习问题。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SCL-MGSM的新方法，通过数据引导的随机投影层来增强预训练模型在持续学习中的表示能力，解决了领域差距导致的表达性不足和数值不稳定问题，并在多个类增量学习基准上取得了优越性能。

摘要翻译

基于随机投影层（Random Projection Layer, RPL）的持续表征学习新范式表明，在预训练模型（Pre-trained Model, PTM）基础上构建时能展现出优越性能。此类方法在PTM后插入随机初始化的RPL，以增强初始阶段的特征表征能力；随后，在持续学习阶段使用线性分类头进行解析更新。然而，当预训练表征与目标域之间存在显著领域差异时，随机初始化的RPL在大规模领域偏移下表达能力有限。虽然大幅增加RPL维度可提升表达能力，但也会导致特征矩阵病态化，从而破坏线性分类头递归解析更新的稳定性。为此，我们提出带有记忆保护监督机制的随机持续学习器（Stochastic Continual Learner with MemoryGuard Supervisory Mechanism, SCL-MGSM）。与随机初始化不同，MGSM通过一种基于数据指导的原则性机制构建投影层，逐步选择与目标对齐的随机基，使PTM表征适应下游任务。这有助于构建紧凑且表达能力强的RPL，同时提升解析更新的数值稳定性。在多个无示例类增量学习（Class Incremental Learning, CIL）基准上的大量实验表明，SCL-MGSM相比现有先进方法取得了更优的性能。

摘要 (Abstract)

Recent paradigms in Random Projection Layer (RPL)-based continual representation learning have demonstrated superior performance when building upon a pre-trained model (PTM). These methods insert a randomly initialized RPL after a PTM to enhance feature representation in the initial stage. Subsequently, a linear classification head is used for analytic updates in the continual learning stage. However, under severe domain gaps between pre-trained representations and target domains, a randomly initialized RPL exhibits limited expressivity under large domain shifts. While largely scaling up the RPL dimension can improve expressivity, it also induces an ill-conditioned feature matrix, thereby destabilizing the recursive analytic updates of the linear head. To this end, we propose the Stochastic Continual Learner with MemoryGuard Supervisory Mechanism (SCL-MGSM). Unlike random initialization, MGSM constructs the projection layer via a principled, data-guided mechanism that progressively selects target-aligned random bases to adapt the PTM representation to downstream tasks. This facilitates the construction of a compact yet expressive RPL while improving the numerical stability of analytic updates. Extensive experiments on multiple exemplar-free Class Incremental Learning (CIL) benchmarks demonstrate that SCL-MGSM achieves superior performance compared to state-of-the-art methods.

关键词: Continual Representation Learning, Pre-trained Model, Random Projection Layer, Domain Adaptation, Class Incremental Learning, Analytic Updates, Feature Representation, Numerical Stability

242. ❌ SHAPCA: Consistent and Interpretable Explanations for Machine Learning Models on Spectroscopy Data

作者: Mingxing Zhang, Nicola Rossberg, Simone Innocente, Katarzyna Komolibus, Rekha Gautam, Barry O’Sullivan, Luca Longo, Andrea Visentin 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19141v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于光谱数据的机器学习模型可解释性，提出SHAPCA框架结合PCA和SHAP。与绝大多数关键词（涉及大模型技术、训练方法、推理优化等）完全无关。仅与’Mechanistic Interpretability OR Explainable AI’高度相关（核心内容），与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（应用于生物医学分析）。

!!! tip deepseek-chat TL;DR

该研究针对光谱数据高维共线性导致模型解释不稳定问题，提出了SHAPCA框架，通过PCA降维和SHAP解释相结合，在原始输入空间提供一致且可解释的特征重要性分析。

摘要翻译

近年来，机器学习模型在化学和生物医学分析的光谱数据集中的应用日益广泛。为使其成功落地，尤其是在临床和安全关键场景中，从业者和研究人员必须能够理解并信任模型预测背后的推理逻辑。然而，光谱数据固有的高维度和强共线性对模型可解释性构成了根本性挑战。这些特性不仅使模型训练复杂化，还会削弱解释的稳定性和一致性，导致特征重要性在多次训练运行中出现波动。特征提取技术已被用于降低输入维度；但这些新特征阻碍了预测结果与原始信号之间的关联。本研究提出SHAPCA，一种可解释的机器学习流程，它结合了主成分分析（用于降维）和沙普利加性解释（用于事后解释），以在原始输入空间中提供解释，使从业者能够理解并将其与生物组分关联起来。该框架支持从全局和局部视角进行分析，既能揭示驱动模型整体行为的光谱波段，也能识别影响个体预测的实例特异性特征。数值分析结果表明，该框架得出的结果具有可解释性，且在不同运行间表现出更高的一致性。

摘要 (Abstract)

In recent years, machine learning models have been increasingly applied to spectroscopic datasets for chemical and biomedical analysis. For their successful adoption, particularly in clinical and safety-critical settings, professionals and researchers must be able to understand and trust the reasoning behind model predictions. However, the inherently high dimensionality and strong collinearity of spectroscopy data pose a fundamental challenge to model explainability. These properties not only complicate model training but also undermine the stability and consistency of explanations, leading to fluctuations in feature importance across repeated training runs. Feature extraction techniques have been used to reduce the input dimensionality; these new features hinder the connection between the prediction and the original signal. This study proposes SHAPCA, an explainable machine learning pipeline that combines Principal Component Analysis (for dimensionality reduction) and Shapely Additive exPlanations (for post hoc explanation) to provide explanations in the original input space, which a practitioner can interpret and link back to the biological components. The proposed framework enables analysis from both global and local perspectives, revealing the spectral bands that drive overall model behaviour as well as the instance-specific features that influence individual predictions. Numerical analysis demonstrated the interpretability of the results and greater consistency across different runs.

关键词: Explainable AI, Machine Learning, Spectroscopy Data, SHAP, Principal Component Analysis, Model Interpretability, Feature Importance, Biomedical Analysis

243. ❌ Hierarchical Latent Structure Learning through Online Inference

作者: Ines Aitsahalia, Kiyohito Iigaya 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19139v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文提出了一种名为HOLMES的计算框架，用于通过在线推理进行分层潜在结构学习，属于机器学习中的贝叶斯建模和在线学习领域。论文内容主要涉及分层潜在变量模型、嵌套中国餐馆过程、顺序蒙特卡洛推理、在线推理算法等，专注于学习系统如何从序列数据中发现层次结构。所有评分关键词均与大语言模型、深度学习技术原理、AI在科学领域的应用等具体技术或应用场景相关，而本论文研究的是通用的机器学习框架和认知计算模型，不涉及大模型技术、深度学习创新或特定科学领域的AI应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究提出了HOLMES模型，一个通过在线推理进行分层潜在结构学习的计算框架，解决了现有模型在平衡泛化与区分能力以及在线学习层次结构方面的局限性，并在模拟中实现了与扁平模型相当的预测性能，同时学习了更紧凑的表示以支持向高层潜在类别的一次性迁移。

摘要翻译

学习系统必须在经验泛化与任务相关细节的辨别之间取得平衡。因此，有效的学习需要能够同时支持这两者的表征。在线潜在原因模型支持增量推断，但假设了扁平划分；而分层贝叶斯模型虽能捕捉多层次结构，却通常需要离线推断。我们提出了多层次经验结构的分层在线学习模型（Hierarchical Online Learning of Multiscale Experience Structure, HOLMES），这是一个通过在线推断进行分层潜在结构学习的计算框架。HOLMES将嵌套中式餐馆过程的先验变体与序列蒙特卡洛推断相结合，以在无需对潜在结构进行显式监督的情况下，对分层潜在表征进行可处理的逐试次推断。在模拟实验中，HOLMES在预测性能上与扁平模型相当，同时学习了更紧凑的表征，这些表征支持向更高层次潜在类别的一次性迁移。在一个具有嵌套时间结构的上下文依赖任务中，HOLMES相较于扁平模型也提升了结果预测的准确性。这些结果为发现序列数据中的分层结构提供了一个可处理的计算框架。

摘要 (Abstract)

Learning systems must balance generalization across experiences with discrimination of task-relevant details. Effective learning therefore requires representations that support both. Online latent-cause models support incremental inference but assume flat partitions, whereas hierarchical Bayesian models capture multilevel structure but typically require offline inference. We introduce the Hierarchical Online Learning of Multiscale Experience Structure (HOLMES) model, a computational framework for hierarchical latent structure learning through online inference. HOLMES combines a variation on the nested Chinese Restaurant Process prior with sequential Monte Carlo inference to perform tractable trial-by-trial inference over hierarchical latent representations without explicit supervision over the latent structure. In simulations, HOLMES matched the predictive performance of flat models while learning more compact representations that supported one-shot transfer to higher-level latent categories. In a context-dependent task with nested temporal structure, HOLMES also improved outcome prediction relative to flat models. These results provide a tractable computational framework for discovering hierarchical structure in sequential data.

关键词: hierarchical latent structure learning, online inference, nested Chinese Restaurant Process, sequential Monte Carlo inference, trial-by-trial inference, one-shot transfer, context-dependent task, sequential data

244. ❌ On Optimizing Multimodal Jailbreaks for Spoken Language Models

作者: Aravind Krishnan, Karolina Stańczak, Dietrich Klakow 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19127v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Spoken Language Models（SLMs）的多模态越狱攻击，与’Large Language Models’高度相关（10分），因为SLMs基于LLM骨干；与’Small Language Models’高度相关（10分），因为SLMs是小型语言模型；与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（5分），因为涉及安全对齐和越狱攻击；其他关键词如MoE、Scaling Laws、Pre-training等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了针对Spoken Language Models的多模态越狱攻击，提出了JAMA联合优化框架，在四种先进SLMs上实现了比单模态攻击高1.5-10倍的越狱成功率，并发现单模态安全措施不足以保护SLMs。

摘要翻译

随着语音语言模型（SLM）融合语音与文本模态，它们继承了其大型语言模型（LLM）主干的安全漏洞，并扩展了攻击面。先前研究表明，SLM易受越狱攻击，即对抗性提示可诱导有害响应。然而，现有攻击大多仍为单模态，仅单独优化文本或音频。本研究探索基于梯度的多模态越狱攻击，提出JAMA（联合音频-文本多模态攻击）——一种结合文本贪婪坐标梯度（GCG）与音频投影梯度下降（PGD）的联合多模态优化框架，以同步扰动两种模态。在四种前沿SLM和四种音频类型上的评估表明，JAMA的越狱成功率较单模态攻击提升1.5倍至10倍。我们分析了该联合攻击的运行机制，并证明采用序列近似方法可使其速度提升4至6倍。我们的研究结果表明，单模态安全措施不足以构建鲁棒的SLM。代码与数据已公开于https://repos.lsv.uni-saarland.de/akrishnan/multimodal-jailbreak-slm。

摘要 (Abstract)

As Spoken Language Models (SLMs) integrate speech and text modalities, they inherit the safety vulnerabilities of their LLM backbone and an expanded attack surface. SLMs have been previously shown to be susceptible to jailbreaking, where adversarial prompts induce harmful responses. Yet existing attacks largely remain unimodal, optimizing either text or audio in isolation. We explore gradient-based multimodal jailbreaks by introducing JAMA (Joint Audio-text Multimodal Attack), a joint multimodal optimization framework combining Greedy Coordinate Gradient (GCG) for text and Projected Gradient Descent (PGD) for audio, to simultaneously perturb both modalities. Evaluations across four state-of-the-art SLMs and four audio types demonstrate that JAMA surpasses unimodal jailbreak rate by 1.5x to 10x. We analyze the operational dynamics of this joint attack and show that a sequential approximation method makes it 4x to 6x faster. Our findings suggest that unimodal safety is insufficient for robust SLMs. The code and data are available at https://repos.lsv.uni-saarland.de/akrishnan/multimodal-jailbreak-slm

关键词: Spoken Language Models, multimodal jailbreaks, joint optimization, Greedy Coordinate Gradient, Projected Gradient Descent, safety vulnerabilities, adversarial prompts, audio-text perturbation

245. ❌ Position: Spectral GNNs Are Neither Spectral Nor Superior for Node Classification

作者: Qin Jiang, Chengjia Wang, Michael Lones, Dongdong Chen, Wei Pang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19091v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文聚焦于图神经网络（GNNs）的理论分析，特别是谱图神经网络（Spectral GNNs）在节点分类任务中的理论基础和有效性。论文的核心内容是批判性地分析谱GNNs的理论缺陷，并论证其实际效果源于消息传递神经网络（MPNNs）的等价性，而非其声称的谱方法。所有给定的关键词均与大语言模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文的研究主题是图神经网络（一种特定类型的深度学习模型）的理论分析，与LLMs、MoE、Scaling Laws、预训练/后训练、对齐、推理加速、AI for Science等关键词无直接关联。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文批判性地分析了谱图神经网络（Spectral GNNs）在节点分类任务中的理论基础，指出其既不能有效捕捉图谱信息，也无法可靠提升性能，其实际效果源于与消息传递神经网络（MPNNs）的等价性。

摘要翻译

用于节点分类的谱图神经网络（Spectral GNNs）承诺在图上进行频域滤波，但其理论基础存在缺陷。近期研究表明，图拉普拉斯特征向量通常不具备真正傅里叶基的关键性质，但未能解释谱GNNs在实际应用中的成功。本文指出两个理论漏洞：（1）常用的“图傅里叶基”并非图信号的经典傅里叶基；（2）通过范德蒙德系统，（n-1）次多项式（n为节点数）可精确插值任意谱响应，因此常见的“多项式逼近”论述缺乏理论依据。图卷积网络（GCN）的有效性通常被归因于谱低通滤波，但我们证明其低通与高通行为完全源于消息传递动力学，而非基于图傅里叶变换的谱表述。随后我们分析了两个代表性的有向谱模型——MagNet与HoloNet。其报告的有效性并非来自谱特性：而是源于将其退化为强大消息传递神经网络（MPNNs）的实现问题。当严格按照宣称的谱算法实现时，模型性能显著下降。本立场论文指出：对于节点分类任务，谱GNNs既未实质捕捉图谱信息，也未稳定提升性能；其竞争优势更应解释为与MPNNs的等效性，有时还受益于与设计初衷不一致的实现方式。

摘要 (Abstract)

Spectral Graph Neural Networks (Spectral GNNs) for node classification promise frequency-domain filtering on graphs, yet rest on flawed foundations. Recent work shows that graph Laplacian eigenvectors do not in general have the key properties of a true Fourier basis, but leaves the empirical success of Spectral GNNs unexplained. We identify two theoretical glitches: (1) commonly used “graph Fourier bases” are not classical Fourier bases for graph signals; (2) (n-1)-degree polynomials (n = number of nodes) can exactly interpolate any spectral response via a Vandermonde system, so the usual “polynomial approximation” narrative is not theoretically justified. The effectiveness of GCN is commonly attributed to spectral low-pass filtering, yet we prove that low- and high-pass behaviors arise solely from message-passing dynamics rather than Graph Fourier Transform-based spectral formulations. We then analyze two representative directed spectral models, MagNet and HoloNet. Their reported effectiveness is not spectral: it arises from implementation issues that reduce them to powerful MPNNs. When implemented consistently with the claimed spectral algorithms, performance becomes weak. This position paper argues that: for node classification, Spectral GNNs neither meaningfully capture the graph spectrum nor reliably improve performance; competitive results are better explained by their equivalence to MPNNs, sometimes aided by implementations inconsistent with their intended design.

关键词: Spectral Graph Neural Networks, node classification, graph Fourier basis, message-passing neural networks, theoretical analysis, graph Laplacian, polynomial approximation, MPNNs

作者: Mohamed Badi, Chaouki Ben Issaid, Mehdi Bennis 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19067v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是多模态联邦学习（CoMFed框架），专注于通信效率、潜在空间对齐和鲁棒性，应用于人类活动识别。所有关键词均与大语言模型（LLMs）、深度学习技术原理或AI for Science应用直接相关，而本文不涉及LLMs、深度学习模型架构创新或科学领域AI应用，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了CoMFed框架，解决了多模态联邦学习中客户端模态异构、特征空间对齐困难以及通信成本高的问题，通过在潜在空间进行正则化对齐，实现了在人类活动识别任务上具有竞争力的准确性和低开销。

摘要翻译

联邦学习（Federated Learning, FL）使得分布式设备能够在无需共享原始数据的情况下进行协作式模型训练，但将FL应用于多模态场景会带来显著挑战。客户端通常拥有异构的模态和模型架构，这使得在保护隐私并最小化通信开销的同时，高效对齐特征空间变得困难。为解决此问题，我们提出了CoMFed，一种通信高效的多模态联邦学习框架，该框架利用可学习的投影矩阵来生成压缩的潜在表征。通过潜在空间正则化器，这些表征在客户端之间得到对齐，从而提升了跨模态一致性以及对异常值的鲁棒性。在人类活动识别基准测试上的实验表明，CoMFed能够以极小的开销实现具有竞争力的准确率。

摘要 (Abstract)

Federated learning (FL) enables collaborative model training across distributed devices without sharing raw data, but applying FL to multi-modal settings introduces significant challenges. Clients typically possess heterogeneous modalities and model architectures, making it difficult to align feature spaces efficiently while preserving privacy and minimizing communication costs. To address this, we introduce CoMFed, a Communication-Efficient Multi-Modal Federated Learning framework that uses learnable projection matrices to generate compressed latent representations. A latent-space regularizer aligns these representations across clients, improving cross-modal consistency and robustness to outliers. Experiments on human activity recognition benchmarks show that CoMFed achieves competitive accuracy with minimal overhead.

关键词: Federated Learning, Multi-modal Learning, Communication Efficiency, Latent Space Alignment, Heterogeneous Modalities, Privacy Preservation, Human Activity Recognition, Robust Learning

247. ❌ Hardness of High-Dimensional Linear Classification

作者: Alexander Munteanu, Simon Omlor, Jeff M. Phillips 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19061v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究高维线性分类的计算复杂性理论问题，属于理论计算机科学和计算几何领域，与所有关键词（均涉及大模型、深度学习技术及其应用）完全无关。论文未涉及任何大模型技术、训练方法、推理优化、对齐、应用或相关概念。

!!! tip deepseek-chat TL;DR

该论文研究了高维线性分类（最大半空间差异问题）的计算复杂性，通过从Affine Degeneracy测试和k-Sum问题的归约，建立了指数级维度下界，填补了现有多项式下界与指数上界之间的理论空白。

摘要翻译

我们针对线性分类建模的最大半空间差异问题建立了新的维度指数级下界。该问题及其近似形式均是计算几何与机器学习中的基础性问题。然而，目前已知的上界仅为$O(n^d)$和相应的$\tilde O(1/\varepsilon^d)$，且已有的多项式下界无法支持维度依赖的指数性增长。通过从广泛被认可的仿射退化性检验与$k$-和问题的困难性猜想出发进行归约，我们在多对数因子内填补了这一空白。基于仿射退化性检验，我们的归约给出了匹配的下界$\tildeΩ(n^d)$及相应的$\tildeΩ(1/\varepsilon^d)$；在$k$-和问题的假设下，则得到$\tildeΩ(n^{d/2})$及相应的$\tildeΩ(1/\varepsilon^{d/2})$下界。若将计算模型限制为侧向性查询——这对应许多现代算法与计算范式中广泛实现并优化的常见设定，则第一组下界在无条件情况下依然成立。

摘要 (Abstract)

We establish new exponential in dimension lower bounds for the Maximum Halfspace Discrepancy problem, which models linear classification. Both are fundamental problems in computational geometry and machine learning in their exact and approximate forms. However, only $O(n^d)$ and respectively $\tilde O(1/\varepsilon^d)$ upper bounds are known and complemented by polynomial lower bounds that do not support the exponential in dimension dependence. We close this gap up to polylogarithmic terms by reduction from widely-believed hardness conjectures for Affine Degeneracy testing and $k$-Sum problems. Our reductions yield matching lower bounds of $\tildeΩ(n^d)$ and respectively $\tildeΩ(1/\varepsilon^d)$ based on Affine Degeneracy testing, and $\tildeΩ(n^{d/2})$ and respectively $\tildeΩ(1/\varepsilon^{d/2})$ conditioned on $k$-Sum. The first bound also holds unconditionally if the computational model is restricted to make sidedness queries, which corresponds to a widely spread setting implemented and optimized in many contemporary algorithms and computing paradigms.

关键词: linear classification, high-dimensional, computational complexity, lower bounds, Maximum Halfspace Discrepancy, Affine Degeneracy, k-Sum, sidedness queries

248. ❌ Fast and Interpretable Autoregressive Estimation with Neural Network Backpropagation

作者: Anaísa Lucena, Ana Martins, Armando J. Pinho, Sónia Gouveia 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19041v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是时间序列分析中的自回归模型参数估计方法，提出了一种基于神经网络反向传播的快速可解释估计方法。论文内容完全聚焦于传统统计模型（AR模型）的优化计算问题，没有涉及任何大语言模型、深度学习技术原理创新、大模型在不同领域的应用、或AI for Science等关键词相关的主题。所有关键词都涉及大模型、深度学习、AI应用等现代AI技术，而本文研究的是传统统计模型的数值优化问题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于神经网络反向传播的自回归模型参数估计方法，相比传统条件最大似然估计，新方法在保持可解释性的同时实现了更快的计算速度和更好的收敛性。

摘要翻译

自回归（AR）模型因其可解释性在时间序列分析中仍被广泛使用，但传统参数估计方法计算成本高昂且易出现收敛问题。本文提出一种基于神经网络（NN）的自回归估计方法，通过将自回归结构直接嵌入前馈神经网络，在保持可解释性的同时利用反向传播进行系数估计。在125,000个具有短期依赖性（1 ≤ p ≤ 5）的合成AR(p)时间序列上的仿真实验表明：所提出的基于神经网络的方法能稳定恢复所有序列的模型系数，而条件最大似然（CML）估计在约55%的案例中无法收敛。当两种方法均收敛时，其估计精度相当，相对误差、R2以及困惑度/似然度的差异可忽略不计。然而当CML失效时，基于神经网络的方法仍能提供可靠估计。在所有案例中，神经网络估计器实现了显著的计算效率提升，中位数加速比达到12.6倍，在更高模型阶数时最高可达34.2倍。总体而言，研究结果表明梯度下降神经网络优化可为可解释的自回归参数估计提供一种快速高效的替代方案。

摘要 (Abstract)

Autoregressive (AR) models remain widely used in time series analysis due to their interpretability, but convencional parameter estimation methods can be computationally expensive and prone to convergence issues. This paper proposes a Neural Network (NN) formulation of AR estimation by embedding the autoregressive structure directly into a feedforward NN, enabling coefficient estimation through backpropagation while preserving interpretability. Simulation experiments on 125,000 synthetic AR(p) time series with short-term dependence (1 <= p <= 5) show that the proposed NN-based method consistently recovers model coefficients for all series, while Conditional Maximum Likelihood (CML) fails to converge in approximately 55% of cases. When both methods converge, estimation accuracy is comparable with negligible differences in relative error, R2 and, perplexity/likelihood. However, when CML fails, the NN-based approach still provides reliable estimates. In all cases, the NN estimator achieves substantial computational gains, reaching a median speedup of 12.6x and up to 34.2x for higher model orders. Overall, results demonstrate that gradient-descent NN optimization can provide a fast and efficient alternative for interpretable AR parameter estimation.

关键词: Autoregressive models, Neural network backpropagation, Parameter estimation, Time series analysis, Computational efficiency, Interpretability, Conditional Maximum Likelihood, Convergence issues

249. ❌ When Differential Privacy Meets Wireless Federated Learning: An Improved Analysis for Privacy and Convergence

作者: Chen Yaoling, Liang Hao, Tu Xiaotong 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19040v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是无线联邦学习中的差分隐私保护，属于隐私保护与分布式机器学习交叉领域，与所有评分关键词（均聚焦于大模型技术、训练方法、推理优化、应用等）完全无关。论文未涉及任何大模型、深度学习技术原理或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文研究了无线联邦学习中差分隐私保护的隐私损失精确表征和收敛性分析问题，提出了针对一般光滑非凸损失目标的隐私-效用权衡理论框架，并通过数值实验验证了理论结果。

摘要翻译

差分隐私无线联邦学习（DPWFL，Differentially Private Wireless Federated Learning）是一种保护用户敏感数据的前瞻性框架。然而，关于如何精确刻画隐私损失的基础性问题仍未解决，且现有研究进一步受限于依赖严格凸性假设或忽略梯度裁剪影响的收敛性分析。为克服这些问题，本文针对具有一般光滑非凸损失目标的DPWFL，提出了隐私性与收敛性的综合分析。我们的分析明确纳入了设备选择与小批量采样，并表明隐私损失可收敛至一个常数而非随迭代次数发散。此外，我们建立了含梯度裁剪的收敛性保证，并推导出明确的隐私-效用权衡关系。数值实验结果验证了我们的理论发现。

摘要 (Abstract)

Differentially private wireless federated learning (DPWFL) is a promising framework for protecting sensitive user data. However, foundational questions on how to precisely characterize privacy loss remain open, and existing work is further limited by convergence analyses that rely on restrictive convexity assumptions or ignore the effect of gradient clipping. To overcome these issues, we present a comprehensive analysis of privacy and convergence for DPWFL with general smooth non-convex loss objectives. Our analysis explicitly incorporates both device selection and mini-batch sampling, and shows that the privacy loss can converge to a constant rather than diverge with the number of iterations. Moreover, we establish convergence guarantees with gradient clipping and derive an explicit privacy-utility trade-off. Numerical results validate our theoretical findings.

关键词: Differential Privacy, Wireless Federated Learning, Privacy Loss, Convergence Analysis, Gradient Clipping, Non-convex Optimization, Privacy-Utility Trade-off, Device Selection

250. ❌ Towards Verifiable AI with Lightweight Cryptographic Proofs of Inference

作者: Pranay Anchuri, Matteo Campanelli, Paul Cesaretti, Rosario Gennaro, Tushar M. Jois, Hasan S. Kayman, Tugce Ozdemir 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19025v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究大模型推理的可验证性问题，提出了基于轻量级密码学证明和统计采样的验证框架。论文明确提到在Llama-2-7B上进行实验，因此与’Large Language Models’高度相关（8分）。其他关键词如MoE、SLMs、训练方法、推理优化、AI应用等均未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对云部署大模型推理结果的可验证性问题，提出了一种基于统计采样和轻量级密码学证明的验证框架，将证明时间从分钟级降低到毫秒级，并在ResNet-18和Llama-2-7B上验证了有效性。

摘要翻译

当大型人工智能模型作为云服务部署时，客户端无法保证响应的正确性或是否由指定模型生成。对于大型模型而言，本地重新运行推理并不可行，而现有的密码学证明系统——尽管能提供强有力的正确性保证——却引入了极高的证明方开销（例如，对于十亿参数模型，每次查询需数百秒）。本文提出一种验证框架与协议，其基于神经网络的统计特性，以轻量级的抽样方法取代了完整的密码学证明。我们形式化了功能相异模型间轨迹分离可被用于论证可验证推理协议安全性的条件。证明方通过基于默克尔树的向量承诺对推理执行轨迹进行承诺，并仅沿随机抽样的从输出到输入的路径开放少量条目。由此产生的协议以效率换取可靠性，这种权衡非常适用于审计场景、重复查询可放大检测概率的大规模部署环境，以及证明方在检测到违规时将面临惩罚的理性激励场景。与最先进的密码学证明系统相比，我们的方法将证明时间减少了数个数量级——从分钟级降至毫秒级，同时证明体积适度增大。在ResNet-18分类器与Llama-2-7B上的实验证实，常见架构展现出本协议所需的统计特性，且自然的对抗策略（梯度下降重建、逆变换、逻辑值交换）无法生成规避检测的轨迹。我们还提出了一种裁判委托模型下的协议，该协议通过两个竞争服务器在对数轮次内实现正确输出的识别。

摘要 (Abstract)

When large AI models are deployed as cloud-based services, clients have no guarantee that responses are correct or were produced by the intended model. Rerunning inference locally is infeasible for large models, and existing cryptographic proof systems – while providing strong correctness guarantees – introduce prohibitive prover overhead (e.g., hundreds of seconds per query for billion-parameter models). We present a verification framework and protocol that replaces full cryptographic proofs with a lightweight, sampling-based approach grounded in statistical properties of neural networks. We formalize the conditions under which trace separation between functionally dissimilar models can be leveraged to argue the security of verifiable inference protocols. The prover commits to the execution trace of inference via Merkle-tree-based vector commitments and opens only a small number of entries along randomly sampled paths from output to input. This yields a protocol that trades soundness for efficiency, a tradeoff well-suited to auditing, large-scale deployment settings where repeated queries amplify detection probability, and scenarios with rationally incentivized provers who face penalties upon detection. Our approach reduces proving times by several orders of magnitude compared to state-of-the-art cryptographic proof systems, going from the order of minutes to the order of milliseconds, with moderately larger proofs. Experiments on ResNet-18 classifiers and Llama-2-7B confirm that common architectures exhibit the statistical properties our protocol requires, and that natural adversarial strategies (gradient-descent reconstruction, inverse transforms, logit swapping) fail to produce traces that evade detection. We additionally present a protocol in the refereed delegation model, where two competing servers enable correct output identification in a logarithmic number of rounds.

关键词: verifiable AI, cryptographic proofs, inference verification, large models, sampling-based verification, Merkle-tree commitments, Llama-2-7B, cloud-based services

251. ❌ Revisiting OmniAnomaly for Anomaly Detection: performance metrics and comparison with PCA-based models

作者: Bruna Alves, Ana Martins, Armando J. Pinho, Sónia Gouveia 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18985v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于多元时间序列异常检测（MTSAD）的基准评估方法，比较了深度学习模型OmniAnomaly与基于PCA的线性基线模型。论文内容涉及深度学习在特定应用（异常检测）中的评估，但未涉及大语言模型（LLMs）、大模型技术原理创新、AI for Science等关键词所涵盖的核心主题。所有关键词均与大语言模型技术、大模型在不同领域的创新应用或AI for Science无关，因此相关度评分为0。

!!! tip deepseek-chat TL;DR

该研究重新评估了多元时间序列异常检测中广泛使用的深度学习模型OmniAnomaly与基于PCA的线性基线模型，发现在统一评估协议下PCA性能与OmniAnomaly相当甚至更优，质疑了复杂模型在当前基准实践中的附加价值。

摘要翻译

深度学习模型已成为多元时间序列异常检测（MTSAD）的主流方法，其性能常被报告为显著优于经典统计方法。然而，这些性能提升往往是在异构阈值设定策略和评估协议下得出的，使得公平比较变得困难。本研究重新审视了OmniAnomaly——一种广泛使用的随机循环模型，并将其与基于主成分分析（PCA）的简单线性基线方法在服务器机器数据集（SMD）上进行了系统比较。两种方法在完全相同的阈值设定和评估流程下进行评估，针对数据集中的28台机器，每台机器均重复实验100次。性能评估采用点级别的精确率、召回率和F1分数，包括应用点调整与不应用点调整的情况，并在不同机器和实验轮次的聚合策略下进行，同时报告了相应的标准差。结果显示，不同机器间的性能存在很大差异，并且PCA可以达到与OmniAnomaly相当的性能，在不应用点调整时甚至表现更优。这些发现对当前基准测试实践中更复杂架构的附加价值提出了质疑，并凸显了评估方法在MTSAD研究中的关键作用。

摘要 (Abstract)

Deep learning models have become the dominant approach for multivariate time series anomaly detection (MTSAD), often reporting substantial performance improvements over classical statistical methods. However, these gains are frequently evaluated under heterogeneous thresholding strategies and evaluation protocols, making fair comparisons difficult. This work revisits OmniAnomaly, a widely used stochastic recurrent model for MTSAD, and systematically compares it with a simple linear baseline based on Principal Component Analysis (PCA) on the Server Machine Dataset (SMD). Both methods are evaluated under identical thresholding and evaluation procedures, with experiments repeated across 100 runs for each of the 28 machines in the dataset. Performance is evaluated using Precision, Recall and F1-score at point-level, with and without point-adjustment, and under different aggregation strategies across machines and runs, with the corresponding standard deviations also reported. The results show large variability across machines and show that PCA can achieve performance comparable to OmniAnomaly, and even outperform it when point-adjustment is not applied. These findings question the added value of more complex architectures under current benchmarking practices and highlight the critical role of evaluation methodology in MTSAD research.

关键词: anomaly detection, multivariate time series, OmniAnomaly, PCA, evaluation methodology, benchmarking, deep learning, performance comparison

252. ❌ Book your room in the Turing Hotel! A symmetric and distributed Turing Test with multiple AIs and humans

作者: Christian Di Maio, Tommaso Guidi, Luigi Quarantiello, Jack Bell, Marco Gori, Stefano Melacci, Vincenzo Lomonaco 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18981v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在分布式多智能体环境中的图灵测试应用，与’Large Language Models’和’LLM Agents’、‘Multi-agent Systems’高度相关（10分），因为论文明确研究LLMs在多智能体社区中的交互和评估。其他关键词如MoE、SFT、RAG等涉及具体技术原理或应用领域，论文未涉及，故给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于LLMs和人类参与者混合社区的分布式图灵测试新方法，实验结果表明当前LLMs在群体交互中仍可能被误认为人类，但人类特征仍可识别。

摘要翻译

本文介绍了我们在“图灵酒店”（TuringHotel）上的实验经验，这是对图灵测试的一种新颖扩展，其基础是大型语言模型（LLM）与人类参与者混合社群中的互动。经典的图灵测试一对一交互模式在群体环境中被重新诠释，在此环境中，人类与人工智能体共同参与有时间限制的讨论，并且有趣的是，双方都同时扮演评判者与应答者的角色。该社群在新型平台UNaIVERSE（https://unaiverse.io）中得以实例化，平台通过内置编程工具创建了一个定义角色与互动动态的“世界”。所有通信均通过经过认证的点对点网络进行，确保第三方无法访问交换内容。该平台还为人类参与者提供了一个可通过移动设备和笔记本电脑访问的统一界面，这是本文实验体验的关键组成部分。我们邀请了17名人类参与者和19个大型语言模型进行实验，结果显示，当前模型有时仍会被误认为是人类。有趣的是，实验中出现了若干意料之外的误判，这表明尽管人工智能参与者具备高质量的语言技能，人类特征的“指纹”仍然可被识别，但并非完全明确无误。我们认为，这是在分布式环境中进行的首次此类实验，类似的举措可能具有国家层面的意义，有助于支持旨在长期监测大型语言模型演进的持续实验与竞赛。

摘要 (Abstract)

In this paper, we report our experience with TuringHotel'', a novel extension of the Turing Test based on interactions within mixed communities of Large Language Models (LLMs) and human participants. The classical one-to-one interaction of the Turing Test is reinterpreted in a group setting, where both human and artificial agents engage in time-bounded discussions and, interestingly, are both judges and respondents. This community is instantiated in the novel platform UNaIVERSE (https://unaiverse.io), creating a World’’ which defines the roles and interaction dynamics, facilitated by the platform’s built-in programming tools. All communication occurs over an authenticated peer-to-peer network, ensuring that no third parties can access the exchange. The platform also provides a unified interface for humans, accessible via both mobile devices and laptops, that was a key component of the experience in this paper. Results of our experimentation involving 17 human participants and 19 LLMs revealed that current models are still sometimes confused as humans. Interestingly, there are several unexpected mistakes, suggesting that human fingerprints are still identifiable but not fully unambiguous, despite the high-quality language skills of artificial participants. We argue that this is the first experiment conducted in such a distributed setting, and that similar initiatives could be of national interest to support ongoing experiments and competitions aimed at monitoring the evolution of large language models over time.

关键词: Turing Test, Large Language Models, multi-agent systems, distributed interaction, human-AI interaction, autonomous agents, peer-to-peer network, UNaIVERSE platform

253. ❌ Best-of-Both-Worlds Multi-Dueling Bandits: Unified Algorithms for Stochastic and Adversarial Preferences under Condorcet and Borda Objectives

作者: S. Akash, Pratik Gajane, Jawar Singh 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18972v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多决斗赌博机算法，属于在线学习和强化学习领域，专注于开发在随机和对抗性偏好下都能最优工作的统一算法。论文内容完全不涉及大模型、深度学习、AI for Science或任何评分关键词中的技术主题，所有关键词均与论文研究内容无关。

!!! tip deepseek-chat TL;DR

该论文解决了多决斗赌博机中一个基本问题，提出了首个在随机和对抗性偏好环境下都能最优工作的统一算法，分别针对Condorcet和Borda目标设计了算法并提供了理论保证。

摘要翻译

在多决斗赌博机问题中，学习者在每轮选择 $m \geq 2$ 个臂并仅观测胜者，这种设定在排序和推荐系统等众多应用中自然出现，但一个基础性问题始终悬而未决：是否存在单一算法，能在不知道所处环境类型的情况下，同时在随机性和对抗性环境中都达到最优性能？我们对此给出了肯定回答，为康多塞（Condorcet）和波达（Borda）目标下的多决斗赌博机问题提出了首个“两全其美”算法。针对康多塞设定，我们提出了 \texttt{MetaDueling}，这是一种黑盒归约方法，通过将多方胜者反馈转化为无偏的成对信号，可将任何决斗赌博机算法转化为多决斗赌博机算法。使用 \texttt{Versatile-DB} 实例化该归约，我们得到了首个适用于多决斗赌博机的两全其美算法：它在对抗性偏好下实现 $O(\sqrt{KT})$ 的伪遗憾，在随机偏好下实现实例最优的 $O!\left(\sum_{i \neq a^\star} \frac{\log T}{Δ_i}\right)$ 伪遗憾，且两者均无需预先知道环境类型。针对波达设定，我们提出了 \AlgBorda，这是一种同时适用于随机和对抗环境的算法，其在随机环境中达到 $O\left(K^2 \log KT + K \log^2 T + \sum_{i: Δ_i^{\mathrm{B}} > 0} \frac{K\log KT}{(Δ_i^{\mathrm{B}})^2}\right)$ 的遗憾，在对抗环境中达到 $O\left(K \sqrt{T \log KT} + K^{1/3} T^{2/3} (\log K)^{1/3}\right)$ 的遗憾，同样无需预先知晓环境类型。我们为康多塞设定的上界提供了匹配的下界作为补充。对于波达设定，我们的上界相对于下界（在 $K$ 倍因子内）是近乎最优的，并与文献中已知的最佳结果一致。

摘要 (Abstract)

Multi-dueling bandits, where a learner selects $m \geq 2$ arms per round and observes only the winner, arise naturally in many applications including ranking and recommendation systems, yet a fundamental question has remained open: can a single algorithm perform optimally in both stochastic and adversarial environments, without knowing which regime it faces? We answer this affirmatively, providing the first best-of-both-worlds algorithms for multi-dueling bandits under both Condorcet and Borda objectives. For the Condorcet setting, we propose \texttt{MetaDueling}, a black-box reduction that converts any dueling bandit algorithm into a multi-dueling bandit algorithm by transforming multi-way winner feedback into an unbiased pairwise signal. Instantiating our reduction with \texttt{Versatile-DB} yields the first best-of-both-worlds algorithm for multi-dueling bandits: it achieves $O(\sqrt{KT})$ pseudo-regret against adversarial preferences and the instance-optimal $O!\left(\sum_{i \neq a^\star} \frac{\log T}{Δ_i}\right)$ pseudo-regret under stochastic preferences, both simultaneously and without prior knowledge of the regime. For the Borda setting, we propose \AlgBorda, a stochastic-and-adversarial algorithm that achieves $O\left(K^2 \log KT + K \log^2 T + \sum_{i: Δ_i^{\mathrm{B}} > 0} \frac{K\log KT}{(Δ_i^{\mathrm{B}})^2}\right)$ regret in stochastic environments and $O\left(K \sqrt{T \log KT} + K^{1/3} T^{2/3} (\log K)^{1/3}\right)$ regret against adversaries, again without prior knowledge of the regime. We complement our upper bounds with matching lower bounds for the Condorcet setting. For the Borda setting, our upper bounds are near-optimal with respect to the lower bounds (within a factor of $K$) and match the best-known results in the literature.

关键词: multi-dueling bandits, best-of-both-worlds algorithms, stochastic preferences, adversarial preferences, Condorcet objective, Borda objective, regret bounds, online learning

254. ❌ Maximum-Entropy Exploration with Future State-Action Visitation Measures

作者: Adrien Bolland, Gaspard Lambrechts, Damien Ernst 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18965v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究最大熵强化学习中的探索策略，专注于状态-动作特征分布的熵最大化，属于强化学习领域。所有评分关键词均针对大模型、深度学习技术原理或AI科学应用，而本文不涉及任何大模型、语言模型、模型训练/微调、推理优化、AI代理或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于未来状态-动作访问度量的最大熵探索方法，通过内在奖励最大化轨迹中特征分布的熵，实验表明该方法能改善单个轨迹内的特征访问并加速探索智能体的学习收敛。

摘要翻译

最大熵强化学习通过提供与熵函数成比例的内在奖励，激励智能体探索状态与动作以最大化特定分布的熵。本文研究一种与未来时间步中访问的状态-动作特征折扣分布熵成比例的内在奖励机制。该方法的提出基于两项研究结果：首先，我们证明这类内在奖励的期望和是从初始状态出发的轨迹中访问的状态-动作特征折扣分布熵的下界，该下界与另一种最大熵目标相关联；其次，我们论证内在奖励定义中使用的分布是收缩算子的不动点，因而可通过离策略方式进行估计。实验表明，如下界理论所预示，新目标函数在单个轨迹内实现了特征访问的显著提升，同时以不同轨迹期望特征访问量的轻微下降为代价。该方法还提升了纯探索型智能体的学习收敛速度。在所选基准测试中，大多数方法的控制性能保持相近水平。

摘要 (Abstract)

Maximum entropy reinforcement learning motivates agents to explore states and actions to maximize the entropy of some distribution, typically by providing additional intrinsic rewards proportional to that entropy function. In this paper, we study intrinsic rewards proportional to the entropy of the discounted distribution of state-action features visited during future time steps. This approach is motivated by two results. First, we show that the expected sum of these intrinsic rewards is a lower bound on the entropy of the discounted distribution of state-action features visited in trajectories starting from the initial states, which we relate to an alternative maximum entropy objective. Second, we show that the distribution used in the intrinsic reward definition is the fixed point of a contraction operator and can therefore be estimated off-policy. Experiments highlight that the new objective leads to improved visitation of features within individual trajectories, in exchange for slightly reduced visitation of features in expectation over different trajectories, as suggested by the lower bound. It also leads to improved convergence speed for learning exploration-only agents. Control performance remains similar across most methods on the considered benchmarks.

关键词: Maximum entropy reinforcement learning, Intrinsic rewards, State-action features, Exploration, Trajectory visitation, Off-policy estimation, Convergence speed, Control performance

255. ❌ BVSIMC: Bayesian Variable Selection-Guided Inductive Matrix Completion for Improved and Interpretable Drug Discovery

作者: Sijian Fan, Liyan Xiong, Dayuan Wang, Guoshuai Cai, Ray Bai 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18957v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于药物发现中的贝叶斯变量选择和归纳矩阵补全方法，属于AI for Science（生物信息学/化学信息学）领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文强调可解释性，与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分）。其他关键词主要涉及大模型技术原理、训练方法、推理优化、代理系统等，论文未涉及这些内容，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种贝叶斯变量选择引导的归纳矩阵补全方法（BVSIMC），用于改进药物发现中的预测准确性和可解释性，并在结核病耐药性和药物重定位应用中验证了其优越性能。

摘要翻译

药物发现领域的最新进展表明，融入辅助信息（例如药物的化学特性与疾病的基因组信息）通常能显著提升预测性能。然而，这些辅助特征的相关性差异较大，且常具有噪声和高维特性。我们提出了一种贝叶斯变量选择引导的归纳矩阵补全方法（Bayesian Variable Selection-Guided Inductive Matrix Completion, BVSIMC），这是一种新的贝叶斯模型，能够在药物发现过程中实现对辅助特征的变量选择。通过学习稀疏潜在嵌入，BVSIMC同时提升了预测准确性与可解释性。我们通过模拟研究及两个药物发现应用验证了本方法的有效性：1）结核分枝杆菌耐药性预测，2）计算性药物重定位中新药物-疾病关联的预测。在合成数据与真实数据上，BVSIMC在预测性能上均优于其他多种先进方法。在两个真实案例中，BVSIMC进一步揭示了最具临床意义的辅助特征。

摘要 (Abstract)

Recent advances in drug discovery have demonstrated that incorporating side information (e.g., chemical properties about drugs and genomic information about diseases) often greatly improves prediction performance. However, these side features can vary widely in relevance and are often noisy and high-dimensional. We propose Bayesian Variable Selection-Guided Inductive Matrix Completion (BVSIMC), a new Bayesian model that enables variable selection from side features in drug discovery. By learning sparse latent embeddings, BVSIMC improves both predictive accuracy and interpretability. We validate our method through simulation studies and two drug discovery applications: 1) prediction of drug resistance in Mycobacterium tuberculosis, and 2) prediction of new drug-disease associations in computational drug repositioning. On both synthetic and real data, BVSIMC outperforms several other state-of-the-art methods in terms of prediction. In our two real examples, BVSIMC further reveals the most clinically meaningful side features.

关键词: Bayesian Variable Selection, Inductive Matrix Completion, Drug Discovery, Drug Resistance Prediction, Drug Repositioning, Interpretable AI, Sparse Latent Embeddings, Bioinformatics

256. ❌ Context Bootstrapped Reinforcement Learning

作者: Saaket Agashe, Jayanth Srinivasa, Gaowen Liu, Ramana Kompella, Xin Eric Wang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18953v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Context Bootstrapped Reinforcement Learning (CBRL)方法，通过few-shot演示增强RLVR训练，核心涉及推理模式学习。与’Chain of Thought/CoT Reasoning/Multi-step Reasoning’、‘System 2 Thinking/Slow Thinking/In-depth Reasoning’高度相关（10分），因为论文专注于需要新颖推理模式的任务。与’In-context Learning/Many-shot Learning’高度相关（10分），因为方法使用few-shot演示作为上下文。与’Large Language Models/LLMs/Foundation Models’有一定关联（8分），因为论文在模型家族上验证方法，但未明确指定为大模型。其他关键词如MoE、量化、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

论文针对强化学习从可验证奖励中探索效率低的问题，提出上下文引导强化学习方法，通过课程式few-shot演示注入提升成功率和探索效率，并在推理任务和领域特定编程语言上验证了有效性。

摘要翻译

基于可验证奖励的强化学习（RLVR）存在探索效率低下的问题，即模型难以生成成功的运行轨迹，导致学习信号微弱。这一挑战在需要获取新颖推理模式或领域特定知识的任务中尤为突出。为解决此问题，我们提出上下文引导强化学习（CBRL），该方法通过在训练提示中随机前置少量示例演示来增强RLVR训练。注入概率遵循一个课程计划：初始值较高以引导早期探索，随后逐渐衰减至零，使得模型最终必须在无辅助情况下独立成功。这迫使策略将演示中的推理模式内化，而非在测试时依赖它们。我们在两个模型系列和五项“推理训练场”任务上验证了CBRL。结果表明，CBRL能持续提升成功率，提供更好的探索效率，且与算法无关。我们进一步展示了CBRL在Q语言上的实际适用性，这是一种与主流编程语言规范差异显著的领域特定编程语言。

摘要 (Abstract)

Reinforcement Learning from Verifiable Rewards (RLVR) suffers from exploration inefficiency, where models struggle to generate successful rollouts, resulting in minimal learning signal. This challenge is particularly severe for tasks that require the acquisition of novel reasoning patterns or domain-specific knowledge. To address this, we propose Context Bootstrapped Reinforcement Learning (CBRL), which augments RLVR training by stochastically prepending few-shot demonstrations to training prompts. The injection probability follows a curriculum that starts high to bootstrap early exploration, then anneals to zero so the model must ultimately succeed without assistance. This forces the policy to internalize reasoning patterns from the demonstrations rather than relying on them at test time. We validate CBRL across two model families and five Reasoning Gym tasks. Our results demonstrate that CBRL consistently improves success rate, provides better exploration efficiency, and is algorithm-agnostic. We further demonstrate CBRL’s practical applicability on Q, a domain-specific programming language that diverges significantly from mainstream language conventions.

关键词: Reinforcement Learning, Context Bootstrapping, Few-shot Demonstrations, Reasoning Patterns, Exploration Efficiency, RLVR, Curriculum Learning, Domain-specific Knowledge

257. ❌ Balancing Performance and Fairness in Explainable AI for Anomaly Detection in Distributed Power Plants Monitoring

作者: Corneille Niyonkuru, Marcellin Atemkeng, Gabin Maxime Nguegnang, Arnaud Nguembang Fadja 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18954v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于传统机器学习（集成方法如LightGBM、XGBoost等）在分布式发电厂异常检测中的应用，并强调可解释性（SHAP）和公平性（DIR）。它不涉及大语言模型（LLMs）、深度学习或任何评分关键词中的特定大模型技术。唯一相关的关键词是’Mechanistic Interpretability OR Explainable AI’（因为使用了SHAP进行可解释性分析）和’AI for Science OR Bioinformatics OR Cheminformatics’（因为应用AI于工业能源管理，属于科学/工程领域），但相关性较弱，因为这些不是论文的核心创新点（论文核心是集成ML和公平性）。其他所有关键词均与大模型技术、训练方法、推理优化等无关，因此得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种集成机器学习框架，结合重采样技术和SHAP可解释性分析，用于解决喀麦隆柴油发电机异常检测中的类别不平衡和公平性问题，实现了高性能（F1-score 0.99）和低偏差（DIR ≈ 0.95）。

摘要翻译

在分布式电站监控系统中实现可靠的异常检测对于确保运行连续性和降低维护成本至关重要，尤其在电信运营商严重依赖柴油发电机的地区。然而，该任务面临极端类别不平衡、缺乏可解释性以及跨区域集群潜在公平性问题的挑战。本研究提出一个监督机器学习框架，该框架将集成方法（LightGBM、XGBoost、随机森林、CatBoost、GBDT、AdaBoost）和基线模型（支持向量机、K最近邻、多层感知器与逻辑回归）与高级重采样技术（结合Tomek Links与ENN的SMOTE方法）相结合，以处理喀麦隆柴油发电机运行数据集中的不平衡问题。通过SHAP（SHapley可加性解释）实现可解释性，同时使用跨运行集群的差异影响比率（Disparate Impact Ratio, DIR）量化公平性。我们进一步采用最大均值差异（Maximum Mean Discrepancy, MMD）评估模型泛化能力，以捕捉区域间的域偏移。实验结果表明，集成模型持续优于基线模型，其中LightGBM的F1分数达到0.99，且跨集群偏差最小（DIR $\approx 0.95$）。SHAP分析指出燃油消耗率和日均运行时间为关键预测因子，为运营商提供了可操作的洞见。我们的研究证明，在异常检测中平衡性能、可解释性与公平性是可行的，这为工业电力管理中更公平、可解释的人工智能系统铺平了道路。最后，除离线评估外，我们还探讨了训练模型在实际实时监控中的部署方案，展示了容器化服务如何实现实时处理、提供低延迟预测，并为运营商生成可解释的输出结果。

摘要 (Abstract)

Reliable anomaly detection in distributed power plant monitoring systems is essential for ensuring operational continuity and reducing maintenance costs, particularly in regions where telecom operators heavily rely on diesel generators. However, this task is challenged by extreme class imbalance, lack of interpretability, and potential fairness issues across regional clusters. In this work, we propose a supervised ML framework that integrates ensemble methods (LightGBM, XGBoost, Random Forest, CatBoost, GBDT, AdaBoost) and baseline models (Support Vector Machine, K-Nearrest Neighbors, Multilayer Perceptrons, and Logistic Regression) with advanced resampling techniques (SMOTE with Tomek Links and ENN) to address imbalance in a dataset of diesel generator operations in Cameroon. Interpretability is achieved through SHAP (SHapley Additive exPlanations), while fairness is quantified using the Disparate Impact Ratio (DIR) across operational clusters. We further evaluate model generalization using Maximum Mean Discrepancy (MMD) to capture domain shifts between regions. Experimental results show that ensemble models consistently outperform baselines, with LightGBM achieving an F1-score of 0.99 and minimal bias across clusters (DIR $\approx 0.95$). SHAP analysis highlights fuel consumption rate and runtime per day as dominant predictors, providing actionable insights for operators. Our findings demonstrate that it is possible to balance performance, interpretability, and fairness in anomaly detection, paving the way for more equitable and explainable AI systems in industrial power management. {\color{black} Finally, beyond offline evaluation, we also discuss how the trained models can be deployed in practice for real-time monitoring. We show how containerized services can process in real-time, deliver low-latency predictions, and provide interpretable outputs for operators.

关键词: anomaly detection, ensemble methods, class imbalance, SHAP, fairness, diesel generators, power plant monitoring, real-time deployment

258. ❌ Unified Taxonomy for Multivariate Time Series Anomaly Detection using Deep Learning

作者: Bruna Alves, Armando J. Pinho, Sónia Gouveia 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18941v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于多变量时间序列异常检测（MTSAD）领域的深度学习方法分类学研究，提出了一个包含11个维度的统一分类法。虽然论文涉及深度学习模型（特别是Transformer-based模型），但所有评分关键词都专门针对大语言模型（LLMs）及其相关技术（如MoE、RLHF、RAG等）、推理方法（如CoT）、优化技术（如量化）或特定应用领域（如AI for Science）。论文完全没有讨论LLMs、大模型技术原理或科学领域的AI应用，而是聚焦于时间序列分析的特定深度学习分类框架，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对多变量时间序列异常检测领域缺乏系统化分类的问题，提出了一个包含输入、输出和模型三部分共11个维度的统一分类法，揭示了该领域向基于Transformer的重建和预测模型收敛的趋势。

摘要翻译

多元时间序列异常检测（Multivariate Time Series Anomaly Detection, MTSAD）这一主题在过去几年中发展迅速，相关出版物稳步增长，深度学习（Deep Learning, DL）模型已成为该领域的主导范式。针对该领域缺乏系统化梳理的问题，本研究提出了一种新颖且统一的分类体系，该体系包含三个部分（输入、输出和模型）共十一个维度，用于对基于深度学习的MTSAD方法进行分类。这些维度的确立采用了双重方法：首先，它们源于对方法学研究的全面分析；其次，融合了综述文献中的见解。此外，本研究通过使用一组额外的近期出版物对所提出的分类体系进行了验证，从而清晰地概述了MTSAD领域的方法学趋势。结果表明，该领域正趋向于基于Transformer的模型以及重构与预测模型，这为新兴的自适应与生成式趋势奠定了基础。基于并补充了现有的综述研究，这一统一分类体系旨在适应未来的发展，允许随着领域进展而增加新的类别或维度。因此，本研究整合了该领域内零散的知识，并为MTSAD的未来研究提供了一个参考基准。

摘要 (Abstract)

The topic of Multivariate Time Series Anomaly Detection (MTSAD) has grown rapidly over the past years, with a steady rise in publications and Deep Learning (DL) models becoming the dominant paradigm. To address the lack of systematization in the field, this study introduces a novel and unified taxonomy with eleven dimensions over three parts (Input, Output and Model) for the categorization of DL-based MTSAD methods. The dimensions were established in a two-fold approach. First, they derived from a comprehensive analysis of methodological studies. Second, insights from review papers were incorporated. Furthermore, the proposed taxonomy was validated using an additional set of recent publications, providing a clear overview of methodological trends in MTSAD. Results reveal a convergence toward Transformer-based and reconstruction and prediction models, setting the foundation for emerging adaptive and generative trends. Building on and complementing existing surveys, this unified taxonomy is designed to accommodate future developments, allowing for new categories or dimensions to be added as the field progresses. This work thus consolidates fragmented knowledge in the field and provides a reference point for future research in MTSAD.

关键词: Multivariate Time Series Anomaly Detection, Deep Learning, Unified Taxonomy, Transformer-based models, Reconstruction models, Prediction models, Methodological trends, Survey

259. ❌ Kernel Single-Index Bandits: Estimation, Inference, and Learning

作者: Sakshi Arya, Satarupa Bhattacharjee, Bharath K. Sriperumbudur 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18938v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是上下文赌博机（contextual bandits）中的单指数模型（single-index model），属于强化学习和统计学习领域，主要关注自适应采样下的估计、推断和遗憾分析。所有关键词均与大模型、深度学习技术原理或AI在科学领域的应用无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了自适应采样下单指数上下文赌博机的估计与推断问题，提出了一种结合核方法的ε-greedy算法，并证明了估计量的渐近正态性和算法的有限时间遗憾界。

摘要翻译

我们研究具有有限数量行动的上下文赌博机问题，其中每个臂的奖励服从单指标模型——该模型包含臂特定的指标参数与未知的非参数链接函数。我们考虑一种场景：臂对应稳定的决策选项，且协变量在赌博机策略下自适应演化。这一设定带来了显著的统计挑战：抽样分布取决于分配规则，观测值随时间存在依赖性，而逆概率加权会引发方差膨胀。我们提出一种核化$\varepsilon$-贪婪算法，该方法将基于Stein方法的指标参数估计与针对奖励函数的逆概率加权核岭回归相结合。该策略在保持可解释性的同时实现了灵活的半参数学习。我们的分析为自适应收集数据的推断开发了新工具。我们建立了自适应抽样下单指标估计量的渐近正态性，从而得到有效的置信区域；并推导了再生核希尔伯特空间估计量的方向性泛函中心极限定理，提供了渐近有效的逐点置信区间。该分析依赖于逆加权Gram矩阵的集中界以及鞅中心极限定理。我们进一步获得了有限时间遗憾保证，包括在公共链接函数满足Lipschitz条件下的$\tilde{O}(\sqrt{T})$收敛速率，表明半参数结构可在不牺牲统计效率的前提下被有效利用。这些结果为单指标上下文赌博机中的同步学习与推断提供了一个统一框架。

摘要 (Abstract)

We study contextual bandits with finitely many actions in which the reward of each arm follows a single-index model with an arm-specific index parameter and an unknown nonparametric link function. We consider a regime in which arms correspond to stable decision options and covariates evolve adaptively under the bandit policy. This setting creates significant statistical challenges: the sampling distribution depends on the allocation rule, observations are dependent over time, and inverse-propensity weighting induces variance inflation. We propose a kernelized $\varepsilon$-greedy algorithm that combines Stein-based estimation of the index parameters with inverse-propensity-weighted kernel ridge regression for the reward functions. This approach enables flexible semiparametric learning while retaining interpretability. Our analysis develops new tools for inference with adaptively collected data. We establish asymptotic normality for the single-index estimator under adaptive sampling, yielding valid confidence regions, and derive a directional functional central limit theorem for the RKHS estimator, which provides asymptotically valid pointwise confidence intervals. The analysis relies on concentration bounds for inverse-weighted Gram matrices together with martingale central limit theorems. We further obtain finite-time regret guarantees, including $\tilde{O}(\sqrt{T})$ rates under common-link Lipschitz conditions, showing that semiparametric structure can be exploited without sacrificing statistical efficiency. These results provide a unified framework for simultaneous learning and inference in single-index contextual bandits.

关键词: contextual bandits, single-index model, adaptive sampling, kernel ridge regression, asymptotic normality, regret analysis, semiparametric learning

260. ❌ An Optimised Greedy-Weighted Ensemble Framework for Financial Loan Default Prediction

作者: Ezekiel Nii Noye Nortey, Jones Asante-Koranteng, Marcellin Atemkeng, Theophilus Ansah-Narh, David Mensah, Rebecca Davis, Ravenhill Adjetey Laryea 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18927v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于金融贷款违约预测的传统机器学习方法，提出了一种优化的贪婪加权集成框架，涉及粒子群优化、神经网络元学习器和特征分析等技术。论文未提及任何大模型、深度学习技术原理或科学领域的AI应用，所有关键词均与大模型技术、深度学习创新或AI for Science无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种优化的贪婪加权集成框架，用于提高金融贷款违约预测的准确性和可解释性，在Lending Club数据集上实现了AUC 0.80和召回率0.81的改进性能。

摘要翻译

贷款违约的精准预测是信用风险管理中的核心挑战，尤其是在具有非线性关系、类别不平衡和借款人行为动态变化特征的现代金融数据集中。传统的统计模型和静态集成方法在此类条件下往往难以保持可靠的性能。本研究提出了一种用于贷款违约预测的优化贪婪加权集成框架，该框架基于经验预测性能动态分配模型权重。该框架集成了多种机器学习分类器，并首先使用粒子群优化算法对其超参数进行优化。随后，通过一种正则化的贪婪加权机制对模型预测结果进行组合。同时，在堆叠集成中采用了基于神经网络的元学习器，以捕捉模型输出间的高阶关系。在Lending Club数据集上进行的实验表明，与单一分类器相比，所提出的框架提升了预测性能。其中，BlendNet集成取得了最佳结果，其AUC（曲线下面积）为0.80，宏平均F1分数为0.73，违约召回率为0.81。校准分析进一步表明，基于树的集成方法（如极端随机树和梯度提升）能提供最可靠的概率估计，而堆叠集成则具有更优的排序能力。使用递归特征消除法进行的特征分析确定，循环信用额度使用率、年收入和债务收入比是预测贷款违约最具影响力的特征。这些发现表明，以性能驱动的集成加权能够同时提升信用风险建模的预测准确性和可解释性。所提出的框架为支持机构信用评估、风险监控和金融决策提供了一种可扩展的数据驱动方法。

摘要 (Abstract)

Accurate prediction of loan defaults is a central challenge in credit risk management, particularly in modern financial datasets characterised by nonlinear relationships, class imbalance, and evolving borrower behaviour. Traditional statistical models and static ensemble methods often struggle to maintain reliable performance under such conditions. This study proposes an Optimised Greedy-Weighted Ensemble framework for loan default prediction that dynamically allocates model weights based on empirical predictive performance. The framework integrates multiple machine learning classifiers, with their hyperparameters first optimised using Particle Swarm Optimisation. Model predictions are then combined via a regularised greedy weighting mechanism. At the same time, a neural-network-based meta-learner is employed within stacked-ensemble to capture higher-order relationships among model outputs. Experiments conducted on the Lending Club dataset demonstrate that the proposed framework improves predictive performance compared with individual classifiers. The BlendNet ensemble achieved the strongest results with an AUC of 0.80, a macro-average F1-score of 0.73, and a default recall of 0.81. Calibration analysis further shows that tree-based ensembles such as Extra Trees and Gradient Boosting provide the most reliable probability estimates, while the stacked ensemble offers superior ranking capability. Feature analysis using Recursive Feature Elimination identifies revolving utilisation, annual income, and debt-to-income ratio as the most influential predictors of loan default. These findings demonstrate that performance-driven ensemble weighting can improve both predictive accuracy and interpretability in credit risk modelling. The proposed framework provides a scalable data-driven approach to support institutional credit assessment, risk monitoring, and financial decision-making.

关键词: loan default prediction, ensemble framework, greedy weighting, Particle Swarm Optimisation, stacked ensemble, credit risk modelling, feature analysis, Lending Club dataset

261. ❌ Neural Galerkin Normalizing Flow for Transition Probability Density Functions of Diffusion Models

作者: Riccardo Saporiti, Fabio Nobile 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18907v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文提出了一种用于近似扩散过程转移概率密度函数的Neural Galerkin Normalizing Flow框架，属于科学计算和偏微分方程数值解领域。论文的核心是使用归一化流和神经Galerkin方法求解Fokker-Planck方程，与大多数关键词（涉及大语言模型、训练技术、推理优化、对齐、智能体等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该方法属于科学AI应用（用于随机微分方程的贝叶斯推断、模拟和扩散桥生成），但并非生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合神经Galerkin方法和归一化流的新框架，用于高效近似扩散过程的转移概率密度函数，解决了高维Fokker-Planck方程的求解问题，并实现了离线训练后低成本在线评估的代理模型。

摘要翻译

我们提出了一种新的神经伽辽金归一化流框架，通过求解具有原子初始分布的福克-普朗克方程（参数化依赖于初始质量的位置）来逼近扩散过程的转移概率密度函数。利用归一化流技术，我们将解表示为参考随机过程转移概率密度函数的变换，从而确保近似具有结构保持性，并自动满足正定性和质量守恒约束。通过将神经伽辽金方案扩展至归一化流框架，我们推导出归一化流参数随时间演化的常微分方程组。自适应采样策略被用于在关键位置评估福克-普朗克方程的残差，这对处理高维偏微分方程至关重要。数值结果表明，该策略能够捕捉真实解的关键特征，并强化初始数据与后续时刻密度函数之间的因果关系。在完成离线训练阶段后，在线评估的计算成本显著低于从头求解偏微分方程。所提出的方法作为一种有前景的代理模型，可应用于随机微分方程相关的多查询问题，如贝叶斯推断、模拟和扩散桥生成。

摘要 (Abstract)

We propose a new Neural Galerkin Normalizing Flow framework to approximate the transition probability density function of a diffusion process by solving the corresponding Fokker-Planck equation with an atomic initial distribution, parametrically with respect to the location of the initial mass. By using Normalizing Flows, we look for the solution as a transformation of the transition probability density function of a reference stochastic process, ensuring that our approximation is structure-preserving and automatically satisfies positivity and mass conservation constraints. By extending Neural Galerkin schemes to the context of Normalizing Flows, we derive a system of ODEs for the time evolution of the Normalizing Flow’s parameters. Adaptive sampling routines are used to evaluate the Fokker-Planck residual in meaningful locations, which is of vital importance to address high-dimensional PDEs. Numerical results show that this strategy captures key features of the true solution and enforces the causal relationship between the initial datum and the density function at subsequent times. After completing an offline training phase, online evaluation becomes significantly more cost-effective than solving the PDE from scratch. The proposed method serves as a promising surrogate model, which could be deployed in many-query problems associated with stochastic differential equations, like Bayesian inference, simulation, and diffusion bridge generation.

关键词: Neural Galerkin, Normalizing Flow, Fokker-Planck equation, diffusion process, transition probability density, adaptive sampling, surrogate model, stochastic differential equations

262. ❌ Uniform a priori bounds and error analysis for the Adam stochastic gradient descent optimization method

作者: Steffen Dereich, Thang Do, Arnulf Jentzen 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18899v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于Adam优化器的理论分析，研究其收敛性和误差界，属于深度学习优化方法的基础理论研究。所有评分关键词均涉及大模型技术、训练方法、推理优化、应用领域等具体方向，而本文不涉及任何大模型技术、训练流程、推理方法或特定应用领域，仅讨论通用的优化算法理论，因此与所有关键词完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文首次为Adam优化器在强凸随机优化问题上建立了统一的先验界，从而提供了无条件的误差分析，解决了先前研究依赖有界性假设的局限性。

摘要翻译

Kingma & Ba (2014) 提出的自适应矩估计（Adam）优化器，很可能是人工智能（AI）系统中用于训练深度神经网络（DNNs）最流行的随机梯度下降（SGD）优化方法。尽管Adam在AI系统训练中取得了突破性成功，但为其提供完整的误差分析——不仅针对DNNs优化，甚至当应用于强凸随机优化问题（SOPs）时——仍然是一个开放的研究课题。文献中先前关于强凸SOPs的误差分析结果提供了条件收敛分析，其依赖于一个假设：Adam不会发散至无穷大，而是保持一致有界。本工作的关键贡献在于为Adam建立了一致的先验界，从而首次针对一大类强凸SOPs，为Adam提供了无条件的误差分析。

摘要 (Abstract)

The adaptive moment estimation (Adam) optimizer proposed by Kingma & Ba (2014) is presumably the most popular stochastic gradient descent (SGD) optimization method for the training of deep neural networks (DNNs) in artificial intelligence (AI) systems. Despite its groundbreaking success in the training of AI systems, it still remains an open research problem to provide a complete error analysis of Adam, not only for optimizing DNNs but even when applied to strongly convex stochastic optimization problems (SOPs). Previous error analysis results for strongly convex SOPs in the literature provide conditional convergence analyses that rely on the assumption that Adam does not diverge to infinity but remains uniformly bounded. It is the key contribution of this work to establish uniform a priori bounds for Adam and, thereby, to provide – for the first time – an unconditional error analysis for Adam for a large class of strongly convex SOPs.

关键词: Adam optimizer, stochastic gradient descent, error analysis, uniform a priori bounds, strongly convex optimization, convergence analysis, deep neural networks

263. ❌ Authority-Level Priors: An Under-Specified Constraint in Hierarchical Predictive Processing

作者: Marcela Palejova 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18888v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是认知神经科学中的分层预测处理理论，提出Authority-Level Priors（ALPs）作为元结构约束来解释自主调节与显式信念更新之间的不对称性，属于理论神经科学和计算认知建模领域。所有评分关键词均涉及大模型、深度学习技术及其应用，而本文完全不涉及这些技术，没有讨论任何AI模型、训练方法、推理技术或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对分层预测处理框架中自主调节与显式信念更新的不对称性问题，提出了Authority-Level Priors（ALPs）作为元结构约束来解释哪些身份层级假设在不确定性下调控自主和行为控制，并通过计算形式化生成了可检验的预测。

摘要翻译

层级预测处理通过精度加权推断解释适应性行为。显性信念修正往往无法对应改变压力反应性或自主神经调节。这种不对称性表明，该框架未充分说明一个治理层级的约束条件，即在不确定性下哪些身份层级的假设调控着自主神经和行为控制。我们引入权威层级先验作为元结构约束，其定义了身份层级假设中一个调控可容许的子集。权威层级先验并非额外的表征状态，也非关于精度的超先验；它们约束哪些假设可用于调控控制。精度在可容许性条件下决定影响力；权威层级先验则决定可容许性本身。这解释了为何显性信念更新能改变表征性信念，而自主神经威胁反应却保持稳定。计算形式化将策略优化限制在由授权假设生成的策略上，从而产生关于压力反应动态、恢复时间常数、代偿性控制参与以及行为持久性的可检验预测。在神经生物学层面，权威层级先验通过分布式前额叶仲裁与控制网络实现。该提案与变分主动推断框架兼容，未引入额外推断算子，而是形式化了确定身份-调控映射所需的边界条件。该模型生成可证伪的预测：治理层级的改变应导致压力反应曲线、恢复动态、代偿性认知投入以及行为改变持久性方面可测量的变化。权威层级先验作为一个架构假说被提出，需通过计算建模和纵向压力诱导范式进行评估。

摘要 (Abstract)

Hierarchical predictive processing explains adaptive behaviour through precision-weighted inference. Explicit belief revision often fails to produce corresponding changes in stress reactivity or autonomic regulation. This asymmetry suggests the framework leaves under-specified a governance-level constraint concerning which identity-level hypotheses regulate autonomic and behavioural control under uncertainty. We introduce Authority-Level Priors (ALPs) as meta-structural constraints defining a regulatory-admissible subset (Hauth, a subset of H) of identity-level hypotheses. ALPs are not additional representational states nor hyperpriors over precision; they constrain which hypotheses are admissible for regulatory control. Precision determines influence conditional on admissibility; ALPs determine admissibility itself. This explains why explicit belief updating modifies representational beliefs while autonomic threat responses remain stable. A computational formalisation restricts policy optimisation to policies generated by authorised hypotheses, yielding testable predictions concerning stress-reactivity dynamics, recovery time constants, compensatory control engagement, and behavioural persistence. Neurobiologically, ALPs manifest through distributed prefrontal arbitration and control networks. The proposal is compatible with variational active inference and introduces no additional inferential operators, instead formalising a boundary condition required for determinate identity-regulation mapping. The model generates falsifiable predictions: governance shifts should produce measurable changes in stress-reactivity curves, recovery dynamics, compensatory cognitive effort, and behavioural change durability. ALPs are advanced as an architectural hypothesis to be evaluated through computational modelling and longitudinal stress-induction paradigms.

关键词: Hierarchical predictive processing, Authority-Level Priors, Autonomic regulation, Computational formalisation, Stress reactivity, Variational active inference, Identity-regulation mapping, Prefrontal arbitration

264. ❌ RadioDiff-FS: Physics-Informed Manifold Alignment in Few-Shot Diffusion Models for High-Fidelity Radio Map Construction

作者: Xiucheng Wang, Zixuan Guo, Nan Cheng 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18865v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文RadioDiff-FS专注于无线电地图构建的扩散模型应用，属于AI for Science领域（高度相关，10分）。它涉及预训练模型的领域适应（Domain Adaptation），通过少量高保真样本将预训练的主路径生成器适应到多路径丰富的目标域（相关，8分）。论文未涉及大语言模型、推理技术、对齐、高效微调、代理系统等其他关键词，因此这些评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对高保真无线电地图构建中数据稀缺和跨域泛化差的挑战，提出了一种基于物理信息流形对齐的少样本扩散框架RadioDiff-FS，在有限监督下显著提升了静态和动态无线电地图的构建精度。

摘要翻译

无线电地图（Radio Maps, RMs）提供了空间连续的传播特性表征，对6G网络规划至关重要，但高保真RM的构建仍具挑战性。严格的电磁求解器计算延迟过高，而数据驱动模型需要大量标注数据集，且从简化仿真泛化到复杂多径环境的能力较差。本文提出RadioDiff-FS，一种少样本扩散框架，该框架仅需少量高保真样本，即可将预训练的主径生成器适配到多径丰富的目标域。该适配基于一个理论分解：将多径RM分解为主导的主径分量和方向稀疏的残差项。该分解表明，跨域偏移对应于一个有界且几何结构化的特征平移，而非任意的分布变化。随后，本文引入了方向一致性损失（Direction-Consistency Loss, DCL），以约束扩散模型沿物理合理的传播方向进行分数更新，从而抑制在低数据量情况下出现的相位不一致伪影。实验表明，相对于原始扩散基线，RadioDiff-FS在静态RM上将归一化均方误差（NMSE）降低了59.5%，在动态RM上降低了74.0%，在监督信息极其有限的条件下，实现了0.9752的结构相似性指数（SSIM）和36.37 dB的峰值信噪比（PSNR）。

摘要 (Abstract)

Radio maps (RMs) provide spatially continuous propagation characterizations essential for 6G network planning, but high-fidelity RM construction remains challenging. Rigorous electromagnetic solvers incur prohibitive computational latency, while data-driven models demand massive labeled datasets and generalize poorly from simplified simulations to complex multipath environments. This paper proposes RadioDiff-FS, a few-shot diffusion framework that adapts a pre-trained main-path generator to multipath-rich target domains with only a small number of high-fidelity samples. The adaptation is grounded in a theoretical decomposition of the multipath RM into a dominant main-path component and a directionally sparse residual. This decomposition shows that the cross-domain shift corresponds to a bounded and geometrically structured feature translation rather than an arbitrary distribution change. A Direction-Consistency Loss (DCL) is then introduced to constrain diffusion score updates along physically plausible propagation directions, suppressing phase-inconsistent artifacts that arise in the low-data regime. Experiments show that RadioDiff-FS reduces NMSE by 59.5% on static RMs and by 74.0% on dynamic RMs relative to the vanilla diffusion baseline, achieving an SSIM of 0.9752 and a PSNR of 36.37 dB under severely limited supervision.

关键词: Radio map construction, Diffusion models, Few-shot learning, Domain adaptation, Physics-informed learning, Multipath propagation, Direction-Consistency Loss, High-fidelity simulation

265. ❌ Data-driven construction of machine-learning-based interatomic potentials for gas-surface scattering dynamics: the case of NO on graphite

作者: Samuel Del Fré, Gilberto A. Alou Angulo, Maurice Monnerville, Alejandro Rivero Santamaría 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18864v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于开发用于气体-表面散射动力学的机器学习原子间势（MLIP），属于科学计算和分子动力学模拟领域。论文内容与绝大多数关键词（涉及大模型、训练技术、推理优化、对齐、智能体等）完全无关，因为这些关键词主要针对自然语言处理和大语言模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在科学（具体是计算化学和分子动力学）中的应用，与生物信息学或化学信息学有相似的科学计算属性，因此给予10分（高度相关，核心内容）。

!!! tip deepseek-chat TL;DR

该研究开发了一种数据驱动的机器学习原子间势构建工作流，用于模拟一氧化氮在石墨表面的散射动力学，通过主动学习策略提高了模拟精度和效率，成功再现了主要实验趋势。

摘要翻译

对气体-表面散射过程进行精确的原子尺度模拟，需要势能面在广泛的构型和能量范围内保持可靠，同时保持大规模轨迹采样所需的高效性。本文开发了一种数据驱动的工作流程，用于构建专门针对气体-表面散射动力学的机器学习原子间势（MLIP），并以一氧化氮（NO）在高度取向热解石墨（HOPG）表面的散射作为基准体系。从初始的从头算分子动力学（AIMD）数据集出发，我们使用SOAP描述符描述局部原子环境，并在通过主成分分析获得的降维特征空间中进行分析。随后，采用最远点采样法构建一个紧凑的训练集，并通过委员会查询（query-by-committee）主动学习策略对所得的深度势能模型进行精炼，该策略利用了从更宽入射能量和表面温度范围的分子动力学模拟中提取的额外构型。最终得到的MLIP能够高保真地复现参考能量和力，并以远低于AIMD的计算成本，实现了NO在石墨表面散射的大规模分子动力学模拟。这些模拟提供了对吸附能量学、捕获与直接散射概率、平动能损失、角度分布以及转动激发的详细洞察。总体而言，模拟结果再现了主要的实验趋势，并证明描述符引导采样与主动学习相结合，为构建气体-表面相互作用的MLIP提供了一种高效且可迁移的策略。

摘要 (Abstract)

Accurate atomistic simulations of gas-surface scattering require potential energy surfaces that remain reliable over broad configurational and energetic ranges while retaining the efficiency needed for extensive trajectory sampling. Here, we develop a data-driven workflow for constructing a machine-learning interatomic potential (MLIP) tailored to gas-surface scattering dynamics, using nitric oxide (NO) scattering from highly oriented pyrolytic graphite (HOPG) as a benchmark system. Starting from an initial ab initio molecular dynamics (AIMD) dataset, local atomic environments are described by SOAP descriptors and analyzed in a reduced feature space obtained through principal component analysis. Farthest point sampling is then used to build a compact training set, and the resulting Deep Potential model is refined through a query-by-committee active-learning strategy using additional configurations extracted from molecular dynamics simulations over extended ranges of incident energies and surface temperatures. The final MLIP reproduces reference energies and forces with high fidelity and enables large-scale molecular dynamics simulations of NO scattering from graphite at a computational cost far below that of AIMD. The simulations provide detailed insight into adsorption energetics, trapping versus direct scattering probabilities, translational energy loss, angular distributions, and rotational excitation. Overall, the results reproduce the main experimental trends and demonstrate that descriptor-guided sampling combined with active learning offers an efficient and transferable strategy for constructing MLIPs for gas-surface interactions.

关键词: machine-learning interatomic potential, gas-surface scattering, active learning, molecular dynamics, NO on graphite, data-driven workflow, SOAP descriptors, Deep Potential model

266. ❌ BeamAgent: LLM-Aided MIMO Beamforming with Decoupled Intent Parsing and Alternating Optimization for Joint Site Selection and Precoding

作者: Xiucheng Wang, Yue Zhang, Nan Cheng 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18855v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是将LLM作为语义翻译器集成到无线通信优化中，属于LLM在特定领域（无线通信）的应用研究。与’Large Language Models’高度相关（10分），因为论文明确使用LLM处理自然语言描述。与’LLM Agents’高度相关（10分），因为论文提出的BeamAgent框架本质上是一个LLM辅助的代理系统，用于解析意图和优化。与’AI for Science’有一定关联（5分），因为无线通信优化可视为工程科学应用，但论文未明确提及生物信息学或化学信息学。其他关键词（如MoE、SFT、RAG等）在论文中未涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了BeamAgent框架，通过将LLM作为语义翻译器与梯度优化器解耦，解决了MIMO波束成形中联合基站选址和预编码设计的优化问题，实验表明其性能优于传统方法并接近专家上限。

摘要翻译

将大规模语言模型（LLM）集成到无线通信优化中是一个前景广阔但充满挑战的方向。现有方法或将LLM用作黑盒求解器，或将其作为代码生成器，使其与数值计算紧密耦合。然而，LLM缺乏物理层优化所需的精度，且无线训练数据的稀缺使得领域特定的微调难以实现。我们提出了BeamAgent，一个LLM辅助的MIMO波束成形框架，该框架明确地将语义意图解析与数值优化解耦。LLM仅作为语义翻译器，将自然语言描述转换为结构化的空间约束。随后，一个专用的基于梯度的优化器通过交替优化算法，联合求解离散的基站选址问题与连续的预编码设计。一种场景感知提示机制实现了无需微调的、基于实际空间环境的推理，而结合双层意图分类的多轮交互机制确保了约束条件的鲁棒性验证。基于惩罚项的损失函数在强制执行暗区功率约束的同时，释放了优化自由度以实现亮区增益最大化。在基于射线追踪的城市MIMO场景实验中，BeamAgent实现了84.0 dB的亮区功率，在相同暗区约束下优于穷举迫零法7.1 dB。该端到端系统性能达到专家理论上限的3.3 dB以内，且完整的优化过程在笔记本电脑上可在2秒内完成。

摘要 (Abstract)

Integrating large language models (LLMs) into wireless communication optimization is a promising yet challenging direction. Existing approaches either use LLMs as black-box solvers or code generators, tightly coupling them with numerical computation. However, LLMs lack the precision required for physical-layer optimization, and the scarcity of wireless training data makes domain-specific fine-tuning impractical. We propose BeamAgent, an LLM-aided MIMO beamforming framework that explicitly decouples semantic intent parsing from numerical optimization. The LLM serves solely as a semantic translator that converts natural language descriptions into structured spatial constraints. A dedicated gradient-based optimizer then jointly solves the discrete base station site selection and continuous precoding design through an alternating optimization algorithm. A scene-aware prompt enables grounded spatial reasoning without fine-tuning, and a multi-round interaction mechanism with dual-layer intent classification ensures robust constraint verification. A penalty-based loss function enforces dark-zone power constraints while releasing optimization degrees of freedom for bright-zone gain maximization. Experiments on a ray-tracing-based urban MIMO scenario show that BeamAgent achieves a bright-zone power of 84.0,dB, outperforming exhaustive zero-forcing by 7.1 dB under the same dark-zone constraint. The end-to-end system reaches within 3.3 dB of the expert upper bound, with the full optimization completing in under 2 s on a laptop.

关键词: LLM-aided optimization, MIMO beamforming, semantic intent parsing, alternating optimization, site selection, precoding design, wireless communication, gradient-based optimizer

267. ❌ Learn for Variation: Variationally Guided AAV Trajectory Learning in Differentiable Environments

作者: Xiucheng Wang, Zhenye Chen, Nan Cheng 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18853v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究自主空中车辆（AAV）轨迹规划的强化学习框架，使用梯度信息替代稀疏奖励信号，属于机器人控制、强化学习和通信网络交叉领域。所有关键词均与大语言模型、模型训练、推理优化、对齐、代理系统等大模型技术相关，而本文完全不涉及大模型或深度学习在科学领域的应用，也未使用任何大模型技术原理。根据研究背景说明，论文需要在大模型技术或科学应用领域有创新才能给分，本文不符合这些条件。

!!! tip deepseek-chat TL;DR

本文提出了一种名为L4V的梯度引导轨迹学习框架，用于解决自主空中车辆在6G物联网网络中数据收集的轨迹规划问题，通过端到端可微分计算图和反向传播时间训练确定性神经策略，在任务完成时间、平均传输速率和训练成本方面优于多种基线方法。

摘要翻译

自主飞行器（AAVs）通过移动驱动的数据采集赋能第六代（6G）物联网（IoT）网络。然而，基于传统奖励驱动的强化学习方法在AAV轨迹规划中存在严重的信用分配问题和训练不稳定性，因为稀疏的标量奖励无法捕捉序列动作的长期非线性效应。为解决这些挑战，本文提出Learn for Variation（L4V）——一种梯度感知的轨迹学习框架，该框架以密集且具有解析依据的策略梯度替代了高方差的标量奖励信号。具体而言，研究首先将AAV运动学、距离相关的信道增益以及每用户数据采集进度的耦合演化展开为一个端到端可微的计算图。随后，通过时间反向传播作为离散伴随求解器，将累积任务目标对每个控制动作及策略参数的精确敏感度进行反向传播。这些结构化梯度被用于训练一个具有时间平滑性正则化和梯度裁剪机制的确定性神经策略。大量仿真实验表明，L4V在任务完成时间、平均传输速率和训练成本方面均持续优于代表性基线方法，包括遗传算法、DQN、A2C和DDPG。

摘要 (Abstract)

Autonomous aerial vehicles (AAVs) empower sixth-generation (6G) Internet-of-Things (IoT) networks through mobility-driven data collection. However, conventional reward-driven reinforcement learning for AAV trajectory planning suffers from severe credit assignment issues and training instability, because sparse scalar rewards fail to capture the long-term and nonlinear effects of sequential movements. To address these challenges, this paper proposes Learn for Variation (L4V), a gradient-informed trajectory learning framework that replaces high-variance scalar reward signals with dense and analytically grounded policy gradients. Particularly, the coupled evolution of AAV kinematics, distance-dependent channel gains, and per-user data-collection progress is first unrolled into an end-to-end differentiable computational graph. Backpropagation through time then serves as a discrete adjoint solver, which propagates exact sensitivities from the cumulative mission objective to every control action and policy parameter. These structured gradients are used to train a deterministic neural policy with temporal smoothness regularization and gradient clipping. Extensive simulations demonstrate that L4V consistently outperforms representative baselines, including a genetic algorithm, DQN, A2C, and DDPG, in mission completion time, average transmission rate, and training cost

关键词: Autonomous aerial vehicles, Trajectory learning, Reinforcement learning, Differentiable environments, Policy gradients, 6G IoT networks, Data collection, Gradient-informed framework

268. ❌ A Model Ensemble-Based Post-Processing Framework for Fairness-Aware Prediction

作者: Zhouting Zhao, Tin Lok James Ng 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18838v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是机器学习中的公平性预测后处理框架，使用模型集成方法。论文内容完全聚焦于传统机器学习公平性问题，未涉及任何大模型、深度学习技术原理、科学AI应用或评分关键词中的具体技术。所有关键词均与大模型、深度学习技术、科学AI应用相关，而该论文属于传统机器学习公平性研究领域，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于模型集成的后处理框架，用于在保持预测性能的同时提升机器学习模型的公平性，并在分类、回归和生存分析任务中验证了其有效性。

摘要翻译

在机器学习领域，如何在预测性能与公平性之间取得最佳平衡，仍然是一个根本性挑战。本研究提出一种后处理框架，通过利用模型集成技术来促进公平感知的预测。该方法被设计为独立于任何特定模型内部结构运行，因此可广泛适用于各种学习任务、模型架构及公平性定义。通过在分类、回归和生存分析等任务上进行大量实验，我们证明该框架能有效提升公平性，同时保持或仅对预测准确性产生极小影响。

摘要 (Abstract)

Striking an optimal balance between predictive performance and fairness continues to be a fundamental challenge in machine learning. In this work, we propose a post-processing framework that facilitates fairness-aware prediction by leveraging model ensembling. Designed to operate independently of any specific model internals, our approach is widely applicable across various learning tasks, model architectures, and fairness definitions. Through extensive experiments spanning classification, regression, and survival analysis, we demonstrate that the framework effectively enhances fairness while maintaining, or only minimally affecting, predictive accuracy.

关键词: fairness-aware prediction, post-processing framework, model ensembling, predictive performance, machine learning, classification, regression, survival analysis

269. ❌ Model Order Reduction of Cerebrovascular Hemodynamics Using POD_Galerkin and Reservoir Computing_based Approach

作者: Rahul Halder, Arash Hajisharifi, Kabir Bakhshaei, Gianluigi Rozza 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18837v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究脑血管血流动力学的模型降阶方法，对比了基于物理的POD-Galerkin方法和数据驱动的POD-Reservoir Computing方法。论文属于计算流体力学和科学计算领域，与所有大模型/深度学习技术关键词（如LLM、MoE、RLHF、RAG等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及AI在科学计算（血流模拟）中的应用，但并非核心的生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文研究了脑血管血流动力学的模型降阶方法，对比了POD-Galerkin和POD-Reservoir Computing两种方法，均实现了10^2到10^3倍的计算加速，可作为高效准确的血流预测替代模型。

摘要翻译

本研究对比了基于物理的侵入式方法与数据驱动的非侵入式框架，探讨了用于脑血管系统非定常血流动力学模拟的模型降阶策略。首先，采用本征正交分解法将理想化基底动脉分叉的高保真三维计算流体动力学快照压缩至低维潜在空间。我们评估了两种模型的性能：一种是将纳维-斯托克斯方程投影至降阶基的POD-伽辽金模型，另一种是通过循环架构学习系数时间演化的POD-储层计算模型。研究引入了多谐波多振幅训练信号以提升训练效率。两种方法相较于全阶模拟均实现了10^2至10^3量级的计算加速，证明了其作为预测壁面剪应力等血流参数的高效精确替代模型的潜力。

摘要 (Abstract)

We investigate model order reduction (MOR) strategies for simulating unsteady hemodynamics within cerebrovascular systems, contrasting a physics-based intrusive approach with a data-driven non-intrusive framework. High-fidelity 3D Computational Fluid Dynamics (CFD) snapshots of an idealised basilar artery bifurcation are first compressed into a low-dimensional latent space using Proper Orthogonal Decomposition (POD). We evaluate the performance of a POD-Galerkin (POD-G) model, which projects the Navier-Stokes equations onto the reduced basis, against a POD-Reservoir Computing (POD-RC) model that learns the temporal evolution of coefficients through a recurrent architecture. A multi-harmonic and multi-amplitude training signal is introduced to improve training efficiency. Both methodologies achieve computational speed-ups on the order of 10^2 to 10^3 compared to full-order simulations, demonstrating their potential as efficient and accurate surrogates for predicting flow quantities such as wall shear stress.

关键词: model order reduction, hemodynamics, cerebrovascular, POD-Galerkin, reservoir computing, computational fluid dynamics, wall shear stress, Navier-Stokes equations

270. ❌ Seasoning Generative Models for a Generalization Aftertaste

作者: Hisham Husain, Valentin De Bortoli, Richard Nock 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18817v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是生成模型的泛化理论改进，通过判别器引导的方法（基于f-散度的强对偶性）来精炼任何生成模型，并证明其能提升泛化性能。虽然论文涉及生成模型（如GANs和扩散模型），但所有关键词都专门针对大语言模型（LLMs）及其特定技术（如MoE、RLHF、RAG等）、应用（如AI for Science）或优化方法（如量化、推理加速）。论文的核心是生成模型的泛化理论，不涉及LLMs、特定LLM技术、科学AI应用或LLM相关优化。因此，所有关键词均不相关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于判别器引导的生成模型精炼方法，通过f-散度的强对偶性理论证明，该方法能提升生成模型的泛化性能，并以扩散模型为例进行了验证。

摘要翻译

利用判别器训练或微调生成模型已被证明是一个相当成功的框架。一个显著的例子是生成对抗网络（GANs），它通过最小化由训练判别器产生的损失，并结合其他利用满足弱学习器约束的判别器来增强生成模型的范式。最近，即使是扩散模型也显示出通过某种判别器引导的优势。在本研究中，我们扩展了一个与$f$-散度相关的强对偶性结果，由此提出了一种判别器引导的通用方法，使我们能够对任何生成模型进行“精炼”。随后我们证明，相较于未经精炼的模型，精炼后的生成模型在理论上能够提升泛化性能。具体而言，我们的分析表明，泛化性能的改进程度取决于用于精炼的判别器集合的Rademacher复杂度。我们的方法涵盖了一种近期提出的基于分数的扩散方法（Kim等人，2022），该方法已展现出显著的实证成功；然而，通过我们的分析，我们能够揭示该方法的泛化保证机制。因此，本研究为现有工作提供了理论验证，为开发新算法指明了方向，并有助于我们更广泛地理解生成模型的泛化特性。

摘要 (Abstract)

The use of discriminators to train or fine-tune generative models has proven to be a rather successful framework. A notable example is Generative Adversarial Networks (GANs) that minimize a loss incurred by training discriminators along with other paradigms that boost generative models via discriminators that satisfy weak learner constraints. More recently, even diffusion models have shown advantages with some kind of discriminator guidance. In this work, we extend a strong-duality result related to $f$-divergences which gives rise to a discriminator-guided recipe that allows us to \textit{refine} any generative model. We then show that the refined generative models provably improve generalization, compared to its non-refined counterpart. In particular, our analysis reveals that the gap in generalization is improved based on the Rademacher complexity of the discriminator set used for refinement. Our recipe subsumes a recently introduced score-based diffusion approach (Kim et al., 2022) that has shown great empirical success, however allows us to shed light on the generalization guarantees of this method by virtue of our analysis. Thus, our work provides a theoretical validation for existing work, suggests avenues for new algorithms, and contributes to our understanding of generalization in generative models at large.

关键词: generative models, discriminator-guided refinement, f-divergences, generalization, Rademacher complexity, diffusion models, theoretical analysis

271. ❌ Signals of Success and Struggle: Early Prediction and Physiological Signatures of Human Performance across Task Complexity

作者: Yufei Cao, Penny Sweetser, Ziyu Chen, Xuanying Zhu 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18798v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究人类在游戏任务中的表现预测，使用眼动和心电生理信号进行早期预测和机制分析，属于人机交互、生理计算和认知科学领域。所有评分关键词均涉及大模型、深度学习技术原理或AI在科学领域的应用，而本论文完全不涉及任何人工智能模型、算法或技术，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究通过早期眼动和心电信号预测用户在复杂游戏任务中的表现，发现融合模型能达到0.86的平衡准确率，并揭示了高表现者具有更专注的注视模式、更稳定的心脏激活和更积极的情感体验。

摘要翻译

用户表现是交互系统中的关键指标，它反映了用户执行任务时的实际效能。前瞻性预测表现能够及时识别出在任务要求中遇到困难的用户。尽管眼动信号与心电信号被广泛用于刻画与表现相关的视觉行为和生理激活状态，但这些信号在早期预测中的潜力及其对表现差异背后生理机制的揭示作用仍未得到充分探索。我们在一个复杂度自然递增的游戏环境中开展了被试内实验，利用早期眼动与心电信号预测后续表现，并检验不同表现组在生理指标与自我报告数据上的差异。结果表明，眼动-心电融合模型的平衡准确率达到0.86，仅使用眼动信号的模型也表现出相当的预测能力。高表现者展现出更具目标性的注视模式，能根据需求调整视觉采样策略，并在任务难度提升时维持更稳定的心电激活水平，同时报告更积极的情感体验。这些发现证明了基于早期生理信号进行跨阶段预测的可行性，为表现差异提供了可解释的生理学视角，并为未来实施主动干预提供了依据。

摘要 (Abstract)

User performance is crucial in interactive systems, capturing how effectively users engage with task execution. Prospectively predicting performance enables the timely identification of users struggling with task demands. While ocular and cardiac signals are widely used to characterise performance-relevant visual behaviour and physiological activation, their potential for early prediction and for revealing the physiological mechanisms underlying performance differences remains underexplored. We conducted a within-subject experiment in a game environment with naturally unfolding complexity, using early ocular and cardiac signals to predict later performance and to examine physiological and self-reported group differences. Results show that the ocular-cardiac fusion model achieves a balanced accuracy of 0.86, and the ocular-only model shows comparable predictive power. High performers exhibited targeted gaze and adjusted visual sampling, and sustained more stable cardiac activation as demands intensified, with a more positive affective experience. These findings demonstrate the feasibility of cross-session prediction from early physiology, providing interpretable insights into performance variation and facilitating future proactive intervention.

关键词: human performance prediction, ocular signals, cardiac signals, early prediction, physiological mechanisms, game environment, task complexity, affective experience

272. ❌ SRRM: Improving Recursive Transport Surrogates in the Small-Discrepancy Regime

作者: Yufei Zhang, Tao Wang, Jingyi Zhang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18781v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是计算统计领域中的Wasserstein距离代理方法（Recursive Rank Matching及其改进版SRRM），属于数学、统计学和计算方法的范畴。论文内容完全不涉及大模型、深度学习、AI技术原理或AI在科学领域的应用。所有关键词都聚焦于大模型技术、AI应用、训练方法、推理优化等AI相关主题，与论文的数学统计方法研究无任何关联。

!!! tip deepseek-chat TL;DR

该论文研究了递归划分方法在Wasserstein距离近似中的统计行为，识别了小差异区域分辨率损失的主导机制，并提出了改进方法SRRM来提高代理保真度。

摘要翻译

递归分割方法为瓦瑟斯坦距离提供了计算高效的替代度量，但其统计特性及其在小差异区域的分辨率仍未得到充分理解。本研究以总体锚定参照为背景，将递归秩匹配作为该类方法的代表性案例进行探讨。在此设定下，我们证明了二次成本下锚定经验递归秩匹配的一致性，并给出了明确的收敛速率。随后，我们识别出导致小差异区域分辨率损失的主导失配机制。基于此分析，我们提出了选择性递归秩匹配方法，该方法能抑制由此产生的主导失配现象，仅以适度增加的计算成本为代价，即可获得对瓦瑟斯坦距离保真度更高的实用替代度量。

摘要 (Abstract)

Recursive partitioning methods provide computationally efficient surrogates for the Wasserstein distance, yet their statistical behavior and their resolution in the small-discrepancy regime remain insufficiently understood. We study Recursive Rank Matching (RRM) as a representative instance of this class under a population-anchored reference. In this setting, we establish consistency and an explicit convergence rate for the anchored empirical RRM under the quadratic cost. We then identify a dominant mismatch mechanism responsible for the loss of resolution in the small-discrepancy regime. Based on this analysis, we introduce Selective Recursive Rank Matching (SRRM), which suppresses the resulting dominant mismatches and yields a higher-fidelity practical surrogate for the Wasserstein distance at moderate additional computational cost.

关键词: Wasserstein distance, recursive partitioning, statistical behavior, small-discrepancy regime, convergence rate, surrogate method, computational efficiency, empirical analysis

273. ❌ Enhancing the Parameterization of Reservoir Properties for Data Assimilation Using Deep VAE-GAN

作者: Marcio Augusto Sampaio, Paulo Henrique Ranazzi, Martin Julian Blunt 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18766v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于石油储层模拟中的深度学习应用，具体使用VAE-GAN模型进行参数化以改进数据同化。论文与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词特指大语言模型（LLM）及相关技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将深度学习应用于石油工程（可视为科学计算或地球科学领域），但并非核心的生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究解决了石油储层模拟中非高斯参数的数据同化问题，通过结合VAE和GAN的优势提出VAE-GAN模型，在两项案例研究中同时实现了高质量储层描述和良好的生产曲线历史匹配。

摘要翻译

当前，被称为迭代集合平滑器的方法，尤其是多重数据同化集合平滑器（ESMDA），可被视为石油储层模拟中历史拟合的最先进技术。然而，该方法存在两个重要局限：一是使用有限规模的集合来表征分布，二是对参数和数据不确定性的高斯假设。后者尤为重要，因为许多储层属性具有非高斯分布。参数化过程涉及在更新前将非高斯参数映射到高斯场，然后将其映射回原始域，以便通过储层模拟器前推集合。一种有前景的参数化方法是通过深度学习模型实现。近期研究表明，生成对抗网络（GAN）在数据同化方面表现不佳，但能生成地质上更可信的储层实现；而变分自编码器（VAE）在数据同化方面优于GAN，但生成的地质模型真实性较低。本研究的创新之处在于结合两者优势，构建了一种名为变分自编码生成对抗网络（VAE-GAN）的深度学习模型，并将其与ESMDA集成。该方法应用于两个案例研究，其一为分类变量案例，另一为连续渗透率值案例。我们的研究结果表明，通过应用VAE-GAN模型，我们能够同时获得高质量的储层描述（如GAN般）和生产曲线的良好历史拟合（如VAE般）。

摘要 (Abstract)

Currently, the methods called Iterative Ensemble Smoothers, especially the method called Ensemble Smoother with Multiple Data Assimilation (ESMDA) can be considered state-of-the-art for history matching in petroleum reservoir simulation. However, this approach has two important limitations: the use of an ensemble with finite size to represent the distributions and the Gaussian assumption in parameter and data uncertainties. This latter is particularly important because many reservoir properties have non-Gaussian distributions. Parameterization involves mapping non-Gaussian parameters to a Gaussian field before the update and then mapping them back to the original domain to forward the ensemble through the reservoir simulator. A promising approach to perform parameterization is through deep learning models. Recent studies have shown that Generative Adversarial Networks (GAN) performed poorly concerning data assimilation, but generated more geologically plausible realizations of the reservoir, while the Variational Autoencoder (VAE) performed better than the GAN in data assimilation, but generated less geologically realistic models. This work is innovative in combining the strengths of both to implement a deep learning model called Variational Autoencoder Generative Adversarial Network (VAE-GAN) integrated with ESMDA. The methodology was applied in two case studies, one case being categorical and the other with continuous values of permeability. Our findings demonstrate that by applying the VAE-GAN model we can obtain high quality reservoir descriptions (just like GANs) and a good history matching on the production curves (just like VAEs) simultaneously.

关键词: Reservoir simulation, Data assimilation, VAE-GAN, Parameterization, Ensemble Smoother, Non-Gaussian distributions, Deep learning, History matching

274. ❌ Holter-to-Sleep: AI-Enabled Repurposing of Single-Lead ECG for Sleep Phenotyping

作者: Donglin Xie, Qingshuo Zhao, Jingyu Wang, Shijia Geng, Jiarui Jin, Jun Li, Rongrong Guo, Guangkun Nie, Gongzheng Tang, Yuxi Zhou, Thomas Penzel, Shenda Hong 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18714v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于使用单导联心电图进行睡眠表型分析的AI方法开发，属于生物医学AI应用领域。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词主要针对大语言模型和深度学习技术本身。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究是AI在生物医学（具体是睡眠和心血管健康）领域的应用，属于AI for Science范畴，且与生物信息学相关，因此给予10分（高度相关，核心内容）。

!!! tip deepseek-chat TL;DR

该研究提出了一个名为Holter-to-Sleep的AI框架，利用单导联心电图作为唯一输入，实现了从同一记录中同时进行夜间睡眠表型分析和Holter级心脏表型分析，为大规模心脏-睡眠关联研究提供了可扩展的解决方案。

摘要翻译

睡眠障碍与心血管风险密切相关，然而多导睡眠监测（PSG）作为临床参考标准，仍存在资源密集、难以适应多夜、居家及大规模筛查的局限性。单导联心电图（ECG）已在动态心电图和贴片式设备中普及，能够实现舒适的长时程采集，并通过自主神经调节与心呼吸耦合编码睡眠相关的生理信息。本文提出一个概念验证性的“从动态心电到睡眠”框架，该框架以单导联ECG为唯一输入，可在同一记录中同时支持整夜睡眠表型分析和动态心电图级别的心脏表型分析，并进一步为可扩展的心血管-睡眠关联研究提供明确的分析路径。该框架基于涵盖四个公共队列、合计10,439项研究的跨中心PSG样本池进行开发与验证，通过独立外部评估检验其跨队列泛化能力，并利用夜间贴片式ECG记录，通过主客观一致性分析进行额外的现实场景可行性评估。这种集成设计能够在异质人群和采集条件下稳健提取具有临床意义的整夜睡眠表型，并促进ECG衍生的睡眠指标与心律失常相关的动态心电表型之间的系统性关联。总体而言，“从动态心电到睡眠”范式为超越传统以PSG为核心的工作流程，实现低负荷、可居家部署且可扩展的心血管-睡眠监测与研究提供了实用基础。

摘要 (Abstract)

Sleep disturbances are tightly linked to cardiovascular risk, yet polysomnography (PSG)-the clinical reference standard-remains resource-intensive and poorly suited for multi-night, home-based, and large-scale screening. Single-lead electrocardiography (ECG), already ubiquitous in Holter and patch-based devices, enables comfortable long-term acquisition and encodes sleep-relevant physiology through autonomic modulation and cardiorespiratory coupling. Here, we present a proof-of-concept Holter-to-Sleep framework that, using single-lead ECG as the sole input, jointly supports overnight sleep phenotyping and Holter-grade cardiac phenotyping within the same recording, and further provides an explicit analytic pathway for scalable cardio-sleep association studies. The framework is developed and validated on a pooled multi-center PSG sample of 10,439 studies spanning four public cohorts, with independent external evaluation to assess cross-cohort generalizability, and additional real-world feasibility assessment using overnight patch-ECG recordings via objective-subjective consistency analysis. This integrated design enables robust extraction of clinically meaningful overnight sleep phenotypes under heterogeneous populations and acquisition conditions, and facilitates systematic linkage between ECG-derived sleep metrics and arrhythmia-related Holter phenotypes. Collectively, the Holter-to-Sleep paradigm offers a practical foundation for low-burden, home-deployable, and scalable cardio-sleep monitoring and research beyond traditional PSG-centric workflows.

关键词: single-lead ECG, sleep phenotyping, Holter, AI framework, cardio-sleep monitoring, polysomnography, autonomic modulation, multi-center validation

275. ❌ Off-Policy Learning with Limited Supply

作者: Koichi Tanaka, Ren Kishimoto, Bushun Kawagishi, Yusuke Narita, Yasuo Yamamoto, Nobuyuki Shimizu, Yuta Saito 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18702v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是上下文多臂老虎机中的离策略学习问题，专注于有限供应约束下的资源分配优化。论文内容完全围绕强化学习、推荐系统和在线广告中的传统机器学习方法，没有涉及任何大模型、深度学习技术原理或AI for Science相关主题。所有关键词都聚焦于大模型技术及其应用，与论文的强化学习/推荐系统主题完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了有限供应约束下的上下文多臂老虎机离策略学习问题，提出了一种新的OPLS方法，在合成和真实数据集上证明其优于现有方法。

摘要翻译

本研究针对情境赌博机中的离线策略学习问题展开探讨，该技术在推荐系统、在线广告等众多实际应用场景中具有关键作用。传统的情境赌博机离线策略学习通常假设环境无约束，即策略可以无限次选择同一项目。然而，在优惠券分配、电子商务等实际应用中，有限供给会通过优惠券的预算限制或商品库存约束对项目形成制约。在此类场景中，若仅基于当前用户的最高期望奖励贪婪地选择项目，可能导致该项目过早耗尽，从而无法分配给未来可能产生更高期望奖励的用户。因此，在无约束环境下最优的离线策略学习方法，在有限供给场景中可能表现次优。为解决该问题，我们通过理论分析证明传统贪婪式离线策略学习方法可能无法实现策略性能最大化，并论证在有限供给场景中必然存在具有更优性能的策略。基于此发现，我们提出一种名为“有限供给下的离线策略学习”的新方法。该方法不再简单选择具有最高期望奖励的项目，而是关注相对于其他用户具有较高期望奖励的项目，从而实现对有限供给项目更高效的分配。我们在合成数据集和真实数据集上的实验结果表明，在有限供给的情境赌博机问题中，该方法优于现有的离线策略学习方法。

摘要 (Abstract)

We study off-policy learning (OPL) in contextual bandits, which plays a key role in a wide range of real-world applications such as recommendation systems and online advertising. Typical OPL in contextual bandits assumes an unconstrained environment where a policy can select the same item infinitely. However, in many practical applications, including coupon allocation and e-commerce, limited supply constrains items through budget limits on distributed coupons or inventory restrictions on products. In these settings, greedily selecting the item with the highest expected reward for the current user may lead to early depletion of that item, making it unavailable for future users who could potentially generate higher expected rewards. As a result, OPL methods that are optimal in unconstrained settings may become suboptimal in limited supply settings. To address the issue, we provide a theoretical analysis showing that conventional greedy OPL approaches may fail to maximize the policy performance, and demonstrate that policies with superior performance must exist in limited supply settings. Based on this insight, we introduce a novel method called Off-Policy learning with Limited Supply (OPLS). Rather than simply selecting the item with the highest expected reward, OPLS focuses on items with relatively higher expected rewards compared to the other users, enabling more efficient allocation of items with limited supply. Our empirical results on both synthetic and real-world datasets show that OPLS outperforms existing OPL methods in contextual bandit problems with limited supply.

关键词: off-policy learning, contextual bandits, limited supply, resource allocation, recommendation systems, online advertising, policy optimization, inventory constraints

276. ❌ Revisiting Label Inference Attacks in Vertical Federated Learning: Why They Are Vulnerable and How to Defend

作者: Yige Liu, Dexuan Xu, Zimai Guo, Yongzhi Cao, Hanpin Wang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18680v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究垂直联邦学习中的标签推断攻击与防御，属于联邦学习安全领域，与所有评分关键词（均涉及大模型、深度学习技术原理或科学AI应用）完全无关。论文未提及任何大模型、语言模型、训练方法、推理技术、代理系统或科学AI应用相关内容。

!!! tip deepseek-chat TL;DR

该论文揭示了垂直联邦学习中标签推断攻击的脆弱性源于特征与标签的分布对齐，并提出了基于层调整的零开销防御方法。

摘要翻译

垂直联邦学习（VFL）允许拥有顶层模型的主动方与拥有底层模型的多个被动方进行协作。在此场景中，仅持有特征的被动方可能试图推断主动方的私有标签，使得标签推断攻击（LIAs）成为一种重大威胁。先前关于LIA的研究声称，训练良好的底层模型能够有效表征标签。然而，我们证明这一观点具有误导性，并揭示了现有LIAs的脆弱性。通过利用互信息，我们首次在VFL中观察到“模型补偿”现象。我们从理论上证明，在VFL中，层输出与标签之间的互信息随层深度增加而增加，这表明底层模型主要提取特征信息，而顶层模型则负责标签映射。基于这一洞见，我们引入任务重分配来证明，现有LIAs的成功实际上源于特征与标签之间的分布对齐。当这种对齐被破坏时，LIAs的性能会急剧下降甚至完全失效。此外，我们还探讨了这一发现对防御策略的启示。我们提出了一种基于层调整的零开销防御技术。在五个数据集和五种代表性模型架构上进行的大量实验表明，将切层前移以增加顶层模型在整个模型中的比例，不仅能提升对LIAs的抵抗能力，还能增强其他防御措施的效果。

摘要 (Abstract)

Vertical federated learning (VFL) allows an active party with a top model, and multiple passive parties with bottom models to collaborate. In this scenario, passive parties possessing only features may attempt to infer active party’s private labels, making label inference attacks (LIAs) a significant threat. Previous LIA studies have claimed that well-trained bottom models can effectively represent labels. However, we demonstrate that this view is misleading and exposes the vulnerability of existing LIAs. By leveraging mutual information, we present the first observation of the “model compensation” phenomenon in VFL. We theoretically prove that, in VFL, the mutual information between layer outputs and labels increases with layer depth, indicating that bottom models primarily extract feature information while the top model handles label mapping. Building on this insight, we introduce task reassignment to show that the success of existing LIAs actually stems from the distribution alignment between features and labels. When this alignment is disrupted, the performance of LIAs declines sharply or even fails entirely. Furthermore, the implications of this insight for defenses are also investigated. We propose a zero-overhead defense technique based on layer adjustment. Extensive experiments across five datasets and five representative model architectures indicate that shifting cut layers forward to increase the proportion of top model layers in the entire model not only improves resistance to LIAs but also enhances other defenses.

关键词: Vertical Federated Learning, Label Inference Attacks, Model Compensation, Mutual Information, Task Reassignment, Layer Adjustment, Defense Technique, Privacy Protection

277. ❌ OCP: Orthogonal Constrained Projection for Sparse Scaling in Industrial Commodity Recommendation

作者: Chen Sun, Beilin Xu, Boheng Tan, Jiacheng Wang, Yuefeng Sun, Rite Bo, Ying He, Yaqiang Zang, Pinghua Gong 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18697v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于工业商品推荐系统中的稀疏扩展和嵌入表示优化问题，提出了一种正交约束投影方法。虽然涉及稀疏模型和扩展性，但所有关键词都直接针对大语言模型（LLMs）及其相关技术（如MoE、SFT、RLHF、RAG、CoT、量化等），而论文研究的是传统推荐系统（非LLM-based），未涉及任何大模型、深度学习技术原理创新或科学领域应用。因此，所有关键词评分为0分，加权总分为0分。

!!! tip deepseek-chat TL;DR

该论文针对工业商品推荐系统中传统Item-Id词汇表在稀疏扩展时面临低频信息干扰和表示坍塌的问题，提出了一种正交约束投影方法，通过强制正交性优化嵌入表示，实验和工业部署表明该方法能加速损失收敛、提升扩展性，并在京东平台上实现了UCXR增长12.97%和GMV提升8.9%。

摘要翻译

在工业商品推荐系统中，项目标识符词表的表征质量直接影响推荐模型的可扩展性与泛化能力。一个核心挑战在于：传统项目标识符词表在进行稀疏扩展时，会受到低频信息干扰，这限制了对海量项目集合的表征能力，并导致表征坍缩。为解决此问题，我们提出了一种正交约束投影方法来优化嵌入表征。通过施加正交性约束，该投影限制了反向传播的流形，使学习到的嵌入的奇异值谱与正交基对齐。这种对齐确保了高奇异熵，从而在抑制虚假关联和对稀有项目过拟合的同时，保留了各向同性的广义特征。实证结果表明，正交约束投影加速了损失收敛并增强了模型的可扩展性；值得注意的是，在扩展稠密层时，它能带来持续的性能提升。在京东的大规模工业部署进一步验证了其有效性，实现了用户点击转化率提升12.97%和商品交易总额提升8.9%，凸显了其在扩展稀疏词表与稠密架构两方面的强大实用性。

摘要 (Abstract)

In industrial commodity recommendation systems, the representation quality of Item-Id vocabularies directly impacts the scalability and generalization ability of recommendation models. A key challenge is that traditional Item-Id vocabularies, when subjected to sparse scaling, suffer from low-frequency information interference, which restricts their expressive power for massive item sets and leads to representation collapse. To address this issue, we propose an Orthogonal Constrained Projection method to optimize embedding representation. By enforcing orthogonality, the projection constrains the backpropagation manifold, aligning the singular value spectrum of the learned embeddings with the orthogonal basis. This alignment ensures high singular entropy, thereby preserving isotropic generalized features while suppressing spurious correlations and overfitting to rare items. Empirical results demonstrate that OCP accelerates loss convergence and enhances the model’s scalability; notably, it enables consistent performance gains when scaling up dense layers. Large-scale industrial deployment on JD.com further confirms its efficacy, yielding a 12.97% increase in UCXR and an 8.9% uplift in GMV, highlighting its robust utility for scaling up both sparse vocabularies and dense architectures.

关键词: industrial commodity recommendation, sparse scaling, orthogonal constrained projection, embedding representation, representation collapse, singular value spectrum, scalability, JD.com deployment

278. ❌ Enhancing Multi-Corpus Training in SSL-Based Anti-Spoofing Models: Domain-Invariant Feature Extraction

作者: Anh-Tuan Dao, Driss Matrouf, Mickael Rouvier, Nicholas Evans 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18657v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于语音反欺骗检测领域，研究多语料库训练中的领域不变特征提取问题，使用SSL（自监督学习）模型和梯度反转层等技术。所有评分关键词均涉及大语言模型、深度学习技术原理或AI在科学领域的应用创新，而本论文的研究内容（语音处理、反欺骗检测、领域适应）与这些关键词没有直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对语音反欺骗检测中多语料库训练导致性能不稳定的问题，提出了一个领域不变特征提取框架，通过多任务学习和梯度反转层减少语料库特定信息，在四个数据集上将平均等错误率降低了20%。

摘要翻译

语音欺骗检测的性能在不同训练与评估语料库间常存在差异。在说话人识别和语音识别等领域，利用多语料库通常能提升模型的鲁棒性和性能。然而，我们的欺骗检测实验表明，多语料库训练并不能持续提升性能，甚至可能导致性能下降。我们假设数据集特定的偏差损害了模型的泛化能力，从而引发性能不稳定。为解决此问题，我们提出了一种不变域特征提取（Invariant Domain Feature Extraction，IDFE）框架，该框架采用多任务学习和梯度反转层，以最小化学习嵌入中的语料库特定信息。在四个不同数据集上的评估显示，与基线相比，IDFE框架将平均等错误率降低了20%。

摘要 (Abstract)

The performance of speech spoofing detection often varies across different training and evaluation corpora. Leveraging multiple corpora typically enhances robustness and performance in fields like speaker recognition and speech recognition. However, our spoofing detection experiments show that multi-corpus training does not consistently improve performance and may even degrade it. We hypothesize that dataset-specific biases impair generalization, leading to performance instability. To address this, we propose an Invariant Domain Feature Extraction (IDFE) framework, employing multi-task learning and a gradient reversal layer to minimize corpus-specific information in learned embeddings. The IDFE framework reduces the average equal error rate by 20% compared to the baseline, assessed across four varied datasets.

关键词: speech spoofing detection, multi-corpus training, domain-invariant feature extraction, gradient reversal layer, self-supervised learning, anti-spoofing models, equal error rate, dataset-specific biases

279. ❌ A Theoretical Comparison of No-U-Turn Sampler Variants: Necessary and Su?cient Convergence Conditions and Mixing Time Analysis under Gaussian Targets

作者: Samuel Gruffaz, Kyurae Kim, Fares Guehtar, Hadrien Duval-decaix, Pacôme Trautmann 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18640v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究No-U-Turn Sampler (NUTS)变体的理论收敛性和混合时间分析，属于贝叶斯统计和计算数学领域。论文内容完全不涉及大模型、深度学习、AI技术或科学AI应用，所有关键词均与大模型技术、训练方法、推理优化、AI应用等无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文通过理论分析比较了NUTS-mul和NUTS-BPS两种No-U-Turn Sampler变体，首次推导了它们的几何遍历性必要条件和充分条件，并证明了在标准高斯分布下两者的混合时间均按O(d^{1/4})缩放，但NUTS-BPS具有更小的常数因子。

摘要翻译

无转向采样器（No-U-Turn Sampler，简称NUTS）是现代贝叶斯软件库的核心计算工具，然而其定性与定量收敛保证直到近期才得以建立。其两种主要变体——分别采用多项式采样（multinomial sampling）和偏置渐进采样（biased progressive sampling）进行索引选择的NUTS-mul与NUTS-BPS——在理论比较方面仍存在显著空白。本文通过三项贡献填补了这一空白。首先，我们首次推导出两种变体几何遍历性的必要条件。其次，我们首次为NUTS-mul建立了几何遍历性与遍历性的充分条件。第三，我们首次在标准高斯分布上获得了NUTS-BPS的混合时间结果。我们的研究表明，NUTS-mul与NUTS-BPS展现出近乎一致的定性行为，其几何遍历性取决于目标分布的尾部特性。然而，二者在收敛速率上存在定量差异。更精确地说，当在典型高斯测度的典型集中初始化时，NUTS-mul与NUTS-BPS的混合时间均按$O(d^{1/4})$尺度变化（忽略对数因子），其中$d$表示维度。尽管如此，NUTS-BPS对应的常数项严格更小。

摘要 (Abstract)

The No-U-Turn Sampler (NUTS) is the computational workhorse of modern Bayesian software libraries, yet its qualitative and quantitative convergence guarantees were established only recently. A significant gap remains in the theoretical comparison of its two main variants: NUTS-mul and NUTS-BPS, which use multinomial sampling and biased progressive sampling, respectively, for index selection. In this paper, we address this gap in three contributions. First, we derive the first necessary conditions for geometric ergodicity for both variants. Second, we establish the first sufficient conditions for geometric ergodicity and ergodicity for NUTS-mul. Third, we obtain the first mixing time result for NUTS-BPS on a standard Gaussian distribution. Our results show that NUTS-mul and NUTS-BPS exhibit nearly identical qualitative behavior, with geometric ergodicity depending on the tail properties of the target distribution. However, they differ quantitatively in their convergence rates. More precisely, when initialized in the typical set of the canonical Gaussian measure, the mixing times of both NUTS-mul and NUTS-BPS scale as $O(d^{1/4})$ up to logarithmic factors, where $d$ denotes the dimension. Nevertheless, the associated constants are strictly smaller for NUTS-BPS.

关键词: No-U-Turn Sampler, NUTS, geometric ergodicity, mixing time, Gaussian distribution, Markov chain Monte Carlo, convergence analysis, Bayesian computation

280. ❌ Evaluating Model-Free Policy Optimization in Masked-Action Environments via an Exact Blackjack Oracle

作者: Kevin Song 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18642v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是无限牌组21点游戏中的模型无关策略优化问题，使用动态规划作为精确基准来评估REINFORCE、SPSA和CEM等传统强化学习算法在状态访问稀疏和动态动作屏蔽环境中的性能。论文完全不涉及大语言模型、深度学习、AI for Science或任何评分关键词中的技术，所有关键词均与论文内容完全无关。

!!! tip deepseek-chat TL;DR

该论文使用无限牌组21点作为基准环境，通过精确动态规划验证了REINFORCE、SPSA和CEM等模型无关优化算法在状态稀疏和动作屏蔽条件下的策略恢复性能，发现REINFORCE表现最佳但所有方法都存在显著的单元条件遗憾。

摘要翻译

无限牌靴赌场二十一点为动态遮蔽行动下的离散随机控制提供了一个严格且可精确验证的基准。在固定的维加斯式规则集（S17，3:2赔付，庄家偷看，任意两张牌可加倍，分牌后可加倍，最多重分至四手）下，我们推导出一个覆盖4,600个规范决策单元的精确动态规划（DP）预言机。该预言机产生了真实行动值、最优策略标签以及每手牌-0.00161的理论期望值（EV）。为评估样本高效的策略恢复能力，通过模拟交互训练了三种无模型优化器：采用单元指数移动平均基线的遮蔽REINFORCE、同步扰动随机逼近（SPSA）以及交叉熵方法（CEM）。REINFORCE的样本效率最高，在10^6手牌后实现了46.37%的行动匹配率和-0.04688的EV，优于CEM（39.46%，7.5x10^6次评估）和SPSA（38.63%，4.8x10^6次评估）。然而，所有方法均表现出显著的单元条件遗憾，表明尽管奖励收敛平滑，策略层面仍存在持续误差。这一差距说明，对于具有严重状态访问稀疏性和动态行动遮蔽的表格化环境，挑战依然存在，而聚合奖励曲线可能掩盖关键的局部失败。作为阴性对照，研究从理论上证明并经验证实在独立同分布抽牌且不进行计数的条件下，最优投注规模会坍缩至台面最低限额。此外，增大赌注在未改善期望值的同时，严格增加了波动性与破产风险。这些结果凸显了使用精确预言机和阴性对照的必要性，以避免将随机变异误认为真实的算法性能。

摘要 (Abstract)

Infinite-shoe casino blackjack provides a rigorous, exactly verifiable benchmark for discrete stochastic control under dynamically masked actions. Under a fixed Vegas-style ruleset (S17, 3:2 payout, dealer peek, double on any two, double after split, resplit to four), an exact dynamic programming (DP) oracle was derived over 4,600 canonical decision cells. This oracle yielded ground-truth action values, optimal policy labels, and a theoretical expected value (EV) of -0.00161 per hand. To evaluate sample-efficient policy recovery, three model-free optimizers were trained via simulated interaction: masked REINFORCE with a per-cell exponential moving average baseline, simultaneous perturbation stochastic approximation (SPSA), and the cross-entropy method (CEM). REINFORCE was the most sample-efficient, achieving a 46.37% action-match rate and an EV of -0.04688 after 10^6 hands, outperforming CEM (39.46%, 7.5x10^6 evaluations) and SPSA (38.63%, 4.8x10^6 evaluations). However, all methods exhibited substantial cell-conditional regret, indicating persistent policy-level errors despite smooth reward convergence. This gap shows that tabular environments with severe state-visitation sparsity and dynamic action masking remain challenging, while aggregate reward curves can obscure critical local failures. As a negative control, it was proven and empirically confirmed that under i.i.d. draws without counting, optimal bet sizing collapses to the table minimum. In addition, larger wagers strictly increased volatility and ruin without improving expectation. These results highlight the need for exact oracles and negative controls to avoid mistaking stochastic variability for genuine algorithmic performance.

关键词: blackjack, policy optimization, reinforcement learning, dynamic programming, masked actions, model-free optimization, REINFORCE, stochastic control

281. ❌ Cyber-Resilient Digital Twins: Discriminating Attacks for Safe Critical Infrastructure Control

作者: Mohammadhossein Homaei, Iman Khazrak, Rubén Molano, Andrés Caro, Mar Ávila 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18613v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于工业网络物理系统的网络安全防御，使用数字孪生、时序卷积网络和模型预测控制等技术，与所有评分关键词（均涉及大模型/深度学习技术原理或特定应用领域）完全无关，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文针对工业网络物理系统面临的网络攻击威胁，提出了一种结合数字孪生、多类攻击识别和自适应弹性控制的智能防御系统i-SDT，在保持操作弹性的同时显著提升了检测精度并降低了误报和运营成本。

摘要翻译

工业信息物理系统（ICPS）正面临日益增长的网络安全威胁，攻击者常利用传感器与控制环节的漏洞发起攻击。数字孪生（Digital Twin, DT）技术可通过预测建模检测异常，但现有方法无法区分攻击类型，且往往依赖代价高昂的全系统停机。本文提出智能自防御数字孪生（i-SDT），融合了液压正则化预测建模、多类攻击判别与自适应弹性控制技术。采用可微分守恒约束的时间卷积网络（Temporal Convolutional Networks, TCNs）捕捉标称动态特性，提升了对对抗性操纵的鲁棒性。结合最大均值差异（Maximum Mean Discrepancy, MMD）的循环残差编码器在潜在空间中实现了正常运行状态与单阶段、多阶段攻击的分离。当攻击被确认后，模型预测控制（Model Predictive Control, MPC）利用具备不确定性感知能力的数字孪生预测来维持安全运行，无需停机。在SWaT和WADI数据集上的评估表明：在仿真在环评估中，检测准确率显著提升，误报率降低44.1%，运行成本减少56.3%；亚秒级推理延迟证实了在工厂级工作站上实时运行的可行性。i-SDT在保持运行弹性的同时，推动了自主网络物理防御技术的发展。

摘要 (Abstract)

Industrial Cyber-Physical Systems (ICPS) face growing threats from cyber-attacks that exploit sensor and control vulnerabilities. Digital Twin (DT) technology can detect anomalies via predictive modelling, but current methods cannot distinguish attack types and often rely on costly full-system shutdowns. This paper presents i-SDT (intelligent Self-Defending DT), combining hydraulically-regularized predictive modelling, multi-class attack discrimination, and adaptive resilient control. Temporal Convolutional Networks (TCNs) with differentiable conservation constraints capture nominal dynamics and improve robustness to adversarial manipulations. A recurrent residual encoder with Maximum Mean Discrepancy (MMD) separates normal operation from single- and multi-stage attacks in latent space. When attacks are confirmed, Model Predictive Control (MPC) uses uncertainty-aware DT predictions to keep operations safe without shutdown. Evaluation on SWaT and WADI datasets shows major gains in detection accuracy, 44.1% fewer false alarms, and 56.3% lower operational costs in simulation-in-the-loop evaluation. with sub-second inference latency confirming real-time feasibility on plant-level workstations, i-SDT advances autonomous cyber-physical defense while maintaining operational resilience.

关键词: Digital Twin, Cyber-Physical Systems, Attack Discrimination, Temporal Convolutional Networks, Model Predictive Control, Resilient Control, Industrial Security, Anomaly Detection

282. ❌ Breaking Hard Isomorphism Benchmarks with DRESS

作者: Eduar Castrillo Velilla 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18582v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究图同构问题，提出Δ-DRESS方法用于图指纹识别，属于图论和算法领域，与所有给定的大模型、深度学习、AI应用等关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出Δ-DRESS方法，一种基于顶点删除的图指纹算法，成功区分了包括经典Rook vs. Shrikhande对在内的34个基准图族中的51,816个非同构图实例，突破了3-WL算法的理论限制。

摘要翻译

本文研究了单点删除变体$Δ$-DRESS，该方法是更广泛的DRESS框架的一部分。我们通过实证证明，在DRESS图指纹上应用单层顶点删除的$Δ$-DRESS方法，在所有测试的强正则图（SRG）参数族中均实现了唯一的指纹识别。研究涵盖了16个参数族共51,718个非同构强正则图，包括完整的Spence集合（12个族，43,703个顶点数最多为64的图）以及四个额外的SRG族（每个族最多包含4,466个图）。结合18个额外的困难图族（共102个图，包括Miyazaki、Chang、Paley、拉丁方和Steiner构造），$Δ$-DRESS在覆盖51,816个不同图实例的34个基准族中实现了100%的族内区分，隐式解决了超过5.76亿个族内非同构图对。此外，经典的Rook $L_2(4)$与Shrikhande图对（SRG(16,6,2,2)）已知无法被原始3-WL算法区分，而$Δ$-DRESS成功将其分离，证明$Δ$-DRESS突破了3-WL的理论界限。该方法对每个图在多项式时间$\mathcal{O}(n \cdot I \cdot m \cdot d_{\max})$内运行；组合指纹的流式实现使用$\mathcal{O}(m + B + n)$内存，其中$B$为直方图箱数，而本文报告的实验额外保留了完整的删除子图多重集矩阵用于事后分析。

摘要 (Abstract)

In this paper we study the single-deletion variant $Δ$-DRESS, part of the broader DRESS framework. We demonstrate empirically that $Δ$-DRESS, a single level of vertex deletion applied to the DRESS graph fingerprint, achieves unique fingerprints within each tested SRG parameter family across all 51,718 non-isomorphic strongly regular graphs (SRGs) considered, spanning 16 parameter families: the complete Spence collection (12 families, 43,703 graphs on up to 64 vertices) plus four additional SRG families with up to 4,466 graphs per family. Combined with 18 additional hard graph families (102 graphs including Miyazaki, Chang, Paley, Latin square, and Steiner constructions), $Δ$-DRESS achieves 100% within-family separation across 34 benchmark families covering 51,816 distinct graph instances, implicitly resolving over 576 million within-family non-isomorphic pairs. Moreover, the classical Rook $L_2(4)$ vs. Shrikhande pair, SRG(16,6,2,2), is known to be indistinguishable by the original 3-WL algorithm, yet $Δ$-DRESS separates it, proving that $Δ$-DRESS escapes the theoretical boundaries of 3-WL. The method runs in polynomial time $\mathcal{O}(n \cdot I \cdot m \cdot d_{\max})$ per graph; a streamed implementation of the combined fingerprint uses $\mathcal{O}(m + B + n)$ memory, where $B$ is the number of histogram bins, while the experiments reported here additionally retain the full deleted-subgraph multiset matrix for post-hoc analysis.

关键词: graph isomorphism, Δ-DRESS, strongly regular graphs, graph fingerprint, 3-WL algorithm, vertex deletion, polynomial-time algorithm, benchmark separation

283. ❌ Attack by Unlearning: Unlearning-Induced Adversarial Attacks on Graph Neural Networks

作者: Jiahao Zhang, Yilong Wang, Suhang Wang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18570v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究图神经网络（GNNs）的近似遗忘机制及其安全漏洞，属于图机器学习领域。论文未涉及任何大语言模型（LLMs）、深度学习技术原理创新或AI for Science应用。所有评分关键词均与大模型技术、训练方法、推理优化、对齐、科学应用等相关，而本文专注于图神经网络的隐私合规和对抗攻击，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文揭示了图神经网络中近似遗忘机制的安全漏洞，通过精心设计的遗忘请求可诱导显著的准确性下降，引发了对现实监管需求下GNN遗忘鲁棒性的担忧。

摘要翻译

图神经网络（GNN）广泛应用于从社交网络、推荐系统和金融平台等领域的图结构数据中学习。为遵守GDPR、CCPA和PIPEDA等隐私法规，近似图遗忘学习——旨在无需完全重新训练即可从已训练模型中消除特定数据点的影响——已成为可信图学习日益重要的组成部分。然而，近似遗忘学习通常会导致细微的性能下降，这可能引发负面且非预期的副作用。本研究表明，此类性能退化可能被放大为对抗性攻击。我们提出遗忘破坏攻击的概念：攻击者将精心选择的节点注入训练图中，随后请求删除这些节点。由于删除请求受法律强制要求且无法拒绝，此攻击面既不可避免又具有隐蔽性：模型在训练期间表现正常，但仅在应用遗忘操作后精度急剧下降。在技术上，我们将此攻击建模为一个双层优化问题：为应对黑盒遗忘与标签稀缺的挑战，我们通过基于梯度的更新来近似模拟遗忘过程，并采用代理模型为优化生成伪标签。跨基准测试与多种遗忘算法的广泛实验表明，少量精心设计的遗忘请求即可导致显著的精度下降，这引发了关于现实世界法规要求下图神经网络遗忘学习鲁棒性的迫切担忧。源代码将在论文录用后公开。

摘要 (Abstract)

Graph neural networks (GNNs) are widely used for learning from graph-structured data in domains such as social networks, recommender systems, and financial platforms. To comply with privacy regulations like the GDPR, CCPA, and PIPEDA, approximate graph unlearning, which aims to remove the influence of specific data points from trained models without full retraining, has become an increasingly important component of trustworthy graph learning. However, approximate unlearning often incurs subtle performance degradation, which may incur negative and unintended side effects. In this work, we show that such degradations can be amplified into adversarial attacks. We introduce the notion of \textbf{unlearning corruption attacks}, where an adversary injects carefully chosen nodes into the training graph and later requests their deletion. Because deletion requests are legally mandated and cannot be denied, this attack surface is both unavoidable and stealthy: the model performs normally during training, but accuracy collapses only after unlearning is applied. Technically, we formulate this attack as a bi-level optimization problem: to overcome the challenges of black-box unlearning and label scarcity, we approximate the unlearning process via gradient-based updates and employ a surrogate model to generate pseudo-labels for the optimization. Extensive experiments across benchmarks and unlearning algorithms demonstrate that small, carefully designed unlearning requests can induce significant accuracy degradation, raising urgent concerns about the robustness of GNN unlearning under real-world regulatory demands. The source code will be released upon paper acceptance.

关键词: Graph Neural Networks, Unlearning, Adversarial Attacks, Privacy Regulations, Bi-level Optimization, Robustness, Approximate Unlearning, Unlearning Corruption Attacks

284. ❌ WarPGNN: A Parametric Thermal Warpage Analysis Framework with Physics-aware Graph Neural Network

作者: Haotian Lu, Jincong Lu, Sachin Sachdeva, Sheldon X. -D. Tan 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18581v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《WarPGNN: A Parametric Thermal Warpage Analysis Framework with Physics-aware Graph Neural Network》专注于使用图神经网络（GNN）进行热致翘曲分析，属于AI在科学计算（具体为芯片封装工程）中的应用。所有关键词均与大模型（LLM）技术、训练方法、推理优化、对齐、代理等直接相关，而本文未涉及任何大模型技术，仅使用GNN进行物理模拟。唯一可能相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文属于AI在科学（工程科学）领域的应用，但并非生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于物理感知图神经网络的参数化热翘曲分析框架WarPGNN，用于高效准确地预测芯片封装中的热致翘曲，相比传统有限元方法实现了超过200倍的加速，同时保持高精度。

摘要翻译

随着系统级封装（SiP）芯粒（chiplet）设计和异质2.5D/3D集成技术的兴起，热致翘曲已成为一个关键的可靠性问题。尽管传统的数值方法能够提供高精度的结果，但其计算成本往往过高，限制了其在复杂芯粒-封装系统中的可扩展性。本文提出WarpGNN，一种基于图神经网络（GNNs）的高效、精确的参数化热翘曲分析框架。该框架直接在由版图构建的图上进行操作，能够实现快速的翘曲感知版图探索，并在不同封装配置间展现出强大的可迁移性。我们的方法首先将多芯片版图编码为精简传递闭包图（rTCG），随后通过一个基于图卷积网络（GCN）的编码器提取层次化结构特征，再经由一个受U-Net启发的解码器，从图特征嵌入中重建翘曲分布图。此外，针对翘曲数据分布的长尾特性，我们开发了一种物理信息损失函数，并改进了一种基于图同构网络（GIN）的消息传递编码器，从而进一步提升了对极端情况的学习能力以及图嵌入的表达力。数值实验表明，与基于二维高效有限元法（FEM）的方法相比，WarpGNN实现了超过205.91倍的加速；与三维有限元法软件COMSOL相比，加速比超过119766.64倍，同时保持了相当的精度，其全量归一化均方根误差仅为1.26%，翘曲值误差为2.21%。与近期基于DeepONet的模型相比，我们的方法在达到相近预测精度和推理加速的同时，训练时间降低了3.4倍。此外，WarpGNN在未见数据集上表现出卓越的可迁移性，归一化均方根误差最高为3.69%，且运行时间相近。

摘要 (Abstract)

With the advent of system-in-package (SiP) chiplet-based design and heterogeneous 2.5D/3D integration, thermal-induced warpage has become a critical reliability concern. While conventional numerical approaches can deliver highly accurate results, they often incur prohib- itively high computational costs, limiting their scalability for complex chiplet-package systems. In this paper, we present WarPGNN, an ef- ficient and accurate parametric thermal warpage analysis framework powered by Graph Neural Networks (GNNs). By operating directly on graphs constructed from the floorplans, WarPGNN enables fast warpage-aware floorplan exploration and exhibits strong transfer- ability across diverse package configurations. Our method first en- codes multi-die floorplans into reduced Transitive Closure Graphs (rTCGs), then a Graph Convolution Network (GCN)-based encoder extracts hierarchical structural features, followed by a U-Net inspired decoder that reconstructs warpage maps from graph feature embed- dings. Furthermore, to address the long-tailed pattern of warpage data distribution, we developed a physics-informed loss and revised a message-passing encoder based on Graph Isomorphic Network (GIN) that further enhance learning performance for extreme cases and expressiveness of graph embeddings. Numerical results show that WarPGNN achieves more than 205.91x speedup compared with the 2-D efficient FEM-based method and over 119766.64x acceleration with 3-D FEM method COMSOL, respectively, while maintaining comparable accuracy at only 1.26% full-scale normalized RMSE and 2.21% warpage value error. Compared with recent DeepONet-based model, our method achieved comparable prediction accuracy and in- ference speedup with 3.4x lower training time. In addition, WarPGNN demonstrates remarkable transferability on unseen datasets with up to 3.69% normalized RMSE and similar runtime.

关键词: thermal warpage analysis, graph neural networks, chiplet-package systems, physics-informed loss, parametric framework, computational acceleration, floorplan exploration, transferability

285. ❌ Transformers Learn Robust In-Context Regression under Distributional Uncertainty

作者: Hoang T. H. Cao, Hai D. V. Trinh, Tho Quan, Lan V. Truong 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18564v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究Transformer模型在分布不确定性下的上下文学习能力，与’In-context Learning OR Many-shot Learning’高度相关（10分），因为这是论文的核心研究内容。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分），因为Transformer是LLMs的基础架构，论文探讨其学习机制。其他关键词如MoE、SFT、RAG、量化等均未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了Transformer在现实分布不确定性（如非高斯系数、重尾噪声和非独立同分布提示）下进行上下文线性回归学习的能力，发现其能够匹配或超越经典基线方法，展现出稳健的上下文适应能力。

摘要翻译

近期研究表明，在独立同分布数据、高斯噪声和高斯回归系数等严格假设下，Transformer模型能够对线性回归任务进行上下文学习。然而，现实世界的数据往往违背这些假设：输入、噪声及回归系数的分布通常未知且非高斯分布，同时在提示序列中可能存在依赖性。这引发了一个根本性问题：在现实分布不确定性的条件下，Transformer能否有效进行上下文学习？本文研究了在广泛分布偏移下的噪声线性回归上下文学习，包括非高斯系数、重尾噪声以及非独立同分布的提示序列。我们将Transformer模型与在相应最大似然准则下最优或次优的经典基线方法进行比较。在所有实验设定中，Transformer模型始终匹配或超越这些基线方法，展现出超越经典估计器的鲁棒性上下文适应能力。

摘要 (Abstract)

Recent work has shown that Transformers can perform in-context learning for linear regression under restrictive assumptions, including i.i.d. data, Gaussian noise, and Gaussian regression coefficients. However, real-world data often violate these assumptions: the distributions of inputs, noise, and coefficients are typically unknown, non-Gaussian, and may exhibit dependency across the prompt. This raises a fundamental question: can Transformers learn effectively in-context under realistic distributional uncertainty? We study in-context learning for noisy linear regression under a broad range of distributional shifts, including non-Gaussian coefficients, heavy-tailed noise, and non-i.i.d. prompts. We compare Transformers against classical baselines that are optimal or suboptimal under the corresponding maximum-likelihood criteria. Across all settings, Transformers consistently match or outperform these baselines, demonstrating robust in-context adaptation beyond classical estimators.

关键词: Transformers, in-context learning, linear regression, distributional uncertainty, non-Gaussian coefficients, heavy-tailed noise, non-i.i.d. prompts, robust adaptation

286. ❌ Learning Decision-Sufficient Representations for Linear Optimization

作者: Yuhan Ye, Saurabh Amin, Asuman Ozdaglar 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18551v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究线性优化中的决策充分表示学习，属于运筹学、优化理论和机器学习交叉领域，但完全不涉及大模型、深度学习、AI for Science或任何评分关键词中的具体技术。论文内容聚焦于线性规划、数据压缩、计算复杂性、PAC理论等传统优化和统计学习概念，与评分关键词列表中的大模型技术、训练方法、推理优化、AI应用等主题无任何关联。

!!! tip deepseek-chat TL;DR

该论文研究了在线性规划中如何构建压缩数据集以恢复最优决策，证明了相关计算问题的NP-hard性，提出了多项式时间算法，并获得了分布无关的PAC泛化保证。

摘要翻译

我们研究如何构建压缩数据集，使其足以在线性规划中恢复最优决策，其中成本向量 $c$ 未知且位于先验集合 $\mathcal{C}$ 中。Bennouna 等人近期的工作通过内在的决策相关维度 $d^\star$，给出了充分决策数据集（sufficient decision datasets, SDDs）的精确几何刻画。然而，他们构建最小规模 SDD 的算法需要求解混合整数规划。本文中，我们建立了计算 $d^\star$ 是 NP 难的，以及判断一个数据集是否全局充分是 coNP 难的硬度结果，从而解决了 Bennouna 等人最近提出的一个开放问题。为应对这一最坏情况下的难解性，我们引入了逐点充分性这一松弛概念，它仅要求对单个成本向量具有充分性。在非退化条件下，我们提出了一种多项式时间的割平面算法，用于构建逐点充分决策数据集。在成本向量独立同分布的数据驱动场景中，我们进一步提出了一种累积算法，该算法聚合样本间的决策相关方向，产生一个规模至多为 $d^\star$ 的稳定压缩方案。这导出了一个与分布无关的 PAC 保证：以高概率（相对于训练样本），新抽取成本向量上逐点充分性失败的概率至多为 $\tilde{O}(d^\star/n)$，且该速率在对数因子范围内是紧的。最后，我们将决策充分表示应用于上下文线性优化，获得了泛化界尺度为 $\tilde{O}(\sqrt{d^\star/n})$ 而非 $\tilde{O}(\sqrt{d/n})$ 的压缩预测器，其中 $d$ 为环境成本维度。

摘要 (Abstract)

We study how to construct compressed datasets that suffice to recover optimal decisions in linear programs with an unknown cost vector $c$ lying in a prior set $\mathcal{C}$. Recent work by Bennouna et al. provides an exact geometric characterization of sufficient decision datasets (SDDs) via an intrinsic decision-relevant dimension $d^\star$. However, their algorithm for constructing minimum-size SDDs requires solving mixed-integer programs. In this paper, we establish hardness results showing that computing $d^\star$ is NP-hard and deciding whether a dataset is globally sufficient is coNP-hard, thereby resolving a recent open problem posed by Bennouna et al. To address this worst-case intractability, we introduce pointwise sufficiency, a relaxation that requires sufficiency for an individual cost vector. Under nondegeneracy, we provide a polynomial-time cutting-plane algorithm for constructing pointwise-sufficient decision datasets. In a data-driven regime with i.i.d.\ costs, we further propose a cumulative algorithm that aggregates decision-relevant directions across samples, yielding a stable compression scheme of size at most $d^\star$. This leads to a distribution-free PAC guarantee: with high probability over the training sample, the pointwise sufficiency failure probability on a fresh draw is at most $\tilde{O}(d^\star/n)$, and this rate is tight up to logarithmic factors. Finally, we apply decision-sufficient representations to contextual linear optimization, obtaining compressed predictors with generalization bounds scaling as $\tilde{O}(\sqrt{d^\star/n})$ rather than $\tilde{O}(\sqrt{d/n})$, where $d$ is the ambient cost dimension.

关键词: linear optimization, decision-sufficient representations, data compression, NP-hardness, PAC guarantee, contextual linear optimization, generalization bounds, cutting-plane algorithm

287. ❌ SINDy-KANs: Sparse identification of non-linear dynamics through Kolmogorov-Arnold networks

作者: Amanda A. Howard, Nicholas Zolman, Bruno Jacob, Steven L. Brunton, Panos Stinis 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18548v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文主要研究Kolmogorov-Arnold networks (KANs)与SINDy方法的结合，用于符号回归和动态系统建模，以提高机器学习模型的解释性。该研究与大多数关键词（如LLM、MoE、训练技术、推理优化、智能体等）完全无关，因为这些关键词主要针对大语言模型及其相关技术。然而，论文与’Mechanistic Interpretability OR Explainable AI’高度相关（8分），因为其核心目标是提高KANs的解释性；同时与’AI for Science OR Bioinformatics OR Cheminformatics’相关（8分），因为该方法应用于科学领域的动态系统建模和符号回归任务。其他关键词均不适用。

!!! tip deepseek-chat TL;DR

该论文提出SINDy-KANs方法，通过结合Kolmogorov-Arnold networks和稀疏非线性动态识别技术，提高了动态系统建模中机器学习模型的解释性，并在多个符号回归任务中实现了准确的方程发现。

摘要翻译

Kolmogorov-Arnold网络（Kolmogorov-Arnold networks，简称KANs）已成为提升机器学习可解释性的一种潜在途径。然而，KANs所学习到的解未必具有可解释性，即其结构可能不够稀疏或简约。非线性动力学稀疏辨识（sparse identification of nonlinear dynamics，简称SINDy）是一种互补性方法，能够从数据中学习动力学系统的稀疏方程；但该方法所学习的方程受限于预设的函数库。在本研究中，我们提出SINDy-KANs方法，该方法同时训练一个KAN网络和一个类SINDy表示，通过在每一层激活函数上应用SINDy来提升KAN表示的可解释性，同时保持深度KAN所能实现的函数复合能力。我们将所提方法应用于包括动力学系统在内的多项符号回归任务，结果表明该方法能在多种系统中实现精确的方程发现。

摘要 (Abstract)

Kolmogorov-Arnold networks (KANs) have arisen as a potential way to enhance the interpretability of machine learning. However, solutions learned by KANs are not necessarily interpretable, in the sense of being sparse or parsimonious. Sparse identification of nonlinear dynamics (SINDy) is a complementary approach that allows for learning sparse equations for dynamical systems from data; however, learned equations are limited by the library. In this work, we present SINDy-KANs, which simultaneously train a KAN and a SINDy-like representation to increase interpretability of KAN representations with SINDy applied at the level of each activation function, while maintaining the function compositions possible through deep KANs. We apply our method to a number of symbolic regression tasks, including dynamical systems, to show accurate equation discovery across a range of systems.

关键词: Kolmogorov-Arnold networks, SINDy, interpretability, sparse identification, dynamical systems, symbolic regression, equation discovery, machine learning

288. ❌ HEP Statistical Inference for UAV Fault Detection: CLs, LRT, and SBI Applied to Blade Damage

作者: Khushiyant 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18546v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于将粒子物理学的统计方法（LRT、CLs、SNPE）应用于无人机螺旋桨故障检测，属于传统机器学习/统计方法在工程领域的应用。所有关键词（除最后一个）均涉及大模型/深度学习技术原理（如LLM架构、训练、推理、对齐、代理等），而本文未使用任何大模型或深度学习技术，也未涉及这些技术原理的创新。最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’评5分，因为论文将AI/统计方法应用于科学工程问题（无人机故障检测），属于’AI for Science’的广义范畴，但并非核心生物信息学或化学信息学应用。

!!! tip deepseek-chat TL;DR

该研究将粒子物理学中的三种统计方法（似然比检验、CLs方法、序列神经后验估计）应用于多旋翼无人机螺旋桨故障检测，在真实飞行数据集上实现了高精度的故障检测和量化表征，性能优于传统方法。

摘要翻译

本文将粒子物理学中的三种统计方法迁移至多旋翼螺旋桨故障检测领域：用于二元检测的似然比检验（LRT）、用于控制误报率的CLs修正频率主义方法，以及用于定量故障表征的序列神经后验估计（SNPE）。该系统基于与转子谐波物理相关的频谱特征运行，可输出三类结果：二元检测结果、受控的误报率，以及关于故障严重程度与电机位置的校准后验分布。在UAV-FD数据集（包含18次真实飞行数据、涉及5%与10%桨叶损伤的六旋翼数据集）上，采用留一飞行交叉验证法，系统取得AUC 0.862 +/- 0.007（95%置信区间：0.849–0.876），优于CUSUM（0.708 +/- 0.010）、自编码器（0.753 +/- 0.009）和LSTM自编码器（0.551）。在5%误报率条件下，系统能检测出93%的显著桨叶损伤和81%的轻微损伤。在四旋翼平台PADRE上，仅重新拟合生成模型后AUC即达到0.986。SNPE提供了故障严重程度的完整后验分布（90%可信区间覆盖率为92–100%，平均绝对误差为0.012），因此输出包含不确定性信息，而非仅提供点估计或故障标志。按飞行序列进行的检测实现了100%的故障检出率，总体准确率达94%。

摘要 (Abstract)

This paper transfers three statistical methods from particle physics to multirotor propeller fault detection: the likelihood ratio test (LRT) for binary detection, the CLs modified frequentist method for false alarm rate control, and sequential neural posterior estimation (SNPE) for quantitative fault characterization. Operating on spectral features tied to rotor harmonic physics, the system returns three outputs: binary detection, controlled false alarm rates, and calibrated posteriors over fault severity and motor location. On UAV-FD, a hexarotor dataset of 18 real flights with 5% and 10% blade damage, leave-one-flight-out cross-validation gives AUC 0.862 +/- 0.007 (95% CI: 0.849–0.876), outperforming CUSUM (0.708 +/- 0.010), autoencoder (0.753 +/- 0.009), and LSTM autoencoder (0.551). At 5% false alarm rate the system detects 93% of significant and 81% of subtle blade damage. On PADRE, a quadrotor platform, AUC reaches 0.986 after refitting only the generative models. SNPE gives a full posterior over fault severity (90% credible interval coverage 92–100%, MAE 0.012), so the output includes uncertainty rather than just a point estimate or fault flag. Per-flight sequential detection achieves 100% fault detection with 94% overall accuracy.

关键词: UAV fault detection, statistical inference, likelihood ratio test, CLs method, sequential neural posterior estimation, blade damage, spectral features, posterior estimation

289. ❌ iSatCR: Graph-Empowered Joint Onboard Computing and Routing for LEO Data Delivery

作者: Jiangtao Luo, Bingbing Xu, Shaohua Xia, Yongyi Ran 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18539v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究低地球轨道卫星网络中的联合机载计算与路由优化问题，采用图嵌入和深度强化学习方法，属于通信网络优化领域。所有评分关键词均与大语言模型、深度学习技术原理、AI科学应用等主题相关，但论文完全不涉及这些内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为iSatCR的分布式图方法，通过联合优化机载计算和路由策略来解决低地球轨道卫星网络中大规模地球观测数据传输的带宽瓶颈问题，实验表明该方法在高负载下优于基线方法。

摘要翻译

将低地球轨道卫星产生的大量地球观测数据传回地面处理，会消耗大量在轨带宽并加剧星地链路瓶颈。现有研究多集中于优化原始数据在星座内的路由传输，但难以应对数据量的激增。近年来，星载计算能力的进步使得在轨原位数据处理成为可能，从而显著减少了需下传的数据量。本文提出iSatCR——一种基于分布式图模型的联合优化方法，通过协同优化星载计算与数据路由以提升传输效率。在iSatCR框架中，我们设计了一种利用偏移特征聚合与分布式消息传递的新型图嵌入方法，用以捕捉卫星状态；进而提出一种基于分布式图的深度强化学习算法，在星载存储受限条件下推导联合计算-路由策略，以应对低地球轨道网络的复杂性与动态性。大量实验表明，iSatCR的性能优于现有基线方法，在高负载条件下表现尤为突出。

摘要 (Abstract)

Sending massive Earth observation data produced by low Earth orbit (LEO) satellites back to the ground for processing consumes a large amount of on-orbit bandwidth and exacerbates the space-to-ground link bottleneck. Most prior work has concentrated on optimizing the routing of raw data within the constellation, yet cannot cope with the surge in data volume. Recently, advances in onboard computing have made it possible to process data in situ, thus significantly reducing the data volume to be transmitted. In this paper, we present iSatCR, a distributed graph-based approach that jointly optimizes onboard computing and routing to boost transmission efficiency. Within iSatCR, we design a novel graph embedding utilizing shifted feature aggregation and distributed message passing to capture satellite states, and then propose a distributed graph-based deep reinforcement learning algorithm that derives joint computing-routing strategies under constrained on-board storage to handle the complexity and dynamics of LEO networks. Extensive experiments show iSatCR outperforms baselines, particularly under high load.

关键词: LEO satellites, onboard computing, routing optimization, graph embedding, deep reinforcement learning, data transmission efficiency, distributed algorithm, satellite networks

290. ❌ GAPSL: A Gradient-Aligned Parallel Split Learning on Heterogeneous Data

作者: Zheng Lin, Ons Aouedi, Wei Ni, Symeon Chatzinotas, Xianhao Chen 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18540v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文GAPSL专注于联邦学习（FL）中的并行分割学习（PSL）框架优化，解决异构数据下客户端梯度方向不一致导致的训练发散问题。论文核心是分布式机器学习系统优化（梯度对齐、模型分割、通信效率），而非大模型技术、深度学习原理创新或大模型在不同领域的应用。所有关键词均与大模型技术、训练方法、推理优化、AI应用等直接相关，而本文研究的是通用神经网络在资源受限设备上的分布式训练框架，不涉及特定的大模型技术、训练范式或科学领域应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种梯度对齐的并行分割学习框架GAPSL，通过领导者梯度识别和梯度方向对齐解决了异构数据下联邦学习的训练发散问题，在原型测试中显著提升了训练精度并降低了延迟。

摘要翻译

神经网络日益增长的复杂性为在资源受限的客户端设备上实现联邦学习（FL）的普及带来了重大挑战。并行分割学习（PSL）通过模型分割将大量计算工作负载卸载至服务器，从而减轻客户端计算负担，并免去客户端模型聚合步骤以降低通信与部署成本，已成为一种前景广阔的解决方案。由于PSL无需聚合，其训练过程因客户端间梯度方向不一致而面临严重的训练发散问题。为应对这一挑战，我们提出GAPSL——一种梯度对齐的PSL框架，该框架包含两个核心组件：主导梯度识别（LGI）与梯度方向对齐（GDA）。LGI动态选取一组方向一致的客户端梯度来构建主导梯度，以捕捉全局收敛趋势；GDA则采用方向感知正则化方法，将每个客户端的梯度与主导梯度对齐，从而缓解设备间的梯度方向不一致性并提升模型收敛性能。我们在原型计算测试平台上对GAPSL进行了评估。大量实验表明，GAPSL在训练精度与延迟方面均持续优于现有先进基准方法。

摘要 (Abstract)

The increasing complexity of neural networks poses significant challenges for democratizing FL on resource?constrained client devices. Parallel split learning (PSL) has emerged as a promising solution by offloading substantial computing workload to a server via model partitioning, shrinking client-side computing load, and eliminating the client-side model aggregation for reduced communication and deployment costs. Since PSL is aggregation-free, it suffers from severe training divergence stemming from gradient directional inconsistency across clients. To address this challenge, we propose GAPSL, a gradient-aligned PSL framework that comprises two key components: leader gradient identification (LGI) and gradient direction alignment (GDA). LGI dynamically selects a set of directionally consistent client gradients to construct a leader gradient that captures the global convergence trend. GDA employs a direction-aware regularization to align each client’s gradient with the leader gradient, thereby mitigating inter-device gradient directional inconsistency and enhancing model convergence. We evaluate GAPSL on a prototype computing testbed. Extensive experiments demonstrate that GAPSL consistently outperforms state-of-the-art benchmarks in training accuracy and latency.

关键词: Parallel Split Learning, Federated Learning, Gradient Alignment, Heterogeneous Data, Model Partitioning, Training Divergence, Resource-constrained Devices, Communication Efficiency

291. ❌ Beyond Passive Aggregation: Active Auditing and Topology-Aware Defense in Decentralized Federated Learning

作者: Sheng Pan, Niansheng Tang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18538v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于去中心化联邦学习（DFL）中的主动审计和拓扑感知防御机制，以对抗自适应后门攻击。研究内容涉及分布式机器学习安全、图拓扑分析和防御策略，但未涉及任何大模型（LLM）技术、深度学习原理创新或AI在科学领域的应用。所有关键词均与大模型技术、训练方法、推理优化、对齐、压缩、代理系统或科学AI应用相关，而本文研究的是联邦学习安全防御，属于不同的机器学习子领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文针对去中心化联邦学习中自适应后门攻击的防御难题，提出了一种主动审计框架和拓扑感知防御策略，通过动态建模、主动审计指标和优化防御部署，有效缓解隐蔽攻击同时保持主要任务性能。

摘要翻译

去中心化联邦学习（DFL）在面对旨在规避传统被动防御指标的自适应后门攻击时，依然极为脆弱。为应对这一局限，我们将防御范式转向一种新颖的主动干预式审计框架。首先，我们建立了一个动力学模型，以刻画对抗性更新在复杂图拓扑结构中的时空扩散过程。其次，我们引入了一套主动审计指标：随机熵异常、随机平滑Kullback-Leibler散度以及激活峰度。这些指标利用私有探针对本地模型进行压力测试，有效暴露了传统静态检测方法无法发现的潜在后门。此外，我们实施了一种拓扑感知的防御部署策略，以最大化全局聚合的鲁棒性。我们为攻击与防御动态共同演化下系统的收敛性提供了理论性质证明。在不同架构上的数值实证评估表明，我们的主动框架在缓解隐蔽的自适应后门攻击方面，与最先进的防御方法相比具有高度竞争力，同时保持了主要任务的效用。

摘要 (Abstract)

Decentralized Federated Learning (DFL) remains highly vulnerable to adaptive backdoor attacks designed to bypass traditional passive defense metrics. To address this limitation, we shift the defensive paradigm toward a novel active, interventional auditing framework. First, we establish a dynamical model to characterize the spatiotemporal diffusion of adversarial updates across complex graph topologies. Second, we introduce a suite of proactive auditing metrics, stochastic entropy anomaly, randomized smoothing Kullback-Leibler divergence, and activation kurtosis. These metrics utilize private probes to stress-test local models, effectively exposing latent backdoors that remain invisible to conventional static detection. Furthermore, we implement a topology-aware defense placement strategy to maximize global aggregation resilience. We provide theoretical property for the system’s convergence under co-evolving attack and defense dynamics. Numeric empirical evaluations across diverse architectures demonstrate that our active framework is highly competitive with state-of-the-art defenses in mitigating stealthy, adaptive backdoors while preserving primary task utility.

关键词: Decentralized Federated Learning, Backdoor Attacks, Active Auditing, Topology-aware Defense, Adversarial Updates, Graph Topologies, Defense Placement, Convergence Analysis

292. ❌ Derivative Discontinuity in Many-Body Perturbation Theory and Chemical Potentials in Random Phase Approximation

作者: Jiachen Li, Weitao Yang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19112v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算化学中的多体微扰理论，研究随机相位近似下的化学势和导数不连续性，属于理论物理/计算化学领域。所有关键词均与大模型、深度学习、AI技术原理或应用相关，而本文完全不涉及这些主题。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于计算化学领域，可视为科学计算的一部分，但论文本身未使用AI方法，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文研究了随机相位近似中化学势的导数不连续性，证明了GW相关能量泛函在整数粒子数处存在跳跃，这解释了GW准粒子能量精度与RPA总能量离域误差之间的不一致性。

摘要翻译

我们在随机相位近似（RPA）框架内推导了化学势的解析表达式，该框架等价于使用非相互作用格林函数（$G_s$）计算的$GW$能量泛函。化学势通过两种形式等价的方法获得：一是总能量对粒子数的直接导数，二是通过$G_s$利用链式法则进行泛函导数，两种方法均通过有限差分基准验证。我们证明，$GW$关联能的泛函导数——即$GW$关联自能——在整数粒子数处表现出不连续性，并存在有限跳跃。这解决了精确的$GW$准粒子能量与RPA总能量中观察到的大离域误差之间明显的矛盾，因为标准的$GW$自能忽略了这种非解析行为。我们的结果表明，导数不连续性是关联能泛函的基本特征，类似于精确交换关联能中已知的不连续性。

摘要 (Abstract)

We derive analytical expressions for chemical potentials within the random phase approximation (RPA), equivalently the $GW$ energy functional evaluated using non interacting Green’s functions ($G_s$). The chemical potential is obtained using two formally equivalent approaches: a direct derivative of the total energy with respect to particle number, and a functional derivative via the chain rule through $G_s$, both validated with finite difference benchmarks. We show that the functional derivative of the $GW$ correlation energy$\unicode{x2013}$i.e., the $GW$ correlation self energy$\unicode{x2013}$exhibits a discontinuity at integer particle numbers with finite jumps. This resolves the apparent inconsistency between accurate $GW$ quasiparticle energies and the large delocalization errors observed in RPA total energies, as standard $GW$ self energies neglect this nonanalytic behavior. Our results suggest that derivative discontinuities are a fundamental feature of correlation energy functionals, analogous to the known discontinuity in the exact exchange correlation energy.

关键词: chemical potentials, random phase approximation, GW correlation energy, derivative discontinuity, many-body perturbation theory, self-energy, correlation energy functionals, quasiparticle energies

293. ❌ Data-efficient pre-training by scaling synthetic megadocs

作者: Konwoo Kim, Suhas Kotha, Yejin Choi, Tatsunori Hashimoto, Nick Haber, Percy Liang 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18534v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	7.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究预训练阶段的数据效率问题，通过合成数据增强和构建megadocs来提升预训练效果，因此与’Pre-training’高度相关（10分）。研究涉及数据约束下的损失缩放，与’Scaling Laws AND Data Quality’相关（7分）。论文提到大模型预训练背景，与’Large Language Models’相关（8分）。方法中通过构建更长文档提升长上下文性能，与’Context Window Extension’有一定关联（5分）。其他关键词如MoE、SFT、RLHF、RAG等均未在摘要中涉及，评为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在数据受限条件下如何通过合成数据增强和构建megadocs来提升大模型预训练的数据效率，实验表明这些方法能显著改善损失缩放和下游任务性能。

摘要翻译

当预训练受限于数据而非算力时，合成数据增强已成为一种前景广阔的解决方案。我们研究如何设计能实现更优损失缩放的合成数据算法：不仅能在有限算力下降低损失，更能在算力趋近无穷时持续改善效果。我们首先证明，在网页数据中混合完全来自不同分布的合成重述数据进行预训练，能提升模型在原始网页数据独立同分布验证集上的损失表现。通过优化混合比例与训练轮次，随着合成生成数量的增加，损失与基准准确率均得到改善且未出现过拟合，在每文档生成32个重述时，数据效率提升约1.48倍后趋于稳定。从新视角出发，我们发现更优的损失缩放效果：同一文档的合成生成可组合成单个显著更长的“超级文档”，而非多个短文档。我们展示两种构建超级文档的方法：拼接同一网页文档的合成重述，或通过插入推理过程扩展文档。相较于简单重述，这两种方法均能提升独立同分布损失表现、下游基准任务性能，尤其改善长上下文损失，在每文档生成32个合成数据时，将数据效率从1.48倍提升至1.80倍。重要的是，随着合成数据量的增加，超级文档相对于简单重述的优势持续扩大。我们的研究结果揭示了如何设计合成数据算法，使其在数据受限条件下能从算力增长中获得更大收益。

摘要 (Abstract)

Synthetic data augmentation has emerged as a promising solution when pre-training is constrained by data rather than compute. We study how to design synthetic data algorithms that achieve better loss scaling: not only lowering loss at finite compute but especially as compute approaches infinity. We first show that pre-training on web data mixed with synthetically generated rephrases improves i.i.d. validation loss on the web data, despite the synthetic data coming from an entirely different distribution. With optimal mixing and epoching, loss and benchmark accuracy improve without overfitting as the number of synthetic generations grows, plateauing near $1.48\times$ data efficiency at 32 rephrases per document. We find even better loss scaling under a new perspective: synthetic generations from the same document can form a single substantially longer megadocument instead of many short documents. We show two ways to construct megadocs: stitching synthetic rephrases from the same web document or stretching a document by inserting rationales. Both methods improve i.i.d. loss, downstream benchmarks, and especially long-context loss relative to simple rephrasing, increasing data efficiency from $1.48\times$ to $1.80\times$ at $32$ generations per document. Importantly, the improvement of megadocs over simple rephrasing widens as more synthetic data is generated. Our results show how to design synthetic data algorithms that benefit more from increasing compute when data-constrained.

关键词: synthetic data, pre-training, data efficiency, megadocs, loss scaling, data augmentation, long-context, compute scaling

294. ❌ Utility-scale quantum computational chemistry

作者: Davide Castaldo, Markus Reiher 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19081v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文讨论量子计算在计算化学中的应用，属于科学领域（化学/材料科学）的AI应用范畴。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、代理系统等）完全无关，因为这些关键词均针对深度学习/大语言模型技术。唯一部分相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及计算化学（属于科学AI应用），但论文核心是量子计算而非传统AI/深度学习，因此相关性较弱，给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文探讨了量子计算在实用规模计算化学中的应用前景，认为量子算法需要适应硬件约束并集成到高通量计算流程中，才能为化学实践提供实际价值。

摘要翻译

化学与材料科学被广泛视为量子硬件的潜在杀手级应用领域。尽管解锁前所未有的模拟能力这一愿景依然极具吸引力，但量子算法的发展必须适应新兴量子硬件不断演化的约束条件，方能为计算化学实践实现任何优势。与此同时，经典波函数理论方法的持续进步正在缩小广泛量子优势的实现窗口。本文从实用规模应用的更广阔视角探讨量子计算的潜在效益。我们认为，量子算法不仅需要能够对一些传统方法难以描述的挑战性（即强关联）分子结构进行精确计算，还必须支持将量子加速计算实际整合到针对任意分子的常规计算的高通量流程中，最终为社会提供切实价值。

摘要 (Abstract)

Chemistry and materials science are widely regarded as potential killer application fields for quantum hardware. While the dream of unlocking unprecedented simulation capabilities remains compelling, quantum algorithm development must adapt to the evolving constraints of the emerging quantum hardware in order to accomplish any advantage for the computational chemistry practice. At the same time, the continuous advancement of classical wavefunction-theory methods narrows the window for a broad quantum advantage. Here, we explore potential benefits of quantum computation from the broader perspective of utility-scale applications. We argue that quantum algorithms need not only enable accurate calculations for a few challenging, that is strongly correlated, molecular structures, that might be hard to describe with traditional methods. Instead, they must also support the practical integration of quantum-accelerated computations into high-throughput pipelines for routine calculations on arbitrary molecules, ultimately delivering a tangible value to society.

关键词: quantum computation, computational chemistry, quantum algorithms, utility-scale applications, high-throughput pipelines, quantum hardware, wavefunction-theory, molecular structures

295. ❌ Maximum entropy distributions of wavefunctions at thermal equilibrium

作者: Jacob T. Willson, Henrik J. Heelweg, Adam P. Willard 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.19060v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究量子统计力学中的波函数分布问题，提出最大熵原理（Scrooge ensemble）来描述热平衡下的波函数系综。所有评分关键词均涉及大模型、深度学习、AI技术及其应用，而本文是纯理论物理研究，完全不涉及人工智能、机器学习或计算模型。论文内容与所有关键词无任何关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了量子系统中热平衡下波函数系综分布的理论问题，提出了基于最大熵原理的Scrooge ensemble，并发现需要约束测量熵等于相对于Gibbs态的Rényi散度才能得到有效的平衡态。

摘要翻译

统计力学揭示，宏观物理系统的性质表现为对统计独立的微观子系统系综的平均，其中每个子系统占据特定的微观状态。在量子系统的某些模型中，这些微观状态即单个量子系统的波函数态。然而，即使处于热平衡条件下，支配波函数系综分布的物理原理尚未得到充分确立。例如，经典的玻尔兹曼分布无法直接应用于波函数，因为它们缺乏确定的能量值。本文提出了一种针对热平衡下量子波函数系综的最大熵原理，即所谓的Scrooge系综。我们指出，仅对能量期望值乃至相应本征态分布的形状施加约束，均无法得到有效的平衡态。研究发现，除上述约束外，还必须约束测量熵等于该系综相对于吉布斯态的Rényi散度，这表明Rényi散度对于量子系统热平衡可能具有尚未被深入探究的物理重要性。

摘要 (Abstract)

Statistical mechanics reveals that the properties of a macroscopic physical system emerge as an average over an ensemble of statistically independent microscopic subsystems, each occupying a specific microstate. In some models of quantum systems, these microstates are the wavefunction states of individual quantum systems.The physical principles that govern the distribution of a wavefunction ensemble, even under conditions of thermal equilibrium, are not well established. For instance, the canonical Boltzmann distribution cannot be applied to wavefunctions because they lack a definite energy. In this manuscript, we present a maximum entropy principle for the quantum wavefunction ensemble at thermal equilibrium, the so-called Scrooge ensemble. We highlight that a constraint on the energy expectation value, or even the shape of the associated eigenstate distribution, fails to yield a valid equilibrium state. We find that in addition to these constraints, one must also constrain the measurement entropy to be equal to the Rényi divergence of the ensemble with respect to the Gibbs state, indicating that the Rényi divergence may have uninvestigated physical importance to thermal equilibrium in quantum systems.

关键词: maximum entropy, wavefunction ensemble, thermal equilibrium, Scrooge ensemble, Rényi divergence, quantum systems, statistical mechanics, measurement entropy

296. ❌ An SO(3)-equivariant reciprocal-space neural potential for long-range interactions

作者: Linfeng Zhang, Taoyong Cui, Dongzhan Zhou, Lei Bai, Sufei Zhang, Luca Rossi, Mao Su, Wanli Ouyang, Pheng-Ann Heng 期刊/来源: arxiv 发布日期: 2026-03-19 arXiv链接: http://arxiv.org/abs/2603.18389v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于开发一种用于分子和凝聚态系统的SO(3)-等变神经原子间势（EquiEwald），以解决长程静电和极化相互作用问题。论文的核心是深度学习在科学计算（具体是计算化学/材料科学）中的应用，属于"AI for Science"范畴，因此该关键词得8分。然而，论文未涉及任何大语言模型（LLM）相关技术、训练方法、推理优化、对齐、代理系统或其他列出的LLM特定主题，所有其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文解决了现有机器学习原子间势无法准确建模长程静电和极化相互作用的问题，提出了一种名为EquiEwald的SO(3)-等变神经势，通过在倒易空间进行等变消息传递，成功捕获了各向异性的长程关联，并在多个基准测试中提高了能量和力的预测精度、数据效率以及长程外推能力。

摘要翻译

长程静电与极化相互作用在分子及凝聚相体系中具有核心地位，但其本质上仍无法与基于局域性的机器学习原子间势函数兼容。尽管现代SO(3)等变神经势函数在短程化学作用上实现了高精度，它们无法表征实际材料中各向异性、缓慢衰减的多极关联；而现有的长程扩展方法要么破坏了SO(3)等变性，要么无法保持能量-力的一致性。本文提出EquiEwald，这是一种统一的神经原子间势函数，它将受Ewald方法启发的倒空间表述嵌入到不可约SO(3)等变框架中。通过借助可学习的等变k空间滤波器与等变逆变换在倒空间进行等变消息传递，EquiEwald能够在不牺牲物理一致性的前提下捕捉各向异性的张量长程关联。在周期与非周期体系测试中，EquiEwald准确复现了与第一性原理参考数据一致的长程静电行为，并持续提升了能量与力的预测精度、数据效率以及长程外推能力。这些成果确立了EquiEwald作为一种具有物理原则性的、能够处理长程作用的机器学习原子间势函数新范式。

摘要 (Abstract)

Long-range electrostatic and polarization interactions play a central role in molecular and condensed-phase systems, yet remain fundamentally incompatible with locality-based machine-learning interatomic potentials. Although modern SO(3)-equivariant neural potentials achieve high accuracy for short-range chemistry, they cannot represent the anisotropic, slowly decaying multipolar correlations governing realistic materials, while existing long-range extensions either break SO(3) equivariance or fail to maintain energy-force consistency. Here we introduce EquiEwald, a unified neural interatomic potential that embeds an Ewald-inspired reciprocal-space formulation within an irreducible SO(3)-equivariant framework. By performing equivariant message passing in reciprocal space through learned equivariant k-space filters and an equivariant inverse transform, EquiEwald captures anisotropic, tensorial long-range correlations without sacrificing physical consistency. Across periodic and aperiodic benchmarks, EquiEwald captures long-range electrostatic behavior consistent with ab initio reference data and consistently improves energy and force accuracy, data efficiency, and long-range extrapolation. These results establish EquiEwald as a physically principled paradigm for long-range-capable machine-learning interatomic potentials.

关键词: SO(3)-equivariant neural potential, long-range interactions, Ewald summation, reciprocal-space formulation, machine-learning interatomic potentials, electrostatic interactions, anisotropic correlations, energy-force consistency

297. ❌ Elucidating Norrish Type-I reactive pathways by ultrafast X-ray absorption spectroscopy

作者: Martin Graßl, Pablo Unzueta, Andreas E. Hillers-Bendtsen, Yusong Liu, Diptarka Hait, Alice E. Green, Xinxin Cheng, Felix Allum, Taran Driver, Ruaridh Forbes, James. M. Glownia, Erik Isele, Kirk A. Larsen, Xiang Li, Ming-Fu Lin, Razib Obaid, Adam Summers, Emily Thierstein, Jun Wang, James P. Cryan, Matthias F. Kling, Todd J. Martinez, Thomas J. A. Wolf 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18339v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是实验物理化学研究，使用超快X射线吸收光谱和量子化学模拟研究Norrish I型反应机制，完全不涉及大模型、深度学习、AI技术或任何计算机科学方法，与所有评分关键词均无关联。

!!! tip deepseek-chat TL;DR

该研究通过超快X射线吸收光谱和量子模拟揭示了芳香羰基化合物（以苯乙酮为例）在Norrish I型反应中从激发态到三重态的光化学动力学路径和时间常数。

摘要翻译

Norrish I型反应能选择性断裂羰基相邻的碳-碳键。尽管该反应与芳香羰基化合物结合已广泛应用于增材制造和牙科紫外线固化领域，但其光化学活性态的本质及布居机制仍未得到充分理解。要获得详细的机理认识，需要绘制涉及内转换和系间窜越的光激发布居流动图谱。本研究以气相苯乙酮为典型芳香羰基模型，结合氧K-edge软X射线时间分辨近边X射线吸收精细结构（TR-NEXAFS）光谱与从头算多重产态（AIMS）模拟，开展了时域研究。利用TR-NEXAFS光谱对具有$nπ^$特性态的特殊敏感性，我们观察到在经历$(0.12 \pm 0.02)$ ps无布居转移的初始诱导期后，布居从初始激发的$^1ππ^$态以$(0.13 \pm 0.02)$ ps的时间常数转移至$^1nπ^$态，该结果与AIMS模拟定量吻合。随后，$^1nπ^$态的布居通过系间窜越（可能经由$^3ππ^$态介导）在$(3.17 \pm 0.66)$ ps内衰减至长寿命的$^3nπ^$态，该态被推测是发生Norrish I型化学反应的活性态。

摘要 (Abstract)

Norrish type I reactions selectively cleave carbon-carbon bonds directly adjacent to carbonyl groups. Despite their broad use in combination with aromatic carbonyls for additive manufacturing and dental UV curing applications, the nature of the photochemically active state and its population mechanism remain insufficiently understood. Detailed mechanistic insight requires mapping of the photoexcited population flow involving internal conversion and intersystem crossing. We present a time-domain study of gas phase acetophenone as a prototypical aromatic carbonyl combining soft X-ray time-resolved near-edge X-ray absorption fine structure (TR-NEXAFS) spectroscopy at the oxygen K-edge with ab initio multiple spawning (AIMS) simulations. Exploiting the specific sensitivity of TR-NEXAFS spectroscopy to states with $nπ^$ character, we observe population transfer from the initially excited $^1ππ^$ state to the $^1nπ^$ state with a time constant of $(0.13 \pm 0.02)$ ps after an initial induction period of $(0.12 \pm 0.02)$ ps without population transfer, in quantitative agreement with the AIMS simulations. The population in the $^1nπ^$ state subsequently decays via intersystem crossing, likely mediated by a $^3ππ^$ state, within $(3.17 \pm 0.66)$ ps to a long-lived $^3nπ^$ state, which is presumed to be active towards Norrish type I chemistry.

关键词: Norrish type I reactions, ultrafast X-ray absorption spectroscopy, TR-NEXAFS, ab initio multiple spawning, photochemical dynamics, intersystem crossing, carbonyl photochemistry, time-resolved spectroscopy

298. ❌ Visualization-Based Approach to Condensed-Phase Line Broadening Using Polyene Chains

作者: Saba Mahmoodpour, Andrew M. Moran 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18291v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于凝聚相光谱学中的谱线展宽现象，使用聚烯链作为模型系统，通过可视化方法和数值模拟来教学本科生理解分子-环境相互作用。论文内容涉及量子化学计算（时间依赖的Hückel哈密顿量）、分子轨道理论、光谱模拟和MATLAB教学工具开发。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本论文属于传统的计算化学和物理化学教育领域，未涉及任何人工智能、机器学习或大模型相关内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于可视化方法的教学框架，通过聚烯链模型和数值模拟来解释凝聚相光谱中谱线展宽的物理起源，并提供了MATLAB代码用于本科教学。

摘要翻译

凝聚相光谱线形编码了分子与其环境之间相互作用的强度与时间尺度，但由于这些概念依赖于形式化的理论处理，在本科阶段往往难以引入。我们提出一种基于可视化的方法，将解析结果与数值模拟相结合，以阐明共轭分子体系中光谱线展宽的物理起源。利用含时的休克尔哈密顿量，我们推导了有限多烯链中相干电子运动的闭式表达式，并展示了这些结果如何直接揭示分子轨道结构在光吸收中的作用。通过哈密顿矩阵元的随机涨落引入环境效应，使学生能够观察系统-环境相互作用如何破坏相干运动，并在电子轨迹中产生类散射特征。实空间动画与模拟吸收光谱在微观动力学与实测线形之间建立了直观联系。本文提供的MATLAB代码为将计算与可视化融入本科教学提供了一个易于使用的平台，同时引入了凝聚相光谱学中的关键概念。

摘要 (Abstract)

Condensed-phase spectral line shapes encode the strength and timescale of interactions between molecules and their environments, yet these ideas are often difficult to introduce at the undergraduate level due to their reliance on formal theoretical treatments. We present a visualization-based approach that combines analytic results with numerical simulations to illustrate the physical origins of spectral line broadening in conjugated molecular systems. Using a time-dependent Hückel Hamiltonian, we derive closed-form expressions for coherent electronic motion in finite polyene chains and show how these results provide direct insight into the role of molecular orbital structure in light absorption. Environmental effects are introduced through stochastic fluctuations of the Hamiltonian matrix elements, allowing students to observe how system–environment interactions disrupt coherent motion and produce scattering-like features in electronic trajectories. Real-space animations and simulated absorption spectra provide an intuitive link between microscopic dynamics and measured line shapes. The MATLAB code provided with this work offers an accessible platform for integrating computation and visualization into undergraduate instruction while introducing key concepts in condensed-phase spectroscopy.

关键词: condensed-phase spectroscopy, spectral line broadening, polyene chains, Hückel Hamiltonian, molecular orbital structure, visualization-based approach, undergraduate education, MATLAB simulation

299. ❌ Isotope Effects in 2D correlation infrared Spectra of Water: HEOM Analysis of Molecular Dynamics-Based Machine Learning Models

作者: Kwanghee Park, Ryotaro Hoshino, Yoshitaka Tanimura 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18276v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究液态水和重水的分子动力学模拟与光谱分析，采用HEOM方法计算二维红外光谱，属于计算化学和分子光谱学领域。论文中提到的’Machine Learning Models’仅指用于分子动力学模拟的机器学习模型（如力场模型），而非大语言模型或深度学习模型。所有关键词均与大语言模型、深度学习技术原理或AI for Science的具体应用（如生物信息学、化学信息学）无关，仅’AI for Science OR Bioinformatics OR Cheminformatics’因涉及科学计算（分子模拟）而获得5分（有一定关联），其余关键词评分为0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文通过分子动力学模拟和HEOM方法分析液态H2O和D2O的二维红外光谱，揭示了同位素效应对分子振动能量弛豫和退相动力学的影响机制。

摘要翻译

我们对液态H₂O和D₂O的分子内振动模式进行建模、模拟与分析，旨在阐明能量激发、弛豫及振动退相干如何通过非谐性模式间耦合相互作用。本研究采用二维（2D）相关谱作为非线性红外振动光谱中的代表性观测手段。要精确复现这些二维光谱轮廓，不仅需要对分子内振动进行精确的动力学描述，还必须恰当处理由周围分子作为热浴产生的强相互作用所带来的热环境效应。进一步捕捉二维光谱的核心特征，则要求对分子内振动模式与其热浴之间的相互作用采用非马尔可夫、非微扰且非线性的描述框架。为此，我们采用层级运动方程（HEOM）方法来计算二维光谱。通过对比H₂O与D₂O的光谱结果，我们深入探究了支配其复杂能量与相位弛豫动力学的内在机制。

摘要 (Abstract)

We model, simulate, and analyze the intramolecular modes of liquid H2O and D2O to elucidate how energy excitation, relaxation, and vibrational dephasing interplay through anharmonic mode-mode coupling. Our analysis employs two-dimensional (2D) correlation spectra, a representative observable in nonlinear infrared vibrational spectroscopy. Accurate reproduction of these 2D spectral profiles requires not only a precise dynamical description of intramolecular vibrations but also an appropriate treatment of thermal environmental effects arising from strong interactions with surrounding molecules, which act as thermal baths. Capturing the essential features of the 2D spectra further demands a non-Markovian, non-perturbative, and nonlinear description of the interactions between intramolecular modes and their baths. To this end, we adopt a hierarchical equations of motion (HEOM) framework to compute the 2D spectra. By comparing the resulting spectra of H2O and D2O, we explore the underlying mechanisms governing their complex energy and phase relaxation dynamics.

关键词: 2D correlation infrared spectra, water isotope effects, hierarchical equations of motion (HEOM), molecular dynamics simulation, vibrational dephasing, anharmonic mode-mode coupling, non-Markovian dynamics, thermal environmental effects

300. ❌ sbml4md: A computational platform for System-Bath Modeling via Molecular Dynamics powered by Machine Learning

作者: Kwanghee Park, Seiji Ueno, Yoshitaka Tanimura 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18274v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究利用机器学习（ML）技术从分子动力学轨迹中提取参数，用于模拟分子液体的非线性振动光谱。虽然论文涉及机器学习在科学计算中的应用，但所有关键词（除最后一个外）都专门针对大语言模型（LLMs）及其相关技术（如训练方法、推理优化、对齐、代理系统等）。论文未提及任何LLM、深度学习技术原理创新或大模型在不同领域的应用。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于计算化学/生物信息学领域，使用ML进行科学模拟，但并非核心创新点，只是工具应用，因此给5分（有一定关联）。其他关键词完全无关，给0分。

!!! tip deepseek-chat TL;DR

该论文开发了sbml4md算法，利用机器学习从分子动力学轨迹中提取参数，以模拟分子液体的非线性振动光谱，避免了经验拟合并提高了优化效率。

摘要翻译

本文介绍了sbml4md——一种新开发的算法软件包，用于从分子动力学（MD）轨迹中提取多模非谐振布朗（Multimode Anharmonic Brownian, MAB）模型的参数，以模拟分子液体分子内模式的非线性振动光谱。该算法利用机器学习（Machine Learning, ML）技术捕获每个振动模式的非谐性、分子间耦合及浴关联函数，从而避免了经验拟合，并能够对具有时空异质性的环境进行建模。本研究提供了一套专门为层级运动方程（Hierarchical Equations of Motion, HEOM）框架定制的参数，实现了非线性振动光谱在数值上的“精确”模拟。基于我们先前针对分子内振动模式的实现工作[Park, Jo, and Tanimura, J. Chem. Phys. 163, 214104 (2025)]，当前代码通过显式考虑分子间振动的贡献，提升了优化效率。这一扩展使sbml4md能够通过无缝整合经典分子动力学方法，拓宽基于HEOM的动力学建模的适用范围，从而为在真实条件下以最少的经验输入模拟线性和非线性光谱，提供了一个灵活且可扩展的框架。随附的机器学习代码使用Python编写，已作为支撑材料提供。

摘要 (Abstract)

We introduce sbml4md, a newly developed algorithm implemented as a software package to extract parameters of multimode anharmonic Brownian (MAB) models from molecular dynamics (MD) trajectories for simulating nonlinear vibrational spectra of intramolecular modes of molecular liquids. By leveraging machine learning (ML) techniques to capture vibrational anharmonicity, intermolecular couplings, and bath correlation functions for each mode, sbml4md obviates empirical fitting and enables the modeling of environments with spatial and temporal heterogeneity. This work provides a set of parameters specifically tailored for the Hierarchical Equations of Motion (HEOM) framework, enabling numerically “exact” simulations of nonlinear vibrational spectra. Building upon our previous implementation for intramolecular vibrational modes [Park, Jo, and Tanimura, J. Chem. Phys. 163, 214104 (2025)], the present code enhances optimization efficiency by explicitly accounting for intermolecular vibrational contributions. This extension enables sbml4md to broaden the applicability of HEOM-based dynamical modeling by seamlessly integrating classical MD approaches, thereby providing a flexible and scalable framework for simulating both linear and nonlinear spectra under realistic conditions with minimal empirical input. The accompanying ML code, written in Python, is provided as supporting material.

关键词: sbml4md, System-Bath Modeling, Molecular Dynamics, Machine Learning, Nonlinear Vibrational Spectra, Hierarchical Equations of Motion, Intermolecular Couplings, Python

301. ❌ Spin-Flip Configuration Interaction for Strong Static Correlation in Quantum Electrodynamics

作者: Braden M. Weight, Zheng Pei, Sergei Tretiak 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18228v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于量子电动力学中的计算化学方法开发（QED-SF-CIS），属于计算化学和量子物理交叉领域。所有关键词均与大模型、深度学习、AI技术原理或应用直接相关，而本文完全不涉及这些主题。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于计算化学（科学计算）领域，但文中未使用AI/机器学习方法，仅涉及传统量子化学计算，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文针对强静态关联的量子电动力学系统，扩展了自旋翻转组态相互作用方法（QED-SF-CIS），以准确描述分子材料中电子态与腔光子的耦合，并通过分子示例展示了腔耦合如何调控键断裂过程。

摘要翻译

在分子材料的计算化学中，当涉及基态在内的电子态出现准简并时（例如化学键断裂过程），会产生强烈的静态关联效应。此类情形对精确的理论处理提出了重大挑战。在此类体系中，基于单行列式描述的多种方法（如哈特里-福克理论及其含时扩展）无法准确再现基态与激发态势能面的正确拓扑结构（例如在锥形交叉附近）。当强关联电子体系在非相对论腔量子电动力学框架内进一步与量子化辐射场强耦合时，额外的光子自由度既带来了新的复杂性，也创造了新的调控机遇。例如在有机金属配合物中，激发态腔光子可以改变化学键断裂过程，并实现几何相变与自旋相变的可调控性。为突破这一瓶颈，本研究将经过深入研究的自旋翻转组态相互作用单激发方法扩展至显式包含量子化腔光子，从而发展出QED-SF-CIS方法。我们推导出自旋翻转哈密顿量，发现体系的双激发子空间（相对于电子激发为单激发）必须纳入组态中以正确描述与腔光子相互作用的单重态电子态。随后通过代表性分子算例，我们阐释了腔耦合如何为化学键断裂过程提供额外的可调控性。最后我们将该方法推广至包含更多光子激发数，这在强耦合区域是必要的。

摘要 (Abstract)

In computational chemistry of molecular materials, strong static correlation effects appear when electronic states, often involving the ground state, become quasi-degenerate, as occurs, for example, in bond-breaking processes. Such situations present significant challenges for accurate theoretical treatment. In these regimes, many-body methods involving a single-determinant description, such as Hartree-Fock theory and its time-dependent extension, fail to reproduce the correct topology of the ground and excited state potential energy surfaces (e.g., near conical intersections). When strongly correlated electronic systems are further strongly coupled to a quantized radiation field within the framework of non-relativistic cavity quantum electrodynamics, an additional photonic degree of freedom introduces both new complexity and new opportunities to control. Excited cavity photons can modify bond-breaking processes and enable tunability of geometrical and spin-phase transitions, for instance, in organometallic complexes. To overcome this bottleneck, in this work, we extend the well-studied spin-flip configuration interaction singles (SF-CIS) approach to explicitly include quantized cavity photons leading to QED-SF-CIS method. We derive the spin-flip Hamiltonian and find that the double excitation subspace of the system (single with respect to electronic excitation) must be included in the configurations to properly describe singlet electronic states interacting with cavity photons. We then illustrate, through representative molecular examples, how cavity coupling can provide additional tunability in bond-breaking processes. We finally generalize this approach to include higher numbers of photonic excitations, which are required in the strong coupling regime.

关键词: spin-flip configuration interaction, quantum electrodynamics, strong static correlation, cavity photons, bond-breaking processes, molecular materials, QED-SF-CIS, electronic states

302. ❌ A Survey of Neural Network Variational Monte Carlo from a Computing Workload Characterization Perspective

作者: Zhengze Xiao, Xuanzhe Ding, Yuyang Lou, Lixue Cheng, Chaojian Li 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.18126v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究神经网络变分蒙特卡洛（NNVMC）在量子多体问题中的应用，属于AI for Science（科学AI）领域，因此与’AI for Science OR Bioinformatics OR Cheminformatics’关键词有一定关联（5分）。然而，论文专注于量子物理计算，未涉及大语言模型（LLMs）、深度学习技术原理创新、或关键词列表中的其他具体技术（如MoE、SFT、RAG等），因此其他所有关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文从计算负载特征角度综述了神经网络变分蒙特卡洛（NNVMC）方法，分析了四种代表性波函数拟设的GPU性能瓶颈，并提出了算法-硬件协同设计以提升可扩展性。

摘要翻译

神经网络变分蒙特卡洛（Neural Network Variational Monte Carlo，NNVMC）通过将变分蒙特卡洛方法与表达能力强大的神经网络波函数拟设相结合，已成为解决量子多体问题的一种前景广阔的范式。尽管NNVMC能够以良好的渐近标度实现有竞争力的精度，但其实际应用仍受限于现代图形处理器上的高运行时与高内存成本。与语言和视觉工作负载相比，NNVMC的执行过程受特定物理阶段塑造，包括马尔可夫链蒙特卡洛采样、波函数构建以及导数/拉普拉斯算子评估，这些阶段产生了异构的核行为与复杂的瓶颈。本文针对四种代表性拟设——PauliNet、FermiNet、Psiformer和Orbformer——进行了面向工作负载的综述与实证GPU特性分析。通过统一的性能剖析协议，我们分析了模型级的运行时与内存趋势，并通过家族分解、算术强度、屋顶线定位以及硬件利用率计数器分析了核级行为。结果表明，端到端性能常受限于低强度的逐元运算与数据移动核，而计算/内存平衡在不同拟设和阶段间存在显著差异。基于这些发现，我们讨论了可扩展NNVMC系统在算法-硬件协同设计方面的启示，包括阶段感知调度、以内存为中心的优化以及异构加速。

摘要 (Abstract)

Neural Network Variational Monte Carlo (NNVMC) has emerged as a promising paradigm for solving quantum many-body problems by combining variational Monte Carlo with expressive neural-network wave-function ansätze. Although NNVMC can achieve competitive accuracy with favorable asymptotic scaling, practical deployment remains limited by high runtime and memory cost on modern graphics processing units (GPUs). Compared with language and vision workloads, NNVMC execution is shaped by physics-specific stages, including Markov-Chain Monte Carlo sampling, wave-function construction, and derivative/Laplacian evaluation, which produce heterogeneous kernel behavior and nontrivial bottlenecks. This paper provides a workload-oriented survey and empirical GPU characterization of four representative ansätze: PauliNet, FermiNet, Psiformer, and Orbformer. Using a unified profiling protocol, we analyze model-level runtime and memory trends and kernel-level behavior through family breakdown, arithmetic intensity, roofline positioning, and hardware utilization counters. The results show that end-to-end performance is often constrained by low-intensity elementwise and data-movement kernels, while the compute/memory balance varies substantially across ansätze and stages. Based on these findings, we discuss algorithm–hardware co-design implications for scalable NNVMC systems, including phase-aware scheduling, memory-centric optimization, and heterogeneous acceleration.

关键词: Neural Network Variational Monte Carlo, NNVMC, quantum many-body problems, GPU characterization, workload analysis, algorithm-hardware co-design, performance optimization, wave-function ansätze

303. ❌ The Convergence Frontier: Integrating Machine Learning and High Performance Quantum Computing for Next-Generation Drug Discovery

作者: Narjes Ansari, César Feniou, Nicolaï Gouraud, Daniele Loco, Siwar Badreddine, Baptiste Claudon, Félix Aviat, Marharyta Blazhynska, Kevin Gasperich, Guillaume Michel, Diata Traore, Corentin Villot, Thomas Plé, Olivier Adjoua, Louis Lagardère, Jean-Philip Piquemal 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17790v2

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究量子计算、高性能计算和机器学习在药物发现中的融合应用，属于AI for Science领域，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文提到’ML foundation models, such as FeNNix-Bio1’，表明涉及基础模型在科学领域的应用，与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分）。其他关键词主要涉及大模型技术细节（如MoE、RLHF、量化等）、推理方法（如CoT、Agent）或特定应用（如工具调用），论文未直接讨论这些内容，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过融合高性能计算、机器学习和量子计算来解决药物发现中量子化学模拟的计算瓶颈，提出量子增强采样作为超越GPU的新前沿，以优化药物研发流程并实现化学精度。

摘要翻译

将量子力学整合到药物发现中，标志着从经验性试错向定量精确性的决定性转变。然而，从头算分子动力学模拟的过高成本历来迫使研究者在化学精度与计算可扩展性之间做出妥协。本文指出，高性能计算、机器学习与量子计算的融合是解决这一瓶颈的明确方案。尽管机器学习基础模型（如FeNNix-Bio1）能够实现量子精度的模拟，它们仍受限于经典数据生成的内在局限。我们详细阐述了利用混合量子处理单元-图形处理单元架构的高性能量子计算，将如何成为量子化学数据的终极加速器。通过利用希尔伯特空间映射，这些系统能够绕过经典近似方法的启发式策略，实现真正的化学精度。我们展示了这种三方融合如何优化药物发现流程，涵盖从初始系统准备到机器学习驱动的高保真模拟。最后，我们将量子增强采样定位为超越图形处理单元的前沿技术，用于模拟反应性细胞系统并开创下一代材料。

摘要 (Abstract)

Integrating quantum mechanics into drug discovery marks a decisive shift from empirical trial-and-error toward quantitative precision. However, the prohibitive cost of ab initio molecular dynamics has historically forced a compromise between chemical accuracy and computational scalability. This paper identifies the convergence of High-Performance Computing (HPC), Machine Learning (ML), and Quantum Computing (QC) as the definitive solution to this bottleneck. While ML foundation models, such as FeNNix-Bio1, enable quantum-accurate simulations, they remain tethered to the inherent limits of classical data generation. We detail how High-Performance Quantum Computing (HPQC), utilizing hybrid QPU-GPU architectures, will serve as the ultimate accelerator for quantum chemistry data. By leveraging Hilbert space mapping, these systems can achieve true chemical accuracy while bypassing the heuristics of classical approximations. We show how this tripartite convergence optimizes the drug discovery pipeline, spanning from initial system preparation to ML-driven, high-fidelity simulations. Finally, we position quantum-enhanced sampling as the beyond GPU frontier for modeling reactive cellular systems and pioneering next-generation materials.

关键词: Quantum Computing, Drug Discovery, Machine Learning, High-Performance Computing, Quantum Chemistry, Molecular Dynamics, Foundation Models, Quantum-enhanced Sampling

304. ❌ In-phase current and temperature oscillations reduce PEM fuel cell resistivity: A modeling study

作者: Andrei Kulikovsky 期刊/来源: arxiv 发布日期: 2026-03-18 arXiv链接: http://arxiv.org/abs/2603.17709v2

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 该论文题为《In-phase current and temperature oscillations reduce PEM fuel cell resistivity: A modeling study》，研究质子交换膜燃料电池阴极催化剂层的非等温阻抗模型，探讨同相电流和温度振荡如何降低电阻。所有评分关键词均涉及大模型、深度学习、AI技术原理或AI在科学领域的应用，而该论文专注于电化学工程和燃料电池物理建模，未涉及任何人工智能、机器学习或大语言模型相关内容，因此与所有关键词完全无关，相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究通过建立质子交换膜燃料电池阴极催化剂层的非等温分析模型，发现同相电流和温度谐波扰动可通过降低质子传输损耗来减少阻抗和静态极化电阻，特定幅值选择可完全消除这些损耗。

摘要翻译

我们为质子交换膜燃料电池阴极催化剂层的阻抗建立了一个非等温分析模型。由于质子传输损耗的降低，电流密度和温度的同相谐波扰动减小了阴极催化剂层的阻抗和静态极化电阻率。通过特殊选择电流和温度扰动的振幅，可以完全消除这些损耗。

摘要 (Abstract)

We have developed a non-isothermal analytical model for the impedance of the cathode catalyst layer (CCL) in a PEM fuel cell. In-phase harmonic perturbations to the current density and temperature reduce the impedance and the static polarisation resistivity of the CCL due to lowering proton transport losses. A special selection of the current and temperature perturbation amplitudes allows for complete elimination of these losses.

关键词: PEM fuel cell, cathode catalyst layer, non-isothermal model, impedance, current density, temperature oscillations, proton transport losses, polarization resistivity

305. ❌ Bridging Classical Sensitivity and Quantum Scrambling: A Tutorial on Out-of-Time-Ordered Correlators

作者: Stephen Wiggins 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16394v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是量子混沌理论中的数学教程，专注于经典混沌与量子扰动的概念映射，特别是通过OTOC（时序错乱关联函数）来理解量子系统中的混沌行为。论文内容完全属于理论物理和数学物理领域，讨论的是量子力学基础、混沌理论和算子理论，没有涉及任何大模型、深度学习、AI技术或AI在科学中的应用。所有评分关键词都围绕大模型技术及其应用，与该论文的物理数学主题完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该教程通过详细阐述时序错乱关联函数（OTOC）的数学机制，解决了如何将经典混沌的敏感性概念映射到量子力学线性框架中的难题，并区分了局部不稳定性和全局混沌的诊断能力。

摘要翻译

在经典动力系统中，混沌行为通常与初始条件的指数敏感性及全局相空间结构相关联。将这一几何概念转换到量子力学严格的线性框架中，构成了一个概念上的难题。无序时序关联函数（out-of-time-ordered correlator, OTOC）常被视作经典蝴蝶效应的量子类比，但这一通俗表述可能掩盖重要的数学差异。本教程通过详细阐述OTOC的数学机制，弥合了应用数学与量子信息领域之间的鸿沟。我们探讨了经典敏感性如何转化为算符的非对易性，为何标准的两点关联函数无法清晰探测这种敏感性，以及量子可观测量离域化如何与经典的混合概念相联系。关键之处在于，我们界定了OTOC能够诊断和无法诊断的内容，从而区分了局部不稳定性与全局混沌。最终，我们提供了一幅精确且可用的概念图谱，探讨了库普曼-冯·诺伊曼形式体系如何为通过统一的线性视角理解经典与量子动力学提供框架。

摘要 (Abstract)

In classical dynamical systems, chaotic behavior is often associated with exponential sensitivity to initial conditions together with global phase-space structure. Translating this geometric concept to the strictly linear framework of quantum mechanics presents a conceptual puzzle. The out-of-time-ordered correlator (OTOC) is often motivated as the quantum analogue of the classical butterfly effect, but this slogan can hide important mathematical distinctions. This tutorial bridges the gap between applied mathematics and quantum information by detailing the mathematical machinery of the OTOC. We explore how classical sensitivity translates to operator non-commutativity, why standard two-point correlation functions fail to cleanly detect this sensitivity, and how the delocalization of quantum observables relates to classical notions of mixing. Crucially, we outline what the OTOC can and cannot diagnose, distinguishing between local instability and global chaos. Ultimately, we provide a precise and usable conceptual map, exploring how the Koopman-von Neumann formalism offers a framework to view classical and quantum dynamics through a shared linear perspective.

关键词: out-of-time-ordered correlator, quantum chaos, classical sensitivity, operator non-commutativity, Koopman-von Neumann formalism, quantum scrambling, butterfly effect, quantum information

Token 消耗统计

总计: 983,923 tokens（输入 660,585 / 输出 323,338）

模型	输入	输出	合计
deepseek-chat	544,081	300,380	844,461
glm-4.7	116,504	22,958	139,462

📊 ArXiv 研究报告 (2026-03-21)#

📌 配置信息#

关键词列表（共 27 个，总权重 27.0）#

评分设置#

📈 论文统计#

⭐ 及格论文详细分析#

1. MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models#

2. TARo: Token-level Adaptive Routing for LLM Test-time Alignment#

3. ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augme#

ZEBRAARENA：用于研究工具增强型大模型中推理-行动耦合的诊断模拟环境#

4. PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching#

PowerFlow：通过原则性分布匹配解锁大语言模型的双重属性#

5. Learning to Self-Evolve#

学习自我进化#

6. TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation#

TerraScope：面向地球观测的像素级视觉推理#

7. D-Mem: A Dual-Process Memory System for LLM Agents#

D-Mem：面向大模型智能体的双过程记忆系统#

8. EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models#

EntropyCache：基于解码Token熵引导的扩散语言模型KV缓存策略#

9. Security awareness in LLM agents: the NDAI zone case#

LLM 代理的安全感知：NDAI 区域案例研究#

10. Bridging Network Fragmentation: A Semantic-Augmented DRL Framework for UAV-aided VANETs#

11. SignAgent: Agentic LLMs for Linguistically-Grounded Sign Language Annotation and Dataset Curation#

SignAgent：基于语言学基础的手语注释与数据集整理的智能体大语言模型#

12. From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-#

13. Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails#

📋 所有论文列表#

1. ✅ MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models#

2. ✅ TARo: Token-level Adaptive Routing for LLM Test-time Alignment#

3. ✅ ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs#

4. ✅ PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching#

5. ✅ Learning to Self-Evolve#

6. ✅ TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation#

7. ✅ D-Mem: A Dual-Process Memory System for LLM Agents#

8. ✅ EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models#

9. ✅ Security awareness in LLM agents: the NDAI zone case#

10. ✅ Bridging Network Fragmentation: A Semantic-Augmented DRL Framework for UAV-aided VANETs#

11. ✅ SignAgent: Agentic LLMs for Linguistically-Grounded Sign Language Annotation and Dataset Curation#

12. ✅ From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models#

13. ✅ Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails#

14. ❌ dTRPO: Trajectory Reduction in Policy Optimization of Diffusion Large Language Models#

15. ❌ Tinted Frames: Question Framing Blinds Vision-Language Models#

16. ❌ Secure Linear Alignment of Large Language Models#

17. ❌ Synthetic Data Generation for Training Diversified Commonsense Reasoning Models#

18. ❌ Modeling the human lexicon under temperature variations: linguistic factors, diversity and typicality in LLM word associations#

19. ❌ DriftGuard: Mitigating Asynchronous Data Drift in Federated Learning#

20. ❌ Security, privacy, and agentic AI in a regulatory view: From definitions and distinctions to provisions and reflections#

21. ❌ F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World#

22. ❌ FinTradeBench: A Financial Reasoning Benchmark for LLMs#

23. ❌ NavTrust: Benchmarking Trustworthiness for Embodied Navigation#

24. ❌ DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising#

25. ❌ Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation#

26. ❌ $R$-equivalence on Cubic Surfaces I: Existing Cases with Non-Trivial Universal Equivalence#

27. ❌ OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards#

28. ❌ Box Maze: A Process-Control Architecture for Reliable LLM Reasoning#

29. ❌ SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits#

30. ❌ ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis#

31. ❌ Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation#

32. ❌ cuGenOpt: A GPU-Accelerated General-Purpose Metaheuristic Framework for Combinatorial Optimization#

33. ❌ VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models#

34. ❌ D5P4: Partition Determinantal Point Process for Diversity in Parallel Discrete Diffusion Decoding#

35. ❌ UGID: Unified Graph Isomorphism for Debiasing Large Language Models#

36. ❌ Implicit Patterns in LLM-Based Binary Analysis#

37. ❌ Adaptive Regime-Aware Stock Price Prediction Using Autoencoder-Gated Dual Node Transformers with Reinforcement Learning Control#

38. ❌ CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization#

39. ❌ How Uncertainty Estimation Scales with Sampling in Reasoning Models#

40. ❌ FedTrident: Resilient Road Condition Classification Against Poisoning Attacks in Federated Learning#

41. ❌ LuMamba: Latent Unified Mamba for Electrode Topology-Invariant and Efficient EEG Modeling#

42. ❌ DaPT: A Dual-Path Framework for Multilingual Multi-hop Question Answering#

43. ❌ SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues#

44. ❌ Serendipity by Design: Evaluating the Impact of Cross-domain Mappings on Human and LLM Creativity#

45. ❌ CAMO: A Conditional Neural Solver for the Multi-objective Multiple Traveling Salesman Problem#

46. ❌ Parallelograms Strike Back: LLMs Generate Better Analogies than People#

47. ❌ Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding#

48. ❌ Man and machine: artificial intelligence and judicial decision making#

49. ❌ SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models#

50. ❌ Behavioral Fingerprints for LLM Endpoint Stability and Identity#

51. ❌ What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?#

52. ❌ Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval#

📊 ArXiv 研究报告 (2026-03-21)

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

评分设置

📈 论文统计

⭐ 及格论文详细分析

1. MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models

2. TARo: Token-level Adaptive Routing for LLM Test-time Alignment

3. ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augme

ZEBRAARENA：用于研究工具增强型大模型中推理-行动耦合的诊断模拟环境

4. PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

PowerFlow：通过原则性分布匹配解锁大语言模型的双重属性

5. Learning to Self-Evolve

学习自我进化

6. TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

TerraScope：面向地球观测的像素级视觉推理

7. D-Mem: A Dual-Process Memory System for LLM Agents

D-Mem：面向大模型智能体的双过程记忆系统

8. EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

EntropyCache：基于解码Token熵引导的扩散语言模型KV缓存策略

9. Security awareness in LLM agents: the NDAI zone case

LLM 代理的安全感知：NDAI 区域案例研究

10. Bridging Network Fragmentation: A Semantic-Augmented DRL Framework for UAV-aided VANETs

11. SignAgent: Agentic LLMs for Linguistically-Grounded Sign Language Annotation and Dataset Curation

SignAgent：基于语言学基础的手语注释与数据集整理的智能体大语言模型

12. From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-

13. Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

📋 所有论文列表

1. ✅ MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models

2. ✅ TARo: Token-level Adaptive Routing for LLM Test-time Alignment

3. ✅ ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs

4. ✅ PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

5. ✅ Learning to Self-Evolve

6. ✅ TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

7. ✅ D-Mem: A Dual-Process Memory System for LLM Agents

8. ✅ EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

9. ✅ Security awareness in LLM agents: the NDAI zone case

10. ✅ Bridging Network Fragmentation: A Semantic-Augmented DRL Framework for UAV-aided VANETs

11. ✅ SignAgent: Agentic LLMs for Linguistically-Grounded Sign Language Annotation and Dataset Curation

12. ✅ From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

13. ✅ Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

14. ❌ dTRPO: Trajectory Reduction in Policy Optimization of Diffusion Large Language Models

15. ❌ Tinted Frames: Question Framing Blinds Vision-Language Models

16. ❌ Secure Linear Alignment of Large Language Models

17. ❌ Synthetic Data Generation for Training Diversified Commonsense Reasoning Models

18. ❌ Modeling the human lexicon under temperature variations: linguistic factors, diversity and typicality in LLM word associations

19. ❌ DriftGuard: Mitigating Asynchronous Data Drift in Federated Learning

20. ❌ Security, privacy, and agentic AI in a regulatory view: From definitions and distinctions to provisions and reflections

21. ❌ F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

22. ❌ FinTradeBench: A Financial Reasoning Benchmark for LLMs

23. ❌ NavTrust: Benchmarking Trustworthiness for Embodied Navigation

24. ❌ DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising

25. ❌ Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

26. ❌ $R$-equivalence on Cubic Surfaces I: Existing Cases with Non-Trivial Universal Equivalence

27. ❌ OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards

28. ❌ Box Maze: A Process-Control Architecture for Reliable LLM Reasoning

29. ❌ SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

30. ❌ ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis

31. ❌ Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

32. ❌ cuGenOpt: A GPU-Accelerated General-Purpose Metaheuristic Framework for Combinatorial Optimization

33. ❌ VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

34. ❌ D5P4: Partition Determinantal Point Process for Diversity in Parallel Discrete Diffusion Decoding

35. ❌ UGID: Unified Graph Isomorphism for Debiasing Large Language Models

36. ❌ Implicit Patterns in LLM-Based Binary Analysis

37. ❌ Adaptive Regime-Aware Stock Price Prediction Using Autoencoder-Gated Dual Node Transformers with Reinforcement Learning Control

38. ❌ CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization

39. ❌ How Uncertainty Estimation Scales with Sampling in Reasoning Models

40. ❌ FedTrident: Resilient Road Condition Classification Against Poisoning Attacks in Federated Learning

41. ❌ LuMamba: Latent Unified Mamba for Electrode Topology-Invariant and Efficient EEG Modeling

42. ❌ DaPT: A Dual-Path Framework for Multilingual Multi-hop Question Answering

43. ❌ SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

44. ❌ Serendipity by Design: Evaluating the Impact of Cross-domain Mappings on Human and LLM Creativity

45. ❌ CAMO: A Conditional Neural Solver for the Multi-objective Multiple Traveling Salesman Problem

46. ❌ Parallelograms Strike Back: LLMs Generate Better Analogies than People

47. ❌ Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

48. ❌ Man and machine: artificial intelligence and judicial decision making

49. ❌ SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models

50. ❌ Behavioral Fingerprints for LLM Endpoint Stability and Identity

51. ❌ What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

52. ❌ Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval