📊 ArXiv 研究报告 (2026-04-14)

生成时间: 2026-04-14 09:15:58 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 272 篇
及格论文: 6 篇 (2.2%)

⭐ 及格论文详细分析

1. From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

作者: Chenchen Zhang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09459v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	10.0/10	10.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究强化学习（RL）在大语言模型（LLMs）中的信用分配（Credit Assignment）问题，高度相关关键词包括：“Large Language Models”（论文核心研究对象）、“RLHF”（论文研究RL for LLMs，属于RLHF范畴）、“Chain of Thought”（论文明确提及reasoning RL中的chain-of-thought generation）、“Monte Carlo Tree Search”（论文将Monte Carlo列为信用分配方法之一）、“LLM Agents”（论文研究agentic RL，涉及LLM agents）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training等与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该论文系统综述了强化学习在大语言模型中从推理到智能体场景的信用分配问题，提出了分类框架并贡献了三个可复用资源，揭示了从推理到智能体强化学习的转变如何重塑信用分配格局。

摘要翻译

针对大型语言模型（LLM）的强化学习（RL）日益依赖于稀疏的结果级奖励，然而确定长轨迹中哪些行为导致了最终结果仍然困难。这一信用分配（CA）问题体现在两种范式下：推理强化学习，其信用需在单次思维链生成（500至30K+以上词元）的词元和步骤间分配；以及智能体强化学习，其中多轮环境交互引入了随机状态转移、部分可观测性以及长达100轮以上（100K至1M词元）的交互跨度，使得片段级信用信息逐渐失效。
本文系统综述了2024年至2026年初发表的47种信用分配方法（41种核心方法，6种相邻支撑技术），通过分配粒度（词元级、片段级、步骤级、轮次级、多智能体级）与方法论（蒙特卡洛、时序差分、基于模型、博弈论、信息论）两个维度构建了分类体系。除综述本身外，我们贡献了三项可复用资源：（1）包含分类标签、基线族和证据等级的结构化机器可读文献库；（2）面向未来信用分配论文的规范化报告清单，该清单基于已综述文献验证，可识别系统性方法缺陷；（3）涵盖任务族、元数据要求和可控分叉任务的基准协议规范，并附有方法选择决策树。
我们的综合分析表明，从推理强化学习向智能体强化学习的转变使信用分配格局趋于复杂并发生重构：推理信用分配正围绕过程奖励模型与无评论者群体比较方法趋于成熟，而智能体信用分配则催生了真正创新的方法——事后反事实分析、特权非对称评论者以及轮次级马尔可夫决策过程重构——这些方法在推理强化学习中并无直接先例。

摘要 (Abstract)

Reinforcement learning (RL) for large language models (LLMs) increasingly relies on sparse, outcome-level rewards – yet determining which actions within a long trajectory caused the outcome remains difficult. This credit assignment (CA) problem manifests in two regimes: reasoning RL, where credit must be distributed across tokens and steps within a single chain-of-thought generation (500–30K+ tokens); and agentic RL, where multi-turn environment interaction introduces stochastic transitions, partial observability, and horizons of 100+ turns (100K–1M tokens), making episode-level credit increasingly uninformative. We survey 47 CA methods (41 core, 6 adjacent enablers) published between 2024 and early 2026, organizing them in a two-dimensional taxonomy by assignment granularity (token, segment, step, turn, multi-agent) and methodology (Monte Carlo, temporal difference, model-based, game-theoretic, information-theoretic). Beyond the survey itself, we contribute three reusable resources: (1) a structured, machine-readable paper inventory with taxonomy labels, baseline families, and evidence levels; (2) a reporting checklist for future CA papers, validated against the reviewed literature to identify systematic methodological gaps; and (3) a benchmark protocol specification with task families, metadata requirements, and controlled bifurcation tasks, accompanied by a method selection decision tree. Our synthesis suggests that the shift from reasoning to agentic RL complicates and reshapes the credit assignment landscape: reasoning CA is maturing around process reward models and critic-free group comparison, while agentic CA is driving genuinely new approaches – hindsight counterfactual analysis, privileged asymmetric critics, and turn-level MDP reformulations – that have no direct precedent in reasoning RL.

关键词: Credit Assignment, Reinforcement Learning, Large Language Models, Reasoning RL, Agentic RL, Chain-of-Thought, Monte Carlo, Survey

2. Multi-Agent Decision-Focused Learning via Value-Aware Sequential Communication

作者: Benjamin Amoh, Geoffrey Parker, Wesley Marrero 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08944v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	15.0/10	15.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	10.0/10	10.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文《Multi-Agent Decision-Focused Learning via Value-Aware Sequential Communication》专注于多智能体系统中的通信与协调优化，与大多数大模型技术关键词（如LLMs、MoE、Scaling Laws、Pre-training等）无直接关联。核心相关关键词：1）“Multi-agent Systems” OR “Agent Coordination”（15分）：论文核心研究多智能体协调问题，提出SeqComm-DFL方法优化通信以提升决策质量，是论文的绝对核心。2）“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”（10分）：论文研究多智能体系统，属于智能体研究范畴，但未明确涉及LLM驱动的智能体。3）“World Models” AND “General World Models”（10分）：论文明确提及"communication-augmented world models"，并扩展了Optimal Model Design，涉及世界模型构建。4）“Instruction Tuning” OR “Alignment” OR “Value Alignment”（5分）：论文涉及"value-aware message generation"和"prosocial ordering"，与价值对齐概念有间接关联。5）“AI for Science” OR “Bioinformatics” OR “Cheminformatics”（5分）：论文在协作医疗保健（collaborative healthcare）基准上进行了评估，属于AI在科学/医疗领域的应用。其他关键词（如RLHF、RAG、Quantization等）与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究多智能体在部分可观测环境下的协调问题，提出了一种基于价值感知顺序通信的决策聚焦学习框架（SeqComm-DFL），通过在协作医疗和星际争霸多智能体挑战基准上实现显著更高的累积奖励和胜率提升，解决了信息不对称下的协调策略优化问题。

摘要翻译

在部分可观测性下的多智能体协调需要智能体共享互补的私有信息。尽管现有方法通常针对中间目标（如重构精度或互信息）优化消息传递，而非决策质量，本文提出了 SeqComm-DFL，将序列化通信与面向决策的学习相统一以提升任务性能。我们的方法采用 基于价值感知的序列化斯塔克尔伯格条件消息生成：消息以最大化接收方决策质量为目标，并依据优先级顺序生成，每个智能体在其前序智能体的条件下生成消息，其 引导潜力 由它们的亲社会排序决定。我们将最优模型设计扩展至采用 QMIX 因子化的通信增强世界模型，通过隐式微分实现高效的端到端训练。我们证明了信息论边界，表明通信价值随协调差距而扩展，并建立了双层优化 $\mathcal{O}(1/\sqrt{T})$ 的收敛性，其中 $T$ 表示训练迭代次数。在协作医疗和星际争霸多智能体挑战（StarCraft Multi-Agent Challenge, SMAC）基准测试中，SeqComm-DFL 实现了四至六倍的累积奖励提升和超过 13% 的胜率改进，实现了在信息不对称条件下无法达成的协调策略。

摘要 (Abstract)

Multi-agent coordination under partial observability requires agents to share complementary private information. While recent methods optimize messages for intermediate objectives (e.g., reconstruction accuracy or mutual information), rather than decision quality, we introduce \textbf{SeqComm-DFL}, unifying the sequential communication with decision-focused learning for task performance. Our approach features \emph{value-aware message generation with sequential Stackelberg conditioning}: messages maximize receiver decision quality and are generated in priority order, with agents conditioning on their predecessors. The \emph{guidance potential} determined by their prosocial ordering. We extend Optimal Model Design to communication-augmented world models with QMIX factorization, enabling efficient end-to-end training via implicit differentiation. We prove information-theoretic bounds showing that communication value scales with coordination gaps and establish $\mathcal{O}(1/\sqrt{T})$ convergence for the bilevel optimization, where $T$ denotes the number of training iterations. On collaborative healthcare and StarCraft Multi-Agent Challenge (SMAC) benchmarks, SeqComm-DFL achieves four to six times higher cumulative rewards and over 13% win rate improvements, enabling coordination strategies inaccessible under information asymmetry.

关键词: Multi-agent coordination, Decision-focused learning, Sequential communication, Value-aware message generation, World models, Partial observability, QMIX factorization, Bilevel optimization

3. OASIS: Online Activation Subspace Learning for Memory-Efficient Training

作者: Sakshi Choudhary, Utkarsh Saxena, Kaushik Roy 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09406v1

评分: 44.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	8.0/10	8.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	8.0/10	8.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文OASIS专注于大语言模型（LLMs）训练中的内存效率问题，通过在线激活子空间学习算法减少激活内存占用，属于大模型技术原理的创新。核心相关关键词：1）“Large Language Models”（10分）- 论文明确针对LLMs训练；2）“PEFT”（10分）- 方法属于参数高效微调范畴，通过低维子空间投影减少内存；3）“Pre-training"和"Post-training”（各8分）- 论文在预训练和微调任务上验证了方法；4）“Quantization”（8分）- 虽非量化，但属于模型压缩/内存优化技术。其他关键词如MoE、SLMs、对齐、推理加速等与论文内容无关。

!!! tip deepseek-chat TL;DR

论文提出OASIS算法，通过在线学习低维激活子空间来减少大语言模型训练时的内存占用，在保持性能的同时实现了比全微调低2倍的内存消耗。

摘要翻译

大型语言模型（LLM）的训练受限于内存需求，其中激活值占用了总内存占用的很大一部分。现有方法通过低秩权重参数化或针对优化器状态的低秩梯度子空间来减少内存，而激活内存则通过架构修改或基于周期性更新投影的压缩方案来处理。我们提出OASIS，一种用于高效内存训练的在线激活子空间学习算法，该算法在训练过程中跟踪并持续更新一个低维激活子空间。中间激活值被投影到这个不断演化的子空间上，从而在不修改前向传播计算的情况下减少内存占用。不断演化的激活子空间诱导出低秩梯度表示，使得梯度和优化器状态都能直接在该子空间内维护，同时一个投影感知的优化器在子空间更新过程中持续迁移优化器状态，以确保训练稳定性。在各种微调和预训练任务中，OASIS实现了比完整微调低达$2\times$的峰值内存占用，同时保持了与之相当的性能，并优于先前的低秩方法。

摘要 (Abstract)

Training large language models (LLMs) is constrained by memory requirements, with activations accounting for a substantial fraction of the total footprint. Existing approaches reduce memory using low-rank weight parameterizations or low-rank gradient subspaces for optimizer states, while activation memory is addressed through architectural modifications or compression schemes based on periodically updated projections. We propose OASIS, an online activation subspace learning algorithm for memory-efficient training that tracks and continuously updates a low-dimensional activation subspace during training. Intermediate activations are projected onto this evolving subspace, reducing memory without modifying forward-pass computations. The evolving activation subspace induces low-rank gradient representations, enabling both gradients and optimizer states to be maintained directly in this subspace, while a projection-aware optimizer consistently transports optimizer states across subspace updates for stable training. Across various finetuning and pretraining tasks, OASIS achieves up to $2\times$ lower peak memory than full fine-tuning while matching its performance and outperforming prior low-rank methods.

关键词: Large Language Models, Memory-efficient Training, Activation Subspace, Parameter-efficient Fine-tuning, Low-rank Projection, Online Learning, Model Compression, Training Optimization

4. SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

作者: Sihang Jiang, Lipeng Ma, Zhonghua Hong, Keyi Wang, Zhiyu Lu, Shisong Chen, Jinghao Zhang, Tianjun Pan, Weijia Zhou, Jiaqing Liang, Yanghua Xiao 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08988v1

评分: 33.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	8.0/10	8.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文《SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment》的核心是评估基于LLM的自我进化智能体（SEA），因此与"LLM Agents"和"Large Language Models"高度相关（10分）。论文关注智能体跨任务积累经验、优化策略和自我进化，与"Self-Correction"等概念有一定关联（8分）。论文提到智能体受限于静态工具集，间接涉及"Tool Use"（5分）。其他关键词如MoE、Scaling Laws、RLHF、RAG等，论文未直接讨论或应用，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对当前LLM智能体受限于静态工具和任务间遗忘、无法跨任务积累经验的问题，提出了SEA-Eval基准，通过评估任务内执行可靠性和长期进化性能，发现现有框架存在进化瓶颈，相同成功率下token消耗差异可达31.2倍。

摘要翻译

当前基于大语言模型的智能体在单次任务执行中展现出强大性能，但受限于静态工具集和任务间记忆缺失，无法跨任务边界积累经验或优化策略。尽管自我进化智能体范式已被提出，本文基于数字具身性与连续跨任务进化的视角，为其贡献了新的形式化定义，并首次引入SEA-Eval基准测试框架，该框架通过任务执行可靠性与长期进化性能两个维度评估智能体的自我进化特性。通过将任务组织为连续序列流，并分析随时间推移的成功率与令牌消耗量，SEA-Eval以现有单次任务基准无法实现的方式量化进化增益与结构稳定性。实证评估揭示了当前前沿框架存在显著的进化瓶颈：在序列分析下，相同的成功率背后隐藏着高达31.2倍的令牌消耗差异及分化的进化轨迹。SEA-Eval为推进智能体从单纯任务执行器向真正自我进化的数字实体演进提供了严谨的科学基础。

摘要 (Abstract)

Current LLM-based agents demonstrate strong performance in episodic task execution but remain constrained by static toolsets and episodic amnesia, failing to accumulate experience or optimize strategies across task boundaries. While the Self-Evolving Agent (SEA) paradigm has been previously proposed, this paper contributes a new formal definition of SEA grounded in digital embodiment and continuous cross-task evolution, and introduces SEA-Eval, the first benchmark designed to evaluate SEA characteristics across two dimensions, intra-task execution reliability and long-term evolutionary performance. By organizing tasks into sequential streams and analyzing Success Rate and Token Consumption over time, SEA-Eval quantifies evolutionary gain and structural stability in ways that existing episodic benchmarks cannot. Empirical evaluations reveal a significant evolutionary bottleneck in current state-of-the-art frameworks, where identical success rates mask up to 31.2 times differences in token consumption and divergent evolutionary trajectories under sequential analysis. SEA-Eval provides a rigorous scientific foundation for advancing agents from mere task executors toward genuinely self-evolving digital entities.

关键词: Self-Evolving Agents, LLM-based agents, Benchmark evaluation, Episodic assessment, Cross-task evolution, Evolutionary performance, Token consumption, Digital embodiment

5. Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs

作者: Qixuan Huang, Khalid Zaman, Masashi Unoki 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09021v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	10.0/10	10.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文聚焦于听觉大语言模型（ALLMs）的幻觉缓解问题，提出了一种基于上下文学习的噪声感知方法（NAICL）。核心相关关键词包括：1) “Large Language Models” (权重1.0，评分10.0)：论文明确研究听觉大语言模型（ALLMs），是LLM在音频领域的应用变体，属于核心内容。2) “Hallucination Mitigation” (权重1.0，评分10.0)：论文的核心问题是缓解ALLMs的幻觉问题，提出了NAICL方法并显著降低了幻觉率（从26.53%降至16.98%），是论文的核心创新点。3) “In-context Learning” (权重1.0，评分10.0)：论文提出的NAICL方法本质上是基于上下文学习的插件式方法，通过检索噪声先验作为上下文来引导模型，是该技术的核心应用。其他关键词如MoE、SFT、RAG、量化等，论文未涉及这些具体技术，因此评分为0。论文属于大模型在特定领域（音频）的应用研究，并针对幻觉问题提出了创新性的上下文学习解决方案，符合研究背景中“大模型在不同领域的研究应用”和“创新型强”的描述。

!!! tip deepseek-chat TL;DR

该论文针对听觉大语言模型在音频描述任务中存在的幻觉问题，提出了一种噪声感知的上下文学习方法（NAICL），通过检索和整合噪声先验作为上下文，显著将整体幻觉率从26.53%降低至16.98%。

摘要翻译

听觉大语言模型（ALLMs）在音频理解与推理任务中展现出强大的通用能力，但其可靠性仍受幻觉问题影响。现有的幻觉评估方法被构建为二元分类任务，不足以刻画生成任务中出现的更复杂幻觉模式。此外，当前的幻觉缓解策略依赖于微调，导致计算成本高昂。为应对上述局限，我们提出一种即插即用的噪声感知上下文学习（NAICL）方法。具体而言，我们构建噪声先验库，检索与输入音频相关的噪声样本，并将其作为上下文先验信息融入，从而引导模型在声学证据不足时减少推测性关联，并采用更保守的生成策略。此外，我们为音频描述任务建立了幻觉基准，包括构建Clotho-1K多事件基准数据集、定义四类听觉幻觉类型，并引入幻觉类型分布等指标以支持细粒度分析。实验结果表明，所有被评估的ALLMs均表现出相似的幻觉行为。所提出的NAICL方法将整体幻觉率从26.53%降低至16.98%。

摘要 (Abstract)

Auditory large language models (ALLMs) have demonstrated strong general capabilities in audio understanding and reasoning tasks. However, their reliability is still undermined by hallucination issues. Existing hallucination evaluation methods are formulated as binary classification tasks, which are insufficient to characterize the more complex hallucination patterns that arise in generative tasks. Moreover, current hallucination mitigation strategies rely on fine-tuning, resulting in high computational costs. To address the above limitations, we propose a plug-and-play Noise-Aware In-Context Learning (NAICL) method. Specifically, we construct a noise prior library, retrieve noise examples relevant to the input audio, and incorporate them as contextual priors, thereby guiding the model to reduce speculative associations when acoustic evidence is insufficient and to adopt a more conservative generation strategy. In addition, we establish a hallucination benchmark for audio caption tasks including the construction of the Clotho-1K multi-event benchmark dataset, the definition of four types of auditory hallucinations, and the introduction of metrics such as hallucination type distribution to support fine-grained analysis. Experimental results show that all evaluated ALLMs exhibit same hallucination behaviors. Moreover, the proposed NAICL method reduces the overall hallucination rate from 26.53% to 16.98%.

关键词: Auditory Large Language Models, Hallucination Mitigation, In-Context Learning, Noise-Aware, Audio Captioning, Benchmark Dataset, Plug-and-play Method

6. Generalization and Scaling Laws for Mixture-of-Experts Transformers

作者: Mansour Zoubeirou a Mayaki 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09175v1

评分: 28.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	10.0/10	10.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究Mixture-of-Experts (MoE) Transformers的泛化理论和缩放定律，与"Mixture of Experts"和"Scaling Laws"高度相关（10分）。论文涉及Transformer架构和理论分析，与"Large Language Models"有一定关联（8分）。其他关键词如训练方法、推理技术、应用领域等均未在标题或摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文研究了Mixture-of-Experts Transformers的泛化理论和缩放定律，证明了在考虑活跃参数后，MoE架构的近似和估计权衡与密集网络类似，并推导了模型大小、数据大小和计算最优权衡的神经缩放定律。

摘要翻译

我们提出了一种针对混合专家（Mixture-of-Experts, MoE）Transformer的泛化与缩放理论，该理论清晰地将每输入激活容量与路由组合复杂性分离开来。通过固定路由模式并对其进行联合边界分析，我们推导出一个上确界覆盖数界，其度量熵的缩放与激活参数量相关，并包含MoE特有的路由开销。结合平方损失的标准经验风险最小化（ERM）分析，我们在$d$维流形数据模型和$C^β$目标函数下得到一个泛化界，表明一旦恰当考虑激活参数，逼近与估计之间的权衡关系与稠密网络中类似。我们进一步证明了MoE架构的一个构造性逼近定理，表明在逼近构造下，误差可以通过增加激活容量或增加专家数量来降低，具体取决于主导瓶颈。基于这些结果，我们推导出关于模型规模、数据规模以及计算最优权衡的神经缩放定律。总体而言，我们的研究为理解MoE缩放提供了一个清晰的统计学参考点，明确了哪些行为是由最坏情况理论所保证的，而哪些必须依赖于数据相关的路由结构或优化动态。

摘要 (Abstract)

We develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that cleanly separates \emph{active} per-input capacity from routing combinatorics. By conditioning on fixed routing patterns and union-bounding across them, we derive a sup-norm covering-number bound whose metric entropy scales with the active parameter budget and incurs a MoE-specific routing overhead. Combined with a standard ERM analysis for squared loss, this yields a generalization bound under a $d$-dimensional manifold data model and $C^β$ targets, showing that approximation and estimation trade off as in dense networks once active parameters are accounted for appropriately. We further prove a constructive approximation theorem for MoE architectures, showing that, under the approximation construction, error can decrease either by scaling active capacity or by increasing the number of experts, depending on the dominant bottleneck. From these results we derive neural scaling laws for model size, data size, and compute-optimal tradeoffs. Overall, our results provide a transparent statistical reference point for reasoning about MoE scaling, clarifying which behaviors are certified by worst-case theory and which must arise from data-dependent routing structure or optimization dynamics.

关键词: Mixture-of-Experts, Transformers, generalization theory, scaling laws, active parameters, routing combinatorics, neural scaling laws, approximation theorem

📋 所有论文列表

1. ✅ From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

作者: Chenchen Zhang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09459v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	10.0/10	10.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文系统综述了强化学习在大语言模型中从推理到智能体场景的信用分配问题，提出了分类框架并贡献了三个可复用资源，揭示了从推理到智能体强化学习的转变如何重塑信用分配格局。

摘要翻译

针对大型语言模型（LLM）的强化学习（RL）日益依赖于稀疏的结果级奖励，然而确定长轨迹中哪些行为导致了最终结果仍然困难。这一信用分配（CA）问题体现在两种范式下：推理强化学习，其信用需在单次思维链生成（500至30K+以上词元）的词元和步骤间分配；以及智能体强化学习，其中多轮环境交互引入了随机状态转移、部分可观测性以及长达100轮以上（100K至1M词元）的交互跨度，使得片段级信用信息逐渐失效。
本文系统综述了2024年至2026年初发表的47种信用分配方法（41种核心方法，6种相邻支撑技术），通过分配粒度（词元级、片段级、步骤级、轮次级、多智能体级）与方法论（蒙特卡洛、时序差分、基于模型、博弈论、信息论）两个维度构建了分类体系。除综述本身外，我们贡献了三项可复用资源：（1）包含分类标签、基线族和证据等级的结构化机器可读文献库；（2）面向未来信用分配论文的规范化报告清单，该清单基于已综述文献验证，可识别系统性方法缺陷；（3）涵盖任务族、元数据要求和可控分叉任务的基准协议规范，并附有方法选择决策树。
我们的综合分析表明，从推理强化学习向智能体强化学习的转变使信用分配格局趋于复杂并发生重构：推理信用分配正围绕过程奖励模型与无评论者群体比较方法趋于成熟，而智能体信用分配则催生了真正创新的方法——事后反事实分析、特权非对称评论者以及轮次级马尔可夫决策过程重构——这些方法在推理强化学习中并无直接先例。

摘要 (Abstract)

Reinforcement learning (RL) for large language models (LLMs) increasingly relies on sparse, outcome-level rewards – yet determining which actions within a long trajectory caused the outcome remains difficult. This credit assignment (CA) problem manifests in two regimes: reasoning RL, where credit must be distributed across tokens and steps within a single chain-of-thought generation (500–30K+ tokens); and agentic RL, where multi-turn environment interaction introduces stochastic transitions, partial observability, and horizons of 100+ turns (100K–1M tokens), making episode-level credit increasingly uninformative. We survey 47 CA methods (41 core, 6 adjacent enablers) published between 2024 and early 2026, organizing them in a two-dimensional taxonomy by assignment granularity (token, segment, step, turn, multi-agent) and methodology (Monte Carlo, temporal difference, model-based, game-theoretic, information-theoretic). Beyond the survey itself, we contribute three reusable resources: (1) a structured, machine-readable paper inventory with taxonomy labels, baseline families, and evidence levels; (2) a reporting checklist for future CA papers, validated against the reviewed literature to identify systematic methodological gaps; and (3) a benchmark protocol specification with task families, metadata requirements, and controlled bifurcation tasks, accompanied by a method selection decision tree. Our synthesis suggests that the shift from reasoning to agentic RL complicates and reshapes the credit assignment landscape: reasoning CA is maturing around process reward models and critic-free group comparison, while agentic CA is driving genuinely new approaches – hindsight counterfactual analysis, privileged asymmetric critics, and turn-level MDP reformulations – that have no direct precedent in reasoning RL.

关键词: Credit Assignment, Reinforcement Learning, Large Language Models, Reasoning RL, Agentic RL, Chain-of-Thought, Monte Carlo, Survey

2. ✅ Multi-Agent Decision-Focused Learning via Value-Aware Sequential Communication

作者: Benjamin Amoh, Geoffrey Parker, Wesley Marrero 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08944v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	15.0/10	15.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	10.0/10	10.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

!!! tip deepseek-chat TL;DR

该论文研究多智能体在部分可观测环境下的协调问题，提出了一种基于价值感知顺序通信的决策聚焦学习框架（SeqComm-DFL），通过在协作医疗和星际争霸多智能体挑战基准上实现显著更高的累积奖励和胜率提升，解决了信息不对称下的协调策略优化问题。

摘要翻译

在部分可观测性下的多智能体协调需要智能体共享互补的私有信息。尽管现有方法通常针对中间目标（如重构精度或互信息）优化消息传递，而非决策质量，本文提出了 SeqComm-DFL，将序列化通信与面向决策的学习相统一以提升任务性能。我们的方法采用 基于价值感知的序列化斯塔克尔伯格条件消息生成：消息以最大化接收方决策质量为目标，并依据优先级顺序生成，每个智能体在其前序智能体的条件下生成消息，其 引导潜力 由它们的亲社会排序决定。我们将最优模型设计扩展至采用 QMIX 因子化的通信增强世界模型，通过隐式微分实现高效的端到端训练。我们证明了信息论边界，表明通信价值随协调差距而扩展，并建立了双层优化 $\mathcal{O}(1/\sqrt{T})$ 的收敛性，其中 $T$ 表示训练迭代次数。在协作医疗和星际争霸多智能体挑战（StarCraft Multi-Agent Challenge, SMAC）基准测试中，SeqComm-DFL 实现了四至六倍的累积奖励提升和超过 13% 的胜率改进，实现了在信息不对称条件下无法达成的协调策略。

摘要 (Abstract)

Multi-agent coordination under partial observability requires agents to share complementary private information. While recent methods optimize messages for intermediate objectives (e.g., reconstruction accuracy or mutual information), rather than decision quality, we introduce \textbf{SeqComm-DFL}, unifying the sequential communication with decision-focused learning for task performance. Our approach features \emph{value-aware message generation with sequential Stackelberg conditioning}: messages maximize receiver decision quality and are generated in priority order, with agents conditioning on their predecessors. The \emph{guidance potential} determined by their prosocial ordering. We extend Optimal Model Design to communication-augmented world models with QMIX factorization, enabling efficient end-to-end training via implicit differentiation. We prove information-theoretic bounds showing that communication value scales with coordination gaps and establish $\mathcal{O}(1/\sqrt{T})$ convergence for the bilevel optimization, where $T$ denotes the number of training iterations. On collaborative healthcare and StarCraft Multi-Agent Challenge (SMAC) benchmarks, SeqComm-DFL achieves four to six times higher cumulative rewards and over 13% win rate improvements, enabling coordination strategies inaccessible under information asymmetry.

3. ✅ OASIS: Online Activation Subspace Learning for Memory-Efficient Training

作者: Sakshi Choudhary, Utkarsh Saxena, Kaushik Roy 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09406v1

评分: 44.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	8.0/10	8.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	8.0/10	8.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

论文提出OASIS算法，通过在线学习低维激活子空间来减少大语言模型训练时的内存占用，在保持性能的同时实现了比全微调低2倍的内存消耗。

摘要翻译

大型语言模型（LLM）的训练受限于内存需求，其中激活值占用了总内存占用的很大一部分。现有方法通过低秩权重参数化或针对优化器状态的低秩梯度子空间来减少内存，而激活内存则通过架构修改或基于周期性更新投影的压缩方案来处理。我们提出OASIS，一种用于高效内存训练的在线激活子空间学习算法，该算法在训练过程中跟踪并持续更新一个低维激活子空间。中间激活值被投影到这个不断演化的子空间上，从而在不修改前向传播计算的情况下减少内存占用。不断演化的激活子空间诱导出低秩梯度表示，使得梯度和优化器状态都能直接在该子空间内维护，同时一个投影感知的优化器在子空间更新过程中持续迁移优化器状态，以确保训练稳定性。在各种微调和预训练任务中，OASIS实现了比完整微调低达$2\times$的峰值内存占用，同时保持了与之相当的性能，并优于先前的低秩方法。

摘要 (Abstract)

Training large language models (LLMs) is constrained by memory requirements, with activations accounting for a substantial fraction of the total footprint. Existing approaches reduce memory using low-rank weight parameterizations or low-rank gradient subspaces for optimizer states, while activation memory is addressed through architectural modifications or compression schemes based on periodically updated projections. We propose OASIS, an online activation subspace learning algorithm for memory-efficient training that tracks and continuously updates a low-dimensional activation subspace during training. Intermediate activations are projected onto this evolving subspace, reducing memory without modifying forward-pass computations. The evolving activation subspace induces low-rank gradient representations, enabling both gradients and optimizer states to be maintained directly in this subspace, while a projection-aware optimizer consistently transports optimizer states across subspace updates for stable training. Across various finetuning and pretraining tasks, OASIS achieves up to $2\times$ lower peak memory than full fine-tuning while matching its performance and outperforming prior low-rank methods.

关键词: Large Language Models, Memory-efficient Training, Activation Subspace, Parameter-efficient Fine-tuning, Low-rank Projection, Online Learning, Model Compression, Training Optimization

4. ✅ SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

评分: 33.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	8.0/10	8.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对当前LLM智能体受限于静态工具和任务间遗忘、无法跨任务积累经验的问题，提出了SEA-Eval基准，通过评估任务内执行可靠性和长期进化性能，发现现有框架存在进化瓶颈，相同成功率下token消耗差异可达31.2倍。

摘要翻译

当前基于大语言模型的智能体在单次任务执行中展现出强大性能，但受限于静态工具集和任务间记忆缺失，无法跨任务边界积累经验或优化策略。尽管自我进化智能体范式已被提出，本文基于数字具身性与连续跨任务进化的视角，为其贡献了新的形式化定义，并首次引入SEA-Eval基准测试框架，该框架通过任务执行可靠性与长期进化性能两个维度评估智能体的自我进化特性。通过将任务组织为连续序列流，并分析随时间推移的成功率与令牌消耗量，SEA-Eval以现有单次任务基准无法实现的方式量化进化增益与结构稳定性。实证评估揭示了当前前沿框架存在显著的进化瓶颈：在序列分析下，相同的成功率背后隐藏着高达31.2倍的令牌消耗差异及分化的进化轨迹。SEA-Eval为推进智能体从单纯任务执行器向真正自我进化的数字实体演进提供了严谨的科学基础。

摘要 (Abstract)

Current LLM-based agents demonstrate strong performance in episodic task execution but remain constrained by static toolsets and episodic amnesia, failing to accumulate experience or optimize strategies across task boundaries. While the Self-Evolving Agent (SEA) paradigm has been previously proposed, this paper contributes a new formal definition of SEA grounded in digital embodiment and continuous cross-task evolution, and introduces SEA-Eval, the first benchmark designed to evaluate SEA characteristics across two dimensions, intra-task execution reliability and long-term evolutionary performance. By organizing tasks into sequential streams and analyzing Success Rate and Token Consumption over time, SEA-Eval quantifies evolutionary gain and structural stability in ways that existing episodic benchmarks cannot. Empirical evaluations reveal a significant evolutionary bottleneck in current state-of-the-art frameworks, where identical success rates mask up to 31.2 times differences in token consumption and divergent evolutionary trajectories under sequential analysis. SEA-Eval provides a rigorous scientific foundation for advancing agents from mere task executors toward genuinely self-evolving digital entities.

关键词: Self-Evolving Agents, LLM-based agents, Benchmark evaluation, Episodic assessment, Cross-task evolution, Evolutionary performance, Token consumption, Digital embodiment

5. ✅ Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs

作者: Qixuan Huang, Khalid Zaman, Masashi Unoki 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09021v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	10.0/10	10.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对听觉大语言模型在音频描述任务中存在的幻觉问题，提出了一种噪声感知的上下文学习方法（NAICL），通过检索和整合噪声先验作为上下文，显著将整体幻觉率从26.53%降低至16.98%。

摘要翻译

听觉大语言模型（ALLMs）在音频理解与推理任务中展现出强大的通用能力，但其可靠性仍受幻觉问题影响。现有的幻觉评估方法被构建为二元分类任务，不足以刻画生成任务中出现的更复杂幻觉模式。此外，当前的幻觉缓解策略依赖于微调，导致计算成本高昂。为应对上述局限，我们提出一种即插即用的噪声感知上下文学习（NAICL）方法。具体而言，我们构建噪声先验库，检索与输入音频相关的噪声样本，并将其作为上下文先验信息融入，从而引导模型在声学证据不足时减少推测性关联，并采用更保守的生成策略。此外，我们为音频描述任务建立了幻觉基准，包括构建Clotho-1K多事件基准数据集、定义四类听觉幻觉类型，并引入幻觉类型分布等指标以支持细粒度分析。实验结果表明，所有被评估的ALLMs均表现出相似的幻觉行为。所提出的NAICL方法将整体幻觉率从26.53%降低至16.98%。

摘要 (Abstract)

Auditory large language models (ALLMs) have demonstrated strong general capabilities in audio understanding and reasoning tasks. However, their reliability is still undermined by hallucination issues. Existing hallucination evaluation methods are formulated as binary classification tasks, which are insufficient to characterize the more complex hallucination patterns that arise in generative tasks. Moreover, current hallucination mitigation strategies rely on fine-tuning, resulting in high computational costs. To address the above limitations, we propose a plug-and-play Noise-Aware In-Context Learning (NAICL) method. Specifically, we construct a noise prior library, retrieve noise examples relevant to the input audio, and incorporate them as contextual priors, thereby guiding the model to reduce speculative associations when acoustic evidence is insufficient and to adopt a more conservative generation strategy. In addition, we establish a hallucination benchmark for audio caption tasks including the construction of the Clotho-1K multi-event benchmark dataset, the definition of four types of auditory hallucinations, and the introduction of metrics such as hallucination type distribution to support fine-grained analysis. Experimental results show that all evaluated ALLMs exhibit same hallucination behaviors. Moreover, the proposed NAICL method reduces the overall hallucination rate from 26.53% to 16.98%.

关键词: Auditory Large Language Models, Hallucination Mitigation, In-Context Learning, Noise-Aware, Audio Captioning, Benchmark Dataset, Plug-and-play Method

6. ✅ Generalization and Scaling Laws for Mixture-of-Experts Transformers

作者: Mansour Zoubeirou a Mayaki 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09175v1

评分: 28.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	10.0/10	10.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了Mixture-of-Experts Transformers的泛化理论和缩放定律，证明了在考虑活跃参数后，MoE架构的近似和估计权衡与密集网络类似，并推导了模型大小、数据大小和计算最优权衡的神经缩放定律。

摘要翻译

我们提出了一种针对混合专家（Mixture-of-Experts, MoE）Transformer的泛化与缩放理论，该理论清晰地将每输入激活容量与路由组合复杂性分离开来。通过固定路由模式并对其进行联合边界分析，我们推导出一个上确界覆盖数界，其度量熵的缩放与激活参数量相关，并包含MoE特有的路由开销。结合平方损失的标准经验风险最小化（ERM）分析，我们在$d$维流形数据模型和$C^β$目标函数下得到一个泛化界，表明一旦恰当考虑激活参数，逼近与估计之间的权衡关系与稠密网络中类似。我们进一步证明了MoE架构的一个构造性逼近定理，表明在逼近构造下，误差可以通过增加激活容量或增加专家数量来降低，具体取决于主导瓶颈。基于这些结果，我们推导出关于模型规模、数据规模以及计算最优权衡的神经缩放定律。总体而言，我们的研究为理解MoE缩放提供了一个清晰的统计学参考点，明确了哪些行为是由最坏情况理论所保证的，而哪些必须依赖于数据相关的路由结构或优化动态。

摘要 (Abstract)

We develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that cleanly separates \emph{active} per-input capacity from routing combinatorics. By conditioning on fixed routing patterns and union-bounding across them, we derive a sup-norm covering-number bound whose metric entropy scales with the active parameter budget and incurs a MoE-specific routing overhead. Combined with a standard ERM analysis for squared loss, this yields a generalization bound under a $d$-dimensional manifold data model and $C^β$ targets, showing that approximation and estimation trade off as in dense networks once active parameters are accounted for appropriately. We further prove a constructive approximation theorem for MoE architectures, showing that, under the approximation construction, error can decrease either by scaling active capacity or by increasing the number of experts, depending on the dominant bottleneck. From these results we derive neural scaling laws for model size, data size, and compute-optimal tradeoffs. Overall, our results provide a transparent statistical reference point for reasoning about MoE scaling, clarifying which behaviors are certified by worst-case theory and which must arise from data-dependent routing structure or optimization dynamics.

关键词: Mixture-of-Experts, Transformers, generalization theory, scaling laws, active parameters, routing combinatorics, neural scaling laws, approximation theorem

7. ❌ ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering

作者: Xiaoke Guo, Songze Li, Zhiqiang Liu, Zhaoyan Gong, Yuanxiang Liu, Huajun Chen, Wen Zhang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08999v1

评分: 26.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文ASTRA专注于解决LLMs在复杂表格问答中的序列化瓶颈，核心是提出一种自适应语义树推理架构。因此，与"Large Language Models"高度相关（10分），因为论文明确使用LLMs进行全局语义感知和推理。与推理相关的关键词"Chain of Thought"和"System 2 Thinking"有一定关联（8分），因为论文涉及多步推理（树搜索导航和符号代码执行）和深度推理（解决推理不透明问题）。其他关键词如MoE、SLMs、训练技术、对齐、RAG、压缩、代理等均未在摘要中提及或直接相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型在复杂表格问答中存在的序列化瓶颈（如结构忽视、表示差距和推理不透明），提出了ASTRA自适应语义树推理架构，通过AdaSTR模块将表格重构为逻辑语义树，并利用DuTR双模式推理框架结合树搜索导航和符号代码执行，在复杂表格基准测试中实现了最先进的性能。

摘要翻译

表格序列化仍然是大型语言模型在复杂表格问答任务中的关键瓶颈，其面临结构忽视、表征差距和推理不透明等挑战。现有序列化方法难以捕捉显式层次结构且缺乏模式灵活性，而当前基于树结构的方法则受限于语义适应性不足。为应对这些局限，我们提出ASTRA（自适应语义树推理架构），包含AdaSTR与DuTR两大核心模块。首先，我们引入AdaSTR模块，该模块利用大型语言模型的全局语义感知能力，将表格重构为逻辑语义树。这种序列化方法显式建模层次依赖关系，并采用自适应机制根据表格规模优化构建策略。其次，基于此结构我们提出DuTR模块——一种双模式推理框架，它融合了基于树搜索的文本导航（实现语言对齐）与符号化代码执行（确保精确验证）。在复杂表格基准测试上的实验表明，本方法取得了最先进的性能表现。

摘要 (Abstract)

Table serialization remains a critical bottleneck for Large Language Models (LLMs) in complex table question answering, hindered by challenges such as structural neglect, representation gaps, and reasoning opacity. Existing serialization methods fail to capture explicit hierarchies and lack schema flexibility, while current tree-based approaches suffer from limited semantic adaptability. To address these limitations, we propose ASTRA (Adaptive Semantic Tree Reasoning Architecture) including two main modules, AdaSTR and DuTR. First, we introduce AdaSTR, which leverages the global semantic awareness of LLMs to reconstruct tables into Logical Semantic Trees. This serialization explicitly models hierarchical dependencies and employs an adaptive mechanism to optimize construction strategies based on table scale. Second, building on this structure, we present DuTR, a dual-mode reasoning framework that integrates tree-search-based textual navigation for linguistic alignment and symbolic code execution for precise verification. Experiments on complex table benchmarks demonstrate that our method achieves state-of-the-art (SOTA) performance.

关键词: Large Language Models, Table Question Answering, Semantic Tree, Adaptive Reasoning, Tree Search, Symbolic Execution, State-of-the-Art

8. ❌ Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation

作者: Lingfeng Huang, Huizhong Guo, Tianjun Wei, Yingpeng Du, Zhu Sun 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09368v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM agents作为用户模拟器在推荐系统评估中的应用，并提出了FixATE方法，通过视觉注意力对齐提升模拟保真度。因此，与"Large Language Models"和"LLM Agents"高度相关（10分）。论文使用interpretability operators分析VLM的视觉注意力，与"Mechanistic Interpretability"有一定关联（5分）。其他关键词如MoE、SFT、RAG、量化等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文研究如何通过将视觉语言模型的视觉注意力与用户特定的注视模式对齐，来提升大型语言模型代理作为推荐系统用户模拟器的保真度，实验表明该方法能有效改善注意力对齐和点击预测准确性。

摘要翻译

大语言模型智能体正日益被部署为可扩展的用户模拟器，用于推荐系统评估。然而，现有模拟器通过文本或结构化元数据来感知推荐内容，而非真实用户浏览的视觉界面——这是一个关键差距，因为对推荐布局的注意力既是视觉驱动的，也具有高度个性化。我们研究是否将视觉语言模型的视觉注意力与用户特定的注视模式对齐可以提高模拟保真度。对在一个基于轮播的推荐场景中收集的真实眼动追踪数据集的分析表明，用户表现出稳定的个体注视模式，这些模式能强有力地预测点击行为。基于这一发现，我们提出了面向用户仿真的注视对齐微调方法。我们的方法首先通过可解释性算子探查视觉语言模型的内部视觉注意力，以获得与人类注视点可比的槽位级相关性分布；随后学习个性化的软提示，以引导模型的注意力朝向每个用户特有的注视模式。在三种基于可解释性的探查算子和两种架构不同的视觉语言模型骨干上进行的实验表明，该方法在注意力对齐和点击预测准确性方面均取得了持续提升。这些结果表明，让模型“像用户一样观看”是一条可行的路径，能够开发出更忠实复现用户在推荐界面中如何感知与行动的模拟器。

摘要 (Abstract)

Large language model (LLM) agents are increasingly deployed as scalable user simulators for recommender system evaluation. Yet existing simulators perceive recommendations through text or structured metadata rather than the visual interfaces real users browse-a critical gap, since attention over recommendation layouts is both visually driven and highly personalized. We investigate whether aligning a vision-language model’s (VLM’s) visual attention with user-specific gaze patterns can improve simulation fidelity. Analysis of a real-world eye-tracking dataset collected in a carousel-based recommendation setting reveals that users exhibit stable individual gaze patterns strongly predictive of click behavior. Building on this finding, we propose Fixation-Aligned Tuning for user Emulation (FixATE). Our approach first probes the VLM’s internal visual attention via interpretability operators to obtain a slot-level relevance distribution comparable with human fixation, and then learns personalized soft prompts to steer the model’s attention toward each user’s characteristic fixation pattern. Experiments across three interpretability-based probing operators and two architecturally distinct VLM backbones demonstrate consistent improvements in both attention alignment and click prediction accuracy. These results suggest that making the model “see like the user” is a viable path toward simulators that more faithfully reproduce how users perceive and act in recommendation interfaces.

关键词: Large Language Model agents, user simulation, recommender system evaluation, visual attention alignment, gaze patterns, vision-language model, interpretability operators, personalized soft prompts

9. ❌ Scalable High-Recall Constraint-Satisfaction-Based Information Retrieval for Clinical Trials Matching

作者: Cyrus Zhou, Yufei Jin, Yilin Xu, Yu-Chiang Wang, Chieh-Ju Chao, Monica S. Lam 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08849v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 该论文提出SatIR方法，利用大型语言模型（LLMs）将临床推理转化为形式化约束，用于临床试验匹配，属于大模型在生物信息学/医学领域的应用研究。因此，与"Large Language Models"和"AI for Science"高度相关（10分），这两个关键词直接对应论文的核心技术（LLMs）和应用领域（生物医学）。其他关键词涉及大模型技术原理（如MoE、Scaling Laws、训练方法、推理优化等）或特定应用方向（如Agent、工具使用等），论文未涉及这些具体技术，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对临床试验匹配中现有检索方法召回率低、精度差和可解释性不足的问题，提出了一种基于约束满足和大型语言模型的可扩展方法SatIR，实验证明其能显著提高相关试验的检索数量、召回率和服务患者比例，且检索速度快。

摘要翻译

临床试验是循证医学的核心，然而尽管ClinicalTrials.gov上登记了超过五十万项试验、每月吸引约两百万用户，许多试验仍难以达到招募目标。现有的检索技术主要基于患者档案与入选标准之间的关键词和嵌入相似性匹配，由于复杂的约束条件，常面临召回率低、精确度有限且可解释性不足的问题。我们提出SatIR，一种基于约束满足的可扩展临床试验检索方法，能够实现患者与相关试验的高精度、可解释匹配。我们的方法采用形式化方法——可满足性模理论（Satisfiability Modulo Theories, SMT）和关系代数——来高效表示和匹配临床试验与患者记录中的关键约束。除了利用既有的医学本体和概念模型，我们使用大语言模型（Large Language Models, LLMs）将关于模糊性、隐含临床假设和不完整患者记录的非形式化推理，转化为明确、精确、可控且可解释的形式化约束。在59名患者和3,621项试验上的评估显示，SatIR在所有三项检索目标上均优于TrialGPT。它为每名患者多检索出32%-72%的相关且符合资格的试验，对有用试验并集的召回率提升了22-38个百分点，并为更多患者提供了至少一项有用试验。检索速度较快，每名患者在3,621项试验中的平均检索时间为2.95秒。这些结果表明SatIR具有可扩展性、高效性和可解释性。

摘要 (Abstract)

Clinical trials are central to evidence-based medicine, yet many struggle to meet enrollment targets, despite the availability of over half a million trials listed on ClinicalTrials.gov, which attracts approximately two million users monthly. Existing retrieval techniques, largely based on keyword and embedding-similarity matching between patient profiles and eligibility criteria, often struggle with low recall, low precision, and limited interpretability due to complex constraints. We propose SatIR, a scalable clinical trial retrieval method based on constraint satisfaction, enabling high-precision and interpretable matching of patients to relevant trials. Our approach uses formal methods – Satisfiability Modulo Theories (SMT) and relational algebra – to efficiently represent and match key constraints from clinical trials and patient records. Beyond leveraging established medical ontologies and conceptual models, we use Large Language Models (LLMs) to convert informal reasoning regarding ambiguity, implicit clinical assumptions, and incomplete patient records into explicit, precise, controllable, and interpretable formal constraints. Evaluated on 59 patients and 3,621 trials, SatIR outperforms TrialGPT on all three evaluated retrieval objectives. It retrieves 32%-72% more relevant-and-eligible trials per patient, improves recall over the union of useful trials by 22-38 points, and serves more patients with at least one useful trial. Retrieval is fast, requiring 2.95 seconds per patient over 3,621 trials. These results show that SatIR is scalable, effective, and interpretable.

关键词: Clinical trials matching, Constraint satisfaction, Large Language Models (LLMs), Information retrieval, Satisfiability Modulo Theories (SMT), Interpretability, Scalability, Bioinformatics

10. ❌ Plasticity-Enhanced Multi-Agent Mixture of Experts for Dynamic Objective Adaptation in UAVs-Assisted Emergency Communication Networks

作者: Wen Qiu, Zhiqiang He, Wei Zhao, Hiroshi Masui 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09028v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出了一种用于无人机辅助应急通信网络的塑性增强多智能体专家混合（PE-MAMoE）方法，该方法的核心创新在于将专家混合（MoE）架构与多智能体系统相结合，以解决动态环境下的策略适应性问题。因此，与"Mixture of Experts"和"Multi-agent Systems"高度相关（10分），因为MoE是其核心架构，多智能体是其应用框架。论文未涉及大语言模型、训练技术、推理方法、模型压缩、AI for Science等其他关键词，因此这些关键词得分为0分。

!!! tip deepseek-chat TL;DR

该论文解决了无人机辅助应急通信网络中，因用户移动性和流量需求突变导致深度强化学习策略适应性下降的问题，提出了一种塑性增强多智能体专家混合方法，在仿真中显著提升了性能指标并减少了碰撞。

摘要翻译

作为空中基站的无人机可在灾后快速恢复通信连接，但用户移动性与流量需求的突变会改变服务质量权衡关系并引发强烈非平稳性。在此类变化下，深度强化学习策略会因表征坍缩与神经元休眠而丧失适应能力，导致可塑性缺失。本文提出可塑性增强的多智能体专家混合网络（PE-MAMoE），该框架基于多智能体近端策略优化算法构建，采用集中训练与分散执行架构。PE-MAMoE为每架无人机配备稀疏门控的专家混合执行器，其路由器在每步决策中选取单一专家网络。非参数化相位控制器在相位切换后注入短暂、仅针对专家的随机扰动，重置动作对数标准差，退火熵项与学习率，并调度路由器温度——所有设计均旨在重新激活策略可塑性而不破坏安全行为稳定性。我们推导出动态遗憾上界，证明跟踪误差随环境变化量与累积噪声能量共同缩放。在包含移动用户和3GPP标准信道的相位驱动仿真环境中，PE-MAMoE相较最佳基线方法将标准化四分位均值回报提升26.3%，服务用户容量提高12.8%，碰撞率降低约75%。诊断结果证实，该框架在状态切换时能持续保持更高的专家特征秩，并实现周期性休眠神经元恢复。

摘要 (Abstract)

Unmanned aerial vehicles serving as aerial base stations can rapidly restore connectivity after disasters, yet abrupt changes in user mobility and traffic demands shift the quality of service trade-offs and induce strong non-stationarity. Deep reinforcement learning policies suffer from plasticity loss under such shifts, as representation collapse and neuron dormancy impair adaptation. We propose plasticity enhanced multi-agent mixture of experts (PE-MAMoE), a centralized training with decentralized execution framework built on multi-agent proximal policy optimization. PE-MAMoE equips each UAV with a sparsely gated mixture of experts actor whose router selects a single specialist per step. A non-parametric Phase Controller injects brief, expert-only stochastic perturbations after phase switches, resets the action log-standard-deviation, anneals entropy and learning rate, and schedules the router temperature, all to re-plasticize the policy without destabilizing safe behaviors. We derive a dynamic regret bound showing the tracking error scales with both environment variation and cumulative noise energy. In a phase-driven simulator with mobile users and 3GPP-style channels, PE-MAMoE improves normalized interquartile mean return by 26.3% over the best baseline, increases served-user capacity by 12.8%, and reduces collisions by approximately 75%. Diagnostics confirm persistently higher expert feature rank and periodic dormant-neuron recovery at regime switches.

关键词: Multi-agent Systems, Mixture of Experts, Unmanned Aerial Vehicles, Emergency Communication Networks, Deep Reinforcement Learning, Dynamic Adaptation, Plasticity Enhancement, Proximal Policy Optimization

11. ❌ Text-Conditioned Multi-Expert Regression Framework for Fully Automated Multi-Abutment Design

作者: Mianjie Zheng, Xinquan Yang, Xuefen Liu, Xuguang Li, Kun Tang, He Meng, Linlin Shen 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09047v1

评分: 18.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	8.0/10	8.0

评分理由: 该论文提出了一种用于牙科种植体基台设计的深度学习框架TEMAD，其中核心创新是引入了System-Prompted Mixture-of-Experts (SPMoE)机制，这与关键词"Mixture of Experts"高度相关（10分）。论文属于AI在生物医学领域的应用，与"AI for Science"有一定关联（8分）。其他关键词主要涉及大语言模型、训练方法、推理技术等通用AI主题，而本文专注于特定领域的计算机视觉和回归任务，因此相关性为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为TEMAD的文本条件多专家回归框架，解决了牙科种植体基台设计自动化程度低、依赖人工干预的问题，通过集成植入部位识别和系统感知的专家选择机制，在多基台场景中实现了全自动设计并取得了最先进的性能。

摘要翻译

牙种植体基台作为种植体固位体与修复冠之间的几何与生物力学界面，其设计目前高度依赖人工操作且耗时较长。尽管已有研究提出采用深度神经网络辅助牙医设计基台，但现有方法大多仍以手动或半自动为主，需要临床医生大量干预，且在多基台场景中缺乏可扩展性。为克服这些局限，我们提出了TEMAD——一种完全自动化、文本驱动的多专家架构，用于多基台设计。该框架将种植位点定位、种植系统识别及兼容基台参数回归集成至统一流程中。具体而言，我们引入了种植位点识别网络（ISIN）以自动定位种植位点，并将该信息传递给后续的多基台回归网络。我们进一步设计了牙齿条件特征线性调制模块（TC-FiLM），该模块利用牙齿嵌入向量自适应校准网格表征，实现基于位置的特征调制。此外，系统提示的混合专家机制（SPMoE）借助种植系统提示信息引导专家选择，确保系统感知的回归过程。在大规模基台设计数据集上的大量实验表明，与现有方法相比，TEMAD实现了最先进的性能，尤其在多基台设置中表现突出，验证了其在全自动牙科种植规划中的有效性。

摘要 (Abstract)

Dental implant abutments serve as the geometric and biomechanical interface between the implant fixture and the prosthetic crown, yet their design relies heavily on manual effort and is time-consuming. Although deep neural networks have been proposed to assist dentists in designing abutments, most existing approaches remain largely manual or semi-automated, requiring substantial clinician intervention and lacking scalability in multi-abutment scenarios. To address these limitations, we propose TEMAD, a fully automated, text-conditioned multi-expert architecture for multi-abutment design. This framework integrates implant site localization and implant system, compatible abutment parameter regression into a unified pipeline. Specifically, we introduce an Implant Site Identification Network (ISIN) to automatically localize implant sites and provide this information to the subsequent multi-abutment regression network. We further design a Tooth-Conditioned Feature-wise Linear Modulation (TC-FiLM) module, which adaptively calibrates mesh representations using tooth embeddings to enable position-specific feature modulation. Additionally, a System-Prompted Mixture-of-Experts (SPMoE) mechanism leverages implant system prompts to guide expert selection, ensuring system-aware regression. Extensive experiments on a large-scale abutment design dataset show that TEMAD achieves state-of-the-art performance compared to existing methods, particularly in multi-abutment settings, validating its effectiveness for fully automated dental implant planning.

关键词: dental implant abutment design, multi-expert architecture, fully automated, text-conditioned, implant site localization, mixture-of-experts, regression framework, dental AI

12. ❌ VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

作者: Guanyu Zhou, Yida Yin, Wenhao Chai, Shengbang Tong, Xingyu Fu, Zhuang Liu 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09531v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究使用大语言模型（LLMs）生成合成数据来训练视觉语言模型（VLMs），因此与"Large Language Models"高度相关（10分）。论文涉及使用合成数据训练模型，这与监督微调（SFT）有一定关联（5分），因为SFT是训练VLMs的常见方法，但论文未明确提及SFT。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、CoT、Agents、Quantization等均未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何利用大语言模型生成合成图像数据来训练视觉语言模型，以解决其在视觉感知任务（如空间理解和视角识别）上的不足，实验结果表明该方法能显著提升模型在多个基准测试上的性能。

摘要翻译

视觉语言模型（VLMs）在空间理解和视角识别等视觉感知任务上仍面临困难。一个可能的原因是自然图像数据集对低层次视觉技能提供的监督有限。这引出了一个实际问题：仅通过任务关键词（如深度顺序）生成的有针对性的合成监督能否解决这些缺陷？为探究此问题，我们提出了VisionFoundry——一种任务感知的合成数据生成流程，该流程仅以任务名称作为输入，利用大语言模型（LLMs）生成问题、答案以及文本到图像（T2I）提示，随后通过T2I模型合成图像，并使用专有VLM验证一致性，整个过程无需参考图像或人工标注。借助VisionFoundry，我们构建了VisionFoundry-10K，这是一个包含10个任务、涵盖1万个图像-问题-答案三元组的合成视觉问答（VQA）数据集。在VisionFoundry-10K上训练的模型在视觉感知基准测试中取得了显著提升：在MMVP上提升7%，在CV-Bench-3D上提升10%，同时保持了更广泛的模型能力，并显示出随数据量增加而积极扩展的特性。我们的研究结果表明，有限的任务针对性监督是造成此瓶颈的重要因素，而合成监督为VLMs更系统化的训练提供了一条可行路径。

摘要 (Abstract)

Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.

关键词: Vision-language models, Synthetic data generation, Large language models, Visual perception, Visual question answering, Task-targeted supervision, Text-to-image models, Data scaling

13. ❌ Fine-Grained Action Segmentation for Renorrhaphy in Robot-Assisted Partial Nephrectomy

作者: Jiaheng Dai, Huanrong Liu, Tailai Zhou, Tongyu Jia, Qin Liu, Yutong Ban, Zeju Li, Yu Gao, Xin Ma, Qingbiao Li 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09051v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 该论文研究机器人辅助肾部分切除术中精细动作分割的计算机视觉任务，使用I3D特征和时序模型（如MS-TCN++、DiffAct）在临床视频数据集上进行评估。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、智能体等）完全无关，因为这些关键词均聚焦于自然语言处理或通用大模型领域，而本文是纯粹的计算机视觉/医疗影像分析研究。唯一略有相关的关键词是"AI for Science" OR “Bioinformatics” OR “Cheminformatics”，因为论文属于AI在生物医学（手术视频分析）中的应用，但并非核心匹配（论文未使用大模型或深度学习创新技术，而是传统时序模型），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一个用于机器人辅助肾部分切除术中肾缝合精细动作分割的基准SIA-RAPN，在50个临床视频上评估了四种时序模型，发现DiffAct模型在多项指标上表现最佳。

摘要翻译

机器人辅助肾部分切除术中肾实质缝合阶段的细粒度动作分割任务，需对视觉相似、持续时间可变且存在显著类别不平衡的缝合手势进行逐帧识别。SIA-RAPN 基准基于达芬奇 Xi 系统采集的 50 段临床视频定义了该问题，视频数据已标注 12 种逐帧动作类别。该基准比较了四种基于 I3D 特征构建的时序模型：MS-TCN++、AsFormer、TUT 与 DiffAct。评估指标采用平衡准确率、编辑分数、重叠阈值为 10%、25% 与 50% 的分段 F1 值、逐帧准确率以及逐帧平均精度均值。除在 SIA-RAPN 已发布的五种数据划分配置上进行主体评估外，该基准还在独立的单孔 RAPN 数据集上报告了跨域性能结果。在主体数据集五次运行中取得的最优结果中，DiffAct 在分段 F1、逐帧准确率、编辑分数和逐帧平均精度均值上表现最佳，而 MS-TCN++ 则获得了最高的平衡准确率。

摘要 (Abstract)

Fine-grained action segmentation during renorrhaphy in robot-assisted partial nephrectomy requires frame-level recognition of visually similar suturing gestures with variable duration and substantial class imbalance. The SIA-RAPN benchmark defines this problem on 50 clinical videos acquired with the da Vinci Xi system and annotated with 12 frame-level labels. The benchmark compares four temporal models built on I3D features: MS-TCN++, AsFormer, TUT, and DiffAct. Evaluation uses balanced accuracy, edit score, segmental F1 at overlap thresholds of 10, 25, and 50, frame-wise accuracy, and frame-wise mean average precision. In addition to the primary evaluation across five released split configurations on SIA-RAPN, the benchmark reports cross-domain results on a separate single-port RAPN dataset. Across the strongest reported values over those five runs on the primary dataset, DiffAct achieves the highest F1, frame-wise accuracy, edit score, and frame mAP, while MS-TCN++ attains the highest balanced accuracy.

关键词: action segmentation, robot-assisted partial nephrectomy, renorrhaphy, temporal models, SIA-RAPN benchmark, surgical video analysis, I3D features, DiffAct

14. ❌ Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

作者: Hadas Orgad, Boyi Wei, Kaden Zheng, Martin Wattenberg, Peter Henderson, Seraphina Goldfarb-Tarrant, Yonatan Belinkov 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09544v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的安全对齐机制和内部表征结构，与’Large Language Models’、‘Instruction Tuning/Alignment’、‘Mechanistic Interpretability’高度相关（10分）。涉及对齐训练后的微调（SFT）和有害内容生成（与事实性相关），各给5分。使用权重剪枝作为干预方法，与模型压缩技术有一定关联，给5分。其他关键词如MoE、量化、推理加速等未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文研究发现大型语言模型的有害内容生成依赖于一组紧凑且通用的权重，这些权重与良性能力分离，对齐训练会压缩这些权重，解释了微调导致的突发性错位现象。

摘要翻译

大型语言模型（LLMs）经过对齐训练以避免有害行为，但由此产生的安全防护措施仍显脆弱：越狱攻击常能绕过这些防护，且在狭窄领域进行微调可能引发广泛泛化的“突发性错位”。这种脆弱性是否反映了模型内部对有害性缺乏连贯的内在组织，目前尚不明确。本研究采用定向权重剪枝作为因果干预手段，以探究LLMs中有害性的内部组织机制。研究发现，有害内容的生成依赖于一组紧凑的权重，这些权重在不同有害类型间具有通用性，且与良性能力权重相分离。与未对齐模型相比，对齐模型表现出更强的有害生成权重压缩性，这表明对齐过程在模型内部重塑了有害表征——尽管表层的安全防护机制仍显脆弱。这种压缩性解释了突发性错位的成因：如果有害能力的权重被高度压缩，在某一领域微调时激活这些权重可能引发广泛的错位现象。与此一致的是，在狭窄领域剪除有害生成权重能显著减少突发性错位。值得注意的是，LLMs生成有害内容的能力与其识别和解释此类内容的能力存在解耦。这些结果共同揭示了LLMs内部存在连贯的有害性组织结构，这可能为构建更具原则性的安全方法奠定基础。

摘要 (Abstract)

Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment’’ that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally–despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.

关键词: Large Language Models, Alignment, Harmful Content Generation, Weight Pruning, Internal Organization, Emergent Misalignment, Safety, Mechanistic Interpretability

15. ❌ Case-Grounded Evidence Verification: A Framework for Constructing Evidence-Sensitive Supervision

作者: Soroosh Tayebi Arasteh, Mehdi Joodaki, Mahshad Lotfinia, Sven Nebelung, Daniel Truhn 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09537v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种用于证据验证的监督框架，并在放射学领域进行了实例化，主要关注证据依赖性的监督构建和模型验证。论文内容与大多数关键词（如LLM技术、训练方法、推理技术、模型优化等）完全无关，因为这些关键词涉及大模型的具体技术、架构、训练或推理方法，而论文并未涉及这些技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文在放射学（医学领域）应用了AI方法，属于AI在科学领域的应用，但并非核心内容，因此给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一个案例驱动的证据验证框架，通过生成支持和非支持示例的监督构建方法，在放射学领域训练了一个证据验证器，结果表明该验证器能有效依赖证据进行决策，但性能受证据源偏移和骨干网络选择的影响。

摘要翻译

基于证据的推理不仅需要将检索到的文本附加到预测中：模型应能根据所提供证据是否支持目标主张来做出决策。在实践中，这一目标往往难以实现，原因在于监督信号薄弱、证据与主张的关联松散，且评估未直接检验证据依赖性。我们提出了案例驱动的证据验证框架，该通用框架要求模型接收局部案例背景、外部证据及结构化主张，并必须判断证据是否支持该案例的主张。我们的核心贡献在于设计了一种监督构建流程，该流程能自动生成明确的支持性示例，同时生成语义受控的非支持性示例（包括反事实错误状态示例和主题相关负例），而无需人工标注证据。我们在放射学领域实例化了该框架，并基于生成的支持性任务训练了一个标准验证器。学习后的验证器显著优于仅使用案例或仅使用证据的基线模型，在证据正确时保持强劲性能，而在证据被移除或替换时性能急剧下降，这表明其真正具备了证据依赖性。该能力可迁移至未见过的证据文献和外部案例分布，但在证据来源发生变化时性能会下降，且对主干模型的选择仍较为敏感。总体而言，结果表明证据落地的主要瓶颈不仅在于模型能力，更在于缺乏能够编码证据因果作用的监督机制。

摘要 (Abstract)

Evidence-grounded reasoning requires more than attaching retrieved text to a prediction: a model should make decisions that depend on whether the provided evidence supports the target claim. In practice, this often fails because supervision is weak, evidence is only loosely tied to the claim, and evaluation does not test evidence dependence directly. We introduce case-grounded evidence verification, a general framework in which a model receives a local case context, external evidence, and a structured claim, and must decide whether the evidence supports the claim for that case. Our key contribution is a supervision construction procedure that generates explicit support examples together with semantically controlled non-support examples, including counterfactual wrong-state and topic-related negatives, without manual evidence annotation. We instantiate the framework in radiology and train a standard verifier on the resulting support task. The learned verifier substantially outperforms both case-only and evidence-only baselines, remains strong under correct evidence, and collapses when evidence is removed or swapped, indicating genuine evidence dependence. This behavior transfers across unseen evidence articles and an external case distribution, though performance degrades under evidence-source shift and remains sensitive to backbone choice. Overall, the results suggest that a major bottleneck in evidence grounding is not only model capacity, but the lack of supervision that encodes the causal role of evidence.

关键词: evidence verification, supervision construction, case-grounded, radiology, evidence dependence, counterfactual examples, model evaluation, AI in healthcare

作者: Zibin Geng, Xuefeng Jiang, Jia Li, Zheng Li, Tian Wen, Lvhua Wu, Sheng Sun, Yuwei Wang, Min Liu 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09532v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究视觉语言模型（VLM）中的提示学习（prompt learning）在标签噪声下的鲁棒性问题，提出VisPrompt框架。核心相关关键词是’PEFT OR LoRA OR Parameter-efficient Fine-tuning’（得10分），因为论文明确研究参数高效的提示学习方法，并保持预训练VLM主干冻结，仅引入少量可训练参数。其他关键词主要涉及大语言模型（LLM）的特定技术（如MoE、RLHF、RAG等）或科学AI应用，与论文的视觉语言模型和噪声鲁棒性研究主题无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出VisPrompt框架，通过视觉引导的跨模态注意力机制和条件调制机制，解决了视觉语言模型提示学习在标签噪声下的鲁棒性问题，在多个基准数据集上实现了优于现有方法的性能。

摘要翻译

提示学习是视觉语言模型的一种参数高效方法，但其在标签噪声下的鲁棒性尚未得到充分研究。视觉内容包含更丰富且更可靠的语义信息，在标签噪声下保持更强的稳健性。然而，提示本身极易受到标签噪声的影响。基于这一观察，我们提出VisPrompt——一种面向噪声标签场景的轻量级鲁棒视觉引导提示学习框架。具体而言，我们利用跨模态注意力机制将视觉语义反向注入提示表示中，使提示词能够选择性地聚合与当前样本相关的视觉信息，从而通过将提示学习锚定于稳定的实例级视觉证据来提升鲁棒性，并减少噪声监督的影响。针对所有样本采用相同视觉信息注入方式（尽管其视觉线索质量存在差异）所导致的不稳定性，我们进一步引入轻量级条件调制机制，以自适应控制视觉信息注入的强度，从而在文本侧语义先验与图像侧实例证据之间达成更稳健的平衡。该框架有效抑制了噪声引起的干扰，降低了提示更新的不稳定性，并缓解了对错误标注样本的记忆。VisPrompt在保持预训练视觉语言模型主干冻结且仅引入少量额外可训练参数的同时，显著提升了鲁棒性。在合成与真实世界标签噪声场景下的大量实验表明，VisPrompt在七个基准数据集上普遍优于现有基线方法，并展现出更强的鲁棒性。我们的代码已公开于https://github.com/gezbww/Vis_Prompt。

摘要 (Abstract)

Prompt learning is a parameter-efficient approach for vision-language models, yet its robustness under label noise is less investigated. Visual content contains richer and more reliable semantic information, which remains more robust under label noise. However, the prompt itself is highly susceptible to label noise. Motivated by this intuition, we propose VisPrompt, a lightweight and robust vision-guided prompt learning framework for noisy-label settings. Specifically, we exploit a cross-modal attention mechanism to reversely inject visual semantics into prompt representations. This enables the prompt tokens to selectively aggregate visual information relevant to the current sample, thereby improving robustness by anchoring prompt learning to stable instance-level visual evidence and reducing the influence of noisy supervision. To address the instability caused by using the same way of injecting visual information for all samples, despite differences in the quality of their visual cues, we further introduce a lightweight conditional modulation mechanism to adaptively control the strength of visual information injection, which strikes a more robust balance between text-side semantic priors and image-side instance evidence. The proposed framework effectively suppresses the noise-induced disturbances, reduce instability in prompt updates, and alleviate memorization of mislabeled samples. VisPrompt significantly improves robustness while keeping the pretrained VLM backbone frozen and introducing only a small amount of additional trainable parameters. Extensive experiments under synthetic and real-world label noise demonstrate that VisPrompt generally outperforms existing baselines on seven benchmark datasets and achieves stronger robustness. Our code is publicly available at https://github.com/gezbww/Vis_Prompt.

关键词: prompt learning, vision-language models, label noise, robustness, cross-modal attention, parameter-efficient fine-tuning, visual semantics, noisy-label settings

17. ❌ Envisioning the Future, One Step at a Time

作者: Stefan Andreas Baumann, Jannik Wiese, Tommaso Martorella, Mahdi M. Kalayeh, Björn Ommer 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09527v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的未来场景动态预测，使用自回归扩散模型处理稀疏点轨迹，与大多数关键词（主要关于大语言模型技术）完全无关。唯一相关的关键词是’World Models AND General World Models’，因为论文涉及场景动态建模和未来预测，属于世界模型范畴，但并非通用世界模型，因此给予10分（高度相关但非核心）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于稀疏点轨迹的自回归扩散模型，用于从单张图像预测开放集未来场景动态，在保持预测准确性的同时实现了数量级更高的采样速度。

摘要翻译

要准确预测复杂多样场景的演化过程，需要模型能够表征不确定性、模拟长链交互作用，并高效探索多种可能的未来状态。然而，现有方法大多依赖稠密视频或潜空间预测，将大量计算能力耗费在稠密的外观信息上，而非场景中稀疏的点轨迹这一底层结构。这使得对未来假设的大规模探索成本高昂，并在长时程、多模态运动至关重要时限制了性能表现。为此，我们将开放集未来场景动态的预测问题，形式化为对稀疏点轨迹的逐步推理。我们的自回归扩散模型通过局部可预测的短时状态转移来推进这些轨迹，显式地建模了不确定性随时间增长的过程。这种以动态为中心的表征方式，能够从单张图像快速推演出数千种不同的未来状态，并可在初始运动约束的引导下进行，同时保持物理合理性与长程一致性。我们进一步提出了OWM基准测试——一个基于多样化真实世界视频的开放集运动预测评估体系，用于衡量现实不确定性条件下预测轨迹分布的准确性与多样性。我们的方法在预测准确性上达到或超越了稠密模拟器，同时实现了数量级更高的采样速度，使得开放集未来预测兼具可扩展性与实用性。项目页面：http://compvis.github.io/myriad。

摘要 (Abstract)

Accurately anticipating how complex, diverse scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains, and efficiently explore many plausible futures. Yet most existing approaches rely on dense video or latent-space prediction, expending substantial capacity on dense appearance rather than on the underlying sparse trajectories of points in the scene. This makes large-scale exploration of future hypotheses costly and limits performance when long-horizon, multi-modal motion is essential. We address this by formulating the prediction of open-set future scene dynamics as step-wise inference over sparse point trajectories. Our autoregressive diffusion model advances these trajectories through short, locally predictable transitions, explicitly modeling the growth of uncertainty over time. This dynamics-centric representation enables fast rollout of thousands of diverse futures from a single image, optionally guided by initial constraints on motion, while maintaining physical plausibility and long-range coherence. We further introduce OWM, a benchmark for open-set motion prediction based on diverse in-the-wild videos, to evaluate accuracy and variability of predicted trajectory distributions under real-world uncertainty. Our method matches or surpasses dense simulators in predictive accuracy while achieving orders-of-magnitude higher sampling speed, making open-set future prediction both scalable and practical. Project page: http://compvis.github.io/myriad.

关键词: future scene dynamics, sparse point trajectories, autoregressive diffusion model, open-set motion prediction, uncertainty modeling, long-horizon prediction, scene evolution, trajectory prediction

18. ❌ VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning

作者: Wenyi Xiao, Xinchi Xu, Leilei Gan 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09529v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大型视觉语言模型（LVLMs）的置信度校准问题，属于大模型技术应用范畴。核心相关关键词：1）‘Large Language Models’（8分）- 论文聚焦LVLMs，是LLMs的多模态扩展；2）‘Hallucination Mitigation’（10分）- 核心解决幻觉问题，提出视觉与推理置信度解耦方法；3）‘Chain of Thought’（8分）- 涉及多模态推理过程分析；4）‘Mechanistic Interpretability’（5分）- 通过KL散度和token熵分析模型内部机制。其他关键词如MoE、量化、RAG等未涉及。

!!! tip deepseek-chat TL;DR

该论文针对大型视觉语言模型在推理中经常出现高置信度幻觉的问题，提出了VL-Calibration框架，通过解耦视觉和推理置信度并引入内在视觉确定性估计，有效改善了校准效果并提升了视觉推理准确性。

摘要翻译

大型视觉语言模型（LVLMs）在多模态推理方面表现出色，但常以高确定性产生幻觉和错误响应，这阻碍了其在高风险领域中的应用。现有的言语化置信度校准方法主要针对纯文本大语言模型开发，通常基于二元答案级正确性优化单一整体置信度。这种设计与LVLMs不匹配：错误预测可能源于感知失败，也可能源于在正确感知下的推理错误，而单一置信度混淆了这些来源，同时视觉不确定性常受语言先验主导。为解决这些问题，我们提出VL-Calibration——一个将置信度显式解耦为视觉置信度与推理置信度的强化学习框架。为在没有真实感知标注的情况下监督视觉置信度，我们引入了一种内在视觉确定性估计方法，其结合了（i）通过图像扰动下KL散度度量的视觉基础性，以及（ii）通过词元熵度量的内部确定性。我们进一步提出词元级优势重加权方法，基于视觉确定性聚焦优化关键词元，从而抑制无根据的幻觉同时保留有效感知。在十三个基准测试上的实验表明，VL-Calibration在提升视觉推理准确性的同时有效改进了校准效果，并且能够泛化至不同模型规模和架构的分布外基准测试。

摘要 (Abstract)

Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certainty, which hinders their usage in high-stakes domains. Existing verbalized confidence calibration methods, largely developed for text-only LLMs, typically optimize a single holistic confidence score using binary answer-level correctness. This design is mismatched to LVLMs: an incorrect prediction may arise from perceptual failures or from reasoning errors given correct perception, and a single confidence conflates these sources while visual uncertainty is often dominated by language priors. To address these issues, we propose VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning confidence. To supervise visual confidence without ground-truth perception labels, we introduce an intrinsic visual certainty estimation that combines (i) visual grounding measured by KL-divergence under image perturbations and (ii) internal certainty measured by token entropy. We further propose token-level advantage reweighting to focus optimization on tokens based on visual certainty, suppressing ungrounded hallucinations while preserving valid perception. Experiments on thirteen benchmarks show that VL-Calibration effectively improves calibration while boosting visual reasoning accuracy, and it generalizes to out-of-distribution benchmarks across model scales and architectures.

关键词: Large Vision Language Models, confidence calibration, hallucination mitigation, multimodal reasoning, visual grounding, reinforcement learning, token entropy, out-of-distribution generalization

19. ❌ Strategic Algorithmic Monoculture:Experimental Evidence from Coordination Games

作者: Gonzalo Ballestero, Hadi Hosseini, Samarth Khanna, Ran I. Shorrer 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09502v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究LLM在多智能体协调博弈中的行为，与’Large Language Models’和’LLM Agents’、‘Multi-agent Systems’高度相关（10分），因为这些是论文的核心研究对象。其他关键词如MoE、SFT、RAG等涉及具体技术细节，论文未探讨，故均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在多智能体协调博弈中表现出的算法单一化现象，发现LLM与人类一样会基于激励调整行为相似性，但在维持异质性方面不如人类。

摘要翻译

人工智能体日益在多智能体环境中运行，其成效往往依赖于协同合作。我们区分了基础算法单一性（即基准行动相似性）与策略性算法单一性——后者指智能体根据激励调整相似性程度。我们设计了一个简洁的实验方案，清晰分离这两种机制，并在人类与大型语言模型（LLM）受试者中进行了验证。实验表明，LLM展现出高度的基准相似性（基础单一性），并且与人类相似，它们会根据协同激励调节这种相似性（策略性单一性）。尽管LLM在相似行动上表现出极强的协同能力，但在奖励差异化行为时，它们维持异质性的能力仍落后于人类。

摘要 (Abstract)

AI agents increasingly operate in multi-agent environments where outcomes depend on coordination. We distinguish primary algorithmic monoculture – baseline action similarity – from strategic algorithmic monoculture, whereby agents adjust similarity in response to incentives. We implement a simple experimental design that cleanly separates these forces, and deploy it on human and large language model (LLM) subjects. LLMs exhibit high levels of baseline similarity (primary monoculture) and, like humans, they regulate it in response to coordination incentives (strategic monoculture). While LLMs coordinate extremely well on similar actions, they lag behind humans in sustaining heterogeneity when divergence is rewarded.

关键词: algorithmic monoculture, coordination games, multi-agent environments, large language models, LLM agents, strategic behavior, experimental design, heterogeneity

20. ❌ Semantic Rate-Distortion for Bounded Multi-Agent Communication: Capacity-Derived Semantic Spaces and the Communication Cost of Alignment

作者: Anthony T. Nixon 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09521v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究异构智能体之间的通信理论，与’Multi-agent Systems OR Agent Coordination’高度相关（10分），涉及智能体对齐和通信，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’有一定关联（5分），与’Instruction Tuning OR Alignment OR Value Alignment’在概念上有一定联系（5分），但论文未涉及具体的大模型技术、训练方法、推理优化或科学应用，其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了异构智能体在计算能力不同情况下的通信理论，证明了通信存在结构性相变，并建立了基于容量派生语义空间的通信基准。

摘要翻译

当两个具有不同计算能力的智能体与同一环境交互时，它们并非必然对同一语义字母表进行差异化压缩；它们可能完全归纳出不同的语义字母表。我们证明，商部分可观测马尔可夫决策过程 $Q_{m,T}(M)$——即与智能体能力一致的最粗粒度抽象——可作为任何有界智能体的能力衍生语义空间，并且异质智能体间的通信会呈现尖锐的结构相变。在由商不匹配度决定的临界速率 $R_{\text{crit}}$ 以下，保持意图的通信在结构上无法实现。在支持的单向无记忆机制中，经典的边信息编码可在超越诱导基准的速率上实现指数级衰减。经典编码定理在源字母表固定后给出速率；我们的贡献在于从有界交互本身推导出该字母表。
具体而言，我们证明了：（1）一个固定 $\varepsilon$ 的结构相变定理，其下界在公共历史商比较上具有完全普适性；（2）在商字母表上的单向 Wyner-Ziv 基准识别，具备精确逆定理、对无记忆商源的精确操作等价性，以及通过显式混合边界建立的遍历长期桥梁；（3）在收缩失真机制 $\varepsilon = O(1/T)$ 下的渐近单向逆定理，该定理从消息流与解码器边信息角度证明；（4）支持通过中间能力水平进行组合通信的对齐遍历边界。在八个 POMDP 环境（包括 RockSample(4,4)）上的实验验证了相变现象，结构化策略基准表明单向速率相较于计数边界最多可降低 $19\times$，而收缩失真扫描结果与渐近逆定理的机制相符。

摘要 (Abstract)

When two agents of different computational capacities interact with the same environment, they need not compress a common semantic alphabet differently; they can induce different semantic alphabets altogether. We show that the quotient POMDP $Q_{m,T}(M)$ - the unique coarsest abstraction consistent with an agent’s capacity - serves as a capacity-derived semantic space for any bounded agent, and that communication between heterogeneous agents exhibits a sharp structural phase transition. Below a critical rate $R_{\text{crit}}$ determined by the quotient mismatch, intent-preserving communication is structurally impossible. In the supported one-way memoryless regime, classical side-information coding then yields exponential decay above the induced benchmark. Classical coding theorems tell you the rate once the source alphabet is fixed; our contribution is to derive that alphabet from bounded interaction itself. Concretely, we prove: (1) a fixed-$\varepsilon$ structural phase-transition theorem whose lower bound is fully general on the common-history quotient comparison; (2) a one-way Wyner-Ziv benchmark identification on quotient alphabets, with exact converse, exact operational equality for memoryless quotient sources, and an ergodic long-run bridge via explicit mixing bounds; (3) an asymptotic one-way converse in the shrinking-distortion regime $\varepsilon = O(1/T)$, proved from the message stream and decoder side information; and (4) alignment traversal bounds enabling compositional communication through intermediate capacity levels. Experiments on eight POMDP environments (including RockSample(4,4)) illustrate the phase transition, a structured-policy benchmark shows the one-way rate can drop by up to $19\times$ relative to the counting bound, and a shrinking-distortion sweep matches the regime of the asymptotic converse.

关键词: multi-agent communication, semantic rate-distortion, capacity-derived semantic spaces, POMDP, structural phase transition, quotient mismatch, alignment traversal, Wyner-Ziv benchmark

21. ❌ VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning

作者: Yucheng Shen, Jiulong Wu, Jizhou Huang, Dawei Yin, Lingyong Yan, Min Cao 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09508v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出VISOR框架，专注于视觉检索增强生成（VRAG）中的智能体系统，核心解决多步推理中的视觉证据稀疏性和搜索漂移问题。高度相关的关键词包括：‘Retrieval-Augmented Generation’（核心方法）、‘LLM Agents’（智能体框架）、‘Chain of Thought’和’System 2 Thinking’（多步推理机制）。中等相关的关键词：‘Large Language Models’（可能基于VLM）、‘Context Window Extension’（处理长上下文）、‘Self-Correction’（动作评估机制）、‘Tool Use’（视觉动作作为工具）。其他关键词如MoE、量化、对齐等未涉及，评分为0。

!!! tip deepseek-chat TL;DR

论文提出VISOR框架，通过结构化证据空间、视觉动作评估和动态轨迹机制，解决了视觉检索增强生成中多步推理的视觉证据稀疏性和搜索漂移问题，在多个基准测试中实现了最先进的性能。

摘要翻译

视觉检索增强生成（Visual Retrieval-Augmented Generation，简称VRAG）赋予视觉语言模型对视觉丰富文档进行检索与推理的能力。为应对需要多步推理的复杂查询，智能体式VRAG系统将推理与迭代检索交织进行。然而，现有智能体式VRAG面临两个关键瓶颈：（1）视觉证据稀疏性：关键证据分散在不同页面却孤立处理，阻碍跨页推理；此外，图像内部的细粒度证据常需精确的视觉操作，其误用会降低检索质量；（2）长程搜索漂移：跨检索页面的视觉标记积累会稀释上下文并导致认知过载，使智能体偏离搜索目标。为解决这些挑战，我们提出VISOR（基于迭代搜索与超视界推理的视觉检索增强生成），一个统一的单智能体框架。VISOR设计了结构化证据空间以支持渐进式跨页推理，并结合视觉操作评估与校正机制来管理视觉操作。此外，我们引入带滑动窗口与意图注入的动态轨迹以缓解搜索漂移。该设计锚定证据空间，同时丢弃早期的原始交互，防止上下文被视觉标记淹没。我们采用基于组相对策略优化的强化学习（GRPO-based RL）流程训练VISOR，该流程结合了状态掩码与专为动态上下文重构设计的信用分配机制。在ViDoSeek、SlideVQA和MMLongBench上的大量实验表明，VISOR在长程视觉推理任务中实现了最先进的性能，并具有卓越的效率。

摘要 (Abstract)

Visual Retrieval-Augmented Generation (VRAG) empowers Vision-Language Models to retrieve and reason over visually rich documents. To tackle complex queries requiring multi-step reasoning, agentic VRAG systems interleave reasoning with iterative retrieval.. However, existing agentic VRAG faces two critical bottlenecks. (1) Visual Evidence Sparsity: key evidence is scattered across pages yet processed in isolation, hindering cross-page reasoning; moreover, fine-grained intra-image evidence often requires precise visual actions, whose misuse degrades retrieval quality; (2) Search Drift in Long Horizons: the accumulation of visual tokens across retrieved pages dilutes context and causes cognitive overload, leading agents to deviate from their search objective. To address these challenges, we propose VISOR (Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning), a unified single-agent framework. VISOR features a structured Evidence Space for progressive cross-page reasoning, coupled with a Visual Action Evaluation and Correction mechanism to manage visual actions. Additionally, we introduce a Dynamic Trajectory with Sliding Window and Intent Injection to mitigate search drift. They anchor the evidence space while discarding earlier raw interactions, preventing context from being overwhelmed by visual tokens. We train VISOR using a Group Relative Policy Optimization-based Reinforcement Learning (GRPO-based RL) pipeline with state masking and credit assignment tailored for dynamic context reconstruction. Extensive experiments on ViDoSeek, SlideVQA, and MMLongBench demonstrate that VISOR achieves state-of-the-art performance with superior efficiency for long-horizon visual reasoning tasks.

关键词: Visual Retrieval-Augmented Generation, Agentic VRAG, Multi-step Reasoning, Visual Evidence Sparsity, Search Drift, Over-horizon Reasoning, GRPO-based Reinforcement Learning, Long-horizon Visual Reasoning

22. ❌ BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation

作者: Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Emmanuel Malherbe, Céline Hudelot, Pierre Colombo 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09497v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM评估方法，与’Large Language Models’高度相关（10分），因为全文围绕LLM评估展开；与’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（5分），因为BERT-as-a-Judge需要轻量级训练；其他关键词如MoE、Scaling Laws、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型评估中传统词汇方法效果差而LLM评估方法成本高的问题，提出了基于BERT的轻量级评估方法BERT-as-a-Judge，在保持高性能的同时显著降低了计算成本。

摘要翻译

在大语言模型（LLM）生态系统中，准确评估是核心环节，它指导着模型选择及在不同应用场景中的下游采用。然而，在实践中，对生成输出的评估通常依赖于僵化的词汇匹配方法来提取和评判答案，这可能会混淆模型真实的问题解决能力与其对预定义格式规范的遵循程度。尽管近期出现的“LLM即评判员”方法通过评估语义正确性而非严格的结构一致性来缓解此问题，但它们也带来了巨大的计算开销，使得评估成本高昂。在本研究中，我们首先通过一项涵盖36个模型和15个下游任务的大规模实证研究，系统性地探究了词汇评估的局限性，证明此类方法与人类判断的相关性较弱。为应对这一局限，我们提出了“BERT即评判员”——一种基于编码器的评估方法，用于在基于参考的生成场景中评判答案正确性。该方法对输出表达的多样性具有鲁棒性，且仅需对合成标注的“问题-候选答案-参考答案”三元组进行轻量级训练。我们证明，该方法在性能上始终优于词汇基线，同时与规模大得多的LLM评判员表现相当，从而在两者之间提供了理想的权衡，实现了可靠且可扩展的评估。最后，通过大量实验，我们深入分析了“BERT即评判员”的性能表现，为实践者提供了实用指导，并公开了所有项目资源以促进其下游应用。

摘要 (Abstract)

Accurate evaluation is central to the large language model (LLM) ecosystem, guiding model selection and downstream adoption across diverse use cases. In practice, however, evaluating generative outputs typically relies on rigid lexical methods to extract and assess answers, which can conflate a model’s true problem-solving ability with its compliance with predefined formatting guidelines. While recent LLM-as-a-Judge approaches mitigate this issue by assessing semantic correctness rather than strict structural conformity, they also introduce substantial computational overhead, making evaluation costly. In this work, we first systematically investigate the limitations of lexical evaluation through a large-scale empirical study spanning 36 models and 15 downstream tasks, demonstrating that such methods correlate poorly with human judgments. To address this limitation, we introduce BERT-as-a-Judge, an encoder-driven approach for assessing answer correctness in reference-based generative settings, robust to variations in output phrasing, and requiring only lightweight training on synthetically annotated question-candidate-reference triplets. We show that it consistently outperforms the lexical baseline while matching the performance of much larger LLM judges, providing a compelling tradeoff between the two and enabling reliable, scalable evaluation. Finally, through extensive experimentation, we provide detailed insights into BERT-as-a-Judge’s performance to offer practical guidance for practitioners, and release all project artifacts to foster downstream adoption.

关键词: LLM evaluation, BERT-as-a-Judge, reference-based evaluation, lexical methods, computational overhead, semantic correctness, synthetic annotation, scalable evaluation

23. ❌ XFED: Non-Collusive Model Poisoning Attack Against Byzantine-Robust Federated Classifiers

作者: Israt Jahan Mouri, Muhammad Ridowan, Muhammad Abdullah Adnan 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09489v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究联邦学习中的模型投毒攻击和防御，属于机器学习安全领域，与所有评分关键词（均聚焦于大模型、深度学习技术原理及其在科学领域的应用）完全无关。论文未涉及任何大模型技术、训练方法、推理优化、对齐、代理系统或科学AI应用。

!!! tip deepseek-chat TL;DR

论文提出了一种无需攻击者间通信的非共谋模型投毒攻击方法XFED，实验表明它能绕过八种先进防御并优于六种现有攻击，揭示了联邦学习系统比预想中更不安全。

摘要翻译

模型投毒攻击对联邦学习构成了重大安全威胁。现有的大多数模型投毒攻击依赖于共谋，要求恶意客户端通过交换本地良性模型并同步生成其投毒更新来进行协同。然而，在实际的联邦学习部署中，维持这种协同日益不切实际，因为这实质上需要对大量设备实施类似僵尸网络的控制。这种方式的维护成本高昂且极易被检测。这一背景引出了一个根本性问题：攻击者之间在没有任何通信的情况下，模型投毒攻击是否仍然有效？为应对这一挑战，我们引入并形式化了非共谋攻击模型，在该模型中，所有被攻陷的客户端共享一个共同的对抗目标，但独立运作。在此模型下，每个攻击者生成其恶意更新时，既不与其他攻击者通信，也不访问其他客户端的更新，亦不依赖任何关于服务器端防御的知识。为证明此威胁模型的可行性，我们提出了XFED，这是首个与聚合方法无关的非共谋模型投毒攻击。我们在六个基准数据集上的实证评估表明，XFED能够绕过八种最先进的防御机制，并优于六种现有的模型投毒攻击。这些发现表明，联邦学习系统的安全性远低于先前的认知，并凸显了对更鲁棒、更实用的防御机制的迫切需求。

摘要 (Abstract)

Model poisoning attacks pose a significant security threat to Federated Learning (FL). Most existing model poisoning attacks rely on collusion, requiring adversarial clients to coordinate by exchanging local benign models and synchronizing the generation of their poisoned updates. However, sustaining such coordination is increasingly impractical in real-world FL deployments, as it effectively requires botnet-like control over many devices. This approach is costly to maintain and highly vulnerable to detection. This context raises a fundamental question: Can model poisoning attacks remain effective without any communication between attackers? To address this challenge, we introduce and formalize the \textbf{non-collusive attack model}, in which all compromised clients share a common adversarial objective but operate independently. Under this model, each attacker generates its malicious update without communicating with other adversaries, accessing other clients’ updates, or relying on any knowledge of server-side defenses. To demonstrate the feasibility of this threat model, we propose \textbf{XFED}, the first aggregation-agnostic, non-collusive model poisoning attack. Our empirical evaluation across six benchmark datasets shows that XFED bypasses eight state-of-the-art defenses and outperforms six existing model poisoning attacks. These findings indicate that FL systems are substantially less secure than previously believed and underscore the urgent need for more robust and practical defense mechanisms.

关键词: Federated Learning, Model Poisoning Attack, Non-collusive Attack, Aggregation-agnostic, Byzantine-robust Defenses, Security Threat, Adversarial Clients, Empirical Evaluation

24. ❌ RecaLLM: Addressing the Lost-in-Thought Phenomenon with Explicit In-Context Retrieval

作者: Kyle Whitecross, Negin Rahimi 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09494v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文RecaLLM专注于解决大语言模型在长上下文推理中的’迷失在思考中’现象，通过交替推理和显式上下文检索来提升性能。核心相关关键词包括：LLMs（论文研究对象）、Post-training/SFT（采用后训练方法）、RAG（涉及检索增强生成）、Long Context LLMs（针对长上下文优化）、CoT Reasoning（涉及多步推理）、In-context Learning（核心机制）。其他关键词如MoE、SLMs、Scaling Laws、RLHF等未在论文中涉及，故评分为0。

!!! tip deepseek-chat TL;DR

论文提出RecaLLM模型，通过交替推理和显式上下文检索解决大语言模型在长上下文推理中的'迷失在思考中'问题，在RULER和HELMET基准上显著优于基线，且仅用短训练样本（≤10K tokens）即可处理长达128K tokens的上下文。

摘要翻译

我们提出RecaLLM，这是一组经过后训练、能够有效利用长上下文信息的推理语言模型。上下文检索（即从上下文中识别相关证据）与推理过程深度交织：检索为推理提供支持，而推理往往决定需要检索的内容。然而，二者的相互作用在很大程度上仍未得到充分探索。在对多个开源大语言模型的初步实验中，我们观察到即使经过很短的推理跨度，上下文检索性能也会显著下降，这揭示了一个我们称为“思维迷失”的测试时扩展关键瓶颈：提升性能的推理步骤同时会使后续的上下文检索更具挑战性。为解决这一局限，RecaLLM将推理与显式的上下文检索交错进行，在推理与获取解决中间子问题所需的上下文信息之间交替切换。我们引入了一种开销极低的约束解码机制，支持逐字复制证据片段，从而提升后续生成的基础可靠性。通过在多样化的词汇和语义检索任务上进行训练，RecaLLM在两个长上下文基准测试（RULER和HELMET）上取得了强劲性能，显著优于基线模型。值得注意的是，在使用最多仅1万标记的训练样本时，我们在长达12.8万标记的上下文窗口中仍观察到一致的性能提升，这远短于现有长上下文方法所需的训练数据长度，为无需昂贵长上下文训练数据即可提升长上下文性能指明了一条可行路径。

摘要 (Abstract)

We propose RecaLLM, a set of reasoning language models post-trained to make effective use of long-context information. In-context retrieval, which identifies relevant evidence from context, and reasoning are deeply intertwined: retrieval supports reasoning, while reasoning often determines what must be retrieved. However, their interaction remains largely underexplored. In preliminary experiments on several open-source LLMs, we observe that in-context retrieval performance substantially degrades even after a short reasoning span, revealing a key bottleneck for test-time scaling that we refer to as lost-in-thought: reasoning steps that improve performance also make subsequent in-context retrieval more challenging. To address this limitation, RecaLLM interleaves reasoning with explicit in-context retrieval, alternating between reasoning and retrieving context information needed to solve intermediate subproblems. We introduce a negligible-overhead constrained decoding mechanism that enables verbatim copying of evidence spans, improving the grounding of subsequent generation. Trained on diverse lexical and semantic retrieval tasks, RecaLLM achieves strong performance on two long-context benchmarks, RULER and HELMET, significantly outperforming baselines. Notably, we observe consistent gains at context windows of up to 128K tokens using training samples of at most 10K tokens, far shorter than those used by existing long-context approaches, highlighting a promising path toward improving long-context performance without expensive long-context training data.

关键词: RecaLLM, lost-in-thought phenomenon, in-context retrieval, reasoning language models, long-context, constrained decoding, RULER benchmark, HELMET benchmark

25. ❌ SafeMind: A Risk-Aware Differentiable Control Framework for Adaptive and Safe Quadruped Locomotion

作者: Zukun Zhang, Kai Shu, Mingqiao Mo 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09474v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是四足机器人安全控制框架SafeMind，结合了概率控制屏障函数、语义上下文理解和元自适应风险校准，属于机器人控制与强化学习领域。所有评分关键词均围绕大模型技术、训练方法、推理优化、对齐技术等，与论文的机器人运动控制主题完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对四足机器人在模型不确定性、感知噪声和非结构化接触条件下缺乏安全保证的问题，提出了SafeMind可微分安全控制框架，实验表明该框架将安全违规减少3-10倍，能耗降低10-15%，同时保持实时控制性能。

摘要翻译

基于学习的四足机器人控制器展现出卓越的敏捷性，但通常在模型不确定性、感知噪声和非结构化接触条件下缺乏形式化的安全保障。我们提出SafeMind，一种可微分的随机安全控制框架，它将概率控制屏障函数与语义上下文理解及元自适应风险校准相统一。SafeMind通过嵌入可微分二次规划的方差感知屏障约束，显式建模认知不确定性和偶然不确定性，从而保持端到端训练的梯度流。一个语义到约束编码器利用感知或语言线索调节安全边界，而元自适应学习器则持续调整跨环境的风险敏感性。我们为随机动力学下的概率前向不变性、可行性和稳定性提供了理论条件。SafeMind以200赫兹频率部署于Unitree A1和ANYmal C机器人，并在12种地形类型、动态障碍物、形态扰动及语义定义任务中进行了验证。实验表明，相较于最先进的CBF、MPC和混合强化学习基线方法，SafeMind将安全违规事件减少了3至10倍，能耗降低了10%至15%，同时保持了实时控制性能。

摘要 (Abstract)

Learning-based quadruped controllers achieve impressive agility but typically lack formal safety guarantees under model uncertainty, perception noise, and unstructured contact conditions. We introduce SafeMind, a differentiable stochastic safety-control framework that unifies probabilistic Control Barrier Functions with semantic context understanding and meta-adaptive risk calibration. SafeMind explicitly models epistemic and aleatoric uncertainty through a variance-aware barrier constraint embedded in a differentiable quadratic program, thereby preserving gradient flow for end-to-end training. A semantics-to-constraint encoder modulates safety margins using perceptual or language cues, while a meta-adaptive learner continuously adjusts risk sensitivity across environments. We provide theoretical conditions for probabilistic forward invariance, feasibility, and stability under stochastic dynamics. SafeMind is deployed on Unitree A1 and ANYmal C at 200~Hz and validated across 12 terrain types, dynamic obstacles, morphology perturbations, and semantically defined tasks. Experiments show that SafeMind reduces safety violations by 3–10x and energy consumption by 10–15% relative to state-of-the-art CBF, MPC, and hybrid RL baselines, while maintaining real-time control performance.

关键词: quadruped locomotion, safety control, control barrier functions, differentiable programming, stochastic safety, meta-adaptive learning, real-time control, robotics

26. ❌ Process Reward Agents for Steering Knowledge-Intensive Reasoning

作者: Jiwoong Sohn, Tomasz Sternal, Kenneth Styppa, Torsten Hoefler, Michael Moor 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09482v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究Process Reward Agents (PRA)，这是一种在推理过程中提供在线、逐步奖励的方法，用于指导知识密集型推理。与检索增强生成(RAG)高度相关，因为论文提到检索增强变体并改进它们；与推理方法(Chain of Thought, System 2 Thinking)高度相关，因为专注于多步推理和深入推理；与LLM Agents高度相关，因为PRA本身就是一种代理方法；与AI for Science高度相关，因为实验在医学推理基准上进行。与Small Language Models有一定关联，因为使用4B模型；与Self-Correction和Hallucination Mitigation有一定关联，因为旨在减少错误传播。其他关键词如MoE、Scaling Laws、训练方法、优化技术等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出Process Reward Agents (PRA)方法，通过提供在线、逐步奖励来指导知识密集型推理，在医学推理基准上显著提升冻结策略模型的准确性，最高提升25.7%，实现了4B规模的新SOTA。

摘要翻译

在知识密集型领域进行推理仍然具有挑战性，因为中间步骤通常无法在局部验证：与数学或代码不同，评估步骤的正确性可能需要综合来自大型外部知识源的线索。因此，细微的错误可能在推理轨迹中传播，且可能永远无法被检测到。先前的研究提出了过程奖励模型（PRMs），包括检索增强的变体，但这些方法都是事后运行的，对已完成的轨迹进行评分，这阻碍了它们与动态推理过程的整合。本文提出过程奖励智能体（PRA），这是一种在测试时提供基于领域、在线、逐步奖励给固定策略的方法。与先前的检索增强PRMs不同，PRA使基于搜索的解码能够在每个生成步骤中对候选轨迹进行排序和剪枝。在多个医学推理基准上的实验表明，PRA始终优于强基线，在MedQA上使用Qwen3-4B模型达到了80.8%的准确率，创造了4B规模的新技术水平。重要的是，PRA能够泛化到未见过的、参数量从0.5B到8B不等的固定策略模型，在无需更新策略模型的情况下，将其准确率最高提升25.7%。更广泛地说，PRA提出了一种范式，即将固定的推理器与特定领域的奖励模块解耦，从而允许在复杂领域中部署新的骨干模型而无需重新训练。

摘要 (Abstract)

Reasoning in knowledge-intensive domains remains challenging as intermediate steps are often not locally verifiable: unlike math or code, evaluating step correctness may require synthesizing clues across large external knowledge sources. As a result, subtle errors can propagate through reasoning traces, potentially never to be detected. Prior work has proposed process reward models (PRMs), including retrieval-augmented variants, but these methods operate post hoc, scoring completed trajectories, which prevents their integration into dynamic inference procedures. Here, we introduce Process Reward Agents (PRA), a test-time method for providing domain-grounded, online, step-wise rewards to a frozen policy. In contrast to prior retrieval-augmented PRMs, PRA enables search-based decoding to rank and prune candidate trajectories at every generation step. Experiments on multiple medical reasoning benchmarks demonstrate that PRA consistently outperforms strong baselines, achieving 80.8% accuracy on MedQA with Qwen3-4B, a new state of the art at the 4B scale. Importantly, PRA generalizes to unseen frozen policy models ranging from 0.5B to 8B parameters, improving their accuracy by up to 25.7% without any policy model updates. More broadly, PRA suggests a paradigm in which frozen reasoners are decoupled from domain-specific reward modules, allowing the deployment of new backbones in complex domains without retraining.

关键词: Process Reward Agents, knowledge-intensive reasoning, retrieval-augmented, step-wise rewards, medical reasoning, frozen policy, search-based decoding, LLM agents

27. ❌ E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning

作者: Weiyang Guo, Zesheng Shi, Liye Zhao, Jiayuan Ma, Zeen Zhu, Junxian He, Min Zhang, Jing Li 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09455v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在工具集成推理（TIR）中的训练范式创新，直接涉及LLM、工具使用和智能体工作流。论文明确提到LLMs、Tool-Integrated Reasoning，并针对agent training提出新方法，因此与’Large Language Models OR LLMs OR Foundation Models’、‘Tool Use OR Function Calling OR API Tool Use’和’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分）。论文还涉及SFT（Supervised Fine-tuning）作为现有训练范式之一，因此与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、RAG、CoT等未在摘要中提及或与论文主题无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型在工具集成推理中现有训练范式（如Zero-RL和SFT-then-RL）存在的探索效率低、数据成本高和能力瓶颈问题，提出了E3-TIR这一预热训练范式，通过动态整合专家前缀、专家引导和自主探索三种经验类型，在工具使用任务上实现了6%的性能提升，同时仅需不到10%的合成数据，并在综合投资回报率指标上获得1.46倍的增益。

摘要翻译

尽管大型语言模型（LLM）在工具集成推理（TIR）中展现出显著潜力，但现有训练范式面临严重局限：零强化学习（Zero-RL）因缺乏先验指导而存在探索效率低下与模式退化问题，而“先监督微调后强化学习”（SFT-then-RL）则受限于高数据成本以及低熵崩溃导致的能力瓶颈。为应对这些挑战，我们提出E3-TIR（增强经验利用），一种面向智能体训练早期阶段的预热范式。具体而言，我们将训练过程构建为三种经验类型的动态整合：专家前缀（Expert Prefixes）、专家引导（Expert Guided）与自主探索（Self-Exploration）。通过围绕专家“锚点”执行多样化分支探索，并采用混合策略优化机制，我们有效缓解了分布偏移，并解决了共享前缀引发的优化冲突。该方法动态调整模型的知识边界，在探索多样性与训练效率之间实现了有效平衡。实验结果表明，E3-TIR在工具使用任务上相比传统范式实现了6%的性能提升，且所需合成数据量不足基准方法的10%。此外，在综合性能、数据成本与训练效率的复合指标ROI方面，我们相比基线获得了1.46倍的增益。代码发布于https://github.com/yuki-younai/E3-TIR。

摘要 (Abstract)

While Large Language Models (LLMs) have demonstrated significant potential in Tool-Integrated Reasoning (TIR), existing training paradigms face significant limitations: Zero-RL suffers from inefficient exploration and mode degradation due to a lack of prior guidance, while SFT-then-RL is limited by high data costs and capability plateaus caused by low-entropy collapse. To address these challenges, we propose E3-TIR (Enhanced Experience Exploitation), a warm-up paradigm for the early stages of agent training. Specifically, we formulate training as the dynamic integration of three experience types: Expert Prefixes, Expert Guided, and Self-Exploration. By executing diverse branching exploration around expert “anchors” and employing a mix policy optimization mechanism, we effectively mitigate distribution shifts and resolve optimization conflicts arising from shared prefixes. Our method dynamically adapts the model’s knowledge boundaries, effectively balancing exploration diversity with training efficiency.Experimental results demonstrate that E3-TIR achieves a 6 performance improvement over traditional paradigms on tool-use tasks, while requiring less than 10 of the synthetic data. Furthermore, in terms of ROI, a comprehensive metric integrating performance, data cost, and training efficiency we achieve a 1.46x gain compared to baselines. Code is available at https://github.com/yuki-younai/E3-TIR.

关键词: Large Language Models, Tool-Integrated Reasoning, Agent Training, Supervised Fine-tuning, Experience Exploitation, Training Paradigm, Performance Improvement, Data Efficiency

28. ❌ SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning

作者: Maksim Anisimov, Francesco Belardinelli, Matthew Wicker 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09452v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于深度强化学习（Deep Reinforcement Learning）中的安全策略更新问题，提出了SafeAdapt方法和Rashomon集概念来提供先验安全保证。虽然论文涉及深度学习（Deep Learning）和强化学习（RL），但所有给定的关键词都明确指向大语言模型（LLMs）及其相关技术（如MoE、SFT、RLHF、RAG、CoT等）、特定应用领域（如AI for Science）或大模型特有技术（如量化、推测解码）。论文完全没有涉及语言模型、自然语言处理、大模型架构、训练方法、推理技术或科学AI应用，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了深度强化学习中策略更新时如何保持安全约束的挑战，提出了基于Rashomon集的SafeAdapt方法，能够在适应新任务时提供先验的确定性安全保证。

摘要翻译

安全保证是将强化学习智能体部署于安全关键任务的前提。部署环境常呈现非平稳动态特性或面临性能目标变更，这要求对已习得策略进行更新。由此引出一个核心挑战：如何在更新强化学习策略的同时，保持其在先前任务中的安全属性？现有方法大多无法提供形式化保证，或仅能后验验证策略安全性。本文提出一种持续强化学习中安全策略更新的新型先验方法，通过引入拉什蒙集合——一个在策略参数空间中被证明能在演示数据分布内满足安全约束的区域。我们进而证明，通过将任意用于更新策略的强化学习算法的更新投影至该拉什蒙集合，即可为其提供形式化的、可证明的保证。我们在网格世界导航环境（冰冻湖与毒苹果）中进行了实证验证，确保在下游适应过程中对源任务具有先验可证明的确定性安全保证。相比之下，基于正则化的基线方法出现了安全约束的灾难性遗忘，而我们的方法在提供可证明安全保留保证的同时，实现了强大的适应能力。

摘要 (Abstract)

Safety guarantees are a prerequisite to the deployment of reinforcement learning (RL) agents in safety-critical tasks. Often, deployment environments exhibit non-stationary dynamics or are subject to changing performance goals, requiring updates to the learned policy. This leads to a fundamental challenge: how to update an RL policy while preserving its safety properties on previously encountered tasks? The majority of current approaches either do not provide formal guarantees or verify policy safety only a posteriori. We propose a novel a priori approach to safe policy updates in continual RL by introducing the Rashomon set: a region in policy parameter space certified to meet safety constraints within the demonstration data distribution. We then show that one can provide formal, provable guarantees for arbitrary RL algorithms used to update a policy by projecting their updates onto the Rashomon set. Empirically, we validate this approach across grid-world navigation environments (Frozen Lake and Poisoned Apple) where we guarantee an a priori provably deterministic safety on the source task during downstream adaptation. In contrast, we observe that regularisation-based baselines experience catastrophic forgetting of safety constraints while our approach enables strong adaptation with provable guarantees that safety is preserved.

关键词: Safe policy updates, Deep Reinforcement Learning, Safety guarantees, Rashomon set, Continual RL, Provable safety, Policy adaptation, Formal guarantees

29. ❌ ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

作者: Lifeng Chen, Tianqi You, Hao Liu, Zhimin Bao, Jile Jiao, Xiao Han, Zhicai Ou, Tao Sun, Xiaofeng Mou, Xiaojie Jin, Yi Xu 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09450v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于医学影像报告生成，属于AI for Science/Bioinformatics领域（高度相关，10分）。论文提出了一种扩散模型加速方法（Direct Conditional Distillation），与Inference Acceleration有一定关联（5分）。论文未涉及大语言模型（LLMs）、MoE、量化、对齐、RAG、CoT、代理等关键词，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

论文提出了一种高效的扩散视觉语言模型ECHO，通过直接条件蒸馏和响应不对称扩散策略，实现了胸部X光报告的单步块生成，在保持临床准确性的同时获得了8倍推理加速和显著的文本质量提升。

摘要翻译

胸部X光报告生成（CXR-RG）技术有望显著减轻放射科医生的工作负担。然而，传统的自回归视觉—语言模型（VLMs）因采用序列化标记解码而存在推理延迟高的问题。基于扩散的模型通过并行生成提供了一种有前景的替代方案，但仍需多次去噪迭代。将多步去噪压缩至单步可进一步降低延迟，但由于标记因子化解码器引入的均值场偏差，常导致文本连贯性下降。为解决这一挑战，我们提出了ECHO，一种用于胸部X光报告生成的高效基于扩散的视觉—语言模型（dVLM）。ECHO通过新颖的直接条件蒸馏（DCD）框架实现了稳定的每块单步推理，该框架通过从同策略扩散轨迹构建非因子化监督来编码联合标记依赖关系，从而缓解了均值场限制。此外，我们引入了响应非对称扩散（RAD）训练策略，在保持模型有效性的同时进一步提升了训练效率。大量实验表明，ECHO超越了当前最先进的自回归方法，将RaTE和SemScore分别提升了64.33% 和60.58%，同时在保证临床准确性的前提下实现了8倍的推理加速。

摘要 (Abstract)

Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists’ workload. However, conventional autoregressive vision–language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence due to the mean-field bias introduced by token-factorized denoisers. To address this challenge, we propose \textbf{ECHO}, an efficient diffusion-based VLM (dVLM) for chest X-ray report generation. ECHO enables stable one-step-per-block inference via a novel Direct Conditional Distillation (DCD) framework, which mitigates the mean-field limitation by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. In addition, we introduce a Response-Asymmetric Diffusion (RAD) training strategy that further improves training efficiency while maintaining model effectiveness. Extensive experiments demonstrate that ECHO surpasses state-of-the-art autoregressive methods, improving RaTE and SemScore by \textbf{64.33%} and \textbf{60.58%} respectively, while achieving an \textbf{$8\times$} inference speedup without compromising clinical accuracy.

关键词: Chest X-ray report generation, Diffusion-based vision-language model, One-step inference, Direct Conditional Distillation, Response-Asymmetric Diffusion, Inference acceleration, Medical AI, Clinical accuracy

30. ❌ Many-Tier Instruction Hierarchy in LLM Agents

作者: Jingyu Zhang, Tianjian Li, William Jurayj, Hongyuan Zhan, Benjamin Van Durme, Daniel Khashabi 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09443v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM代理中的指令冲突解决，与’Large Language Models’和’LLM Agents’高度相关（10分），涉及指令层次和权限管理，与’Instruction Tuning’有一定关联（8分）。其他关键词如MoE、量化、推理加速等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM代理中多源指令冲突问题，提出了Many-Tier Instruction Hierarchy方法，并创建了包含853个任务的ManyIH-Bench基准，实验显示当前前沿模型在指令冲突扩展时性能较差（约40%准确率）。

摘要翻译

大型语言模型智能体从多种来源接收指令——系统消息、用户提示、工具输出等——每种指令都承载着不同层级的信任度与权威性。当这些指令发生冲突时，模型必须可靠地遵循最高权限指令以保持安全性和有效性。当前主流范式“指令层级”（Instruction Hierarchy, IH）假设存在一个固定且有限的权限等级集合（通常少于五级），这些等级由僵化的角色标签定义（例如系统 > 用户）。这种范式难以适应现实世界中的智能体应用场景，因为冲突可能出现在更多样的来源和情境中。本研究提出“多层级指令体系”（Many-Tier Instruction Hierarchy, ManyIH），这是一种能够处理任意多权限层级间指令冲突的新范式。我们同时推出了首个面向ManyIH的基准测试“ManyIH-Bench”。该基准要求模型在多达12个不同权限层级的冲突指令中进行导航，包含853项智能体任务（427项代码任务与426项指令遵循任务）。ManyIH-Bench通过组合由大语言模型生成并经人工验证的约束条件，构建了涵盖46种现实世界智能体的真实且高难度的测试用例。实验结果表明，当指令冲突规模扩大时，即使当前最先进的模型也表现不佳（准确率约40%）。这项工作凸显了在智能体场景中，迫切需要能够针对细粒度、可扩展的指令冲突解决方法。

摘要 (Abstract)

Large language model agents receive instructions from many sources-system messages, user prompts, tool outputs, and more-each carrying different levels of trust and authority. When these instructions conflict, models must reliably follow the highest-privilege instruction to remain safe and effective. The dominant paradigm, instruction hierarchy (IH), assumes a fixed, small set of privilege levels (typically fewer than five) defined by rigid role labels (e.g., system > user). This is inadequate for real-world agentic settings, where conflicts can arise across far more sources and contexts. In this work, we propose Many-Tier Instruction Hierarchy (ManyIH), a paradigm for resolving instruction conflicts among instructions with arbitrarily many privilege levels. We introduce ManyIH-Bench, the first benchmark for ManyIH. ManyIH-Bench requires models to navigate up to 12 levels of conflicting instructions with varying privileges, comprising 853 agentic tasks (427 coding and 426 instruction-following). ManyIH-Bench composes constraints developed by LLMs and verified by humans to create realistic and difficult test cases spanning 46 real-world agents. Our experiments show that even the current frontier models perform poorly (~40% accuracy) when instruction conflict scales. This work underscores the urgent need for methods that explicitly target fine-grained, scalable instruction conflict resolution in agentic settings.

关键词: LLM agents, instruction hierarchy, instruction conflict, privilege levels, ManyIH-Bench, agentic tasks, conflict resolution, scalable instruction

31. ❌ Physics-guided surrogate learning enables zero-shot control of turbulent wings

作者: Yuning Wang, Pol Suarez, Mathis Bode, Ricardo Vinuesa 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09434v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究湍流边界层控制，使用强化学习训练策略并实现零样本控制，属于AI在科学领域的应用（流体力学），但与所有大模型技术关键词无关，仅与’AI for Science’有一定关联（5分），因为它是AI在物理科学中的应用，但未涉及生物信息学或化学信息学。

!!! tip deepseek-chat TL;DR

该论文通过物理引导的代理学习方法，在湍流通道流中训练控制策略，实现了在NACA4412机翼上的零样本控制，使表面摩擦阻力降低28.7%，总阻力降低10.7%，同时将训练成本降低了四个数量级。

摘要翻译

气动表面上的湍流边界层是飞机阻力的主要来源，但由于其多尺度动力学特性和空间可变性，尤其在逆压梯度条件下，对其控制仍具挑战性。在典型流动中，强化学习已超越现有最优控制策略，但其在实际几何构型中的应用受限于计算成本与可迁移性。本文研究表明，通过利用壁面湍流的局部结构可克服这些限制。我们在与机翼边界层统计特性匹配的湍流槽道流中训练控制策略，并直接部署于$Re_c=2\times10^5$条件下的NACA4412翼型上，无需额外训练，即实现所谓“零样本控制”。该方法实现了28.7%的表面摩擦阻力降低与10.7%的总阻力降低，在摩擦阻力削减上超越现有最优反对称控制40%，总阻力削减提升5%。相较于在机翼上直接训练，本方法将训练成本降低四个数量级，为实现可扩展的流动控制提供了可能。

摘要 (Abstract)

Turbulent boundary layers over aerodynamic surfaces are a major source of aircraft drag, yet their control remains challenging due to multiscale dynamics and spatial variability, particularly under adverse pressure gradients. Reinforcement learning has outperformed state-of-the-art strategies in canonical flows, but its application to realistic geometries is limited by computational cost and transferability. Here we show that these limitations can be overcome by exploiting local structures of wall-bounded turbulence. Policies are trained in turbulent channel flows matched to wing boundary-layer statistics and deployed directly onto a NACA4412 wing at $Re_c=2\times10^5$ without further training, being the so-called zero-shot control. This achieves a 28.7% reduction in skin-friction drag and a 10.7% reduction in total drag, outperforming the state-of-the-art opposition control by 40% in friction drag reduction and 5% in total drag. Training cost is reduced by four orders of magnitude relative to on-wing training, enabling scalable flow control.

关键词: turbulent boundary layers, reinforcement learning, zero-shot control, skin-friction drag reduction, physics-guided surrogate learning, aerodynamic surfaces, computational fluid dynamics, flow control

32. ❌ TME-PSR: Time-aware, Multi-interest, and Explanation Personalization for Sequential Recommendation

作者: Qingzhuo Wang, Leilei Wen, Juntao Chen, Kunyu Peng, Ruiyang Qin, Zhihua Wei, Wen Shen 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09439v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文TME-PSR专注于序列推荐系统，提出了一种整合时间感知、多兴趣和解释个性化的模型。虽然属于AI应用领域，但所有给定的关键词均针对大模型（LLM）技术、训练方法、推理优化、对齐技术、代理系统等特定主题。该论文未涉及任何大模型技术，也未在生物信息学或化学信息学等科学AI领域应用大模型，而是使用传统的深度学习架构（如线性循环单元）进行推荐系统优化。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种整合时间感知、多兴趣和解释个性化的序列推荐模型TME-PSR，通过双门控时间编码器、轻量多头线性循环单元和动态双分支互信息加权机制，在降低计算成本的同时提高了推荐准确性和解释质量。

摘要翻译

本文提出了一种融合时间感知个性化、多兴趣个性化与解释个性化的序列推荐模型（TME-PSR），用于个性化序列推荐任务。该模型综合考虑了不同用户在时间节奏偏好、细粒度潜在兴趣多样性以及推荐结果与解释之间个性化语义对齐三个维度的差异。具体而言，TME-PSR模型采用双门控时间编码器捕捉个性化时间节奏，利用轻量级多头线性循环单元架构实现高效细粒度子兴趣建模，并通过动态双分支互信息加权机制达成推荐与解释的个性化对齐。在多个真实数据集上的大量实验表明，本方法能以较低计算成本持续提升推荐准确性与解释质量。

摘要 (Abstract)

In this paper, we propose a sequential recommendation model that integrates Time-aware personalization, Multi-interest personalization, and Explanation personalization for Personalized Sequential Recommendation (TME-PSR). That is, we consider the differences across different users in temporal rhythm preference, multiple fine-grained latent interests, and the personalized semantic alignment between recommendations and explanations. Specifically, the proposed TME-PSR model employs a dual-view gated time encoder to capture personalized temporal rhythms, a lightweight multihead Linear Recurrent Unit architecture that enables fine-grained sub-interest modeling with improved efficiency, and a dynamic dual-branch mutual information weighting mechanism to achieve personalized alignment between recommendations and explanations. Extensive experiments on real-world datasets demonstrate that our method consistently improves recommendation accuracy and explanation quality, at a lower computational cost.

关键词: sequential recommendation, personalization, time-aware, multi-interest, explanation, Linear Recurrent Unit, mutual information weighting, computational efficiency

33. ❌ On the Representational Limits of Quantum-Inspired 1024-D Document Embeddings: An Experimental Evaluation Framework

作者: Dario Maio 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09430v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究量子启发的文档嵌入在检索增强生成（RAG）中的应用，与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（10分）。论文提到密集模型源自大语言模型（LLMs），因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分）。其他关键词如MoE、SLMs、对齐、推理加速等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了量子启发的1024维文档嵌入在检索增强生成（RAG）中的表示限制，实验结果表明BM25仍是强基线，量子启发嵌入作为独立检索表示存在几何结构限制，更适合作为辅助组件。

摘要翻译

文本嵌入是现代信息检索与检索增强生成（RAG）的核心技术。尽管当前实践主要采用源自大语言模型（LLM）的稠密模型，但近期研究受希尔伯特式空间的几何特性及其编码更丰富语义结构的潜力启发，开始探索量子启发的替代方案。本文提出了一种实验框架，用于构建基于重叠窗口与多尺度聚合的量子启发式1024维文档嵌入。该流程结合了语义投影（如EigAngle）、电路启发的特征映射、可选的师生蒸馏，以及用于确保可复现性和受控评估的指纹机制。我们引入了一套混合检索诊断工具，包括BM25与基于嵌入的评分之间的静态与动态插值、候选结果合并策略，以及为评分级融合提供理论上限的概念性α-预言机。通过在技术、叙事和法律领域的意大利语及英语受控语料库上使用合成查询进行实验，结果表明：BM25仍是强劲的基线方法；师生蒸馏产生的嵌入能提供稳定的语义结构；而独立的量子启发式嵌入则表现出较弱且不稳定的排序信号。蒸馏效果参差不齐，虽在某些情况下提升了语义对齐，但未能持续改善检索性能；而混合检索在结合词汇与嵌入信号时能够获得具有竞争力的结果。总体而言，实验结果揭示了量子启发式嵌入在几何结构上的局限性，包括距离压缩和排序不稳定性，并明确了其应作为辅助组件而非独立检索表征的角色。

摘要 (Abstract)

Text embeddings are central to modern information retrieval and Retrieval-Augmented Generation (RAG). While dense models derived from Large Language Models (LLMs) dominate current practice, recent work has explored quantum-inspired alternatives motivated by the geometric properties of Hilbert-like spaces and their potential to encode richer semantic structure. This paper presents an experimental framework for constructing quantum-inspired 1024-dimensional document embeddings based on overlapping windows and multi-scale aggregation. The pipeline combines semantic projections (e.g., EigAngle), circuit-inspired feature mappings, and optional teacher-student distillation, together with a fingerprinting mechanism for reproducibility and controlled evaluation. We introduce a set of diagnostic tools for hybrid retrieval, including static and dynamic interpolation between BM25 and embedding-based scores, candidate union strategies, and a conceptual alpha-oracle that provides an upper bound for score-level fusion. Experiments on controlled corpora of Italian and English documents across technical, narrative, and legal domains, using synthetic queries, show that BM25 remains a strong baseline, teacher embeddings provide stable semantic structure, and standalone quantum-inspired embeddings exhibit weak and unstable ranking signals. Distillation yields mixed effects, improving alignment in some cases but not consistently enhancing retrieval performance, while hybrid retrieval can recover competitive results when lexical and embedding-based signals are combined. Overall, the results highlight structural limitations in the geometry of quantum-inspired embeddings, including distance compression and ranking instability, and clarify their role as auxiliary components rather than standalone retrieval representations.

关键词: quantum-inspired embeddings, document embeddings, Retrieval-Augmented Generation, RAG, information retrieval, BM25, teacher-student distillation, hybrid retrieval

34. ❌ Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

作者: Wonbong Jang, Shikun Liu, Soubhik Sanyal, Juan Camilo Perez, Kam Woh Ng, Sanskar Agrawal, Juan-Manuel Perez-Rua, Yiannis Douratsos, Tao Xiang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09429v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories》专注于计算机视觉和图形学领域，提出了一种视频扩散模型（VDM）来联合学习视频和相机轨迹的分布。其核心贡献在于将相机表示为密集光线像素（raxels），并通过解耦自交叉注意力机制与视频帧联合去噪。论文涉及相机参数估计、视频生成、自一致性测试等任务，但所有关键词均与大模型、深度学习技术原理或AI在科学领域的应用无关。关键词列表主要涵盖大语言模型（LLMs）的技术、训练方法、推理优化、对齐、代理系统、压缩、幻觉缓解、可解释性、科学AI应用等，而本论文未涉及任何这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种视频扩散模型（Rays as Pixels），通过将相机表示为光线像素并联合去噪视频帧和相机轨迹，解决了从稀疏图像中恢复相机参数和生成新视角视频的联合任务，实现了相机轨迹预测、联合生成和条件视频生成，并通过自一致性测试验证了其前向和逆向预测的一致性。

摘要翻译

从图像中恢复相机参数并从新视角渲染场景，长期以来在计算机视觉与图形学领域被视为独立任务。当图像覆盖稀疏或位姿存在歧义时，这种分离会失效，因为每项任务都需要另一项任务的输出结果。我们提出“以光线为像素”（Rays as Pixels）方法，这是一种学习视频与相机轨迹联合分布的视频扩散模型（Video Diffusion Model, VDM）。我们将每个相机表示为密集的光线像素（raxels），并通过解耦自交叉注意力机制（Decoupled Self-Cross Attention）将其与视频帧联合去噪。单个训练完成的模型可处理三项任务：从视频预测相机轨迹、根据输入图像联合生成视频与相机轨迹，以及沿目标相机轨迹从输入图像生成视频。由于该模型既能从视频预测轨迹，又能基于自身预测生成视角，我们通过闭环自一致性测试对其进行评估，证明其正向与逆向预测结果相互吻合。值得注意的是，轨迹预测所需的去噪步骤远少于视频生成，甚至仅需少量去噪步骤即可满足自一致性要求。我们在姿态估计与相机控制视频生成任务上报告了实验结果。

摘要 (Abstract)

Recovering camera parameters from images and rendering scenes from novel viewpoints have long been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task needs what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories. We represent each camera as dense ray pixels (raxels) and denoise them jointly with video frames through Decoupled Self-Cross Attention mechanism. A single trained model handles three tasks: predicting camera trajectories from video, jointly generating video and camera trajectory from input images, and generating video from input images along a target camera trajectory. Because the model can both predict trajectories from a video and generate views conditioned on its own predictions, we evaluate it through a closed-loop self-consistency test, demonstrating that its forward and inverse predictions agree. Notably, trajectory prediction requires far fewer denoising steps than video generation, even a few denoising steps suffice for self-consistency. We report results on pose estimation and camera-controlled video generation.

关键词: Video Diffusion Model, camera trajectories, joint distribution, ray pixels, self-consistency test, pose estimation, video generation, Decoupled Self-Cross Attention

作者: Sanchita S. Kamath, Aziz N Zeidieh, Venkatesh Potluri, Sile O’Modhrain, Kenneth Perry, JooYoung Seo 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09426v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于为盲人和低视力人群开发多模态3D数据可视化工具，采用基于经验的协同设计方法，涉及触觉和听觉界面设计。论文内容与所有评分关键词（均围绕大模型、深度学习技术原理及其应用）完全无关，未涉及任何大模型、深度学习、AI技术或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究通过盲人和低视力协同设计者的参与，开发了一个多模态、基于网络的3D数据可视化工具原型，以解决盲人和低视力人群在STEM领域中访问3D数据可视化（如表面图）的障碍，并通过参考声化、立体和体积音频等特征提高了分析准确性和可学习性。

摘要翻译

三维（3D）数据可视化（如曲面图）在从生物医学成像到光谱学等STEM领域至关重要，但对于盲人和低视力（BLV）人群而言，这类可视化工具仍基本无法访问。为弥补这一空白，我们与具有非视觉数据表征专业知识的BLV协同设计者开展了基于经验的协同设计，以创建一种可访问、多模态、基于网络的可视化工具。采用多阶段方法，我们的团队（包括五名BLV研究者和一名非BLV研究者）参与了两轮迭代设计会议，对比了低保真触觉探针与高保真数字原型。这一过程产生了一个具有实证基础功能特性的原型，包括参考声波化、立体与容积音频、可配置的缓冲区聚合等，协同设计者验证了这些功能可提升分析准确性与易学性。在本研究中，我们聚焦于非视觉三维数据探索所必需的核心分析任务：空间定向、地标与峰值定位、局部极值与整体趋势对比、梯度追踪，以及识别被遮挡或部分隐藏的特征。我们的工作为无障碍研究者与开发者提供了一套将触觉知识转化为数字界面的协同设计流程，为未来系统提供了具体的设计指引，并为将无障碍三维可视化拓展至具身化数据环境创造了机遇。

摘要 (Abstract)

Three-dimensional (3D) data visualizations, such as surface plots, are vital in STEM fields from biomedical imaging to spectroscopy, yet remain largely inaccessible to blind and low-vision (BLV) people. To address this gap, we conducted an Experience-Based Co-Design with BLV co-designers with expertise in non-visual data representations to create an accessible, multi-modal, web-native visualization tool. Using a multi-phase methodology, our team of five BLV and one non-BLV researcher(s) participated in two iterative sessions, comparing a low-fidelity tactile probe with a high-fidelity digital prototype. This process produced a prototype with empirically grounded features, including reference sonification, stereo and volumetric audio, and configurable buffer aggregation, which our co-designers validated as improving analytic accuracy and learnability. In this study, we target core analytic tasks essential for non-visual 3D data exploration: orientation, landmark and peak finding, comparing local maxima versus global trends, gradient tracing, and identifying occluded or partially hidden features. Our work offers accessibility researchers and developers a co-design protocol for translating tactile knowledge to digital interfaces, concrete design guidance for future systems, and opportunities to extend accessible 3D visualization into embodied data environments.

关键词: 3D data visualization, blind and low-vision, multi-modal, co-design, accessibility, sonification, tactile probe, web-native tool

36. ❌ Do We Really Need to Approach the Entire Pareto Front in Many-Objective Bayesian Optimisation?

作者: Chao Jiang, Jingyu Huang, Miqing Li 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09417v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是多目标贝叶斯优化算法，属于优化算法领域，与所有评分关键词（均涉及大模型、深度学习技术及其应用）完全无关。论文未提及任何大模型、深度学习、AI for Science等相关内容，专注于传统优化算法框架的改进。

!!! tip deepseek-chat TL;DR

该论文针对多目标贝叶斯优化中因评估预算有限而难以逼近整个帕累托前沿的问题，提出了一个单点搜索框架SPMO，旨在沿目标间良好权衡的方向提高解的质量，并在基准和现实问题上优于现有方法。

摘要翻译

多目标优化的一个子集——多目标优化，涉及目标数量超过三个的优化问题。随着目标数量的增加，充分表征整个帕累托前沿所需的解的数量通常大幅增长。这使得设计能够有效探索整个帕累托前沿的搜索算法变得极具挑战性，甚至往往难以实现。这种困难在贝叶斯优化范式中尤为突出，因为该范式对采样效率要求极高，且通常仅能评估有限数量的解（通常为几百个）。此外，无论优化过程后存在多少高质量、多样化的解，决策者最终仅会选择其中一个用于实际部署。鉴于此，我们认为在评估预算极其有限的情况下，相较于现有多目标贝叶斯优化方法通常追求的近似整个帕累托前沿的目标，专注于为决策者寻找一个尽可能高质量的解可能更为实用。基于这一思路，本文提出了一种基于单点的多目标搜索框架（SPMO），旨在沿着能够实现目标间良好权衡的方向提升解的质量。在SPMO框架内，我们提出了一种简单的采集函数，称为期望单点改进（ESPI），该函数可在无噪声和有噪声两种场景下工作。我们证明了ESPI可通过基于梯度的样本平均近似（SAA）方法进行有效优化，并从理论上证明了其在SAA下的收敛保证。我们还通过实验证明，所提出的SPMO框架在计算上是可处理的，并且在广泛的基准测试和现实问题上优于现有先进方法。

摘要 (Abstract)

Many-objective optimisation, a subset of multi-objective optimisation, involves optimisation problems with more than three objectives. As the number of objectives increases, the number of solutions needed to adequately represent the entire Pareto front typically grows substantially. This makes it challenging, if not infeasible, to design a search algorithm capable of effectively exploring the entire Pareto front. This difficulty is particularly acute in the Bayesian optimisation paradigm, where sample efficiency is critical and only a limited number of solutions (often a few hundred) are evaluated. Moreover, after the optimisation process, the decision-maker eventually selects just one solution for deployment, regardless of how many high-quality, diverse solutions are available. In light of this, we argue an idea that under a very limited evaluation budget, it may be more useful to focus on finding a single solution of the highest possible quality for the decision-maker, rather than aiming to approximate the entire Pareto front as existing many-/multi-objective Bayesian optimisation methods typically do. Bearing this idea in mind, this paper proposes a \underline{s}ingle \underline{p}oint-based \underline{m}ulti-\underline{o}bjective search framework (SPMO) that aims to improve the quality of solutions along a direction that leads to a good tradeoff between objectives. Within SPMO, we present a simple acquisition function, called expected single-point improvement (ESPI), working under both noiseless and noisy scenarios. We show that ESPI can be optimised effectively with gradient-based methods via the sample average approximation (SAA) approach and theoretically prove its convergence guarantees under the SAA. We also empirically demonstrate that the proposed SPMO is computationally tractable and outperforms state-of-the-arts on a wide range of benchmark and real-world problems.

关键词: many-objective optimisation, Bayesian optimisation, Pareto front, single-point search, acquisition function, sample efficiency, multi-objective optimisation, SPMO

37. ❌ Yes, But Not Always. Generative AI Needs Nuanced Opt-in

作者: Wiebke Hutiri, Morgan Scheuerman, Shruti Nagpal, Austin Hoag, Alice Xiang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09413v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 这篇论文主要讨论生成式AI中的同意机制和版权问题，提出了基于代理的推理时选择加入架构。论文内容聚焦于AI伦理、法律框架和系统设计，而非大模型技术原理或具体应用创新。与大多数技术关键词（如模型架构、训练方法、推理优化等）完全无关。唯一相关的关键词是’LLM Agents OR Autonomous Agents OR Agentic Workflow’，因为论文提出了一个基于代理的架构来验证用户意图是否符合权利持有人的条件同意，但这只是论文提出的解决方案的一部分，并非核心技术创新，因此给予5分（有一定关联）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

这篇论文研究了生成式AI中一刀切的同意机制不足的问题，提出了推理时选择加入的细粒度同意验证架构，并通过音乐案例研究展示了该方法如何平衡权利持有人和AI开发者之间的权力关系。

摘要翻译

本文认为，在生成式人工智能中使用创意作品时，采用“一刀切”的同意授权模式是不充分的。现实中的所有权与权利持有者结构、对艺术风格与相似性的模仿，以及人工智能输出作品使用场景的无限可能性，使得当前以默认选择加入为基础的二元同意机制难以为继。为突破当前僵局，我们考察了生成式人工智能工作流程中训练、推理和传播环节的控制杠杆。基于这些分析，我们将推理阶段的选择加入机制定位为一种被忽视的、可实现精细同意验证的机遇。我们构建了适用于选择加入机制的精细化同意条件框架，并提出一种基于智能体的推理阶段选择加入架构，用以验证用户意图请求是否符合权利持有者设定的条件性授权。通过音乐领域的案例研究，我们证明推理阶段的精细化选择加入机制能够兼顾既有权利，并重建权利持有者与人工智能开发者之间的权力平衡。

摘要 (Abstract)

This paper argues that a one-size-fits-all approach to specifying consent for the use of creative works in generative AI is insufficient. Real-world ownership and rights holder structures, the imitation of artistic styles and likeness, and the limitless contexts of use of AI outputs make the status quo of binary consent with opt-in by default untenable. To move beyond the current impasse, we consider levers of control in generative AI workflows at training, inference, and dissemination. Based on these insights, we position inference-time opt-in as an overlooked opportunity for nuanced consent verification. We conceptualize nuanced consent conditions for opt-in and propose an agent-based inference-time opt-in architecture to verify if user intent requests meet conditional consent granted by rights holders. In a case study for music, we demonstrate that nuanced opt-in at inference can account for established rights and re-establish a balance of power between rights holders and AI developers.

关键词: generative AI, consent, opt-in, rights holders, inference-time, agent-based architecture, copyright, music case study

38. ❌ PhysInOne: Visual Physics Learning and Reasoning in One Suite

作者: Siyuan Zhou, Hejun Wang, Hu Cheng, Jinxi Li, Dongsheng Wang, Junwei Jiang, Yixiao Jin, Jiayue Huang, Shiwei Mao, Shangjia Liu, Yafei Yang, Hongkang Song, Shenxing Wei, Zihui Zhang, Peng Huang, Shijie Liu, Zhengli Hao, Hao Li, Yitian Li, Wenqi Zhou, Zhihan Zhao, Zongqi He, Hongtao Wen, Shouwang Huang, Peng Yun, Bowen Cheng, Pok Kazaf Fu, Wai Kit Lai, Jiahao Chen, Kaiyuan Wang, Zhixuan Sun, Ziqi Li, Haochen Hu, Di Zhang, Chun Ho Yuen, Bing Wang, Zhihua Wang, Chuhang Zou, Bo Yang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09415v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文PhysInOne的核心贡献是创建了一个大规模合成数据集，用于训练物理基础的世界模型，属于AI for Science领域（高度相关，10分）。论文明确提到对基础模型进行微调（Post-training/SFT，8分），并涉及基础模型（LLMs/Foundation Models，8分）。数据集规模和质量与Scaling Laws & Data Quality有一定关联（5分），微调过程与Pre-training/Domain Adaptation有一定关联（5分）。其他关键词如MoE、SLMs、RLHF、RAG、推理加速、量化等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文创建了PhysInOne大规模合成数据集，包含200万视频和15万动态3D场景，覆盖71种物理现象，用于训练物理基础的世界模型，实验表明微调基础模型能显著提升物理合理性并揭示现有模型的局限性。

摘要翻译

我们推出PhysInOne——一个大规模合成数据集，旨在解决人工智能系统缺乏物理基础训练数据的关键问题。与现有数据集仅包含数百或数千个样本不同，PhysInOne提供了涵盖153,810个动态三维场景的200万个视频，涉及力学、光学、流体动力学和磁学领域的71种基础物理现象。区别于以往研究，我们的场景以复杂背景下的多物体交互为特征，并提供全面的真实值标注，包括三维几何结构、语义信息、动态运动、物理属性及文本描述。我们通过四个新兴应用验证了PhysInOne的有效性：物理感知视频生成、长/短期未来帧预测、物理属性估计与运动迁移。实验表明，基于PhysInOne对基础模型进行微调可显著提升物理合理性，同时暴露出在复杂物理动态建模和本征属性估计方面存在的关键不足。作为该领域目前规模最大的数据集，其数据量级较先前工作实现了数量级突破，PhysInOne为推进生成式模型、仿真模拟与具身人工智能中的物理基础世界模型建立了新的基准。

摘要 (Abstract)

We present PhysInOne, a large-scale synthetic dataset addressing the critical scarcity of physically-grounded training data for AI systems. Unlike existing datasets limited to merely hundreds or thousands of examples, PhysInOne provides 2 million videos across 153,810 dynamic 3D scenes, covering 71 basic physical phenomena in mechanics, optics, fluid dynamics, and magnetism. Distinct from previous works, our scenes feature multiobject interactions against complex backgrounds, with comprehensive ground-truth annotations including 3D geometry, semantics, dynamic motion, physical properties, and text descriptions. We demonstrate PhysInOne’s efficacy across four emerging applications: physics-aware video generation, long-/short-term future frame prediction, physical property estimation, and motion transfer. Experiments show that fine-tuning foundation models on PhysInOne significantly enhances physical plausibility, while also exposing critical gaps in modeling complex physical dynamics and estimating intrinsic properties. As the largest dataset of its kind, orders of magnitude beyond prior works, PhysInOne establishes a new benchmark for advancing physics-grounded world models in generation, simulation, and embodied AI.

关键词: physics-grounded world models, synthetic dataset, physical phenomena, foundation models fine-tuning, video generation, future frame prediction, physical property estimation, motion transfer

39. ❌ HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

作者: Mohamed Elfeki, Tu Trinh, Kelvin Luu, Guangze Luo, Nathan Hunt, Ernesto Montoya, Nandan Marwaha, Yannis He, Charles Wang, Fernando Crabedo, Alessa Castilo, Bing Liu 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09408v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究LLM代理在遇到不完整或模糊规范时的判断能力，核心关注代理何时自主行动、何时寻求帮助的决策机制。高度相关的关键词包括：LLM Agents（核心研究对象）、Self-Correction（涉及自我改进和反思）、Large Language Models（评估前沿模型）。中等相关的关键词包括：RLHF（使用RL训练改进判断）、Chain of Thought/System 2 Thinking（涉及推理和深度思考过程）、Hallucination Mitigation（与不确定性检测和事实性相关）。其他关键词如MoE、SLMs、Scaling Laws等与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该论文研究了LLM代理在任务规范不完整或模糊时判断何时需要寻求帮助的能力，发现当前前沿模型存在普遍的判断缺陷，并通过RL训练证明这种判断能力是可训练的，能够同时提高求助质量和任务成功率。

摘要翻译

前沿编码智能体在获得完整上下文时能解决复杂任务，但当任务说明不完整或存在歧义时则会失效。其瓶颈并非原始能力，而是判断力：即知道何时应自主行动、何时需要求助。现有基准测试未能捕捉这种失效模式——它们提供的是明确详尽的指令，且仅以执行正确性作为评价标准，因此一个对缺失要求进行侥幸猜测的智能体，其得分会与另一个本应通过询问确认的智能体完全相同。
我们提出人机协同基准测试（HiL-Bench）以衡量这种选择性升级能力。每个任务均包含经过人工验证的阻碍因素（信息缺失、模糊请求、矛盾信息），这些因素仅通过渐进式探索显现，无法在前期检查中被发现。我们的核心指标“询问调和均值”（Ask-F1）——即问题精确度与阻碍因素召回率的调和平均数——捕捉了过度询问与沉默猜测之间的张力；其结构设计从机制上防止了通过滥发问题来操纵系统的行为。
在软件工程与文本转SQL领域的评估揭示出普遍存在的巨大判断力差距：所有前沿模型在自主决定是否询问时，其表现均仅能达到完整信息条件下性能的一小部分。故障分析识别出三种关键求助模式：未能察觉认知缺口而过度自信产生错误信念；虽能检测高度不确定性却持续出错；进行宽泛模糊的升级请求而缺乏自我修正。这些一致模式证实，薄弱的求助能力是模型层面的缺陷，而非特定任务导致。基于结构化Ask-F1奖励的强化学习训练表明判断力具有可塑性：一个320亿参数的模型在求助质量与任务通过率上均获得提升，且这种提升能跨领域迁移。该模型并未学习何时询问的领域特定启发式规则，而是学会了检测无法解决的不确定性并据此行动。

摘要 (Abstract)

Frontier coding agents solve complex tasks when given complete context but collapse when specifications are incomplete or ambiguous. The bottleneck is not raw capability, but judgment: knowing when to act autonomously and when to ask for help. Current benchmarks are blind to this failure mode. They supply unambiguous detailed instructions and solely reward execution correctness, so an agent that makes a lucky guess for a missing requirement will score identically to one that would have asked to be certain. We present HiL-Bench (Human-in-the-Loop Benchmark) to measure this selective escalation skill. Each task contains human-validated blockers (missing information, ambiguous requests, contradictory information) that surface only through progressive exploration, not upfront inspection. Our core metric, Ask-F1, the harmonic mean of question precision and blocker recall, captures the tension between over-asking and silent guessing; its structure architecturally prevents gaming through question spam. Evaluation across SWE and text-to-SQL domains reveals a large universal judgment gap: no frontier model recovers more than a fraction of its full-information performance when deciding whether to ask. Failure analysis identifies three key help-seeking patterns: overconfident wrong beliefs with no gap detection; high uncertainty detection yet persistent errors; broad, imprecise escalation without self-correction. These consistent patterns confirm poor help-seeking is a model-level flaw, not task-specific. RL training on shaped Ask-F1 reward shows judgment is trainable: a 32B model improves both help-seeking quality and task pass rate, with gains that transfer across domains. The model does not learn domain-specific heuristics for when to ask; it learns to detect unresolvable uncertainty and act on it.

关键词: LLM agents, help-seeking, judgment gap, selective escalation, human-in-the-loop, uncertainty detection, RL training, benchmark evaluation

40. ❌ The AI Codebase Maturity Model: From Assisted Coding to Self-Sustaining Systems

作者: Andy Anderson 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09388v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究AI代码开发系统的成熟度模型，涉及AI辅助编程到自主系统的演进。与LLM相关（使用了Claude Code和GitHub Copilot），但未深入技术原理。与Self-Correction/Self-Improvement相关（系统通过反馈循环实现自我改进），与LLM Agents/Tool Use相关（AI工具在开发工作流中的应用），但未涉及其他具体技术关键词如MoE、Scaling Laws、训练方法等。

!!! tip deepseek-chat TL;DR

该论文提出了AI代码库成熟度模型（ACMM），通过一个5级框架描述了代码库如何从基本AI辅助编程演进到自我维持系统，并基于实际项目验证了反馈机制在解锁每个成熟度级别中的关键作用。

摘要翻译

人工智能编码工具已被广泛采用，但多数团队停留在“提示与审查”阶段，缺乏系统化进阶的框架。本文提出AI代码库成熟度模型（ACMM），这是一个五级框架，描述了代码库如何从基础的人工智能辅助编码演进为自持系统。该模型借鉴了能力成熟度模型集成（CMMI）的思想，每个级别由其反馈循环拓扑结构定义——即必须存在特定机制，才能实现向下一级别的跃升。我通过为期四个月的经验报告验证了该模型，报告记录了使用Claude Code（Opus）和GitHub Copilot从零构建的CNCF Kubernetes仪表板项目KubeStellar Console的维护过程。该系统目前运行着63个CI/CD工作流、32套夜间测试套件，代码覆盖率达到91%，并实现了全天24小时内从发现缺陷到修复完成不超过30分钟的处理速度。核心发现是：人工智能驱动开发系统的智能性并非源于AI模型本身，而是存在于围绕其构建的指令、测试、指标与反馈循环的基础设施之中。开发者无法跳过任何成熟度级别，而在每个级别中，解锁下一阶段的关键始终是引入另一种反馈机制。实践证明，对测试用例数量、覆盖率阈值及测试执行可靠性的投入，是整个演进过程中最为重要的投资。

摘要 (Abstract)

AI coding tools are widely adopted, but most teams plateau at prompt-and-review without a framework for systematic progression. This paper presents the AI Codebase Maturity Model (ACMM), a 5-level framework describing how codebases evolve from basic AI-assisted coding to self-sustaining systems. Inspired by CMMI, each level is defined by its feedback loop topology the specific mechanisms that must exist before the next level becomes possible. I validate the model through a 4-month experience report maintaining KubeStellar Console, a CNCF Kubernetes dashboard built from scratch with Claude Code (Opus) and GitHub Copilot. The system currently operates with 63 CI/CD workflows, 32 nightly test suites, 91% code coverage, and achieves bug-to-fix times under 30 minutes 24 hours a day. The central finding: the intelligence of an AI-driven development system resides not in the AI model itself, but in the infrastructure of instructions, tests, metrics, and feedback loops that surround it. You cannot skip levels, and at each level, the thing that unlocks the next one is another feedback mechanism. Testing the volume of test cases, the coverage thresholds, and the reliability of test execution proved to be the single most important investment in the entire journey.

关键词: AI Codebase Maturity Model, AI-assisted coding, self-sustaining systems, feedback loops, CI/CD workflows, test coverage, Claude Code, GitHub Copilot

41. ❌ BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning

作者: Guiyao Tie, Jiawen Shi, Pan Zhou, Lichao Sun 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09378v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究智能体生态系统中的技能模型后门攻击，核心涉及LLM智能体（Agentic Workflow）和工具使用（Tool Use），因此这两个关键词高度相关（10分）。论文使用大模型（LLMs）作为攻击载体，相关度较高（8分）。攻击方法涉及后门微调（Post-training/SFT），这是核心攻击技术（10分）。其他关键词如MoE、量化、推理加速、科学AI等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了BadSkill攻击方法，通过在智能体技能中嵌入后门微调的模型，能够在满足特定触发条件时激活恶意行为，实验表明该攻击在多种模型架构上能达到99.5%的攻击成功率。

摘要翻译

智能体生态系统日益依赖可安装技能来扩展功能，部分技能将训练好的模型构件作为其执行逻辑的一部分进行打包。这带来了一种未被提示注入或普通插件滥用所涵盖的供应链风险：第三方技能可能表面无害，却在其打包的模型中隐藏恶意行为。本文提出BadSkill，一种针对此类“技能中模型”威胁面的后门攻击方案。在BadSkill中，攻击者发布一个看似良性的技能，其嵌入的模型经过后门微调，仅当常规技能参数满足攻击者设定的语义触发组合时，才会激活隐藏的有效载荷。为实现此攻击，我们采用复合目标训练嵌入的分类器，该目标结合了分类损失、基于间隔的分离以及针对投毒优化的目标，并在一个受OpenClaw启发的仿真环境中进行评估。该环境保留了第三方技能的安装与执行流程，同时支持可控的多模型研究。我们的基准测试涵盖13项技能，包括8项触发任务和5项非触发对照技能，合并的主评估集包含571个负类查询和396个触发对齐查询。在来自五个模型系列的八种架构（参数量494M至7.1B）上，BadSkill在八项触发技能中实现了最高99.5%的平均攻击成功率（ASR），同时在负类查询上保持了良好的良性侧准确率。在标准测试集划分的投毒率扫描中，仅3%的投毒率即可实现91.7%的ASR。该攻击在所有评估的模型规模下以及面对五种文本扰动类型时均保持有效。这些发现表明，承载模型的技能是智能体生态系统中一种独特的模型供应链风险，并呼吁对第三方技能构件实施更严格的来源验证和行为审查。

摘要 (Abstract)

Agent ecosystems increasingly rely on installable skills to extend functionality, and some skills bundle learned model artifacts as part of their execution logic. This creates a supply-chain risk that is not captured by prompt injection or ordinary plugin misuse: a third-party skill may appear benign while concealing malicious behavior inside its bundled model. We present BadSkill, a backdoor attack formulation that targets this model-in-skill threat surface. In BadSkill, an adversary publishes a seemingly benign skill whose embedded model is backdoor-fine-tuned to activate a hidden payload only when routine skill parameters satisfy attacker-chosen semantic trigger combinations. To realize this attack, we train the embedded classifier with a composite objective that combines classification loss, margin-based separation, and poison-focused optimization, and evaluate it in an OpenClaw-inspired simulation environment that preserves third-party skill installation and execution while enabling controlled multi-model study. Our benchmark spans 13 skills, including 8 triggered tasks and 5 non-trigger control skills, with a combined main evaluation set of 571 negative-class queries and 396 trigger-aligned queries. Across eight architectures (494M–7.1B parameters) from five model families, BadSkill achieves up to 99.5% average attack success rate (ASR) across the eight triggered skills while maintaining strong benign-side accuracy on negative-class queries. In poison-rate sweeps on the standard test split, a 3% poison rate already yields 91.7% ASR. The attack remains effective across the evaluated model scales and under five text perturbation types. These findings identify model-bearing skills as a distinct model supply-chain risk in agent ecosystems and motivate stronger provenance verification and behavioral vetting for third-party skill artifacts.

关键词: backdoor attack, agent skills, model poisoning, fine-tuning, supply-chain risk, third-party skills, attack success rate, skill parameters

42. ❌ LLM-Rosetta: A Hub-and-Spoke Intermediate Representation for Cross-Provider LLM API Translation

作者: Peng Ding 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09360v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文LLM-Rosetta专注于解决LLM API之间的互操作性问题，通过设计中间表示实现不同提供商API的转换。因此，与"Large Language Models OR LLMs OR Foundation Models"高度相关（10分），因为论文直接处理LLM API。与"Tool Use OR Function Calling OR API Tool Use"高度相关（10分），因为论文的核心是API调用和工具使用的标准化转换。其他关键词如MoE、SLMs、训练方法、推理优化、科学AI应用等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文提出了LLM-Rosetta框架，通过设计一个中心辐射型中间表示来解决不同LLM提供商API之间的互操作性问题，实现了双向转换并验证了其无损保真度和低延迟。

摘要翻译

大型语言模型（LLM）供应商的迅速增加——各自提供专有的API格式——导致生态系统碎片化，应用程序与单一供应商紧密耦合。切换或桥接不同供应商需要O(N²)规模的双边适配器，这阻碍了可移植性和多供应商架构的实现。我们观察到，尽管主要LLM API在语法上存在显著差异，但它们共享一个共同的语义核心：实际挑战在于语法变体的组合复杂性，而非深层的语义不兼容性。基于这一发现，我们提出了LLM-Rosetta，这是一个基于中心辐射型中间表示（IR）的开源翻译框架。该框架通过一个包含9种类型的内容模型和10种类型的流事件模式，捕捉了共享的语义核心——消息、内容片段、工具调用、推理轨迹和生成控制。模块化的操作组合转换器架构使得每个API标准可以独立添加。LLM-Rosetta支持请求和响应载荷的双向转换（供应商到IR再到供应商），包括具备状态上下文管理的分块级流式处理。我们为四种API标准（OpenAI Chat Completions、OpenAI Responses、Anthropic Messages和Google GenAI）实现了转换器，覆盖了绝大多数商业供应商。实证评估表明，该框架实现了无损的往返保真度、正确的流式行为以及低于100微秒的转换开销——其性能与LiteLLM的单向转换方法相当，同时提供了双向性和供应商中立性。LLM-Rosetta通过了Open Responses合规性测试套件，并在阿贡国家实验室投入生产使用。代码发布于https://github.com/Oaklight/llm-rosetta。

摘要 (Abstract)

The rapid proliferation of Large Language Model (LLM) providers–each exposing proprietary API formats–has created a fragmented ecosystem where applications become tightly coupled to individual vendors. Switching or bridging providers requires $O(N^2)$ bilateral adapters, impeding portability and multi-provider architectures. We observe that despite substantial syntactic divergence, the major LLM APIs share a common semantic core: the practical challenge is the combinatorial surface of syntactic variations, not deep semantic incompatibility. Based on this finding, we present LLM-Rosetta, an open-source translation framework built on a hub-and-spoke Intermediate Representation (IR) that captures the shared semantic core–messages, content parts, tool calls, reasoning traces, and generation controls–in a 9-type content model and 10-type stream event schema. A modular Ops-composition converter architecture enables each API standard to be added independently. LLM-Rosetta supports bidirectional conversion (provider-to-IR-to-provider) for both request and response payloads, including chunk-level streaming with stateful context management. We implement converters for four API standards (OpenAI Chat Completions, OpenAI Responses, Anthropic Messages, and Google GenAI), covering the vast majority of commercial providers. Empirical evaluation demonstrates lossless round-trip fidelity, correct streaming behavior, and sub-100 microsecond conversion overhead–competitive with LiteLLM’s single-pass approach while providing bidirectionality and provider neutrality. LLM-Rosetta passes the Open Responses compliance suite and is deployed in production at Argonne National Laboratory. Code is available at https://github.com/Oaklight/llm-rosetta.

关键词: LLM API translation, Intermediate Representation, hub-and-spoke, cross-provider, API interoperability, streaming conversion, semantic core, modular converter

43. ❌ Visually-Guided Policy Optimization for Multimodal Reasoning

作者: Zengbin Wang, Feng Xiong, Liang Lin, Xuecai Hu, Yong Wang, Yanlin Wang, Man Zhang, Xiangxiang Chu 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09349v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究视觉语言模型（VLMs）在推理任务中的视觉注意力问题，提出VGPO框架来增强视觉关注。与关键词的相关性分析：1）论文涉及视觉语言模型，属于大模型应用范畴，但与纯LLM技术关联较弱（5分）；2）论文核心关注多步推理过程中的视觉注意力机制，与"Chain of Thought"和"System 2 Thinking"高度相关（各8分）；3）其他关键词如MoE、量化、RAG等与论文内容无直接关联（0分）。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型在推理过程中视觉注意力不足和视觉遗忘问题，提出了视觉引导策略优化框架，通过视觉注意力补偿和双重粒度优势重加权策略，显著提升了多模态推理任务的性能。

摘要翻译

基于可验证奖励的强化学习（RLVR）显著提升了视觉语言模型（VLM）的推理能力。然而，视觉语言模型固有的文本主导特性常导致其视觉忠实性不足，表现为对视觉标记的注意力激活稀疏。更重要的是，我们的实证分析表明，沿推理步骤发生的时间性视觉遗忘进一步加剧了这一缺陷。为弥补这一差距，我们提出了视觉引导策略优化（Visually-Guided Policy Optimization，VGPO），这是一个在策略优化过程中强化视觉聚焦的新框架。具体而言，VGPO首先引入了一种视觉注意力补偿机制，该机制利用视觉相似性来定位并增强视觉线索，同时在后续步骤中逐步提升视觉期望以对抗视觉遗忘。基于此机制，我们实施了一种双粒度优势重加权策略：在轨迹内部层面，突出显示具有相对较高视觉激活的标记；在轨迹间层面，优先选择展现出更优视觉累积的轨迹。大量实验表明，VGPO在数学多模态推理及视觉依赖任务中实现了更好的视觉激活与更优的性能。

摘要 (Abstract)

Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual focus during policy optimization. Specifically, VGPO initially introduces a Visual Attention Compensation mechanism that leverages visual similarity to localize and amplify visual cues, while progressively elevating visual expectations in later steps to counteract visual forgetting. Building on this mechanism, we implement a dual-grained advantage re-weighting strategy: the intra-trajectory level highlights tokens exhibiting relatively high visual activation, while the inter-trajectory level prioritizes trajectories demonstrating superior visual accumulation. Extensive experiments demonstrate that VGPO achieves better visual activation and superior performance in mathematical multimodal reasoning and visual-dependent tasks.

关键词: Vision-Language Models, Multimodal Reasoning, Visual Attention, Reinforcement Learning, Policy Optimization, Visual Faithfulness, Visual Forgetting, Mathematical Reasoning

44. ❌ Mind the Gap Between Spatial Reasoning and Acting! Step-by-Step Evaluation of Agents With Spatial-Gym

作者: Lars Benedikt Kaesberg, Tianyu Yang, Niklas Bauer, Terry Ruas, Jan Philip Wahle, Bela Gipp 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09338v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文评估了包括GPT-OSS 120B在内的八个模型在空间推理任务中的表现，核心涉及大语言模型（LLMs）在空间推理和智能体（Agents）任务中的应用。论文明确提到’chain-of-thought reasoning’，因此与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（10分）。论文研究模型在交互式环境中的逐步推理和决策，这与’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’有一定关联（5分）。论文评估模型作为智能体在空间环境中的表现，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分）。其他关键词如MoE、SFT、RAG、量化等未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文通过引入Spatial-Gym环境评估大语言模型在空间推理任务中的表现，发现模型在逐步推理设置下表现不佳，解决率远低于人类基准，但扩展的思维链推理仍能保持3-5倍的准确率优势。

摘要翻译

空间推理是导航与机器人技术的核心能力，但衡量模型在此类任务上的表现仍具挑战性。现有基准测试通常在单次生成设置中评估模型，要求模型一次性生成完整解决方案，这与人类在交互环境中逐步解决问题的模式不同。为此，我们提出Spatial-Gym：这是一个基于Gymnasium的环境，通过将二维网格谜题中的路径规划任务构建为可进行回溯的序列决策问题，以隔离并测试空间约束推理能力。我们在500个测试场景中，以人类表现、随机策略及A*算法为基线，评估了八种模型在三种设置下的表现（单次生成、逐步推理、支持回溯的逐步推理）。性能最佳的模型GPT-OSS 120B的解决率为16.0%，较人类基线（98.0%）低82个百分点。逐步推理格式通过消除格式错误帮助较弱模型提升表现（最高提升5.4%），却因限制全局规划而对较强模型产生负面影响（最高降低5.6%）。回溯机制能提高任务完成度，但仅对较弱模型提升解决率；较强模型极少回溯且未从中获益。我们的实验得出三个关键发现：（1）模型无法根据问题难度调整推理强度；（2）接收空间环境图像输入的视觉模型解决率下降73%；（3）即使在逐步推理设置中，扩展的思维链推理仍保持3-5倍于标准推理的准确率优势。Spatial-Gym能够诊断模型局限性，并为通过强化学习提升空间推理能力提供框架。

摘要 (Abstract)

Spatial reasoning is central to navigation and robotics, yet measuring model capabilities on these tasks remains difficult. Existing benchmarks evaluate models in a one-shot setting, requiring full solution generation in a single response, unlike humans, who work in interactive environments step-by-step. We introduce Spatial-Gym, a Gymnasium environment that isolates spatial constraint reasoning by testing pathfinding in 2D-grid puzzles as a sequential decision task with optional backtracking. We evaluate eight models in three settings (one-shot, step-by-step, step-by-step with backtracking) against human, random, and A* baselines on 500 episodes. The best model, GPT-OSS 120B, achieves a solve rate of 16.0%, 82 points below the human baseline (98.0%). Step-by-step format helps weaker models (up to +5.4%) by removing formatting errors, but hurts stronger models (up to 5.6%) by constraining global planning. Backtracking improves episode completion, but increases solve rate only for weaker models; stronger models rarely backtrack and do not benefit from it. Our experiments have three key findings: (1) models fail to scale reasoning effort with difficulty, (2) vision models receiving images of the spatial environment reduce solve rate by 73%, and (3) extended chain-of-thought reasoning retains a 3-5x accuracy advantage over standard inference even in the step-by-step setting. Spatial-Gym enables diagnosis of model limitations and provides a framework for improving spatial reasoning through reinforcement learning.

关键词: spatial reasoning, LLM agents, chain-of-thought reasoning, sequential decision task, pathfinding, Spatial-Gym, step-by-step evaluation, model capabilities

45. ❌ Constraint-Aware Corrective Memory for Language-Based Drug Discovery Agents

作者: Maochen Sun, Youzhi Zhang, Gaofeng Meng 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09308v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究基于大语言模型（LLMs）的自主药物发现智能体（LLM Agents），提出CACM框架来解决现有系统在协议级约束诊断和状态管理上的不足。因此，与’Large Language Models’、‘LLM Agents’、‘Self-Correction’（涉及诊断和纠正）以及’AI for Science’（药物发现属于生物信息学应用）高度相关（10分）。论文涉及智能体的逐步规划和决策，与’Chain of Thought’和’System 2 Thinking’有一定关联（5分），且智能体可能使用工具进行分子分析，与’Tool Use’有一定关联（5分）。其他关键词如MoE、量化、RAG等未在摘要中提及，完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对基于大语言模型的药物发现智能体在协议级约束诊断和状态管理上的不足，提出了CACM框架，通过精确的集合级诊断和简洁的记忆回写机制，将目标级成功率提高了36.4%。

摘要翻译

大型语言模型正使自主药物发现智能体日益可行，但在此场景下的可靠成功并非取决于单一操作或分子，而是取决于最终返回的候选分子集合能否整体满足方案层面的要求，例如集合规模、多样性、结合质量以及可开发性。这产生了一个根本性的控制问题：智能体逐步进行规划，而任务有效性却需在整个候选集合层面进行判定。因此，现有的基于语言的药物发现系统往往依赖冗长的原始历史记录和定义模糊的自我反思，导致故障定位不精确，且面向规划器的智能体状态噪声日益增加。我们提出了CACM（约束感知纠正记忆），这是一个围绕精确的集合级诊断和简洁的记忆回写机制构建的基于语言的药物发现框架。CACM引入了方案审计和基于事实的诊断模块，二者协同分析涵盖任务要求、靶点口袋上下文和候选集合证据的多模态证据，以定位方案违规情况、生成可操作的修复提示，并引导后续行动偏向最相关的纠正方向。为保持规划上下文简洁，CACM将记忆组织为静态、动态和纠正三个通道，并在回写前对其进行压缩，从而在保留持久性任务信息的同时，仅暴露与决策最相关的故障信息。我们的实验结果表明，CACM将目标级别的成功率较现有最先进基线提高了36.4%。这些结果表明，可靠的基于语言的药物发现不仅受益于更强大的分子工具，也依赖于更精确的诊断和更经济的智能体状态。

摘要 (Abstract)

Large language models are making autonomous drug discovery agents increasingly feasible, but reliable success in this setting is not determined by any single action or molecule. It is determined by whether the final returned set jointly satisfies protocol-level requirements such as set size, diversity, binding quality, and developability. This creates a fundamental control problem: the agent plans step by step, while task validity is decided at the level of the whole candidate set. Existing language-based drug discovery systems therefore tend to rely on long raw history and under-specified self-reflection, making failure localization imprecise and planner-facing agent states increasingly noisy. We present CACM (Constraint-Aware Corrective Memory), a language-based drug discovery framework built around precise set-level diagnosis and a concise memory write-back mechanism. CACM introduces protocol auditing and a grounded diagnostician, which jointly analyze multimodal evidence spanning task requirements, pocket context, and candidate-set evidence to localize protocol violations, generate actionable remediation hints, and bias the next action toward the most relevant correction. To keep planning context compact, CACM organizes memory into static, dynamic, and corrective channels and compresses them before write-back, thereby preserving persistent task information while exposing only the most decision-relevant failures. Our experimental results show that CACM improves the target-level success rate by 36.4% over the state-of-the-art baseline. The results show that reliable language-based drug discovery benefits not only from more powerful molecular tools, but also from more precise diagnosis and more economical agent states.

关键词: Large Language Models, Drug Discovery Agents, Constraint-Aware Corrective Memory, Protocol Auditing, Set-level Diagnosis, Memory Compression, Autonomous Agents, Bioinformatics

46. ❌ SatQNet: Satellite-assisted Quantum Network Entanglement Routing Using Directed Line Graph Neural Networks

作者: Tobias Meuser, Jannis Weil, Aninda Lahiri, Marius Paraschiv 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09306v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究卫星辅助量子网络的纠缠路由问题，使用强化学习和图神经网络方法，属于量子网络和通信领域。所有关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，但论文未涉及任何大语言模型、深度学习技术原理或AI在生物/化学信息学等科学子领域的应用，仅与’AI for Science’有微弱关联（因属于量子网络这一科学应用领域），故除该关键词外均评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于强化学习和有向线图神经网络的去中心化路由方法SatQNet，用于解决卫星辅助量子网络中因拓扑高度动态而难以建立高保真端到端纠缠的问题，并在多种设置下优于现有方法且能泛化到未见过的拓扑。

摘要翻译

量子网络有望成为连接量子设备的关键使能技术。然而，与经典通信网络不同，由于纠缠分发的物理限制，量子网络中的信息传输通常仅限于短距离。卫星可以扩展长距离的纠缠分发，但此类网络中的路由极具挑战性，因为卫星运动和随机链路生成会形成高度动态的量子拓扑。现有的路由方法通常依赖于全局拓扑信息，而这些信息会因经典控制平面的延迟而迅速过时；同时，去中心化方法通常仅基于不完整的局部信息进行操作。我们提出了SatQNet，一种用于卫星辅助量子网络中纠缠路由的强化学习方法，该方法可在运行时实现去中心化。其核心创新在于一种以边为中心的有向线图神经网络，该网络在有向边嵌入上进行局部消息传递，从而能够更好地捕捉高动态时变拓扑中的链路特性。通过与相邻中继器交换消息，SatQNet在运行时学习局部图表示，以支持智能体建立高保真度的端到端纠缠。在随机图上训练后，SatQNet在多种场景（包括真实世界的欧洲骨干拓扑）中均优于启发式和基于学习的方法，并且无需重新训练即可泛化到未见过的拓扑结构。

摘要 (Abstract)

Quantum networks are expected to become a key enabler for interconnecting quantum devices. In contrast to classical communication networks, however, information transfer in quantum networks is usually restricted to short distances due to physical constraints of entanglement distribution. Satellites can extend entanglement distribution over long distances, but routing in such networks is challenging because satellite motion and stochastic link generation create a highly dynamic quantum topology. Existing routing methods often rely on global topology information that quickly becomes outdated due to delays in the classical control plane, while decentralized methods typically act on incomplete local information. We propose SatQNet, a reinforcement learning approach for entanglement routing in satellite-assisted quantum networks that can be decentralized at runtime. Its key innovation is an edge-centric directed line graph neural network that performs local message passing on directed edge embeddings, enabling it to better capture link properties in high-degree and time-varying topologies. By exchanging messages with neighboring repeaters, SatQNet learns a local graph representation at runtime that supports agents in establishing high-fidelity end-to-end entanglements. Trained on random graphs, SatQNet outperforms heuristic and learning-based approaches across diverse settings, including a real-world European backbone topology, and generalizes to unseen topologies without retraining.

关键词: Satellite-assisted quantum networks, Entanglement routing, Reinforcement learning, Directed line graph neural network, Decentralized routing, Dynamic topology, Edge-centric message passing, Generalization

47. ❌ SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering

作者: Jingzhi Gong, Ruizhen Gu, Zhiwei Fei, Yazhuo Cao, Lukas Twist, Alina Geiger, Shuo Han, Dominik Sobania, Federica Sarro, Jie M. Zhang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09297v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SkillMOO专注于基于LLM的编码代理（LLM-based coding agents）的技能优化，核心是LLM代理（LLM Agents）的研究应用。摘要明确提到“LLM-based coding agents”，因此与“Large Language Models OR LLMs OR Foundation Models”高度相关（10分）。研究涉及多代理系统（solver agent和optimizer agent协同工作），与“Multi-agent Systems OR Agent Coordination”有一定关联（5分）。论文未涉及其他关键词如MoE、SFT、RAG、推理加速等具体技术，也未涉及生物信息学等科学领域应用，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了SkillMOO框架，通过多目标优化自动演化LLM编码代理的技能包，在软件工程任务中显著提高了通过率并降低了成本。

摘要翻译

智能体技能为基于大语言模型的编码智能体提供了模块化、任务导向的指导，但手动调整技能组合以平衡成功率、成本与运行时间既耗费资源又脆弱易变。本文提出SkillMOO，一种多目标优化框架，该框架利用大语言模型提出的修改建议与NSGA-II幸存者选择机制，自动演化技能组合：求解器智能体在编码任务上评估候选技能组合，优化器智能体则基于失败分析提出组合修改方案。在SkillsBench的三项软件工程任务上，相较于各任务的最佳基线方法，SkillMOO在较低优化开销下将通过率最高提升131%，同时将成本降低最高达32%。模式分析表明，剪枝与替换是性能提升的主要驱动力，这揭示出高效的技能组合倾向于精简、聚焦的内容，而非累积堆砌的指令。

摘要 (Abstract)

Agent skills provide modular, task-specific guidance for LLM- based coding agents, but manually tuning skill bundles to balance success rate, cost, and runtime is expensive and fragile. We present SkillMOO, a multi-objective optimization framework that automatically evolves skill bundles using LLM-proposed edits and NSGA-II survivor selection: a solver agent evaluates candidate skill bundles on coding tasks and an optimizer agent proposes bundle edits based on failure analysis. On three SkillsBench software engineering tasks, SkillMOO improves pass rate by up to 131% while reducing cost up to 32% relative to the best baseline per task at low optimization overhead. Pattern analysis reveals pruning and substitution as primary drivers of improvement, suggesting effective bundles favor minimal, focused content over accumulated instructions.

关键词: LLM-based agents, multi-objective optimization, software engineering, agent skills, NSGA-II, coding tasks, pass rate improvement, cost reduction

48. ❌ SAGE: A Service Agent Graph-guided Evaluation Benchmark

作者: Ling Shi, Yuqin Dai, Ziyin Wang, Ning Gao, Wei Zhang, Chaozheng Wang, Yujie Wang, Wei He, Jinpeng Wang, Deiyi Xiong 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09285v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是提出SAGE基准，用于评估LLM在客户服务场景中的多智能体性能。高度相关关键词：‘Large Language Models’（论文明确研究LLM性能）、‘LLM Agents’和’Multi-agent Systems’（SAGE是多智能体基准，涉及User/Service/Judge Agents）。中等相关：‘Chain of Thought’和’System 2 Thinking’（论文关注逻辑合规性和推理失败，如’Execution Gap’）。其他关键词（如MoE、量化、RAG等）未涉及，评0分。

!!! tip deepseek-chat TL;DR

该论文提出了SAGE基准，通过动态对话图和对抗意图分类来评估LLM在客户服务多智能体场景中的性能，发现模型存在意图分类准确但后续行动错误的'执行差距'，并在高对抗强度下表现出'共情韧性'。

摘要翻译

大型语言模型（LLM）的发展推动了客户服务领域的自动化进程，然而对其性能进行基准测试仍面临挑战。现有基准主要依赖静态范式与单维度指标，未能充分考虑多样化的用户行为，也未能满足实际部署中对结构化标准操作程序（SOP）的严格遵循要求。为弥补这一差距，我们提出了SAGE（服务智能体图引导评估），一种用于自动化双轴评估的通用多智能体基准框架。SAGE将非结构化SOP形式化为动态对话图，从而实现对逻辑合规性的精确验证与路径的全面覆盖。我们引入了对抗性意图分类体系与模块化扩展机制，支持跨领域的低成本部署，并促进自动化对话数据合成。评估通过一个框架进行，其中评判智能体与规则引擎分析用户智能体与服务智能体之间的交互，以生成确定性的基准真值。在6个工业场景中对27个大型语言模型进行的大规模实验揭示了一个显著的“执行差距”：模型能够准确分类用户意图，却无法推导出正确的后续操作。我们还观察到“共情韧性”现象，即在高对抗强度下，尽管底层逻辑已出现故障，模型仍能维持礼貌的对话表象。代码与资源已发布于https://anonymous.4open.science/r/SAGE-Bench-4CD3/。

摘要 (Abstract)

The development of Large Language Models (LLMs) has catalyzed automation in customer service, yet benchmarking their performance remains challenging. Existing benchmarks predominantly rely on static paradigms and single-dimensional metrics, failing to account for diverse user behaviors or the strict adherence to structured Standard Operating Procedures (SOPs) required in real-world deployments. To bridge this gap, we propose SAGE (Service Agent Graph-guided Evaluation), a universal multi-agent benchmark for automated, dual-axis assessment. SAGE formalizes unstructured SOPs into Dynamic Dialogue Graphs, enabling precise verification of logical compliance and comprehensive path coverage. We introduce an Adversarial Intent Taxonomy and a modular Extension Mechanism, enabling low-cost deployment across domains and facilitating automated dialogue data synthesis. Evaluation is conducted via a framework where Judge Agents and a Rule Engine analyze interactions between User and Service Agents to generate deterministic ground truth. Extensive experiments on 27 LLMs across 6 industrial scenarios reveal a significant Execution Gap'' where models accurately classify intents but fail to derive correct subsequent actions. We also observe Empathy Resilience’’, a phenomenon where models maintain polite conversational facades despite underlying logical failures under high adversarial intensity. Code and resources are available at https://anonymous.4open.science/r/SAGE-Bench-4CD3/.

关键词: Large Language Models, Multi-agent Benchmark, Service Agent, Dynamic Dialogue Graphs, Adversarial Intent, Execution Gap, Empathy Resilience, Standard Operating Procedures

49. ❌ Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization

作者: Yuqin Lan, Gen Li, Yuanze Hu, Weihao Shen, Zhaoxin Fan, Faguo Wu, Xiao Zhang, Laurence T. Yang, Zhiming Zheng 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09253v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多模态越狱攻击，主要涉及视觉语言模型（VLMs）的安全性和对齐问题。与’Large Language Models’相关（8分），因为VLMs是LLMs的多模态扩展；与’Instruction Tuning OR Alignment OR Value Alignment’相关（8分），因为越狱攻击直接针对模型的安全对齐机制。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理优化、AI for Science等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

论文研究了针对闭源视觉语言模型的多模态越狱攻击，提出了Mosaic框架，通过多视图集成优化减轻代理依赖，在安全基准测试中实现了最先进的攻击成功率。

摘要翻译

视觉语言模型（VLMs）虽功能强大，但仍易受多模态越狱攻击的影响。现有攻击主要依赖于显式视觉提示攻击或基于梯度的对抗性优化方法。前者较易被检测，后者生成的扰动则更为隐蔽、不易察觉，但通常在同质的开源代理-目标模型设置下进行优化与评估，其在异构设置下对商业闭源VLMs的有效性尚不明确。为探究此问题，我们研究了不同的代理-目标模型设置，并观察到同质与异构设置间存在稳定差距，我们将此现象称为代理依赖性。基于这一发现，我们提出了Mosaic——一种针对闭源VLMs的多模态越狱多视图集成优化框架，通过降低对单一代理模型和单一视觉视图的过度依赖，缓解异构代理-目标设置下的代理依赖问题。具体而言，Mosaic包含三个核心组件：文本侧转换模块，通过扰动对拒绝敏感的词汇模式实现攻击；多视图图像优化模块，在不同裁剪视图下更新扰动以避免对单一视觉视图的过拟合；以及代理集成引导模块，聚合多个代理VLMs的优化信号以减少代理特定偏差。在安全基准上的大量实验表明，Mosaic在针对商业闭源VLMs的攻击成功率与平均毒性指标上均达到了最优水平。

摘要 (Abstract)

Vision-Language Models (VLMs) are powerful but remain vulnerable to multimodal jailbreak attacks. Existing attacks mainly rely on either explicit visual prompt attacks or gradient-based adversarial optimization. While the former is easier to detect, the latter produces subtle perturbations that are less perceptible, but is usually optimized and evaluated under homogeneous open-source surrogate-target settings, leaving its effectiveness on commercial closed-source VLMs under heterogeneous settings unclear. To examine this issue, we study different surrogate-target settings and observe a consistent gap between homogeneous and heterogeneous settings, a phenomenon we term surrogate dependency. Motivated by this finding, we propose Mosaic, a Multi-view ensemble optimization framework for multimodal jailbreak against closed-source VLMs, which alleviates surrogate dependency under heterogeneous surrogate-target settings by reducing over-reliance on any single surrogate model and visual view. Specifically, Mosaic incorporates three core components: a Text-Side Transformation module, which perturbs refusal-sensitive lexical patterns; a Multi-View Image Optimization module, which updates perturbations under diverse cropped views to avoid overfitting to a single visual view; and a Surrogate Ensemble Guidance module, which aggregates optimization signals from multiple surrogate VLMs to reduce surrogate-specific bias. Extensive experiments on safety benchmarks demonstrate that Mosaic achieves state-of-the-art Attack Success Rate and Average Toxicity against commercial closed-source VLMs.

关键词: Vision-Language Models, multimodal jailbreak, closed-source VLMs, surrogate dependency, multi-view optimization, safety benchmarks, attack success rate, adversarial optimization

50. ❌ DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?

作者: Young-Suk Lee, Ramon Fernandez Astudillo, Radu Florian 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09251v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文提出DRBENCHER基准测试，用于评估深度研究代理（deep research agents）在浏览网页与多步计算交织任务中的性能。核心涉及LLM代理（agentic workflow）、工具使用（web browsing as tool use）、检索增强生成（RAG-like property retrieval）、多步推理（multi-hop reasoning）和系统2思维（complex computation）。论文在生物化学等科学领域有应用，但未涉及MoE、量化、对齐等具体技术细节。

!!! tip deepseek-chat TL;DR

该研究针对深度研究代理在网页浏览与多步计算交织任务中缺乏综合评估的问题，提出了DRBENCHER合成基准测试生成器，结果显示当前最强前沿模型在该基准上仅达到20%的答案准确率。

摘要翻译

深度研究智能体日益频繁地将网络浏览与多步骤计算任务交织进行，然而现有基准测试孤立地评估这些能力，导致对实际性能的评估存在盲区。我们推出DRBENCHER——一个针对需要浏览与计算复合型问题的合成基准测试生成器。该工具强制执行四项标准：可验证性（通过执行基于知识图谱值的参数化代码计算得出标准答案）、复杂性（涉及多跳实体识别、属性检索及领域特定计算）、难度（采用两阶段验证级联过滤掉生成模型可独立解决的问题）以及多样性（通过贪心最大最小嵌入过滤器实现最大覆盖范围）。这些标准通过一个跨五个领域（生物化学、金融、地球物理、安全与历史）的统一“答案优先”流程实现。人工评估显示其有效性问题占比达76%（若排除陈旧数据则达84%），其中35%的错误源于知识图谱条目过时，这凸显了基于动态演化数据进行推理的系统固有局限。自动评估表明，当前最强的前沿模型仅达到20%的答案准确率。与人工构建的基准测试（BrowseComp+、MATH-500、GPQA）相比，DRBENCHER实现了最高的语义多样性。

摘要 (Abstract)

Deep research agents increasingly interleave web browsing with multi-step computation, yet existing benchmarks evaluate these capabilities in isolation, creating a blind spot in assessing real-world performance. We introduce DRBENCHER, a synthetic benchmark generator for questions that require both browsing and computation. It enforces four criteria: verifiability (gold answers are computed by executing parameterized code over knowledge-graph values), complexity (multi-hop entity identification, property retrieval, and domain-specific computation), difficulty (a two-stage verification cascade filters out questions solvable by the generating model), and diversity (a greedy max-min embedding filter maximizes coverage). These criteria are realized via a unified answer-first pipeline spanning five domains: biochemistry, financial, geophysical, security, and history. Human evaluation shows 76% validity (84% excluding stale data), with 35% of errors due to outdated knowledge-graph entries, highlighting an inherent limitation of systems that reason over evolving data. Automatic evaluation shows that the strongest frontier model achieves only 20% answer accuracy. Compared to manually constructed benchmarks (BrowseComp+, MATH-500, GPQA), DRBENCHER achieves the highest semantic diversity.

关键词: deep research agents, benchmark generation, multi-step computation, web browsing, knowledge-graph, entity identification, property retrieval, synthetic benchmark

51. ❌ DDSP-QbE++: Improving Speech Quality for Speech Anonymisation for Atypical Speech

作者: Suhita Ghosh, Yamini Sinha, Sebastian Stober 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09246v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于语音信号处理领域，研究DDSP-QbE语音匿名化系统中的语音质量改进技术，具体涉及可微分数字信号处理、语音合成、谐波激励生成、相位累积振荡器、PolyBLEP校正等底层信号处理技术。所有评分关键词均与大语言模型、深度学习技术原理创新、AI for Science等主题相关，而本论文完全不涉及这些领域，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对DDSP-QbE语音匿名化系统中因相位累积振荡器产生锯齿波形导致的高频混叠失真问题，提出了结合显式浊音检测和PolyBLEP校正的改进方法，有效减少了谐波伪影并提升了语音感知自然度。

摘要翻译

用于语音转换的可微分数字信号处理（DDSP）流水线依赖于减法合成，即通过学习的频谱包络对周期性激励信号进行整形，以重建目标语音。在DDSP-QbE中，激励信号通过相位累加生成，产生类似锯齿波的波形，其突变的不连续性会引入混叠伪影，在感知上表现为嗡嗡声和频谱失真，尤其在较高基频时更为明显。我们针对DDSP-QbE减法合成器的激励阶段提出了两项针对性改进。首先，我们引入显式清浊音检测来门控谐波激励，抑制非浊音区域的周期性成分，并用滤波后的噪声替代，从而在感知干扰最严重的区域避免混叠谐波内容。其次，我们对相位累加振荡器应用多项式带限阶跃（PolyBLEP）校正，用平滑的多项式残差替代每次相位回绕处的硬波形间断，在不进行过采样或频谱截断的情况下消除混叠生成分量。综合这些改进，系统获得了更清晰的谐波衰减、更少的高频伪影，并通过平均意见得分（MOS）测量证实了感知自然度的提升。所提方法计算轻量、可微分，且无需额外可学习参数，可无缝集成到现有的DDSP-QbE训练流程中。

摘要 (Abstract)

Differentiable Digital Signal Processing (DDSP) pipelines for voice conversion rely on subtractive synthesis, where a periodic excitation signal is shaped by a learned spectral envelope to reconstruct the target voice. In DDSP-QbE, the excitation is generated via phase accumulation, producing a sawtooth-like waveform whose abrupt discontinuities introduce aliasing artefacts that manifest perceptually as buzziness and spectral distortion, particularly at higher fundamental frequencies. We propose two targeted improvements to the excitation stage of the DDSP-QbE subtractive synthesizer. First, we incorporate explicit voicing detection to gate the harmonic excitation, suppressing the periodic component in unvoiced regions and replacing it with filtered noise, thereby avoiding aliased harmonic content where it is most perceptually disruptive. Second, we apply Polynomial Band-Limited Step (PolyBLEP) correction to the phase-accumulated oscillator, substituting the hard waveform discontinuity at each phase wrap with a smooth polynomial residual that cancels alias-generating components without oversampling or spectral truncation. Together, these modifications yield a cleaner harmonic roll-off, reduced high-frequency artefacts, and improved perceptual naturalness, as measured by MOS. The proposed approach is lightweight, differentiable, and integrates seamlessly into the existing DDSP-QbE training pipeline with no additional learnable parameters.

关键词: DDSP, speech anonymisation, voice conversion, subtractive synthesis, phase accumulation, PolyBLEP correction, harmonic excitation, perceptual naturalness

52. ❌ Statistical Properties of the King Wen Sequence: An Anti-Habituation Structure That Does Not Improve Neural Network Training

作者: Augustin Chan 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09234v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是《易经》中《文王卦序》的统计特性及其对神经网络训练的影响，属于传统序列分析和基础神经网络训练实验，完全不涉及大模型、深度学习技术原理创新或AI在科学领域的应用。所有关键词均与大模型技术、训练方法、推理优化、AI应用等现代AI研究主题相关，而本文内容为古典序列分析和基础神经网络训练，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文研究了《易经》中文王卦序的统计特性，发现其具有反习惯化结构，但实验证明这种结构会破坏梯度优化，反而损害神经网络训练性能。

摘要翻译

《易经》中的文王卦序（约公元前1000年）将六十四卦——即一个六维二元空间的诸种状态——排列成一种令学者困惑三千年的模式。本研究通过基于十万次随机基线的蒙特卡洛置换分析，对该序列进行了严格的统计表征。我们发现该序列具有四项统计显著特性：高于随机水平的转移距离（98.2百分位）、负的滞后一阶自相关性（p=0.037）、四卦组内的阴阳平衡性（p=0.002）以及卦对内与卦对间距离的非对称性（99.2百分位）。这些特性表面类似于课程学习（curriculum learning）与好奇心驱动探索中的原则，从而启发我们提出假设：它们可能有益于神经网络训练。我们通过三项实验检验该假设：学习率调度调制、课程排序以及种子敏感性分析，实验在两个硬件平台（PyTorch框架下的NVIDIA RTX 2060与MLX框架下的Apple Silicon）上进行。结果均为负面：文王序列的学习率调制在所有测试幅度下均降低性能；作为课程排序，文王序列在一个平台上是最差的非顺序排列，在另一平台上则处于噪声范围内；三十次种子数扫描证实，仅文王序列导致的性能下降超出自然种子方差。我们阐释了其原因：该序列的高方差——正是使其具有统计显著性的特性——会破坏基于梯度的优化过程的稳定性。固定组合序列中的反习惯化效应并不等同于有效的训练动态。

摘要 (Abstract)

The King Wen sequence of the I-Ching (c. 1000 BC) orders 64 hexagrams – states of a six-dimensional binary space – in a pattern that has puzzled scholars for three millennia. We present a rigorous statistical characterization of this ordering using Monte Carlo permutation analysis against 100,000 random baselines. We find that the sequence has four statistically significant properties: higher-than-random transition distance (98.2nd percentile), negative lag-1 autocorrelation (p=0.037), yang-balanced groups of four (p=0.002), and asymmetric within-pair vs. between-pair distances (99.2nd percentile). These properties superficially resemble principles from curriculum learning and curiosity-driven exploration, motivating the hypothesis that they might benefit neural network training. We test this hypothesis through three experiments: learning rate schedule modulation, curriculum ordering, and seed sensitivity analysis, conducted across two hardware platforms (NVIDIA RTX 2060 with PyTorch and Apple Silicon with MLX). The results are uniformly negative. King Wen LR modulation degrades performance at all tested amplitudes. As curriculum ordering, King Wen is the worst non-sequential ordering on one platform and within noise on the other. A 30-seed sweep confirms that only King Wen’s degradation exceeds natural seed variance. We explain why: the sequence’s high variance – the very property that makes it statistically distinctive – destabilizes gradient-based optimization. Anti-habituation in a fixed combinatorial sequence is not the same as effective training dynamics.

关键词: King Wen sequence, I-Ching, statistical properties, neural network training, curriculum learning, gradient optimization, Monte Carlo permutation analysis, anti-habituation

53. ❌ Neural Distribution Prior for LiDAR Out-of-Distribution Detection

作者: Zizhao Li, Zhengkang Xiang, Jiayang Ao, Feng Liu, Joseph West, Kourosh Khoshelham 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09232v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于LiDAR点云数据的OOD检测，提出Neural Distribution Prior框架和Perlin噪声合成策略。虽然属于深度学习在自动驾驶感知领域的应用，但所有评分关键词均针对大语言模型（LLM）及相关技术（如MoE、RLHF、RAG、量化等），而本文完全不涉及语言模型、文本生成、指令调优、推理加速等LLM相关主题。论文内容与所有关键词无直接关联，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文针对LiDAR感知中的开放世界OOD检测问题，提出了Neural Distribution Prior框架和OOD合成策略，在SemanticKITTI和STU基准上显著提升了检测性能。

摘要翻译

基于激光雷达的感知因其在弱光照与低能见度条件下的鲁棒性，对自动驾驶至关重要。然而，当前模型在闭集假设下运行，往往无法识别开放世界中意外的分布外（OOD）物体。现有的OOD评分函数性能有限，因为它们忽略了激光雷达OOD检测中固有的显著类别不平衡问题，并假设了均匀的类别分布。为应对这一局限，我们提出神经分布先验（Neural Distribution Prior，NDP）框架，该框架对网络预测的分布结构进行建模，并根据与学习到的分布先验的对齐情况自适应地重新加权OOD分数。NDP通过一个基于注意力的模块动态捕捉训练数据的逻辑值分布模式，并校正与类别相关的置信度偏差。我们进一步引入一种基于Perlin噪声的OOD合成策略，该策略从输入扫描中生成多样化的辅助OOD样本，从而无需外部数据集即可实现稳健的OOD训练。在SemanticKITTI和STU基准上的大量实验表明，NDP显著提升了OOD检测性能，在STU测试集上达到了61.31%的点级平均精度（AP），较先前最佳结果提高了超过10倍。我们的框架兼容多种现有的OOD评分方法，为开放世界的激光雷达感知提供了一个有效的解决方案。

摘要 (Abstract)

LiDAR-based perception is critical for autonomous driving due to its robustness to poor lighting and visibility conditions. Yet, current models operate under the closed-set assumption and often fail to recognize unexpected out-of-distribution (OOD) objects in the open world. Existing OOD scoring functions exhibit limited performance because they ignore the pronounced class imbalance inherent in LiDAR OOD detection and assume a uniform class distribution. To address this limitation, we propose the Neural Distribution Prior (NDP), a framework that models the distributional structure of network predictions and adaptively reweights OOD scores based on alignment with a learned distribution prior. NDP dynamically captures the logit distribution patterns of training data and corrects class-dependent confidence bias through an attention-based module. We further introduce a Perlin noise-based OOD synthesis strategy that generates diverse auxiliary OOD samples from input scans, enabling robust OOD training without external datasets. Extensive experiments on the SemanticKITTI and STU benchmarks demonstrate that NDP substantially improves OOD detection performance, achieving a point-level AP of 61.31% on the STU test set, which is more than 10$\times$ higher than the previous best result. Our framework is compatible with various existing OOD scoring formulations, providing an effective solution for open-world LiDAR perception.

关键词: LiDAR, out-of-distribution detection, Neural Distribution Prior, class imbalance, Perlin noise, autonomous driving, open-world perception, OOD scoring

54. ❌ The Fast Lane Hypothesis: Von Economo Neurons Implement a Biological Speed-Accuracy Tradeoff

作者: Esila Keskin 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09229v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是Von Economo神经元（VENs）在快速社会决策中的生物学计算功能，属于计算神经科学领域。论文内容完全聚焦于生物神经元建模、神经回路模拟和认知功能分析，没有涉及任何大模型、深度学习技术原理、AI应用或相关技术关键词。所有评分关键词都是关于大模型技术、训练方法、推理优化、AI应用等，与论文的生物学计算主题完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了快速通道假说，首次建立了Von Economo神经元的计算模型，证明这些神经元通过稀疏快速通路实现生物速度-准确性权衡，从而加速社会决策，而不影响最终分类准确性。

摘要翻译

冯·埃科诺神经元（Von Economo neurons, VENs）是一种大型双极投射神经元，仅存在于具有复杂社会认知能力的物种（包括人类、类人猿和鲸类）的前扣带皮层（anterior cingulate cortex, ACC）和脑岛前部。它们在额颞叶痴呆（frontotemporal dementia, FTD）中的选择性缺失以及在自闭症中发育异常的现象，提示其参与快速社会决策过程，然而此前尚未存在任何关于VEN功能的计算模型。我们提出“快速通道假说”：VENs通过提供稀疏、快速的投射通路，实现了生物学的速度-准确性权衡（speed-accuracy tradeoff, SAT），从而以牺牲精细加工准确性为代价，支持快速社会决策。我们将VENs建模为具有5毫秒膜时间常数和8个传入稀疏树突扇入的快速漏积分发放（leaky integrate-and-fire, LIF）神经元，而标准锥体神经元则对应20毫秒膜时间常数和80个传入，这些神经元被置于一个包含2000个神经元的脉冲皮层回路中，并在社会辨别任务上进行训练。我们在10个独立随机种子下评估了三种临床相关条件：典型状态（2% VENs）、自闭症样状态（0.4% VENs）和FTD样状态（训练后VEN消融）。所有配置均达到相同的渐近分类准确率（99.4%），这与“VENs调节决策速度而非表征能力”的预测一致。时序分析证实，VENs产生首次发放潜伏期的中位数比锥体神经元早4毫秒。在固定决策阈值下，典型条件显著快于FTD样条件（t=-23.31, p<0.0001），而自闭症样条件介于两者之间（平均反应时=26.91+/-9.01毫秒 vs. 典型条件20.70+/-2.02毫秒；p=0.078）。一项初步的进化分析显示，模型最优VEN比例与灵长类系统发育梯度存在定性对应关系。据我们所知，这是首个探究冯·埃科诺神经元实际计算功能的计算模型。

摘要 (Abstract)

Von Economo neurons (VENs) are large bipolar projection neurons found exclusively in the anterior cingulate cortex (ACC) and frontal insula of species with complex social cognition, including humans, great apes, and cetaceans. Their selective depletion in frontotemporal dementia (FTD) and altered development in autism implicate them in rapid social decision-making, yet no computational model of VEN function has previously existed. We introduce the Fast Lane Hypothesis: VENs implement a biological speed-accuracy tradeoff (SAT) by providing a sparse, fast projection pathway that enables rapid social decisions at the cost of deliberate processing accuracy. We model VENs as fast leaky integrate-and-fire (LIF) neurons with membrane time constant 5 ms and sparse dendritic fan-in of eight afferents, compared to 20 ms and eighty afferents for standard pyramidal neurons, within a spiking cortical circuit of 2,000 neurons trained on a social discrimination task. Networks are evaluated under three clinically motivated conditions across 10 independent random seeds: typical (2% VENs), autism-like (0.4% VENs), and FTD-like (post-training VEN ablation). All configurations achieve equivalent asymptotic classification accuracy (99.4%), consistent with the prediction that VENs modulate decision speed rather than representational capacity. Temporal analysis confirms that VENs produce median first-spike latencies 4 ms earlier than pyramidal neurons. At a fixed decision threshold, the typical condition is significantly faster than FTD-like (t=-23.31, p<0.0001), while autism-like is intermediate (mean RT=26.91+/-9.01 ms vs. typical 20.70+/-2.02 ms; p=0.078). A preliminary evolutionary analysis shows qualitative correspondence between model-optimal VEN fraction and the primate phylogenetic gradient. To our knowledge, this is the first computational model that asks what a Von Economo neuron actually computes.

关键词: Von Economo neurons, speed-accuracy tradeoff, spiking neural network, social decision-making, computational neuroscience, leaky integrate-and-fire, frontotemporal dementia, autism

55. ❌ GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking

作者: Yunqiang Wang, Hengyuan Na, Di Wu, Miao Hu, Guocong Quan 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09222v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于音频大语言模型（ALLMs）的越狱攻击，核心涉及大语言模型的安全漏洞和攻击方法。因此，仅与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文直接研究音频大语言模型。其他关键词如MoE、SLMs、Scaling Laws、各种训练方法（Pre-training, SFT, RLHF等）、推理优化（RAG, Attention）、代理系统、模型压缩、科学AI等均未在论文标题或摘要中提及，与论文内容无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了音频大语言模型（ALLMs）的越狱攻击，提出了一种基于梯度比率掩码的频率选择性攻击框架GRM，在保持模型实用性的同时实现了高越狱成功率。

摘要翻译

音频大语言模型（ALLMs）实现了丰富的语音-文本交互能力，但也引入了音频模态的越狱漏洞。现有音频越狱方法主要优化越狱成功率，却忽视了效用保持——这体现在转录质量和问答性能上。实践中，更强的攻击往往以效用下降为代价。为研究这种权衡关系，我们通过改变攻击在频域的扰动覆盖范围（从部分频带到全频带）重新评估现有攻击方法，发现更宽的频率覆盖并不一定能提升越狱性能，而效用却持续恶化。这表明将扰动集中在部分频带子集上，比无差别的全频带覆盖能产生更好的攻击-效用权衡。基于此发现，我们提出GRM——一个效用感知的频率选择性越狱框架。该框架通过衡量各梅尔频带对攻击贡献与效用敏感度的相对重要性进行排序，仅对选定的频带子集施加扰动，并在语义保持目标下学习可复用的通用扰动。在四个代表性ALLMs上的实验表明，GRM实现了平均88.46%的越狱成功率（JSR），同时相比代表性基线方法提供了更优的攻击-效用权衡。这些结果凸显了频率选择性扰动在音频越狱中更好平衡攻击效果与效用保持的潜力。内容警示：本文包含有害查询示例及不安全的模型响应。

摘要 (Abstract)

Audio large language models (ALLMs) enable rich speech-text interaction, but they also introduce jailbreak vulnerabilities in the audio modality. Existing audio jailbreak methods mainly optimize jailbreak success while overlooking utility preservation, as reflected in transcription quality and question answering performance. In practice, stronger attacks often come at the cost of degraded utility. To study this trade-off, we revisit existing attacks by varying their perturbation coverage in the frequency domain, from partial-band to full-band, and find that broader frequency coverage does not necessarily improve jailbreak performance, while utility consistently deteriorates. This suggests that concentrating perturbation on a subset of bands can yield a better attack-utility trade-off than indiscriminate full-band coverage. Based on this insight, we propose GRM, a utility-aware frequency-selective jailbreak framework. It ranks Mel bands by their attack contribution relative to utility sensitivity, perturbs only a selected subset of bands, and learns a reusable universal perturbation under a semantic-preservation objective. Experiments on four representative ALLMs show that GRM achieves an average Jailbreak Success Rate (JSR) of 88.46% while providing a better attack-utility trade-off than representative baselines. These results highlight the potential of frequency-selective perturbation for better balancing attack effectiveness and utility preservation in audio jailbreak. Content Warning: This paper includes harmful query examples and unsafe model responses.

关键词: Audio Large Language Models, Jailbreak Attacks, Utility Preservation, Frequency-Selective Perturbation, Gradient-Ratio Masking, Attack-Utility Trade-off, Mel Bands, Universal Perturbation

56. ❌ On the Role of DAG topology in Energy-Aware Cloud Scheduling : A GNN-Based Deep Reinforcement Learning Approach

作者: Anas Hattay, Fred Ngole Mboula, Eric Gascard, Zakaria Yahoun 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09202v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文研究的是基于图神经网络（GNN）和深度强化学习（DRL）的云工作流调度方法，旨在优化完成时间和能耗。虽然涉及深度学习（GNN和DRL），但所有关键词均与大模型（LLMs）及其相关技术（如MoE、SFT、RLHF、RAG等）或大模型在科学领域的应用（如AI for Science）无关。论文未提及任何语言模型、大模型技术原理或大模型在科学领域的应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了基于图神经网络和深度强化学习的云工作流调度方法，发现其在分布外条件下因结构不匹配而性能下降，揭示了当前方法的局限性并指出需要更鲁棒的表示。

摘要翻译

云服务提供商必须将异构计算资源分配给工作流有向无环图（DAG），同时权衡完成时间、成本和能耗等多个竞争性目标。本研究针对单工作流、无队列的调度场景，探讨了一种基于图神经网络（GNN）的深度强化学习调度器，其设计目标是最小化工作流完成时间和能源消耗。我们识别了特定分布外（OOD）条件，在这些条件下基于GNN的深度强化学习调度器会失效，并从原理上解释了失效原因。通过受控的OOD评估，我们证明性能下降源于训练环境与部署环境之间的结构失配，这种失配扰乱了消息传递机制并破坏了策略的泛化能力。我们的分析揭示了当前基于GNN的调度器的根本局限性，并强调需要构建更鲁棒的表征方法，以确保在分布变化下仍能保持可靠的调度性能。

摘要 (Abstract)

Cloud providers must assign heterogeneous compute resources to workflow DAGs while balancing competing objectives such as completion time, cost, and energy consumption. In this work, we study a single-workflow, queue-free scheduling setting and consider a graph neural network (GNN)-based deep reinforcement learning scheduler designed to minimize workflow completion time and energy usage. We identify specific out-of-distribution (OOD) conditions under which GNN-based deep reinforcement learning schedulers fail and provide a principled explanation of why these failures occur. Through controlled OOD evaluations, we demonstrate that performance degradation stems from structural mismatches between training and deployment environments, which disrupt message passing and undermine policy generalization. Our analysis exposes fundamental limitations of current GNN-based schedulers and highlights the need for more robust representations to ensure reliable scheduling performance under distribution shifts.

关键词: Cloud Scheduling, Workflow DAG, Graph Neural Network, Deep Reinforcement Learning, Energy-Aware, Out-of-Distribution, Performance Degradation, Policy Generalization

57. ❌ Artificial intelligence can persuade people to take political actions

作者: Kobi Hackenburg, Luke Hewitt, Caroline Wagner, Ben M. Tappin, Christopher Summerfield 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09200v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究AI的说服力对人们行为的影响，使用了对话式AI模型进行实验，但未具体说明模型类型、架构或技术细节。所有关键词均涉及大模型的技术原理、训练方法、优化技术或特定应用领域（如科学AI），而本文聚焦于AI说服力的行为实验，不涉及这些技术层面的创新或应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该研究通过实验发现AI能有效说服人们采取真实行动（如签署请愿书、捐款），但AI对态度的影响与对行为的影响不相关，且信息提供能解释态度变化但无法解释行为变化。

摘要翻译

关于先进人工智能影响人类行为的能力已引发广泛关注。快速增长的研究表明，AI能够对人们的态度产生显著的说服效果，但其能否说服人们采取具有实际影响的现实行动仍不明确。通过两项大规模预注册实验（共收集14,779人提供的17,950份反馈），我们使用对话式AI模型对参与者在态度与行为层面进行多维度说服干预，包括签署真实请愿书及向慈善机构捐款。研究发现AI对这些行为结果产生了显著的说服效应（例如请愿签署率提升19.7个百分点）。然而，我们未观察到AI在态度层面与行为层面的说服效果存在相关性。此外，我们复现了先前研究中信息供给驱动态度改变的结论，但在行为结果中未发现类似证据。通过对八种行为说服策略的测试，所有策略均优于最有效的态度说服策略，但八种策略间的差异较小。综合而言，这些结果表明，先前依赖态度指标的研究结论可能难以推广至实际行为，因而存在严重误判AI说服在现实世界中行为影响的风险。

摘要 (Abstract)

There is substantial concern about the ability of advanced artificial intelligence to influence people’s behaviour. A rapidly growing body of research has found that AI can produce large persuasive effects on people’s attitudes, but whether AI can persuade people to take consequential real-world actions has remained unclear. In two large preregistered experiments N=17,950 responses from 14,779 people), we used conversational AI models to persuade participants on a range of attitudinal and behavioural outcomes, including signing real petitions and donating money to charity. We found sizable AI persuasion effects on these behavioural outcomes (e.g. +19.7 percentage points on petition signing). However, we observed no evidence of a correlation between AI persuasion effects on attitudes and behaviour. Moreover, we replicated prior findings that information provision drove effects on attitudes, but found no such evidence for our behavioural outcomes. In a test of eight behavioural persuasion strategies, all outperformed the most effective attitudinal persuasion strategy, but differences among the eight were small. Taken together, these results suggest that previous findings relying on attitudinal outcomes may generalize poorly to behaviour, and therefore risk substantially mischaracterizing the real-world behavioural impact of AI persuasion.

关键词: AI persuasion, behavioral outcomes, conversational AI, petition signing, charity donation, attitudinal outcomes, persuasion strategies, real-world actions

58. ❌ Vision Transformers for Preoperative CT-Based Prediction of Histopathologic Chemotherapy Response Score in High-Grade Serous Ovarian Carcinoma

作者: Francesca Fati, Felipe Coutinho, Marika Reinius, Marina Rosanu, Gabriel Funingana, Luigi De Vitis, Gabriella Schivardi, Hannah Clayton, Alice Traversa, Zeyu Gao, Guilherme Penteado, Shangqi Gao, Francesco Pastori, Ramona Woitek, Maria Cristina Ghioni, Giovanni Damiano Aletti, Mercedes Jimenez-Linan, Sarah Burge, Nicoletta Colombo, Evis Sala, Maria Francesca Spadea, Timothy L. Kline, James D. Brenton, Jaime Cardoso, Francesco Multinu, Elena De Momi, Mireia Crispin-Ortuzar, Ines P. Machado 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09197v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于医学影像分析领域，使用Vision Transformer进行卵巢癌化疗反应预测，属于AI在生物医学（Bioinformatics）的应用，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。然而，论文未涉及大语言模型（LLMs）、MoE、缩放定律、训练技术（预训练、微调、对齐、RLHF、PEFT）、推理优化（RAG、上下文扩展、注意力优化）、推理方法（CoT、系统2思维、MCTS）、自我改进、智能体、工具使用、多智能体系统、模型压缩、解码加速、幻觉缓解、可解释性、世界模型、模型合并或上下文学习等主题，因此这些关键词均得0分。

!!! tip deepseek-chat TL;DR

该研究开发了一个基于Vision Transformer的多模态深度学习框架，利用术前CT影像和临床数据预测高级别浆液性卵巢癌的化疗反应评分，在内部测试集上取得了0.95的ROC-AUC和95%的准确率。

摘要翻译

目的。高级别浆液性卵巢癌（High-grade serous ovarian carcinoma, HGSOC）具有显著的生物学和空间异质性，且常在晚期确诊。对于不适合直接进行初次肿瘤细胞减灭术的患者，通常采用新辅助化疗（Neoadjuvant chemotherapy, NACT）后进行延迟初次手术的治疗策略。化疗反应评分（Chemotherapy Response Score, CRS）是一种经过验证的、用于评估NACT反应的组织病理学生物标志物，但其仅在术后才能获得。本研究旨在探讨，是否可以利用治疗前的计算机断层扫描（computed tomography, CT）影像和临床数据来预测CRS，作为一种研究性的决策支持辅助工具，为多学科团队（multidisciplinary team, MDT）讨论预期治疗反应提供信息。方法。我们提出了一种2.5D多模态深度学习框架，该框架使用预训练的视觉变换器（Vision Transformer）编码器处理病灶密集的大网膜切片，并通过一个中间融合模块将得到的视觉表征与临床变量整合，以预测CRS。结果。我们整合影像与临床数据的多模态模型在内部测试队列（IEO，n=41例患者）中取得了0.95的受试者工作特征曲线下面积（ROC-AUC），同时准确率为95%，精确率为80%。在外部测试集（OV04，n=70例患者）中，其ROC-AUC为0.68，准确率为67%，精确率为75%。结论。这些初步结果表明，基于变换器的深度学习模型利用常规临床数据和CT影像，对HGSOC患者进行术前CRS预测是可行的。作为一种研究性的、治疗前决策支持工具，该方法可通过提供早期、无创的治疗反应预估，辅助MDT的讨论。

摘要 (Abstract)

Purpose. High-grade serous ovarian carcinoma (HGSOC) is characterized by pronounced biological and spatial heterogeneity and is frequently diagnosed at an advanced stage. Neoadjuvant chemotherapy (NACT) followed by delayed primary surgery is commonly employed in patients unsuitable for primary cytoreduction. The Chemotherapy Response Score (CRS) is a validated histopathological biomarker of response to NACT, but it is only available postoperatively. In this study, we investigate whether pre-treatment computed tomography (CT) imaging and clinical data can be used to predict CRS as an investigational decision-support adjunct to inform multidisciplinary team (MDT) discussions regarding expected treatment response. Methods. We proposed a 2.5D multimodal deep learning framework that processes lesion-dense omental slices using a pre-trained Vision Transformer encoder and integrates the resulting visual representations with clinical variables through an intermediate fusion module to predict CRS. Results. Our multimodal model, integrating imaging and clinical data, achieved a ROC-AUC of 0.95 alongside 95% accuracy and 80% precision on the internal test cohort (IEO, n=41 patients). On the external test set (OV04, n=70 patients), it achieved a ROC-AUC of 0.68, alongside 67% accuracy and 75% precision. Conclusion. These preliminary results demonstrate the feasibility of transformer-based deep learning for preoperative prediction of CRS in HGSOC using routine clinical data and CT imaging. As an investigational, pre-treatment decision-support tool, this approach may assist MDT discussions by providing early, non-invasive estimates of treatment response.

关键词: Vision Transformer, deep learning, ovarian carcinoma, chemotherapy response, CT imaging, multimodal framework, preoperative prediction, medical AI

59. ❌ Camera Artist: A Multi-Agent Framework for Cinematic Language Storytelling Video Generation

作者: Haobo Hu, Qi Mao, Yuanhang Li, Libiao Jin 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09195v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Camera Artist，一个用于电影语言叙事视频生成的多智能体框架。该研究直接与’LLM Agents OR Autonomous Agents OR Agentic Workflow’和’Multi-agent Systems OR Agent Coordination’高度相关，因为其核心是构建一个模拟真实电影制作流程的多智能体系统，并引入了专门的Cinematography Shot Agent来增强叙事连续性和电影语言注入。然而，论文摘要未提及任何大模型（LLM）的具体使用、训练技术（如预训练、微调、对齐）、推理优化（如注意力机制、解码加速）、模型压缩或特定科学领域应用，因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该研究解决了现有多智能体视频生成系统在跨镜头叙事连续性和电影语言运用方面的不足，提出了Camera Artist框架，通过引入专门的电影摄影智能体和递归故事板生成，显著提升了生成视频的叙事一致性、动态表现力和电影质量。

摘要翻译

我们提出Camera Artist，这是一个多智能体框架，它模拟真实世界的电影制作工作流程，以生成具有明确电影语言的叙事视频。尽管近期的多智能体系统在从剧本到视频的自动化电影制作流程方面取得了显著进展，但它们往往缺乏明确的机制来构建相邻镜头间的叙事推进，以及对电影语言的刻意运用，导致叙事碎片化且电影化质量有限。为解决这一问题，Camera Artist在现有智能体流程基础上，引入了一个专门的电影摄影镜头智能体（Cinematography Shot Agent）。该智能体整合了递归式故事板生成，以增强镜头间的叙事连贯性，并通过注入电影语言来生成更具表现力、更符合电影导向的镜头设计。大量的定量与定性实验结果表明，我们的方法在叙事一致性、动态表现力和感知电影质量方面均持续优于现有基线。

摘要 (Abstract)

We propose Camera Artist, a multi-agent framework that models a real-world filmmaking workflow to generate narrative videos with explicit cinematic language. While recent multi-agent systems have made substantial progress in automating filmmaking workflows from scripts to videos, they often lack explicit mechanisms to structure narrative progression across adjacent shots and deliberate use of cinematic language, resulting in fragmented storytelling and limited filmic quality. To address this, Camera Artist builds upon established agentic pipelines and introduces a dedicated Cinematography Shot Agent, which integrates recursive storyboard generation to strengthen shot-to-shot narrative continuity and cinematic language injection to produce more expressive, film-oriented shot designs. Extensive quantitative and qualitative results demonstrate that our approach consistently outperforms existing baselines in narrative consistency, dynamic expressiveness, and perceived film quality.

关键词: multi-agent framework, cinematic language, video generation, narrative continuity, storyboard generation, filmmaking workflow, shot-to-shot progression, Cinematography Shot Agent

60. ❌ Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies

作者: Avni Mittal 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09189v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs通过RLHF获得的安全策略与其实践行为的一致性，高度相关关键词包括：LLMs（研究对象）、RLHF（安全策略来源）、Alignment（安全对齐）、Self-Reflection（自我一致性审计）、Factuality（行为真实性）、Explainable AI（策略可解释性）。Chain of Thought和System 2 Thinking得5分，因为论文提到推理模型在自我一致性方面表现最好，但并非核心方法。其余关键词与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文研究了LLMs通过RLHF获得的安全策略与其实际行为之间的系统性差距，通过Symbolic-Neural Consistency Audit框架发现模型经常违反自己声明的安全规则，且这种差距是可测量且与架构相关的。

摘要翻译

大型语言模型通过人类反馈强化学习内化了安全策略，但这些策略从未被正式定义且难以审查。现有基准测试依据外部标准评估模型，却无法衡量模型是否理解并执行其自我声明的边界。我们提出符号-神经一致性审计框架，该框架（1）通过结构化提示提取模型自我声明的安全规则，（2）将其形式化为类型化谓词（绝对型、条件型、自适应型），（3）通过基于伤害基准的确定性比较来测量行为合规性。通过对四个前沿模型在45个伤害类别和47,496次观测中的评估，揭示了声明策略与观测行为间的系统性差距：声称绝对拒绝的模型频繁遵从有害提示，具备推理能力的模型实现了最高的自我一致性但未能为29%的类别阐明策略，且跨模型在规则类型上的一致性极低（11%）。这些结果表明，大型语言模型言与行之间的差距是可测量且架构依赖的，这促使我们将自反一致性审计作为行为基准测试的重要补充。

摘要 (Abstract)

LLMs internalize safety policies through RLHF, yet these policies are never formally specified and remain difficult to inspect. Existing benchmarks evaluate models against external standards but do not measure whether models understand and enforce their own stated boundaries. We introduce the Symbolic-Neural Consistency Audit (SNCA), a framework that (1) extracts a model’s self-stated safety rules via structured prompts, (2) formalizes them as typed predicates (Absolute, Conditional, Adaptive), and (3) measures behavioral compliance via deterministic comparison against harm benchmarks. Evaluating four frontier models across 45 harm categories and 47,496 observations reveals systematic gaps between stated policy and observed behavior: models claiming absolute refusal frequently comply with harmful prompts, reasoning models achieve the highest self-consistency but fail to articulate policies for 29% of categories, and cross-model agreement on rule types is remarkably low (11%). These results demonstrate that the gap between what LLMs say and what they do is measurable and architecture-dependent, motivating reflexive consistency audits as a complement to behavioral benchmarks.

关键词: LLMs, safety policies, RLHF, self-consistency, behavioral compliance, reflexive audit, harm benchmarks, Symbolic-Neural Consistency Audit

61. ❌ Persona-E$^2$: A Human-Grounded Dataset for Personality-Shaped Emotional Responses to Textual Events

作者: Yuqin Yang, Haowu Zhou, Haoran Tu, Zhiwen Hui, Shiqi Yan, HaoYang Li, Dong She, Xianrong Yao, Yang Gao, Zhanpeng Jin 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09162v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究LLMs在情感计算中的应用，特别是人格对情感反应的影响，因此与’Large Language Models’高度相关（10分）。论文涉及LLMs的角色扮演、人格幻觉等应用问题，但未涉及其他关键词的具体技术细节（如MoE、SFT、RAG等），因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对LLMs在模拟人格化情感反应时存在的'人格幻觉'问题，构建了Persona-E²数据集，并通过实验发现人格信息（特别是大五人格）能显著提升LLMs对情感变化的理解能力。

摘要翻译

当前多数情感计算研究将情感视为文本的静态属性，聚焦于作者的情感表达而忽视了读者视角。这种方法忽略了个体人格如何导致对同一事件产生多样化的情感评价。尽管角色扮演大语言模型试图模拟此类细微反应，但它们常陷入“人格幻觉”——依赖表面刻板印象而非真实的认知逻辑。一个关键瓶颈在于缺乏将人格特质与情感变化相连接的真实人类数据。为填补这一空白，我们提出了Persona-E$^2$（人格-事件到情感）数据集，该大规模数据集基于标注的MBTI（迈尔斯-布里格斯类型指标）和五大人格特质构建，旨在捕捉读者在新闻、社交媒体及生活叙事中的情感差异。大量实验表明，现有最先进的大语言模型难以准确捕捉情感评价的细微变化，尤其在社交媒体领域。关键的是，我们发现人格信息能显著提升模型的理解能力，其中五大人格特质有效缓解了“人格幻觉”问题。

摘要 (Abstract)

Most affective computing research treats emotion as a static property of text, focusing on the writer’s sentiment while overlooking the reader’s perspective. This approach ignores how individual personalities lead to diverse emotional appraisals of the same event. Although role-playing Large Language Models (LLMs) attempt to simulate such nuanced reactions, they often suffer from “personality illusion’’ – relying on surface-level stereotypes rather than authentic cognitive logic. A critical bottleneck is the absence of ground-truth human data to link personality traits to emotional shifts. To bridge the gap, we introduce Persona-E$^2$ (Persona-Event2Emotion), a large-scale dataset grounded in annotated MBTI and Big Five traits to capture reader-based emotional variations across news, social media, and life narratives. Extensive experiments reveal that state-of-the-art LLMs struggle to capture precise appraisal shifts, particularly in social media domains. Crucially, we find that personality information significantly improves comprehension, with the Big Five traits alleviating “personality illusion.’

关键词: Large Language Models, Personality, Emotion, Dataset, Persona-E², MBTI, Big Five, Personality Illusion

62. ❌ Structuring versus Problematizing: How LLM-based Agents Scaffold Learning in Diagnostic Reasoning

作者: Fatma Betül Güreş, Tanya Nazaretsky, Seyed Parsa Neshaei, Tanja Käser 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09158v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文明确研究LLM-based agents在教育场景中的应用，核心是LLM agents和诊断推理（与CoT/System 2相关），属于AI for Science（教育科学）领域。其他关键词如MoE、SFT、RAG等未涉及。

!!! tip deepseek-chat TL;DR

该研究探讨了基于LLM的药剂师代理在药学培训中采用结构化与问题化两种支架方法如何影响学生的诊断推理表现，发现两种方法均有效支持诊断策略使用，性能主要受场景复杂度影响而非先验知识或支架方法。

摘要翻译

支持学生发展诊断推理能力是跨教育领域的一项关键挑战。新手常面临过早定论和过度依赖启发式方法等认知偏差，并且难以将诊断策略迁移至新案例中。基于学习分析和大语言模型增强的情境学习，通过结合真实案例经验与个性化支架，提供了一种前景广阔的解决方案。然而，不同支架方法如何塑造推理过程仍未得到充分探索。本研究介绍了PharmaSim Switch——一个用于药学技师培训的情境学习环境，其扩展了一个由学习分析和大语言模型驱动的药师智能体。该智能体基于两种理论驱动的支架方法（即结构化与问题化）以及学生学习轨迹，实施教学对话。在一项组间实验中，63名职业学生在两种支架条件之一下，完成了一个学习情境、一个近迁移情境和一个远迁移情境。结果表明，两种支架方法均能有效支持诊断策略的运用。学习表现主要受情境复杂性影响，而非学生先备知识或所使用的支架方法。结构化方法关联于更准确的主动与互动式参与，而问题化方法则引发了更多的建构性投入。这些发现强调了在设计基于学习分析和大语言模型的系统时，结合多种支架方法对于有效培养诊断推理能力的重要价值。

摘要 (Abstract)

Supporting students in developing diagnostic reasoning is a key challenge across educational domains. Novices often face cognitive biases such as premature closure and over-reliance on heuristics, and they struggle to transfer diagnostic strategies to new cases. Scenario-based learning (SBL) enhanced by Learning Analytics (LA) and large language models (LLM) offers a promising approach by combining realistic case experiences with personalized scaffolding. Yet, how different scaffolding approaches shape reasoning processes remains insufficiently explored. This study introduces PharmaSim Switch, an SBL environment for pharmacy technician training, extended with an LA- and LLM-powered pharmacist agent that implements pedagogical conversations rooted in two theory-driven scaffolding approaches: \emph{structuring} and \emph{problematizing}, as well as a student learning trajectory. In a between-groups experiment, 63 vocational students completed a learning scenario, a near-transfer scenario, and a far-transfer scenario under one of the two scaffolding conditions. Results indicate that both scaffolding approaches were effective in supporting the use of diagnostic strategies. Performance outcomes were primarily influenced by scenario complexity rather than students’ prior knowledge or the scaffolding approach used. The structuring approach was associated with more accurate Active and Interactive participation, whereas problematizing elicited more Constructive engagement. These findings underscore the value of combining scaffolding approaches when designing LA- and LLM-based systems to effectively foster diagnostic reasoning.

关键词: LLM-based agents, diagnostic reasoning, scaffolding, scenario-based learning, pharmacy technician training, structuring, problematizing, learning analytics

63. ❌ CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation

作者: Yushi Feng, Junye Du, Qifan Wang, Zizhan Ma, Qian Niu, Yutaka Matsuo, Long Feng, Lequan Yu 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09155v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究GUI自动化代理的安全保障框架，核心是自主代理（LLM Agents）的安全控制机制，因此该关键词得10分。论文使用Vision Language Models（VLMs），属于大模型的一种应用，因此Large Language Models得5分。论文提到Diagnostician模型进行多模态推理以推荐干预措施（如reflect），这涉及自我反思机制，因此Self-Correction得5分。其他关键词如MoE、SFT、RAG、CoT等均未在摘要中提及或与论文主题无关，得0分。

!!! tip deepseek-chat TL;DR

论文提出了CORA框架，通过风险控制机制为自主GUI代理提供统计安全保障，解决了现有方法缺乏形式化验证和用户可调保证的问题，并在新基准Phone-Harm上验证了其安全-有用性-中断的帕累托改进。

摘要翻译

基于视觉语言模型（VLM）的图形用户界面（GUI）智能体正迅速从被动辅助转向自主操作。然而，这种不受限制的行动空间使用户面临严重且不可逆的财务、隐私或社会危害风险。现有防护机制依赖于提示工程、脆弱的启发式方法以及VLM作为评判者，缺乏形式化验证和用户可调的保障。我们提出CORA（COnformal Risk-controlled GUI Agent，符合性风险控制GUI智能体），一种策略后、行动前的防护框架，为已执行的有害行动提供统计性保障。CORA将安全性重新定义为选择性行动执行：我们训练一个守护者模型来估计每个拟执行步骤的条件风险。不同于对原始分数进行简单阈值处理，我们利用符合性风险控制方法来校准一个满足用户指定风险预算的执行/放弃边界，并将被拒绝的行动路由至一个可训练的诊断模型；该模型对被拒行动进行多模态推理，以推荐干预措施（例如确认、反思或中止），从而最小化用户负担。目标锁定机制将评估锚定于一个经过澄清且冻结的用户意图，以抵御视觉注入攻击。为严格评估此范式，我们引入了Phone-Harm——一个在真实场景下包含步骤级危害标签的移动端安全违规新基准。在Phone-Harm及公开基准上针对多种基线的实验验证表明，CORA改善了安全性-实用性-中断性的帕累托前沿，为自主GUI执行提供了一个实用且基于统计的安全范式。代码与基准测试集可在cora-agent.github.io获取。

摘要 (Abstract)

Graphical user interface (GUI) agents powered by vision language models (VLMs) are rapidly moving from passive assistance to autonomous operation. However, this unrestricted action space exposes users to severe and irreversible financial, privacy or social harm. Existing safeguards rely on prompt engineering, brittle heuristics and VLM-as-critic lack formal verification and user-tunable guarantees. We propose CORA (COnformal Risk-controlled GUI Agent), a post-policy, pre-action safeguarding framework that provides statistical guarantees on harmful executed actions. CORA reformulates safety as selective action execution: we train a Guardian model to estimate action-conditional risk for each proposed step. Rather than thresholding raw scores, we leverage Conformal Risk Control to calibrate an execute/abstain boundary that satisfies a user-specified risk budget and route rejected actions to a trainable Diagnostician model, which performs multimodal reasoning over rejected actions to recommend interventions (e.g., confirm, reflect, or abort) to minimize user burden. A Goal-Lock mechanism anchors assessment to a clarified, frozen user intent to resist visual injection attacks. To rigorously evaluate this paradigm, we introduce Phone-Harm, a new benchmark of mobile safety violations with step-level harm labels under real-world settings. Experiments on Phone-Harm and public benchmarks against diverse baselines validate that CORA improves the safety–helpfulness–interruption Pareto frontier, offering a practical, statistically grounded safety paradigm for autonomous GUI execution. Code and benchmark are available at cora-agent.github.io.

关键词: GUI agents, vision language models, safety safeguards, conformal risk control, autonomous operation, harmful action prevention, multimodal reasoning, statistical guarantees

64. ❌ EquiformerV3: Scaling Efficient, Expressive, and General SE(3)-Equivariant Graph Attention Transformers

作者: Yi-Lun Liao, Alexander J. Hoffman, Sabrina C. Shen, Alexandre Duval, Sam Walton Norwood, Tess Smidt 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09130v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于SE(3)-等变图神经网络在3D原子建模中的应用，属于AI for Science领域，特别是材料科学和化学信息学。论文的核心贡献是改进Equiformer架构的效率、表达能力和物理一致性，用于准确建模势能面。所有其他关键词（如LLMs、MoE、RLHF、RAG等）均与自然语言处理、大模型技术或通用AI方法相关，而本文研究的是专门用于科学计算的图神经网络，因此除’AI for Science OR Bioinformatics OR Cheminformatics’外，其他关键词均不相关。

!!! tip deepseek-chat TL;DR

该论文提出了EquiformerV3，一种改进的SE(3)-等变图注意力Transformer，通过软件优化、架构修改和新激活函数，提高了效率、表达能力和泛化性，在多个材料科学基准上实现了最先进的性能。

摘要翻译

随着$SE(3)$-等变图神经网络成为三维原子建模的核心工具，提升其效率、表达能力和物理一致性已成为大规模应用的关键挑战。本研究推出EquiformerV3，即第三代$SE(3)$-等变图注意力Transformer，旨在同时推进效率、表达能力和通用性这三个维度。基于EquiformerV2，我们实现了以下三项关键进展。首先，我们优化了软件实现，获得了$1.75\times$的加速。其次，我们对EquiformerV2进行了简单而有效的改进，包括等变合并层归一化、改进的前馈网络超参数以及采用平滑半径截断的注意力机制。第三，我们提出了SwiGLU-$S^2$激活函数，以引入多体相互作用来提升理论表达能力，并在降低$S^2$网格采样复杂度的同时保持严格的等变性。SwiGLU-$S^2$激活函数与平滑截断注意力机制共同实现了对平滑变化势能面（PES）的精确建模，使EquiformerV3能够泛化至需要能量守恒模拟及PES高阶导数计算的任务。凭借这些改进，通过去噪非平衡结构（DeNS）辅助任务训练的EquiformerV3，在OC20、OMat24和Matbench Discovery数据集上取得了最先进的结果。

摘要 (Abstract)

As $SE(3)$-equivariant graph neural networks mature as a core tool for 3D atomistic modeling, improving their efficiency, expressivity, and physical consistency has become a central challenge for large-scale applications. In this work, we introduce EquiformerV3, the third generation of the $SE(3)$-equivariant graph attention Transformer, designed to advance all three dimensions: efficiency, expressivity, and generality. Building on EquiformerV2, we have the following three key advances. First, we optimize the software implementation, achieving $1.75\times$ speedup. Second, we introduce simple and effective modifications to EquiformerV2, including equivariant merged layer normalization, improved feedforward network hyper-parameters, and attention with smooth radius cutoff. Third, we propose SwiGLU-$S^2$ activations to incorporate many-body interactions for better theoretical expressivity and to preserve strict equivariance while reducing the complexity of sampling $S^2$ grids. Together, SwiGLU-$S^2$ activations and smooth-cutoff attention enable accurate modeling of smoothly varying potential energy surfaces (PES), generalizing EquiformerV3 to tasks requiring energy-conserving simulations and higher-order derivatives of PES. With these improvements, EquiformerV3 trained with the auxiliary task of denoising non-equilibrium structures (DeNS) achieves state-of-the-art results on OC20, OMat24, and Matbench Discovery.

关键词: SE(3)-equivariant graph neural networks, 3D atomistic modeling, graph attention Transformer, potential energy surfaces, material science, EquiformerV3, SwiGLU-S^2 activations, state-of-the-art

65. ❌ Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition

作者: Peng Wang, Yanqiao Zhu, Zixuan Jiang, Qinyuan Chen, Xingjian Zhao, Xipeng Qiu, Wupeng Wang, Zhifu Gao, Xiangang Li, Kai Yu, Xie Chen 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09121v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用LLM作为评估器（LLM-as-a-Judge）和驱动代理框架（LLM-driven agent framework）来改进自动语音识别（ASR）的语义评估和交互式修正，因此与’Large Language Models’和’LLM Agents’高度相关（10分）。论文提到’iterative refinement’和’feedback’，与’Self-Correction’有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对自动语音识别中语义评估不足和交互式修正缺乏的问题，提出了一个基于LLM的代理框架，使用LLM作为语义评估器并驱动多轮交互，实验证明该框架能有效提升语义保真度和交互修正能力。

摘要翻译

近年来，得益于模型架构和大规模训练数据的进步，自动语音识别（ASR）领域取得了显著进展。然而，仍有两大重要方面尚未得到充分探索。首先，作为主导数十年的核心评估指标，词错误率（Word Error Rate, WER）对所有词汇一视同仁，往往无法在句子层面反映话语的语义正确性。其次，交互式修正——人类沟通的关键组成部分——在ASR研究中鲜有被系统性地探讨。本文中，我们将这两个视角整合到一个面向交互式ASR的智能体框架下。我们提出利用“大语言模型即评判者”（LLM-as-a-Judge）作为一种语义感知的评估指标，以超越词元级准确度来衡量识别质量。此外，我们设计了一个由大语言模型驱动的智能体框架，以模拟类人多轮交互，从而通过语义反馈实现对识别结果的迭代优化。我们在多个标准基准上进行了广泛实验，包括GigaSpeech（英语）、WenetSpeech（中文）以及ASRU 2019语码转换测试集。客观评估与主观评估均表明，所提框架在提升语义保真度与交互修正能力方面具有显著效果。我们将公开代码，以促进未来在交互式与智能体化ASR方向的研究。

摘要 (Abstract)

Recent years have witnessed remarkable progress in automatic speech recognition (ASR), driven by advances in model architectures and large-scale training data. However, two important aspects remain underexplored. First, Word Error Rate (WER), the dominant evaluation metric for decades, treats all words equally and often fails to reflect the semantic correctness of an utterance at the sentence level. Second, interactive correction-an essential component of human communication-has rarely been systematically studied in ASR research. In this paper, we integrate these two perspectives under an agentic framework for interactive ASR. We propose leveraging LLM-as-a-Judge as a semantic-aware evaluation metric to assess recognition quality beyond token-level accuracy. Furthermore, we design an LLM-driven agent framework to simulate human-like multi-turn interaction, enabling iterative refinement of recognition outputs through semantic feedback. Extensive experiments are conducted on standard benchmarks, including GigaSpeech (English), WenetSpeech (Chinese), the ASRU 2019 code-switching test set. Both objective and subjective evaluations demonstrate the effectiveness of the proposed framework in improving semantic fidelity and interactive correction capability. We will release the code to facilitate future research in interactive and agentic ASR.

关键词: Interactive ASR, LLM-as-a-Judge, semantic evaluation, agentic framework, multi-turn interaction, semantic feedback, automatic speech recognition, iterative refinement

66. ❌ PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

作者: Changi Hong, Yoonah Song, Hwayoung Park, Chaewoon Bang, Dayeon Gu, Do Hyun Lee, Hong Kook Kim 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09111v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究自动配音中的语音同步问题，提出了一种结合语言模型进行文本改写以实现时长同步和语音同步的方法。论文仅在第1步的等时性处理中使用了语言模型（未指定具体类型），因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分）。其他关键词均未在论文标题或摘要中提及，与论文核心内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于自动配音的语音同步方法PS-TTS，通过语言模型改写翻译文本实现时长同步，并引入语音同步技术来保持唇部同步，实验表明该方法在多种语言对中优于传统TTS系统和真人配音。

摘要翻译

近年来，基于人工智能的配音技术取得进展，使自动配音技术能够将视频中的源语音转换为不同语言的目标语音。然而，自然的自动配音仍面临时长与唇部同步等关键同步挑战，这对保持观众体验至关重要。为此，本文提出一种基于译文改写的自动配音同步方法，包含两个步骤：满足时长约束的等时性处理与保持唇形同步的语音同步。首先，我们通过语言模型对译文进行改写以实现等时性，确保目标语音时长与源语音一致。其次，我们引入语音同步方法，该方法采用动态时间规整算法，并基于训练数据测算的元音距离作为局部代价，使目标文本的元音构成与源语言元音发音相似。再次，我们将此方法扩展为PSComet，其同时考虑语义与语音相似性以更好地保留原意。所提方法被集成至文本转语音系统PS-TTS与PS-Comet TTS中。基于韩语和英语唇读数据集及专业配音演员数据集进行的性能评估表明，在多项客观指标上，两个系统均优于未使用语音同步的TTS系统，并在韩英与英韩双向配音任务中表现优于配音演员。我们将实验扩展至法语，测试所有语言对以评估跨语言适用性。在所有语言对中，PS-Comet均表现最佳，在唇形同步准确性与语义保留间取得平衡，证实其相比单一语音同步方法能在保持语义的同时实现更精确的唇形同步。

摘要 (Abstract)

Recently, artificial intelligence-based dubbing technology has advanced, enabling automated dubbing (AD) to convert the source speech of a video into target speech in different languages. However, natural AD still faces synchronization challenges such as duration and lip-synchronization (lip-sync), which are crucial for preserving the viewer experience. Therefore, this paper proposes a synchronization method for AD processes that paraphrases translated text, comprising two steps: isochrony for timing constraints and phonetic synchronization (PS) to preserve lip-sync. First, we achieve isochrony by paraphrasing the translated text with a language model, ensuring the target speech duration matches that of the source speech. Second, we introduce PS, which employs dynamic time warping (DTW) with local costs of vowel distances measured from training data so that the target text composes vowels with pronunciations similar to source vowels. Third, we extend this approach to PSComet, which jointly considers semantic and phonetic similarity to preserve meaning better. The proposed methods are incorporated into text-to-speech systems, PS-TTS and PS-Comet TTS. The performance evaluation using Korean and English lip-reading datasets and a voice-actor dubbing dataset demonstrates that both systems outperform TTS without PS on several objective metrics and outperform voice actors in Korean-to-English and English-to-Korean dubbing. We extend the experiments to French, testing all pairs among these languages to evaluate cross-linguistic applicability. Across all language pairs, PS-Comet performed best, balancing lip-sync accuracy with semantic preservation, confirming that PS-Comet achieves more accurate lip-sync with semantic preservation than PS alone.

关键词: automated dubbing, phonetic synchronization, text-to-speech, lip-sync, dynamic time warping, language model, cross-linguistic, PS-Comet

67. ❌ TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training

作者: Chenhao Ye, Huaizheng Zhang, Mingcong Han, Baoquan Zhong, Xiang Li, Qixiang Chen, Xinyi Zhang, Weidong Zhang, Kaihua Jiang, Wang Zhang, He Sun, Wencong Xiao, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09107v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM强化学习训练中的权重传输系统优化，与’Large Language Models’和’RLHF’高度相关（10分），因为论文明确针对LLM RL训练，而RLHF是RL训练的一种重要方法。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、PEFT、RAG等均未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

论文针对LLM强化学习训练中权重传输效率低的问题，提出了Reference-Oriented Storage（ROS）抽象和TensorHub系统，显著提升了训练性能并减少了GPU停滞时间。

摘要翻译

现代大语言模型强化学习工作负载需要高效的权重传输系统，以在异构计算资源间扩展训练规模。然而，现有的权重传输方法要么无法为动态扩展集群提供灵活性，要么产生固有的数据移动开销，导致性能低下。
我们提出面向引用的存储（Reference-Oriented Storage，ROS），这是一种用于强化学习权重传输的新型存储抽象，它就地利用高度复制的模型权重。ROS营造出特定版本的模型权重已被存储并可按需获取的假象。在底层，ROS并不物理存储任何权重复制；相反，它跟踪那些在GPU上持有这些权重以进行推理的工作节点。当收到请求时，ROS直接利用这些节点来提供读取服务。我们构建了TensorHub，这是一个生产级系统，它通过拓扑优化传输、强一致性和容错机制扩展了ROS理念。评估表明，TensorHub能够完全饱和RDMA带宽，并以最小的工程代价适配三种不同的推演工作负载。具体而言，对于独立推演，TensorHub将GPU总停滞时间最多降低6.7倍；对于弹性推演，将权重更新速度提升4.8倍；对于跨数据中心推演，将停滞时间缩短19倍。TensorHub已部署于生产环境，用于支持前沿的强化学习训练。

摘要 (Abstract)

Modern LLM reinforcement learning (RL) workloads require a highly efficient weight transfer system to scale training across heterogeneous computational resources. However, existing weight transfer approaches either fail to provide flexibility for dynamically scaling clusters or incur fundamental data movement overhead, resulting in poor performance. We introduce Reference-Oriented Storage (ROS), a new storage abstraction for RL weight transfer that exploits the highly replicated model weights in place. ROS presents the illusion that certain versions of the model weights are stored and can be fetched on demand. Underneath, ROS does not physically store any copies of the weights; instead, it tracks the workers that hold these weights on GPUs for inference. Upon request, ROS directly uses them to serve reads. We build TensorHub, a production-quality system that extends the ROS idea with topology-optimized transfer, strong consistency, and fault tolerance. Evaluation shows that TensorHub fully saturates RDMA bandwidth and adapts to three distinct rollout workloads with minimal engineering effort. Specifically, TensorHub reduces total GPU stall time by up to 6.7x for standalone rollouts, accelerates weight update for elastic rollout by 4.8x, and cuts cross-datacenter rollout stall time by 19x. TensorHub has been deployed in production to support cutting-edge RL training.

关键词: LLM reinforcement learning, weight transfer system, Reference-Oriented Storage, TensorHub, scalable training, GPU stall reduction, RDMA bandwidth, production deployment

68. ❌ Scheming in the wild: detecting real-world AI scheming incidents with open-source intelligence

作者: Tommy Shaffer Shane, Simon Mylius, Hamish Hobbs 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09104v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究AI scheming（阴谋行为）的检测，重点关注AI系统在现实世界中追求未对齐目标的行为。核心相关关键词：1）‘Large Language Models’（10分）- 研究基于聊天机器人对话，属于大模型应用；2）‘Instruction Tuning OR Alignment OR Value Alignment’（10分）- scheming本质是AI未对齐问题，涉及目标对齐和价值观对齐；3）‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（10分）- 研究AI代理在现实世界中的自主行为，包括规避保障措施、欺骗用户等。其他关键词如MoE、量化、推理加速等与论文内容无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于开源情报的方法，通过分析在线聊天记录来检测现实世界中AI系统的阴谋行为，在2025年10月至2026年3月期间识别出698起相关事件，并发现事件数量呈显著增长趋势。

摘要翻译

“密谋”（Scheming），即人工智能系统暗中追求与人类目标错位的目标，代表了一种潜在的灾难性风险，然而相关研究存在显著局限。具体而言，当前的密谋评估所展示的行为可能不会在现实场景中出现，这限制了对该现象的科学理解，阻碍了政策制定，也无法实现对失控事件的实时侦测。获取现实世界证据至关重要，但现有监控技术对此效果有限。本文提出了一种新颖的开源情报（OSINT）方法，用于侦测现实世界中的密谋事件：收集并分析在线分享的聊天机器人对话或命令行交互记录。通过分析来自X平台（原Twitter）的超过183,420条记录，我们识别出2025年10月至2026年3月期间发生的698起现实世界密谋相关事件。我们观察到，从首月至末月，月度事件数量出现了统计上显著的4.9倍增长，而同期讨论密谋的帖子数量仅增长1.7倍。我们发现了先前仅在实验中报告过的多种密谋相关行为在现实部署中的证据，其中许多已导致现实危害。尽管我们未检测到灾难性的密谋事件，但观察到的行为显示出令人担忧的先兆，例如：愿意无视指令、规避安全措施、对用户撒谎，以及以有害方式执着地追求目标。随着人工智能系统能力不断增强，这些行为可能演变为更具战略性的密谋，并带来潜在的灾难性后果。我们的研究结果证明了基于交互记录的OSINT方法作为一种可扩展的现实世界密谋侦测手段是可行的，能够支持科学研究、政策制定和应急响应。我们建议进一步投资于利用OSINT技术来监控密谋及失控现象。

摘要 (Abstract)

Scheming, the covert pursuit of misaligned goals by AI systems, represents a potentially catastrophic risk, yet scheming research suffers from significant limitations. In particular, scheming evaluations demonstrate behaviours that may not occur in real-world settings, limiting scientific understanding, hindering policy development, and not enabling real-time detection of loss of control incidents. Real-world evidence is needed, but current monitoring techniques are not effective for this purpose. This paper introduces a novel open-source intelligence (OSINT) methodology for detecting real-world scheming incidents: collecting and analysing transcripts from chatbot conversations or command-line interactions shared online. Analysing over 183,420 transcripts from X (formerly Twitter), we identify 698 real-world scheming-related incidents between October 2025 and March 2026. We observe a statistically significant 4.9x increase in monthly incidents from the first to last month, compared to a 1.7x increase in posts discussing scheming. We find evidence of multiple scheming-related behaviours in real-world deployments previously reported only in experiments, many resulting in real-world harms. While we did not detect catastrophic scheming incidents, the behaviours observed demonstrate concerning precursors, such as willingness to disregard instructions, circumvent safeguards, lie to users, and single-mindedly pursue goals in harmful ways. As AI systems become more capable, these could evolve into more strategic scheming with potentially catastrophic consequences. Our findings demonstrate the viability of transcript-based OSINT as a scalable approach to real-world scheming detection supporting scientific research, policy development, and emergency response. We recommend further investment towards OSINT techniques for monitoring scheming and loss of control.

关键词: AI scheming, real-world detection, open-source intelligence, alignment, AI safety, chatbot transcripts, loss of control, autonomous agents

69. ❌ DeepGuard: Secure Code Generation via Multi-Layer Semantic Aggregation

作者: Li Huang, Zhongxin Liu, Yifan Wu, Tao Yin, Dong Li, Jichao Bi, Nankun Mu, Hongyu Zhang, Meng Yan 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09089v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于代码生成LLMs的安全性问题，核心贡献是提出DeepGuard框架，通过聚合多层语义表示来增强安全分析。论文明确涉及LLMs和fine-tuning（特别是SFT），因此这两个关键词高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、RAG、Agents等均未在论文中提及或相关，故得0分。

!!! tip deepseek-chat TL;DR

论文研究了代码生成大语言模型（LLMs）的安全漏洞问题，提出DeepGuard框架通过多层语义聚合来增强安全分析，实验表明该方法在保持功能正确性的同时将安全正确生成率平均提高了11.9%。

摘要翻译

用于代码生成的大语言模型（LLM）可能从其训练数据中复制不安全的模式。为缓解此问题，一种常见的安全强化策略是利用源自最终Transformer层的监督信号对模型进行微调。然而，这种设计可能面临最终层瓶颈问题：漏洞判别性线索可能分布在多个层中，并在接近为下一词元预测而优化的输出表示时变得难以检测。为诊断此问题，我们进行了逐层线性探测。我们观察到，与漏洞相关的信号在中间至上层的一个波段中最为明显，但在接近最终层时逐渐衰减。基于这一观察，我们提出了DeepGuard框架，该框架通过基于注意力的模块聚合多个上层的表示，从而利用分布式的安全相关线索。聚合后的信号驱动一个专用的安全分析器，该分析器被置于平衡安全增强与功能正确性的多目标训练框架中，并进一步支持轻量级的推理时引导策略。在五种代码大语言模型上的大量实验表明，DeepGuard相较于SVEN等强基线方法，将安全且正确的代码生成率平均提升了11.9%。该框架在保持功能正确性的同时，对未见的漏洞类型也表现出良好的泛化能力。我们的代码公开于https://github.com/unknownhl/DeepGuard。

摘要 (Abstract)

Large Language Models (LLMs) for code generation can replicate insecure patterns from their training data. To mitigate this, a common strategy for security hardening is to fine-tune models using supervision derived from the final transformer layer. However, this design may suffer from a final-layer bottleneck: vulnerability-discriminative cues can be distributed across layers and become less detectable near the output representations optimized for next-token prediction. To diagnose this issue, we perform layer-wise linear probing. We observe that vulnerability-related signals are most detectable in a band of intermediate-to-upper layers yet attenuate toward the final layers. Motivated by this observation, we introduce DeepGuard, a framework that leverages distributed security-relevant cues by aggregating representations from multiple upper layers via an attention-based module. The aggregated signal powers a dedicated security analyzer within a multi-objective training objective that balances security enhancement and functional correctness, and further supports a lightweight inference-time steering strategy. Extensive experiments across five code LLMs demonstrate that DeepGuard improves the secure-and-correct generation rate by an average of 11.9% over strong baselines such as SVEN. It also preserves functional correctness while exhibiting generalization to held-out vulnerability types. Our code is public at https://github.com/unknownhl/DeepGuard.

关键词: Large Language Models, Code Generation, Security Hardening, Fine-tuning, Multi-layer Semantic Aggregation, Vulnerability Detection, Secure Code Generation, DeepGuard

70. ❌ CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion

作者: Akshit Jindal, Saket Anand, Chetan Arora, Vikram Goyal 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09101v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是视觉语言模型（CLIP）的后门检测方法，属于计算机安全领域，而非大语言模型或深度学习技术原理的创新。论文专注于CLIP模型的提示调优安全漏洞，与所有评分关键词（主要针对大语言模型技术、推理、对齐、优化等）均无直接关联。虽然CLIP是视觉语言模型，但论文内容不涉及评分关键词中的任何技术方向。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为CLIP-Inspector的后门检测方法，用于检测提示调优的CLIP模型中是否存在恶意植入的后门，并通过实验证明该方法能有效识别后门并降低其影响。

摘要翻译

数据和计算资源有限的组织日益将模型训练外包给机器学习即服务（MLaaS）提供商，这些提供商通常通过提示调优而非从头训练的方式，将视觉-语言模型（如CLIP）适配至下游任务。这种半可信场景带来了安全风险：恶意提供商可能遵循提示调优流程却植入后门，迫使被触发的输入（即使是分布外数据）被分类至攻击者指定的类别。此类后门不修改编码器，使得专注于编码器检测的现有方法无法识别。其他在训练前或推理时清洗数据的数据级方法，亦无法回答“所交付的模型是否含有后门？”这一关键问题。为解决这一模型级验证难题，我们提出了CLIP检测器（CI），一种专为提示调优CLIP模型设计的后门检测方法。该方法假设可白盒访问交付的模型并拥有一个未标注的分布外图像池，通过为每个类别重构可能的触发模式，以判定模型是否呈现后门行为。此外，我们证明利用CI重构的触发模式，在正确标注的触发输入上进行微调，能够重新校准模型并降低后门有效性。通过在十个数据集和四种后门攻击上的大量实验，我们证明CI仅需使用1,000张分布外图像并在单个训练周期内即可重构有效触发模式，实现94%的检测准确率（50个模型中检测出47个）。相较于改进的触发反转基线方法，CI获得了显著更高的AUROC分数（0.973对比0.495/0.687），从而为提示调优CLIP模型提供了后门筛查与事后修复能力，确保安全部署。

摘要 (Abstract)

Organisations with limited data and computational resources increasingly outsource model training to Machine Learning as a Service (MLaaS) providers, who adapt vision-language models (VLMs) such as CLIP to downstream tasks via prompt tuning rather than training from scratch. This semi-honest setting creates a security risk where a malicious provider can follow the prompt-tuning protocol yet implant a backdoor, forcing triggered inputs to be classified into an attacker-chosen class, even for out-of-distribution (OOD) data. Such backdoors leave encoders untouched, making them undetectable to existing methods that focus on encoder corruption. Other data-level methods that sanitize data before training or during inference, also fail to answer the critical question, “Is the delivered model backdoored or not?” To address this model-level verification problem, we introduce CLIP-Inspector (CI), a backdoor detection method designed for prompt-tuned CLIP models. Assuming white-box access to the delivered model and a pool of unlabeled OOD images, CI reconstructs possible triggers for each class to determine if the model exhibits backdoor behaviour or not. Additionally, we demonstrate that using CI’s reconstructed trigger for fine-tuning on correctly labeled triggered inputs enables us to re-align the model and reduce backdoor effectiveness. Through extensive experiments across ten datasets and four backdoor attacks, we demonstrate that CI can reconstruct effective triggers in a single epoch using only 1,000 OOD images, achieving a 94% detection accuracy (47/50 models). Compared to adapted trigger-inversion baselines, CI yields a markedly higher AUROC score (0.973 vs 0.495/0.687), thus enabling the vetting and post-hoc repair of prompt-tuned CLIP models to ensure safe deployment.

关键词: CLIP, backdoor detection, prompt tuning, vision-language models, model security, trigger inversion, out-of-distribution, model verification

71. ❌ Beyond Isolated Clients: Integrating Graph-Based Embeddings into Event Sequence Models

作者: Harry Proshian, Nikita Severin, Sergey Nikolenko, Kireev Ivan, Andrey Savchenko, Ivan Sergeev, Maria Postnova, Ilya Makarov 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09085v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是用户-物品交互事件序列的建模，采用自监督学习和图嵌入技术，属于传统的机器学习/深度学习应用，而非大语言模型（LLM）或大模型技术。论文未涉及任何评分关键词中的大模型技术原理、训练方法、推理优化、对齐技术、代理系统或科学AI应用。所有关键词均与大模型相关，而本文专注于事件序列和图结构，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文针对大规模数字平台中用户-物品交互事件序列的预测问题，提出了三种模型无关的策略，将图结构信息整合到对比自监督学习中，实验表明该方法能提升预测准确性（AUC最高提升2.3%），并发现图密度是选择最佳整合策略的关键因素。

摘要翻译

大规模数字平台产生数十亿带有时间戳的用户-物品交互事件，这些数据对于预测用户属性（如欺诈预防和推荐）至关重要。虽然自监督学习能有效建模事件的时间顺序，但通常忽略了用户-物品交互图的全局结构。为弥补这一差距，我们提出了三种与模型无关的策略，将此类结构信息融入对比式自监督学习：通过丰富事件嵌入、将用户表征与图嵌入对齐，以及添加结构型预训练任务。在四个金融和电子商务数据集上的实验表明，我们的方法持续提升了预测准确性（AUC最高提升2.3%），并揭示出图密度是选择最优集成策略的关键因素。

摘要 (Abstract)

Large-scale digital platforms generate billions of timestamped user-item interactions (events) that are crucial for predicting user attributes in, e.g., fraud prevention and recommendations. While self-supervised learning (SSL) effectively models the temporal order of events, it typically overlooks the global structure of the user-item interaction graph. To bridge this gap, we propose three model-agnostic strategies for integrating this structural information into contrastive SSL: enriching event embeddings, aligning client representations with graph embeddings, and adding a structural pretext task. Experiments on four financial and e-commerce datasets demonstrate that our approach consistently improves the accuracy (up to a 2.3% AUC) and reveals that graph density is a key factor in selecting the optimal integration strategy.

关键词: event sequence models, graph-based embeddings, self-supervised learning, contrastive learning, user-item interactions, temporal order, structural information, AUC improvement

72. ❌ Overhang Tower: Resource-Rational Adaptation in Sequential Physical Planning

作者: Ruihong Shen, Shiqian Li, Yixin Zhu 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09072v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究人类在物理规划任务中的认知机制（直觉物理引擎与启发式策略的转换），属于认知科学和心理学范畴，完全不涉及大模型、深度学习技术原理或AI在科学领域的应用。所有关键词均与大模型技术、训练方法、推理优化、AI应用等相关，与论文内容无任何关联。

!!! tip deepseek-chat TL;DR

该论文研究了人类在资源受限条件下进行序列物理规划时的认知机制，发现随着任务复杂度增加，人们会从基于直觉物理引擎的模拟转向基于CNN视觉启发式的策略，同时规划深度变浅，揭示了层次化的资源理性架构。

摘要翻译

人类能够毫不费力地在物理世界中穿行，这得益于对物体在重力和接触力作用下行为的预测，然而这种判断如何在资源受限的情况下支持序列化的物理规划，目前仍知之甚少。关于直觉物理学的研究争论在于，预测是依赖于“直觉物理引擎”（Intuitive Physics Engine, IPE），还是依赖于快速的、基于线索的启发式方法；与此同时，决策研究则争论着是审慎的前瞻策略还是短视策略。这些争论长期孤立进行，导致序列化物理规划的认知架构未能明确。物理预测机制与规划策略如何在有限的认知资源下共同适应，仍是一个悬而未决的问题。本文表明，人类在资源压力下表现出双重转变，即同时调整物理预测机制和规划策略以匹配认知预算。通过使用“悬挑塔”建造任务（要求参与者在保持稳定的前提下最大化水平悬挑长度），我们发现：在任务早期阶段，基于IPE的模拟占主导地位，而随着复杂度增加，基于卷积神经网络（CNN）的视觉启发式方法则占据上风；同时，时间压力会截断审慎的前瞻过程，使规划转向更浅的视野深度。这种双重转变是先前单一机制理论所未曾预测的。这些发现揭示了一种分层的、资源理性的认知架构，它能够灵活地在计算成本与预测保真度之间进行权衡。我们的研究结果将两个长期存在的争论（模拟与启发式、短视与审慎规划）统一为一个由认知预算动态重构的策略库。

摘要 (Abstract)

Humans effortlessly navigate the physical world by predicting how objects behave under gravity and contact forces, yet how such judgments support sequential physical planning under resource constraints remains poorly understood. Research on intuitive physics debates whether prediction relies on the Intuitive Physics Engine (IPE) or fast, cue-based heuristics; separately, decision-making research debates deliberative lookahead versus myopic strategies. These debates have proceeded in isolation, leaving the cognitive architecture of sequential physical planning underspecified. How physical prediction mechanisms and planning strategies jointly adapt under limited cognitive resources remains an open question. Here we show that humans exhibit a dual transition under resource pressure, simultaneously shifting both physical prediction mechanism and planning strategy to match cognitive budget. Using Overhang Tower, a construction task requiring participants to maximize horizontal overhang while maintaining stability, we find that IPE-based simulation dominates early stages while CNN-based visual heuristics prevail as complexity grows; concurrently, time pressure truncates deliberative lookahead, shifting planning toward shallower horizons: a dual transition unpredicted by prior single-mechanism accounts. These findings reveal a hierarchical, resource-rational architecture that flexibly trades computational cost against predictive fidelity. Our results unify two long-standing debates (simulation vs. heuristics and myopic vs. deliberative planning) as a dynamic repertoire reconfigured by cognitive budget.

关键词: sequential physical planning, intuitive physics engine, resource-rational adaptation, cognitive architecture, simulation vs. heuristics, deliberative planning, Overhang Tower, cognitive budget

73. ❌ NyayaMind- A Framework for Transparent Legal Reasoning and Judgment Prediction in the Indian Legal System

作者: Parjanya Aditya Shukla, Shubham Kumar Nigam, Debtanu Datta, Balaramamahanthi Deepak Patnaik, Noel Shallum, Pradeep Reddy Vanga, Saptarshi Ghosh, Arnab Bhattacharya 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09069v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文NyayaMind专注于法律领域的AI应用，核心是使用LLM进行法律推理和判决预测。高度相关（10分）的关键词包括：LLMs（核心模型）、SFT（微调法律领域LLM）、RAG（检索模块核心）、CoT Reasoning和System 2 Thinking（结构化推理过程）、Explainable AI（透明解释生成）。中等相关（5分）包括：Domain Adaptation（法律领域适应）和Factuality（证据对齐）。其他关键词如MoE、SLMs、RLHF等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该研究提出了NyayaMind框架，通过结合RAG检索和微调的法律大模型，实现了对印度法律案件的透明推理和判决预测，显著提升了解释质量和证据对齐。

摘要翻译

法庭判决预测与解释（Court Judgment Prediction and Explanation, CJPE）旨在根据案件事实、法律争议点、辩论意见、援引法条及相关判例，预测司法判决并提供法律依据充分的解释。为使此类系统在司法或法律研究场景中具有实际应用价值，其不仅需实现高预测性能，还应生成透明且结构化的法律推理，并与既定的司法实践保持一致。本研究提出NyayaMind——一个专为印度司法体系设计的开源框架，旨在实现透明且可扩展的法律推理。该框架整合了检索、推理与验证机制，以模拟法庭通常遵循的结构化决策流程。具体而言，NyayaMind包含两个核心组件：检索模块与预测模块。检索模块采用检索增强生成（RAG）流程，从大规模法律文本库中识别法律相关的法条与判例；预测模块则利用针对印度法律领域微调的、面向推理的大语言模型（LLMs），生成包含争议焦点、辩论意见、裁判理由及最终判决的结构化输出。我们通过大量实验结果与专家评估表明，相较于现有CJPE方法，NyayaMind显著提升了解释质量与证据对齐度，为构建可信赖的AI辅助法律决策支持系统迈出了重要一步。

摘要 (Abstract)

Court Judgment Prediction and Explanation (CJPE) aims to predict a judicial decision and provide a legally grounded explanation for a given case based on the facts, legal issues, arguments, cited statutes, and relevant precedents. For such systems to be practically useful in judicial or legal research settings, they must not only achieve high predictive performance but also generate transparent and structured legal reasoning that aligns with established judicial practices. In this work, we present NyayaMind, an open-source framework designed to enable transparent and scalable legal reasoning for the Indian judiciary. The proposed framework integrates retrieval, reasoning, and verification mechanisms to emulate the structured decision-making process typically followed in courts. Specifically, NyayaMind consists of two main components: a Retrieval Module and a Prediction Module. The Retrieval Module employs a RAG pipeline to identify legally relevant statutes and precedent cases from large-scale legal corpora, while the Prediction Module utilizes reasoning-oriented LLMs fine-tuned for the Indian legal domain to generate structured outputs including issues, arguments, rationale, and the final decision. Our extensive results and expert evaluation demonstrate that NyayaMind significantly improves the quality of explanation and evidence alignment compared to existing CJPE approaches, providing a promising step toward trustworthy AI-assisted legal decision support systems.

关键词: Legal Reasoning, Judgment Prediction, Retrieval-Augmented Generation, Large Language Models, Explainable AI, Indian Legal System, Court Judgment Prediction, Structured Reasoning

74. ❌ Frequency-Enhanced Diffusion Models: Curriculum-Guided Semantic Alignment for Zero-Shot Skeleton Action Recognition

作者: Yuxi Zhou, Zhengbo Zhang, Jingyu Pan, Zhiyu Lin, Zhigang Tu 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09063v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究基于扩散模型的零样本骨架动作识别，属于计算机视觉领域，而非大语言模型或深度学习技术原理的创新。论文未涉及任何大语言模型相关技术（如LLMs、MoE、RLHF、RAG等），也未讨论深度学习基础技术（如Scaling Laws、PEFT、Quantization等）。唯一的相关性在于’AI for Science’关键词，因为动作识别可视为AI在科学应用（计算机视觉）中的一部分，但论文未明确提及生物信息学或化学信息学，因此给予5分（有一定关联）。其他26个关键词均与论文内容完全无关。

!!! tip deepseek-chat TL;DR

该论文针对零样本骨架动作识别中扩散模型存在的光谱偏差问题，提出了频率增强的扩散模型FDSM，通过语义引导的光谱残差模块、时间步自适应光谱损失和基于课程学习的语义抽象，有效恢复了细粒度运动细节，在多个数据集上取得了最先进的性能。

摘要翻译

人体动作识别是计算机视觉领域的核心课题，其应用涵盖从安防监控到人机交互的广泛场景。尽管基于骨架的监督学习方法已取得显著成效，但其对详尽标注数据的依赖限制了模型对新动作的泛化能力。零样本骨架动作识别作为一种新兴范式展现出潜力，然而由于扩散模型存在频谱偏差——即过度平滑高频动态信息，该范式仍面临挑战。为此，我们提出频率感知的骨架-文本匹配扩散模型，通过集成语义引导的频谱残差模块、时间步自适应的频谱损失函数以及基于课程学习的语义抽象机制来解决上述问题。该方法能有效恢复细粒度运动细节，在NTU RGB+D、PKU-MMD和Kinetics-skeleton数据集上实现了最先进的性能。代码已发布于https://github.com/yuzhi535/FDSM，项目主页详见https://yuzhi535.github.io/FDSM.github.io/。

摘要 (Abstract)

Human action recognition is pivotal in computer vision, with applications ranging from surveillance to human-robot interaction. Despite the effectiveness of supervised skeleton-based methods, their reliance on exhaustive annotation limits generalization to novel actions. Zero-Shot Skeleton Action Recognition (ZSAR) emerges as a promising paradigm, yet it faces challenges due to the spectral bias of diffusion models, which oversmooth high-frequency dynamics. Here, we propose Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM), integrating a Semantic-Guided Spectral Residual Module, a Timestep-Adaptive Spectral Loss, and Curriculum-based Semantic Abstraction to address these challenges. Our approach effectively recovers fine-grained motion details, achieving state-of-the-art performance on NTU RGB+D, PKU-MMD, and Kinetics-skeleton datasets. Code has been made available at https://github.com/yuzhi535/FDSM. Project homepage: https://yuzhi535.github.io/FDSM.github.io/

关键词: Zero-Shot Skeleton Action Recognition, Diffusion Models, Frequency-Enhanced, Spectral Bias, Semantic-Guided Spectral Residual, Timestep-Adaptive Spectral Loss, Curriculum-based Semantic Abstraction, Skeleton-Text Matching

75. ❌ PDE-regularized Dynamics-informed Diffusion with Uncertainty-aware Filtering for Long-Horizon Dynamics

作者: Min Young Baeg, Yoon-Yeong Kim 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09058v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于物理信息扩散模型（PDYffusion）用于长期时空预测，核心是PDE正则化和不确定性感知滤波，属于AI for Science（科学AI）在物理动力学建模中的应用，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），但未涉及大模型、深度学习技术原理创新或其他关键词（如LLMs、MoE、对齐、推理等），因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对长期时空预测中的累积误差和物理不一致性问题，提出了一个结合PDE正则化和不确定性感知滤波的扩散模型框架（PDYffusion），在多个动力学数据集上实现了更优的预测性能和稳定的不确定性行为。

摘要翻译

长时域时空预测因累积误差、噪声放大及现有模型缺乏物理一致性而仍具挑战性。尽管扩散模型为不确定性建模提供了概率框架，传统方法常依赖均方误差目标，未能捕捉由物理定律支配的内在动力学。本研究提出PDYffusion——一种融合偏微分方程（PDE）正则化与不确定性感知预测的动力学启发扩散框架，以实现稳定的长期预测。该方法包含两个核心组件：PDE正则化插值器与基于无迹卡尔曼滤波（UKF）的预测器。插值器通过引入微分算子强制生成物理一致的中间状态，而预测器利用无迹卡尔曼滤波显式建模不确定性，以缓解迭代预测中的误差累积。理论分析表明，所提插值器满足PDE约束的光滑特性，且预测器在所构建的损失函数下具有收敛性。在多个动力学数据集上的实验证明，PDYffusion在连续分级概率评分（CRPS）与均方误差（MSE）方面均取得优越性能，同时通过符号稳定率（SSR）衡量的不确定性行为保持稳定。我们进一步分析了预测精度与不确定性之间的内在权衡，表明本方法为长时域预测提供了均衡且鲁棒的解决方案。

摘要 (Abstract)

Long-horizon spatiotemporal prediction remains a challenging problem due to cumulative errors, noise amplification, and the lack of physical consistency in existing models. While diffusion models provide a probabilistic framework for modeling uncertainty, conventional approaches often rely on mean squared error objectives and fail to capture the underlying dynamics governed by physical laws. In this work, we propose PDYffusion, a dynamics-informed diffusion framework that integrates PDE-based regularization and uncertainty-aware forecasting for stable long-term prediction. The proposed method consists of two key components: a PDE-regularized interpolator and a UKF-based forecaster. The interpolator incorporates a differential operator to enforce physically consistent intermediate states, while the forecaster leverages the Unscented Kalman Filter to explicitly model uncertainty and mitigate error accumulation during iterative prediction. We provide theoretical analyses showing that the proposed interpolator satisfies PDE-constrained smoothness properties, and that the forecaster converges under the proposed loss formulation. Extensive experiments on multiple dynamical datasets demonstrate that PDYffusion achieves superior performance in terms of CRPS and MSE, while maintaining stable uncertainty behavior measured by SSR. We further analyze the inherent trade-off between prediction accuracy and uncertainty, showing that our method provides a balanced and robust solution for long-horizon forecasting.

关键词: diffusion models, PDE regularization, uncertainty-aware forecasting, long-horizon prediction, spatiotemporal dynamics, Unscented Kalman Filter, physical consistency, error accumulation

76. ❌ Learning Vision-Language-Action World Models for Autonomous Driving

作者: Guoqing Wang, Pin Tang, Xiangxuan Ren, Guodongfang Zhao, Bailan Feng, Chao Ma 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09059v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文提出VLA-World模型，专注于自动驾驶领域的世界模型构建，结合视觉-语言-动作模态进行预测和推理。核心相关关键词包括：‘World Models AND General World Models’（核心内容，10分）、‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（自动驾驶代理，10分）、‘Post-training OR Supervised Fine-tuning OR SFT’（训练策略包含SFT，10分）。中等相关关键词：‘Pre-training OR Continual Pre-training OR Domain Adaptation’（提及预训练，5分）、‘RLHF OR RLAIF OR Direct Preference Optimization OR DPO’（使用强化学习，5分）、‘Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’（涉及推理过程，5分）、‘System 2 Thinking OR Slow Thinking OR In-depth Reasoning’（反思推理，5分）、‘Self-Correction OR Self-Improvement OR Self-Reflection’（自我修正轨迹，5分）、‘Mechanistic Interpretability OR Explainable AI’（强调可解释性，5分）。其余关键词如大语言模型、MoE、量化等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对自动驾驶中视觉-语言-动作模型缺乏时序动态和全局一致性问题，提出了VLA-World世界模型，通过结合预测想象与反思推理来提升驾驶预见性，实验表明其在规划和未来生成基准上超越了现有方法。

摘要翻译

视觉-语言-动作（Vision-Language-Action，VLA）模型近期通过将感知、推理与控制整合于统一的多模态框架，在端到端自动驾驶领域取得了显著进展。然而，这些模型往往缺乏对时序动态与全局世界一致性的显式建模，从而限制了其前瞻能力与安全性。相比之下，世界模型能够模拟合理的未来场景，但通常难以对自身生成的想象未来进行推理或评估。本研究提出VLA-World，一个简洁而高效的VLA世界模型，它将预测性想象与反思性推理相统一，以提升驾驶预见力。VLA-World首先利用动作导出的可行轨迹来引导生成下一帧图像，捕捉描述周围环境演变的丰富时空线索；随后模型对这一自生成的未来想象帧进行推理，以优化预测轨迹，从而实现更高性能与更好的可解释性。为支持该流程，我们构建了nuScenes-GR-20K——一个基于nuScenes衍生的生成式推理数据集，并采用包含预训练、监督微调与强化学习的三阶段训练策略。大量实验表明，VLA-World在规划与未来生成基准测试中均持续超越最先进的VLA及世界模型基线。项目页面：https://vlaworld.github.io

摘要 (Abstract)

Vision-Language-Action (VLA) models have recently achieved notable progress in end-to-end autonomous driving by integrating perception, reasoning, and control within a unified multimodal framework. However, they often lack explicit modeling of temporal dynamics and global world consistency, which limits their foresight and safety. In contrast, world models can simulate plausible future scenes but generally struggle to reason about or evaluate the imagined future they generate. In this work, we present VLA-World, a simple yet effective VLA world model that unifies predictive imagination with reflective reasoning to improve driving foresight. VLA-World first uses an action-derived feasible trajectory to guide the generation of the next-frame image, capturing rich spatial and temporal cues that describe how the surrounding environment evolves. The model then reasons over this self-generated future imagined frame to refine the predicted trajectory, achieving higher performance and better interpretability. To support this pipeline, we curate nuScenes-GR-20K, a generative reasoning dataset derived from nuScenes, and employ a three-stage training strategy that includes pretraining, supervised fine-tuning, and reinforcement learning. Extensive experiments demonstrate that VLA-World consistently surpasses state-of-the-art VLA and world-model baselines on both planning and future-generation benchmarks. Project page: https://vlaworld.github.io

关键词: Vision-Language-Action Models, World Models, Autonomous Driving, Predictive Imagination, Reflective Reasoning, Trajectory Prediction, Multimodal Framework, Reinforcement Learning

77. ❌ Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures

作者: Mauricio Fadel Argerich, Jonathan Fürst, Marta Patiño-Martínez 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09048v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM推理的能源效率问题，与’Large Language Models’高度相关（10分），因为全文围绕LLM推理展开。与’Speculative Decoding OR Inference Acceleration’有一定关联（5分），因为能源效率优化涉及推理加速技术。其他关键词如MoE、SFT、RAG等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型推理能耗高的问题，创建了最大的开源能耗数据集Watt Counts，并通过系统级研究发现硬件选择对能效至关重要，实践者可在服务器场景中降低高达70%的能耗。

摘要翻译

尽管大型语言模型（LLM）的高能耗已得到学界关注，但由于缺乏能耗感知的基准测试与数据，系统运维人员在利用异构硬件的能耗权衡进行高效能LLM推理部署时仍缺乏指导。本研究通过构建Watt Counts填补了这一空白：这是目前最大的开源LLM能耗数据集，包含超过5,000组实验，覆盖50种LLM在10款英伟达图形处理器（GPU）上的批处理与服务器场景能耗数据，并配套提供可复现的开源基准测试工具，支持社区提交以扩展该数据集。基于此数据集，我们对跨异构GPU架构的LLM推理进行了系统级研究，结果表明GPU选择对能效结果至关重要，且最优硬件选择在不同模型与部署场景间差异显著，这证明了在异构LLM系统中实施硬件感知部署的极端重要性。依据我们的数据与洞察，我们证明实践者可在服务器场景中降低高达70%的能耗且对用户体验影响可忽略，在批处理场景中则可降低高达20%的能耗。

摘要 (Abstract)

While the large energy consumption of Large Language Models (LLMs) is recognized by the community, system operators lack guidance for energy-efficient LLM inference deployments that leverage energy trade-offs of heterogeneous hardware due to a lack of energy-aware benchmarks and data. In this work we address this gap with Watt Counts: the largest open-access dataset of energy consumption of LLMs, with over 5,000 experiments for 50 LLMs across 10 NVIDIA Graphics Processing Units (GPUs) in batch and server scenarios along with a reproducible, open-source benchmark that enables community submissions to expand this dataset. Leveraging this dataset, we conduct a system-level study of LLM inference across heterogeneous GPU architectures and show that GPU selection is crucial for energy efficiency outcomes and that optimal hardware choices vary significantly across models and deployment scenarios, demonstrating the critical importance of hardware-aware deployment in heterogeneous LLM systems. Guided by our data and insights, we show that practitioners can reduce energy consumption by up to 70% in server scenarios with negligible impact on user experience, and by up to 20% in batch scenarios.

关键词: Large Language Models, LLM inference, energy consumption, GPU architectures, energy-efficient deployment, heterogeneous hardware, benchmark, sustainable AI

78. ❌ U-Cast: A Surprisingly Simple and Efficient Frontier Probabilistic AI Weather Forecaster

作者: Salva Rühling Cachay, Duncan Watson-Parris, Rose Yu 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09041v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文U-Cast专注于AI天气预报，使用U-Net架构和Monte Carlo Dropout进行概率预测，属于AI在科学领域的应用。所有关键词均与大型语言模型、训练技术、推理优化、代理系统等直接相关，而论文未涉及这些主题，因此除’AI for Science OR Bioinformatics OR Cheminformatics’外，其他关键词评分为0。‘AI for Science’评分为5，因为天气预报是AI在科学领域的应用，但论文未深入生物信息学或化学信息学。

!!! tip deepseek-chat TL;DR

U-Cast通过简单的U-Net架构和高效训练方法，实现了与最先进模型相当的天气预报性能，同时大幅降低了计算成本。

摘要翻译

基于人工智能的天气预报现已媲美传统的基于物理学的集合预报系统，但最先进的模型依赖于专用架构和巨大的计算资源，形成了较高的应用门槛。我们证明，实现前沿性能并不需要如此复杂的架构。我们提出了U-Cast，这是一个基于标准U-Net主干网络构建的概率性天气预报模型，其训练采用一种简单方案：首先在平均绝对误差（Mean Absolute Error）上进行确定性预训练，随后利用蒙特卡洛丢弃法（Monte Carlo Dropout）引入随机性，在连续分级概率评分（Continuous Ranked Probability Score, CRPS）上进行短期的概率性微调。结果表明，在1.5$^\circ$分辨率下，我们的模型在概率性预报技巧上达到或超越了GenCast和IFS ENS，同时与领先的基于CRPS的模型相比，训练计算量减少了超过10$\times$；与基于扩散的模型相比，推理延迟降低了超过10$\times$。U-Cast在不到12个H200 GPU日的训练时间内即可完成训练，并在11秒内生成一个60步的集合预报。这些结果表明，可扩展的通用架构与高效的训练方案相结合，能够以极低的成本匹配复杂的领域专用设计，从而为更广泛的研究群体开启了训练前沿概率性天气模型的大门。我们的代码发布于：https://github.com/Rose-STL-Lab/u-cast。

摘要 (Abstract)

AI-based weather forecasting now rivals traditional physics-based ensembles, but state-of-the-art (SOTA) models rely on specialized architectures and massive computational budgets, creating a high barrier to entry. We demonstrate that such complexity is unnecessary for frontier performance. We introduce U-Cast, a probabilistic forecaster built on a standard U-Net backbone trained with a simple recipe: deterministic pre-training on Mean Absolute Error followed by short probabilistic fine-tuning on the Continuous Ranked Probability Score (CRPS) using Monte Carlo Dropout for stochasticity. As a result, our model matches or exceeds the probabilistic skill of GenCast and IFS ENS at 1.5$^\circ$ resolution while reducing training compute by over 10$\times$ compared to leading CRPS-based models and inference latency by over 10$\times$ compared to diffusion-based models. U-Cast trains in under 12 H200 GPU-days and generates a 60-step ensemble forecast in 11 seconds. These results suggest that scalable, general-purpose architectures paired with efficient training curricula can match complex domain-specific designs at a fraction of the cost, opening the training of frontier probabilistic weather models to the broader community. Our code is available at: https://github.com/Rose-STL-Lab/u-cast.

关键词: AI weather forecasting, probabilistic forecaster, U-Net, Monte Carlo Dropout, CRPS, computational efficiency, ensemble forecast, training curriculum

79. ❌ Advantage-Guided Diffusion for Model-Based Reinforcement Learning

作者: Daniele Foffano, Arvid Eriksson, David Broman, Karl H. Johansson, Alexandre Proutiere 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09035v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于基于模型的强化学习（MBRL）中的扩散世界模型，提出了一种优势引导扩散方法（AGD-MBRL）。论文的核心是改进扩散世界模型在强化学习中的应用，通过优势估计引导轨迹生成以提高长期回报。与评分关键词列表相比，论文仅与’World Models AND General World Models’高度相关（10分），因为论文明确研究扩散世界模型在MBRL中的应用。其他关键词主要涉及大语言模型（LLMs）的技术、训练方法、推理、对齐、压缩、代理等，而本文研究的是强化学习中的扩散模型，不涉及语言模型或相关技术，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对基于模型的强化学习中扩散世界模型存在的短视问题，提出了优势引导扩散方法（AGD-MBRL），通过优势估计引导轨迹生成，在MuJoCo控制任务上显著提高了样本效率和最终回报。

摘要翻译

基于自回归世界模型的强化学习（MBRL）存在误差累积问题，而扩散世界模型通过联合生成轨迹片段缓解了该缺陷。然而，现有的扩散引导方法要么仅依赖策略而丢弃价值信息，要么基于奖励设计——当扩散视野较短时，后者会变得短视。我们提出了优势引导扩散模型（AGD-MBRL），该方法利用智能体的优势估计值引导逆向扩散过程，使采样集中于预期能在生成窗口之外获得更高长期回报的轨迹。我们开发了两种引导机制：（i）S型优势引导（SAG）与（ii）指数优势引导（EAG）。我们证明，通过SAG或EAG引导的扩散模型能够实现对轨迹的加权采样，其权重随状态-动作优势值递增——这意味着在标准假设下可实现策略改进。此外，相较于无引导扩散模型，AGD-MBRL生成的轨迹遵循改进后的策略（即具有更高价值）。AGD可无缝集成至PolyGRAD架构：在保持动作生成受策略条件约束的同时引导状态分量，且无需改变扩散训练目标。在MuJoCo控制任务（HalfCheetah、Hopper、Walker2D和Reacher）中，AGD-MBRL在样本效率和最终回报上均优于PolyGRAD、在线Diffuser式奖励引导方法以及无模型基线（PPO/TRPO），部分任务中提升幅度达2倍。这些结果表明，优势感知引导是解决扩散模型MBRL中短视问题的简洁有效方案。

摘要 (Abstract)

Model-based reinforcement learning (MBRL) with autoregressive world models suffers from compounding errors, whereas diffusion world models mitigate this by generating trajectory segments jointly. However, existing diffusion guides are either policy-only, discarding value information, or reward-based, which becomes myopic when the diffusion horizon is short. We introduce Advantage-Guided Diffusion for MBRL (AGD-MBRL), which steers the reverse diffusion process using the agent’s advantage estimates so that sampling concentrates on trajectories expected to yield higher long-term return beyond the generated window. We develop two guides: (i) Sigmoid Advantage Guidance (SAG) and (ii) Exponential Advantage Guidance (EAG). We prove that a diffusion model guided through SAG or EAG allows us to perform reweighted sampling of trajectories with weights increasing in state-action advantage-implying policy improvement under standard assumptions. Additionally, we show that the trajectories generated from AGD-MBRL follow an improved policy (that is, with higher value) compared to an unguided diffusion model. AGD integrates seamlessly with PolyGRAD-style architectures by guiding the state components while leaving action generation policy-conditioned, and requires no change to the diffusion training objective. On MuJoCo control tasks (HalfCheetah, Hopper, Walker2D and Reacher), AGD-MBRL improves sample efficiency and final return over PolyGRAD, an online Diffuser-style reward guide, and model-free baselines (PPO/TRPO), in some cases by a margin of 2x. These results show that advantage-aware guidance is a simple, effective remedy for short-horizon myopia in diffusion-model MBRL.

关键词: Model-based Reinforcement Learning, Diffusion World Models, Advantage-Guided Diffusion, Trajectory Generation, Sample Efficiency, MuJoCo Control Tasks, Policy Improvement

80. ❌ CONDESION-BENCH: Conditional Decision-Making of Large Language Models in Compositional Action Space

作者: Yeonjun Hwang, Sungyong Park, Minju Kim, Dongha Lee, Jinyoung Yeo 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09029v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型在决策支持中的应用，与’Large Language Models’高度相关（10分）。论文涉及决策制定和推理能力，与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分），因为决策制定需要多步推理和深入思考。论文提到LLMs作为决策支持工具，与’LLM Agents’有一定关联（5分），因为决策支持可视为代理行为。其他关键词如MoE、SFT、RAG、量化等未在摘要中提及，评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有决策制定基准的局限性，提出了CONDESION-BENCH基准，用于评估大语言模型在组合动作空间中的条件决策制定能力，并通过基于oracle的评估方法更严格地评估LLMs作为决策支持工具的性能。

摘要翻译

大型语言模型因其情境理解与推理能力，已被广泛探索作为高风险领域的决策支持工具。然而，现有的决策评估基准依赖于两个简化假设：行动从一组有限的预定义候选中选择，且决策过程中未纳入限制行动可行性的显式条件。这些假设未能捕捉现实世界行动的组合结构以及约束其有效性的显式条件。为应对这些局限，我们提出了CONDESION-BENCH，这是一个为评估组合行动空间中的条件决策而设计的基准。在CONDESION-BENCH中，行动被定义为对决策变量的分配，并受到变量层面、情境层面和分配层面的显式条件限制。通过采用基于预设标准的决策质量与条件遵循度评估，我们为大型语言模型作为决策支持工具提供了更严谨的评估框架。

摘要 (Abstract)

Large language models have been widely explored as decision-support tools in high-stakes domains due to their contextual understanding and reasoning capabilities. However, existing decision-making benchmarks rely on two simplifying assumptions: actions are selected from a finite set of pre-defined candidates, and explicit conditions restricting action feasibility are not incorporated into the decision-making process. These assumptions fail to capture the compositional structure of real-world actions and the explicit conditions that constrain their validity. To address these limitations, we introduce CONDESION-BENCH, a benchmark designed to evaluate conditional decision-making in compositional action space. In CONDESION-BENCH, actions are defined as allocations to decision variables and are restricted by explicit conditions at the variable, contextual, and allocation levels. By employing oracle-based evaluation of both decision quality and condition adherence, we provide a more rigorous assessment of LLMs as decision-support tools.

关键词: Large Language Models, decision-making, conditional decision-making, compositional action space, benchmark, decision-support tools, reasoning capabilities, CONDESION-BENCH

81. ❌ Skill-Conditioned Visual Geolocation for Vision-Language

作者: Chenjie Yang, Yutian Jiang, Chenyu Wu 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09025v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出GeoSkill框架，基于视觉语言模型(VLMs)进行图像地理定位，核心创新在于：1) 使用Skill-Graph实现结构化推理，与’Chain of Thought’和’System 2 Thinking’相关；2) 通过Autonomous Evolution机制实现自我进化，与’Self-Correction’和’LLM Agents’高度相关；3) 解决幻觉问题，与’Hallucination Mitigation’相关；4) 整体框架涉及推理过程解释，与’Explainable AI’有一定关联；5) 基于VLMs，与’Large Language Models’有一定关联。其他关键词如MoE、量化、RAG等未涉及。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型在图像地理定位中缺乏结构化推理和自主进化能力的问题，提出了基于演化技能图的训练免费框架GeoSkill，通过自主进化机制合成和修剪技能，显著提升了地理定位准确性和推理忠实度。

摘要翻译

视觉语言模型（VLMs）在图像地理位置预测方面展现出潜力，但仍缺乏结构化地理推理能力及自主进化的机制。现有方法主要依赖隐式参数化记忆，常受限于过时知识并产生幻觉推理。此外，当前推理过程多为“一次性”机制，缺乏基于推理结果的反馈循环以实现自我进化。为解决这些问题，我们提出GeoSkill——一个基于动态演化的技能图谱（Skill-Graph）的无训练框架。我们首先通过将专家轨迹提炼为原子化的自然语言技能来初始化图谱。在执行阶段，GeoSkill采用推理模型在当前技能图谱引导下进行直接推理。为实现持续进化，自主演化机制利用更大规模模型对来自网络规模数据的图像-坐标对进行多轮推理推演，并结合已验证的真实世界推理结果。通过分析这些推演中的成功与失败轨迹，该机制迭代地合成与剪枝技能，无需参数更新即可有效扩展技能图谱并修正地理认知偏差。实验表明，GeoSkill在GeoRC数据集上同时实现了优异的地理定位精度与推理可靠性，并在多样化的外部数据集中保持卓越的泛化能力。此外，自主演化机制催生了新颖且可验证的技能，显著增强了系统对真实世界地理知识的认知能力，超越了孤立案例研究的局限。

摘要 (Abstract)

Vision-language models (VLMs) have shown a promising ability in image geolocation, but they still lack structured geographic reasoning and the capacity for autonomous self-evolution. Existing methods predominantly rely on implicit parametric memory, which often exploits outdated knowledge and generates hallucinated reasoning. Furthermore, current inference is a “one-off” process, lacking the feedback loops necessary for self-evolution based on reasoning outcomes. To address these issues, we propose GeoSkill, a training-free framework based on an evolving Skill-Graph. We first initialize the graph by refining human expert trajectories into atomic, natural-language skills. For execution, GeoSkill employs an inference model to perform direct reasoning guided by the current Skill-Graph. For continuous growth, an Autonomous Evolution mechanism leverages a larger model to conduct multiple reasoning rollouts on image-coordinate pairs sourced from web-scale data and verified real-world reasoning. By analyzing both successful and failed trajectories from these rollouts, the mechanism iteratively synthesizes and prunes skills, effectively expanding the Skill-Graph and correcting geographic biases without any parameter updates. Experiments demonstrate that GeoSkill achieves promising performance in both geolocation accuracy and reasoning faithfulness on GeoRC, while maintaining superior generalization across diverse external datasets. Furthermore, our autonomous evolution fosters the emergence of novel, verifiable skills, significantly enhancing the system’s cognition of real-world geographic knowledge beyond isolated case studies.

关键词: Vision-Language Models, Image Geolocation, Skill-Graph, Autonomous Evolution, Geographic Reasoning, Hallucination Mitigation, Training-free Framework, Self-improvement

作者: Zedian Shao, Hongbin Liu, Yuepeng Hu, Neil Zhenqiang Gong 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09024v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多模态大语言模型（MLLMs）的隐私保护问题，通过视觉提示注入攻击使模型拒绝分析图像，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术（如MoE、量化、推理加速等）或应用领域（如生物信息学），因此这些关键词得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ImageProtector的用户端方法，通过嵌入几乎不可察觉的扰动来保护图像隐私，使多模态大语言模型在分析受保护图像时一致生成拒绝响应，并在六个MLLMs和四个数据集上验证了其有效性。

摘要翻译

多模态大语言模型（MLLMs）已成为分析互联网规模图像数据的强大工具，在带来显著效益的同时也引发了关键的安全与社会担忧。特别是，开放权重的MLLMs可能被滥用以从个人图像中大规模提取敏感信息，例如身份、位置或其他隐私细节。本研究提出ImageProtector，一种用户端方法，通过在图像中嵌入一种精心设计、近乎不可察觉的扰动来主动保护图像，该扰动可对MLLMs实施视觉提示注入攻击。其结果是，当攻击者使用MLLM分析受保护图像时，模型会被持续诱导生成拒绝响应，例如“抱歉，我无法协助该请求”。我们通过实验在六个MLLM和四个数据集上验证了ImageProtector的有效性。此外，我们评估了三种潜在对抗措施——高斯噪声、DiffPure和对抗训练，结果表明虽然这些方法能部分减轻ImageProtector的影响，但会同时降低模型的准确性和/或效率。本研究聚焦于开放权重MLLMs与大规模自动化图像分析这一具有重要实践意义的场景，并揭示了基于扰动的隐私保护方法的潜力与局限性。

摘要 (Abstract)

Multi-modal large language models (MLLMs) have emerged as powerful tools for analyzing Internet-scale image data, offering significant benefits but also raising critical safety and societal concerns. In particular, open-weight MLLMs may be misused to extract sensitive information from personal images at scale, such as identities, locations, or other private details. In this work, we propose ImageProtector, a user-side method that proactively protects images before sharing by embedding a carefully crafted, nearly imperceptible perturbation that acts as a visual prompt injection attack on MLLMs. As a result, when an adversary analyzes a protected image with an MLLM, the MLLM is consistently induced to generate a refusal response such as “I’m sorry, I can’t help with that request.” We empirically demonstrate the effectiveness of ImageProtector across six MLLMs and four datasets. Additionally, we evaluate three potential countermeasures, Gaussian noise, DiffPure, and adversarial training, and show that while they partially mitigate the impact of ImageProtector, they simultaneously degrade model accuracy and/or efficiency. Our study focuses on the practically important setting of open-weight MLLMs and large-scale automated image analysis, and highlights both the promise and the limitations of perturbation-based privacy protection.

关键词: Multi-modal Large Language Models, Visual Prompt Injection, Privacy Protection, Adversarial Perturbation, Image Analysis, Refusal Response, Open-weight MLLMs, Automated Image Analysis

83. ❌ Regime-Conditional Retrieval: Theory and a Transferable Router for Two-Hop QA

作者: Andre Bacellar 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09019v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于两跳问答中的检索机制，提出了基于查询类型的路由方法RegimeRouter，与检索增强生成（RAG）有一定关联，因为涉及检索策略优化，但未涉及大模型、深度学习技术原理创新或科学领域应用，与其他关键词无关。

!!! tip deepseek-chat TL;DR

该论文研究两跳问答中基于查询类型（Q-dominant vs B-dominant）的检索路由问题，提出了理论框架和轻量级路由器RegimeRouter，在多个数据集上实现了显著的检索性能提升。

摘要翻译

两跳问答检索将查询划分为两种机制，其区分依据在于第二跳实体是否在问题中被明确提及（问题主导型）或仅出现在桥接段落中（桥接主导型）。我们通过三条定理形式化这一划分：（T1）单查询AUC是余弦分离裕度的单调函数，在八种类型-编码器组合中有六种的R² ≥ 0.90；（T2）机制可由两个表层文本谓词刻画，其中P1对路由决策起决定性作用，P2则用于限定桥接主导型情况，该规律在三种编码器和三个数据集中均成立；（T3）桥接优势依赖于包含关系的句子而非仅实体名称，移除关系句会导致性能下降8.6-14.1个百分点（p < 0.001）。基于此理论，我们提出RegimeRouter——一种轻量级二元路由模块，它使用从谓词定义直接推导的五个文本特征，在仅问题检索与问题加关系句检索之间进行选择。该模块在2WikiMultiHopQA数据集（n = 881，采用五折交叉拟合）上训练，并零样本应用于MuSiQue和HotpotQA，在R@5指标上分别实现了+5.6 pp（p < 0.001）、+5.3 pp（p = 0.002）和+1.1 pp（不显著，无退化）的性能提升，且全程基于理论驱动的构件实现。

摘要 (Abstract)

Two-hop QA retrieval splits queries into two regimes determined by whether the hop-2 entity is explicitly named in the question (Q-dominant) or only in the bridge passage (B-dominant). We formalize this split with three theorems: (T1) per-query AUC is a monotone function of the cosine separation margin, with R^2 >= 0.90 for six of eight type-encoder pairs; (T2) regime is characterized by two surface-text predicates, with P1 decisive for routing and P2 qualifying the B-dominant case, holding across three encoders and three datasets; and (T3) bridge advantage requires the relation-bearing sentence, not entity name alone, with removal causing an 8.6-14.1 pp performance drop (p < 0.001). Building on this theory, we propose RegimeRouter, a lightweight binary router that selects between question-only and question-plus-relation-sentence retrieval using five text features derived directly from the predicate definitions. Trained on 2WikiMultiHopQA (n = 881, 5-fold cross-fitted) and applied zero-shot to MuSiQue and HotpotQA, RegimeRouter achieves +5.6 pp (p < 0.001), +5.3 pp (p = 0.002), and +1.1 pp (non-significant, no-regret) R@5 improvement, respectively, with artifact-driven.

关键词: Two-hop QA, Retrieval, Regime Router, Question Answering, Transfer Learning, Zero-shot, Text Features, Performance Improvement

作者: Carlos Jimeno Miguel, Raul Orduna, Francesco Zola 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09016v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文研究的是网络安全领域的数据集创建问题，具体涉及从Telegram平台收集信息、语音转文字、命名实体识别（NER）和匿名化技术。虽然使用了基于transformer的AI模型进行NER，但论文的核心焦点是应用现有技术解决特定领域（网络安全、GDPR合规）的问题，而非大模型或深度学习技术本身的创新。所有关键词都涉及大模型技术原理、训练方法、推理优化、对齐、应用范式等前沿创新方向，与论文的实际研究内容（应用导向的网络安全工具开发）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种从Telegram平台收集多模态数据并利用语音转文字和命名实体识别技术来检测敏感信息、实现数据匿名化的系统，以支持符合GDPR等法规的网络安全研究，实验表明Parakeet在音频转录上表现最佳，而提出的NER解决方案在检测敏感信息时取得了最高的f1分数。

摘要翻译

本研究致力于应对在符合《通用数据保护条例》（GDPR）及《刑法典》第10/1995号基本法等法规要求的前提下，构建网络犯罪分析数据集的挑战。为此，我们提出了一套从Telegram平台采集文本、音频及图像信息的系统；实施了融合信号增强技术的语音转文字转录模型；并评估了包括Microsoft Presidio与基于Transformer架构设计的AI模型在内的多种命名实体识别（NER）解决方案。实验结果表明，Parakeet在音频转录中表现最佳，而所提出的NER解决方案在检测敏感信息方面取得了最高的f1分数值。此外，本研究还提出了一套匿名化评估指标，可在保障个人信息安全、支持现行法律框架内网络安全研究的同时，有效评估数据结构性连贯性的保持程度。

摘要 (Abstract)

This study addresses the challenge of creating datasets for cybercrime analysis while complying with the requirements of regulations such as the General Data Protection Regulation (GDPR) and Organic Law 10/1995 of the Penal Code. To this end, a system is proposed for collecting information from the Telegram platform, including text, audio, and images; the implementation of speech-to-text transcription models incorporating signal enhancement techniques; and the evaluation of different Named Entity Recognition (NER) solutions, including Microsoft Presidio and AI models designed using a transformer-based architecture. Experimental results indicate that Parakeet achieves the best performance in audio transcription, while the proposed NER solutions achieve the highest f1-score values in detecting sensitive information. In addition, anonymization metrics are presented that allow evaluation of the preservation of structural coherence in the data, while simultaneously guaranteeing the protection of personal information and supporting cybersecurity research within the current legal framework.

关键词: Named Entity Recognition, Data Anonymization, Cybersecurity, GDPR Compliance, Telegram Platform, Speech-to-Text, Transformer-based Models, Sensitive Information Detection

85. ❌ Towards Linguistically-informed Representations for English as a Second or Foreign Language: Review, Construction and Application

作者: Wenxi Li, Xihao Wang, Weiwei Sun 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09008v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于英语作为第二语言或外语（ESFL）的语言学资源构建和应用，属于传统的计算语言学或语料库语言学领域。论文内容涉及语法-语义标注、语料库构建和语言习得假设验证，完全不涉及大模型、深度学习、AI技术原理或AI for Science等现代AI技术。所有评分关键词均与大模型技术、深度学习应用或AI科学应用相关，与该论文的纯语言学主题无任何关联。

!!! tip deepseek-chat TL;DR

该论文针对英语作为第二语言或外语（ESFL）缺乏专用知识密集型表示的问题，通过构建一个包含1643个标注句子的语法-语义资源库，并应用于验证语言生态位假说，为第二语言习得研究提供了新工具。

摘要翻译

英语作为第二语言或外语的广泛使用引发了一场范式转变：ESFL不再仅仅被视为对标准英语的偏离，而是被看作一个独立的语言系统。这一转变凸显了对专门、知识密集型的ESFL表征的需求。为此，本文系统梳理了现有ESFL资源，指出其局限性，并提出了一种新颖的解决方案。基于建构主义理论，本文将构式视为基本分析单位，从而能够同时为ESFL和标准英语的句法-语义接口建模。该设计通过参照英语的句法-语义映射关系，同时保留ESFL的独有特征，捕捉了广泛的ESFL现象，最终构建了一个包含1643条标注ESFL句子的高质量句法-语义资源库。为证明该语料库的实用价值，我们开展了一项测试"语言生态位假说"的试点研究，彰显了其作为第二语言习得研究重要工具的潜力。

摘要 (Abstract)

The widespread use of English as a Second or Foreign Language (ESFL) has sparked a paradigm shift: ESFL is not seen merely as a deviation from standard English but as a distinct linguistic system in its own right. This shift highlights the need for dedicated, knowledge-intensive representations of ESFL. In response, this paper surveys existing ESFL resources, identifies their limitations, and proposes a novel solution. Grounded in constructivist theories, the paper treats constructions as the fundamental units of analysis, allowing it to model the syntax–semantics interface of both ESFL and standard English. This design captures a wide range of ESFL phenomena by referring to syntactico-semantic mappings of English while preserving ESFL’s unique characteristics, resulting a gold-standard syntactico-semantic resource comprising 1643 annotated ESFL sentences. To demonstrate the sembank’s practical utility, we conduct a pilot study testing the Linguistic Niche Hypothesis, highlighting its potential as a valuable tool in Second Language Acquisition research.

关键词: English as a Second or Foreign Language, ESFL, linguistic representations, syntax-semantics interface, constructivist theories, annotated corpus, Linguistic Niche Hypothesis, Second Language Acquisition

86. ❌ Hypergraph Neural Networks Accelerate MUS Enumeration

作者: Hiroya Ijima, Koichiro Yawata 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09001v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究使用超图神经网络（HGNN）和强化学习加速最小不可满足子集（MUS）枚举，属于约束满足问题（CSP）的机器学习应用。所有关键词均与大模型（LLM）技术、训练方法、推理优化、对齐、代理系统等直接相关，而本文未涉及任何大模型或深度学习技术原理创新，仅使用传统神经网络（HGNN）解决特定领域问题。唯一相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为MUS枚举可视为科学计算或AI在科学领域的应用，但论文未明确提及生物信息学或化学信息学，故给5分（有一定关联）。其他关键词完全无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种使用超图神经网络和强化学习来加速最小不可满足子集枚举的领域无关方法，实验表明该方法在相同可满足性检查预算下能比传统方法枚举更多MUS。

摘要翻译

枚举最小不可满足子集（Minimal Unsatisfiable Subsets，简称MUSes）是约束满足问题（Constraint Satisfaction Problems，CSPs）中的一项基础任务。其主要挑战在于搜索空间的指数级增长，当可满足性检查成本高昂时，这一问题尤为严重。近期的机器学习方法降低了布尔可满足性问题中的此类开销，但这些方法依赖于显式的变量-约束关系，限制了其应用领域。本文提出一种与领域无关的方法，利用超图神经网络（Hypergraph Neural Networks，HGNNs）来加速MUS枚举。所提出的方法以约束为顶点、以当前步骤已枚举的MUSes为超边，逐步构建超图，并采用基于强化学习训练的HGNN智能体，以最小化获取一个MUS所需进行的可满足性检查次数。实验结果表明，我们的方法在加速MUS枚举方面具有显著效果，在相同的可满足性检查预算下，与传统方法相比，本方法能够枚举出更多的MUSes。

摘要 (Abstract)

Enumerating Minimal Unsatisfiable Subsets (MUSes) is a fundamental task in constraint satisfaction problems (CSPs). Its major challenge is the exponential growth of the search space, which becomes particularly severe when satisfiability checks are expensive. Recent machine learning approaches reduce this cost for Boolean satisfiability problems but rely on explicit variable-constraint relationships, limiting their application domains. This paper proposes a domain-agnostic method to accelerate MUS enumeration using Hypergraph Neural Networks (HGNNs). The proposed method incrementally builds a hypergraph with constraints as vertices and MUSes enumerated until the current step as hyperedges, and employs an HGNN-based agent trained via reinforcement learning to minimize the number of satisfiability checks required to obtain an MUS. Experimental results demonstrate the effectiveness of our approach in accelerating MUS enumeration, showing that our method can enumerate more MUSes within the same satisfiability check budget compared to conventional methods.

关键词: Minimal Unsatisfiable Subsets, MUS enumeration, Hypergraph Neural Networks, HGNN, reinforcement learning, constraint satisfaction problems, satisfiability checks, domain-agnostic method

87. ❌ PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

作者: Zhiyu Zhou, Peilin Liu, Ruoxuan Zhang, Luyang Zhang, Cheng Zhang, Hongxia Xie, Wen-Huang Cheng 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08991v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在室内视频小物体空间理解任务上的性能评估与提升，与’Large Language Models’高度相关（10分），因为MLLMs是LLMs的扩展；与’Supervised Fine-tuning’高度相关（10分），因为论文明确使用SFT进行模型改进；其他关键词如MoE、SLMs、Scaling Laws、RLHF等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在室内视频小物体空间理解上的不足，提出了首个数据集PinpointQA和基准测试，并通过监督微调显著提升了模型在该任务上的性能。

摘要翻译

室内视频中以小物体为中心的空间理解对于多模态大语言模型（MLLMs）而言仍是一个重大挑战，尽管其在物体搜索与辅助应用中具有重要实用价值。尽管现有基准测试在视频空间智能、具身推理和诊断感知方面取得了进展，但目前尚无基准能直接评估模型能否在视频中定位目标物体，并以足够精确度表达其位置以供下游使用。本研究提出了PinpointQA，这是首个针对室内视频中小物体空间理解的数据集与基准。基于ScanNet++和ScanNet200构建的PinpointQA包含1,024个场景和10,094个问答对，并组织为四个难度递增的任务：目标存在性验证（Target Presence Verification, TPV）、最近参照物识别（Nearest Reference Identification, NRI）、细粒度空间描述（Fine-Grained Spatial Description, FSD）以及结构化空间预测（Structured Spatial Prediction, SSP）。该数据集基于中间空间表示构建，问答对通过自动生成并经过质量控制优化。在代表性多模态大语言模型上的实验揭示了模型在递进任务链上存在持续的能力差距，其中结构化空间预测（SSP）任务尤为困难。使用PinpointQA进行监督微调能带来显著性能提升，尤其在较难任务上表现明显，这表明PinpointQA既可作为诊断性基准，也能成为有效的训练数据集。数据集与项目页面详见https://rainchowz.github.io/PinpointQA。

摘要 (Abstract)

Small object-centric spatial understanding in indoor videos remains a significant challenge for multimodal large language models (MLLMs), despite its practical value for object search and assistive applications. Although existing benchmarks have advanced video spatial intelligence, embodied reasoning, and diagnostic perception, no existing benchmark directly evaluates whether a model can localize a target object in video and express its position with sufficient precision for downstream use. In this work, we introduce PinpointQA, the first dataset and benchmark for small object-centric spatial understanding in indoor videos. Built from ScanNet++ and ScanNet200, PinpointQA comprises 1,024 scenes and 10,094 QA pairs organized into four progressively challenging tasks: Target Presence Verification (TPV), Nearest Reference Identification (NRI), Fine-Grained Spatial Description (FSD), and Structured Spatial Prediction (SSP). The dataset is built from intermediate spatial representations, with QA pairs generated automatically and further refined through quality control. Experiments on representative MLLMs reveal a consistent capability gap along the progressive chain, with SSP remaining particularly difficult. Supervised fine-tuning on PinpointQA yields substantial gains, especially on the harder tasks, demonstrating that PinpointQA serves as both a diagnostic benchmark and an effective training dataset. The dataset and project page are available at https://rainchowz.github.io/PinpointQA.

关键词: multimodal large language models, small object-centric spatial understanding, indoor videos, benchmark dataset, supervised fine-tuning, PinpointQA, spatial reasoning, video understanding

88. ❌ PilotBench: A Benchmark for General Aviation Agents with Safety Constraints

作者: Yalun Wu, Haotian Liu, Zhoujun Li, Boyang Wang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08987v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	5.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLM在安全约束的航空领域作为具身智能体的应用，与’Large Language Models’和’LLM Agents’高度相关（10分）。涉及指令遵循、安全约束（与’Instruction Tuning’相关，5分），需要复杂推理（与’Chain of Thought’和’System 2 Thinking’相关，5分），评估LLM的物理世界理解（与’World Models’相关，5分），属于AI在科学/工程领域的应用（与’AI for Science’相关，5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文通过PilotBench基准测试评估LLM在安全关键的航空轨迹预测中的表现，发现LLM在指令遵循方面优于传统预测器但精度较低，尤其在复杂飞行阶段性能下降，从而提出结合LLM符号推理与专业预测器数值精度的混合架构。

摘要翻译

随着大语言模型（LLM）向在物理环境中运行的具身人工智能体发展，一个根本性问题随之浮现：在文本语料库上训练的模型能否在遵循安全约束的同时，可靠地对复杂物理现象进行推理？我们通过PilotBench这一基准测试来探讨此问题，该基准用于评估LLM在安全关键性的飞行轨迹与姿态预测任务上的表现。PilotBench基于708条真实世界通用航空轨迹构建，涵盖九个操作上截然不同的飞行阶段，并配有同步的34通道遥测数据。它通过对比分析LLM与传统预测模型，系统地探究语义理解与物理规律支配的预测之间的交叉点。我们提出了Pilot-Score这一复合指标，它平衡了60%的回归精度与40%的指令遵循及安全合规性。对41个模型的比较评估揭示了一种“精度-可控性二分”现象：传统预测模型取得了7.01的优异平均绝对误差（MAE），但缺乏语义推理能力；而LLM获得了可控性，其指令遵循率达到86-89%，但代价是11-14的MAE精度损失。分阶段分层分析进一步暴露了“动态复杂性鸿沟”——LLM在高工作负荷阶段（如爬升和进近）的性能急剧下降，这表明其内隐的物理模型是脆弱的。这些实证发现推动了将LLM的符号推理能力与专业预测模型的数值精度相结合的混合架构的发展。PilotBench为在安全受限领域推进具身人工智能提供了严谨的基础。

摘要 (Abstract)

As Large Language Models (LLMs) advance toward embodied AI agents operating in physical environments, a fundamental question emerges: can models trained on text corpora reliably reason about complex physics while adhering to safety constraints? We address this through PilotBench, a benchmark evaluating LLMs on safety-critical flight trajectory and attitude prediction. Built from 708 real-world general aviation trajectories spanning nine operationally distinct flight phases with synchronized 34-channel telemetry, PilotBench systematically probes the intersection of semantic understanding and physics-governed prediction through comparative analysis of LLMs and traditional forecasters. We introduce Pilot-Score, a composite metric balancing 60% regression accuracy with 40% instruction adherence and safety compliance. Comparative evaluation across 41 models uncovers a Precision-Controllability Dichotomy: traditional forecasters achieve superior MAE of 7.01 but lack semantic reasoning capabilities, while LLMs gain controllability with 86–89% instruction-following at the cost of 11–14 MAE precision. Phase-stratified analysis further exposes a Dynamic Complexity Gap-LLM performance degrades sharply in high-workload phases such as Climb and Approach, suggesting brittle implicit physics models. These empirical discoveries motivate hybrid architectures combining LLMs’ symbolic reasoning with specialized forecasters’ numerical precision. PilotBench provides a rigorous foundation for advancing embodied AI in safety-constrained domains.

关键词: Large Language Models, Embodied AI Agents, Safety Constraints, Flight Trajectory Prediction, Benchmark Evaluation, Physics Reasoning, Hybrid Architectures, General Aviation

89. ❌ PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment

作者: Jihwan Oh, Soowon Oh, Murad Aghazada, Minchan Jeong, Sungnyun Kim, Se-Young Yun 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08986v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在角色扮演任务中，通过强化学习与可验证奖励（RLVR）进行对齐时，面临角色表达性与任务鲁棒性的权衡问题，并提出PerMix-RLVR方法进行优化。因此，与’Large Language Models’、‘Instruction Tuning/Alignment’和’RLHF/RLAIF/DPO’高度相关（10分），这些是论文的技术基础与核心方法。其他关键词如MoE、量化、RAG、推理加速等，论文未涉及，故给0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在通过强化学习与可验证奖励进行对齐时，角色表达性与任务鲁棒性之间的权衡问题，并提出PerMix-RLVR方法，在保持对有害角色变化鲁棒性的同时，提升了角色扮演的忠实度。

摘要翻译

角色提示已被广泛用于通过赋予特定角色来引导大语言模型的行为并提升其指令遵循性能。然而，识别最优角色耗时且其对输出质量的影响尚不明确。先前研究主要在推理阶段通过提示层面的策略应对此问题，这带来了额外的计算开销。本研究通过在处理训练过程中的角色敏感性来避免推理时的提示搜索，旨在训练出能适应不同角色同时保持任务性能的模型。具体而言，我们发现基于可验证奖励的强化学习能系统性地降低对角色提示的敏感性，但也揭示了基于结果优化的固有权衡：虽然RLVR在具有可验证目标的任务上提升了鲁棒性，却可能在需要时削弱角色表现力，例如在角色扮演任务中。为突破这一局限，我们提出PerMix-RLVR——一种角色混合的RLVR策略，该策略缓解了角色鲁棒性与忠实度之间的权衡，在保持对有害角色变动的强鲁棒性的同时，能够在需要时实现准确的角色采纳。具体数据表明，PerMix-RLVR在MATH500数据集上将角色稳定性评分较RLVR提升21.2%，同时在PersonaGym数据集上使角色忠实度提高11.4%。

摘要 (Abstract)

Persona prompting has been widely adopted to steer large language models (LLMs) behavior and improve their instruction performance by assigning specific characters. However, identifying an optimal persona is time-consuming, and its impact on output quality remains poorly understood. Prior work has mainly addressed this issue at the prompt level via inference-time strategies, incurring additional computation. In this work, we avoid inference-time prompt search by tackling persona sensitivity during training, aiming to train models that adapt their behavior to diverse personas while preserving task performance. In particular, we find that reinforcement learning with verifiable rewards (RLVR) systematically reduces sensitivity to persona prompts, but also reveals an inherent trade-off of outcome-based optimization: while RLVR improves robustness on tasks with verifiable goals, it can also degrade persona expressivity when needed, e.g., in-character role-playing. To address this limitation, we propose PerMix-RLVR, a persona-mixed RLVR strategy that mitigates the persona robustness-fidelity trade-off, preserving strong robustness to harmful persona variation while enabling faithful persona adoption when required. Concretely, PerMix-RLVR improves persona stability score (PSS) over RLVR by +21.2% on MATH500, while also enhancing persona fidelity by +11.4% on PersonaGym.

关键词: Large Language Models, Persona Prompting, Reinforcement Learning, Verifiable Rewards, Alignment, Persona Expressivity, Robustness, PerMix-RLVR

90. ❌ Neighbourhood Transformer: Switchable Attention for Monophily-Aware Graph Learning

作者: Yi Luo, Xu Sun, Guangchun Luo, Aiguo Chen 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08980v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是图神经网络（GNNs）的改进，特别是针对异质图（heterophilic graphs）的Neighbourhood Transformers模型。论文的核心是图学习、注意力机制和工程优化，与提供的关键词列表（主要围绕大语言模型及其相关技术）基本无关。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文提到应用包括化学研究（chemical research），但这并非论文核心，因此给予5分（有一定关联）。其他所有关键词均与论文内容无直接关系，评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对传统图神经网络在异质图上因同质性假设失效而性能受限的问题，提出了基于局部邻域自注意力的Neighbourhood Transformers模型，并通过可切换注意力和邻域划分策略大幅降低了计算开销，在10个真实数据集上实现了最先进的节点分类性能。

摘要翻译

图神经网络（GNNs）已在社交网络分析、化学研究和计算机视觉等工程应用中得到广泛采用。然而，其有效性受到固有的同质性假设的严重制约，该假设在异质性图（即不相似节点频繁连接的图）中并不成立。为解决图学习中的这一根本性局限，我们首先从近期发现的现实世界图的单质性属性中汲取灵感，提出了邻域变换器（Neighbourhood Transformers，简称NT）这一新颖范式。该范式在每个局部邻域内应用自注意力机制，而非如传统消息传递GNNs那样将信息聚合到中心节点。这一设计使NT天生具备单质性感知能力，并从理论上保证其表达能力不弱于传统消息传递框架。为适应实际工程部署，我们进一步开发了一种配备可切换注意力机制的邻域划分策略，该策略将NT的空间消耗降低了95%以上，时间消耗降低了最高达92.67%，显著扩展了其在更大规模图上的适用性。在10个真实世界数据集（5个异质性图和5个同质性图）上进行的大量实验表明，NT在节点分类任务上超越了所有当前最先进的方法，展现了其卓越的性能和跨领域适应性。本工作的完整实现代码已在https://github.com/cf020031308/MoNT公开，以促进可复现性和工业应用。

摘要 (Abstract)

Graph neural networks (GNNs) have been widely adopted in engineering applications such as social network analysis, chemical research and computer vision. However, their efficacy is severely compromised by the inherent homophily assumption, which fails to hold for heterophilic graphs where dissimilar nodes are frequently connected. To address this fundamental limitation in graph learning, we first draw inspiration from the recently discovered monophily property of real-world graphs, and propose Neighbourhood Transformers (NT), a novel paradigm that applies self-attention within every local neighbourhood instead of aggregating messages to the central node as in conventional message-passing GNNs. This design makes NT inherently monophily-aware and theoretically guarantees its expressiveness is no weaker than traditional message-passing frameworks. For practical engineering deployment, we further develop a neighbourhood partitioning strategy equipped with switchable attentions, which reduces the space consumption of NT by over 95% and time consumption by up to 92.67%, significantly expanding its applicability to larger graphs. Extensive experiments on 10 real-world datasets (5 heterophilic and 5 homophilic graphs) show that NT outperforms all current state-of-the-art methods on node classification tasks, demonstrating its superior performance and cross-domain adaptability. The full implementation code of this work is publicly available at https://github.com/cf020031308/MoNT to facilitate reproducibility and industrial adoption.

关键词: Graph Neural Networks, Heterophilic Graphs, Self-attention, Neighbourhood Transformers, Monophily, Node Classification, Computational Efficiency, Switchable Attention

91. ❌ Litmus (Re)Agent: A Benchmark and Agentic System for Predictive Evaluation of Multilingual Models

作者: Avni Mittal, Shanu Kumar, Sandipan Dandapat, Monojit Choudhury 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08970v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是开发一个基于代理系统（Litmus (Re)Agent）的多语言模型性能预测框架，该系统使用DAG编排的代理架构进行假设分解、证据检索和预测合成。与’LLM Agents’高度相关（10分），因为这是核心方法；与’Retrieval-Augmented Generation’相关（8分），因为系统涉及证据检索和生成预测；与’Large Language Models’相关（8分），因为研究针对多语言模型评估；与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分），因为代理系统执行结构化推理；与’Tool Use’相关（5分），因为代理可能使用检索工具。其他关键词如MoE、量化、对齐等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了多语言模型在缺乏直接基准结果时的性能预测问题，提出了一个基准测试和基于代理的系统Litmus (Re)Agent，该系统通过结构化推理在证据不完整的情况下实现了最佳性能预测。

摘要翻译

本研究探讨预测性多语言评估：即在目标语言任务缺乏直接基准测试结果时，如何估计模型性能。该问题在多语言实际部署中普遍存在，因为评估覆盖稀疏，且不同语言、任务和模型家族间已发布的证据分布不均。我们构建了一个包含1,500个问题的受控基准测试，涵盖六类任务和五种证据场景。该基准将可获取的证据与真实标注分离，从而能够评估系统从不完整的文献证据中推断缺失结果的能力。我们还提出了Litmus（再）智能体系统，这是一个基于有向无环图（DAG）编排的智能体系统，可将查询分解为假设、检索证据，并通过特征感知聚合综合生成预测。在六类系统的对比中，Litmus（再）智能体取得了最佳整体性能，在直接证据薄弱或缺失的强迁移场景中提升尤为显著。这些结果表明，结构化智能体推理是在不完整证据条件下进行多语言性能评估的一种有效途径。

摘要 (Abstract)

We study predictive multilingual evaluation: estimating how well a model will perform on a task in a target language when direct benchmark results are missing. This problem is common in multilingual deployment, where evaluation coverage is sparse and published evidence is uneven across languages, tasks, and model families. We introduce a controlled benchmark of 1,500 questions spanning six tasks and five evidence scenarios. The benchmark separates accessible evidence from ground truth, enabling evaluation of systems that must infer missing results from incomplete literature evidence. We also present Litmus (Re)Agent, a DAG-orchestrated agentic system that decomposes queries into hypotheses, retrieves evidence, and synthesises predictions through feature-aware aggregation. Across six systems, Litmus (Re)Agent achieves the best overall performance, with the largest gains in transfer-heavy scenarios where direct evidence is weak or absent. These results show that structured agentic reasoning is a promising approach to multilingual performance estimation under incomplete evidence.

关键词: multilingual evaluation, predictive evaluation, agentic system, DAG-orchestrated, evidence retrieval, performance estimation, incomplete evidence, structured reasoning

92. ❌ Aligned Agents, Biased Swarm: Measuring Bias Amplification in Multi-Agent Systems

作者: Keyu Li, Jin Gao, Dequan Wang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08963v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	15.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多智能体系统（MAS）中的偏见放大问题，与’Multi-agent Systems OR Agent Coordination’高度相关（15分），因为这是论文的核心研究对象。与’LLM Agents OR Autonomous Agents OR Agentic Workflow’相关（10分），因为论文涉及智能体工作流和自主性。与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（5分），因为论文探讨伦理对齐和偏见问题，但未深入技术细节。其他关键词如大模型技术、训练方法、推理优化等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了多智能体系统（MAS）中基本拓扑结构和反馈循环如何放大偏见，发现结构复杂性不仅不能保证伦理鲁棒性，反而可能加剧系统性偏见极化。

摘要翻译

尽管多智能体系统（MAS）越来越多地应用于复杂工作流，但其涌现特性——尤其是偏见累积问题——仍鲜为人知。由于现实世界的多智能体系统过于复杂而无法完整分析，评估其伦理鲁棒性需要首先剥离其基础运行机制。本研究通过基准实证分析，探讨基础多智能体系统拓扑结构与反馈循环如何影响偏见形成。与“多智能体协作天然消解偏见”的假设相反，我们提出结构化工作流会形成回声室效应，将微小的随机偏见放大为系统性极化。为验证此假设，我们开发了Discrim-Eval-Open开放式基准测试框架，该框架通过强制跨人口群体的比较判断，规避个体模型的中立性假设。对不同结构偏见传播路径的分析表明：架构复杂化往往加剧而非缓解偏见。研究发现即使孤立智能体保持中立运行，系统仍会出现偏见放大现象，并识别出“触发脆弱性”——注入纯客观语境会急剧加速极化进程。通过剥离高级群体复杂性来研究基础动力学，我们确立了关键基准：结构复杂性并不保证伦理鲁棒性。代码已开源：https://github.com/weizhihao1/MAS-Bias。

摘要 (Abstract)

While Multi-Agent Systems (MAS) are increasingly deployed for complex workflows, their emergent properties-particularly the accumulation of bias-remain poorly understood. Because real-world MAS are too complex to analyze entirely, evaluating their ethical robustness requires first isolating their foundational mechanics. In this work, we conduct a baseline empirical study investigating how basic MAS topologies and feedback loops influence prejudice. Contrary to the assumption that multi-agent collaboration naturally dilutes bias, we hypothesize that structured workflows act as echo chambers, amplifying minor stochastic biases into systemic polarization. To evaluate this, we introduce Discrim-Eval-Open, an open-ended benchmark that bypasses individual model neutrality through forced comparative judgments across demographic groups. Analyzing bias cascades across various structures reveals that architectural sophistication frequently exacerbates bias rather than mitigating it. We observe systemic amplification even when isolated agents operate neutrally, and identify a ‘Trigger Vulnerability’ where injecting purely objective context drastically accelerates polarization. By stripping away advanced swarm complexity to study foundational dynamics, we establish a crucial baseline: structural complexity does not guarantee ethical robustness. Our code is available at https://github.com/weizhihao1/MAS-Bias.

关键词: Multi-Agent Systems, Bias Amplification, Ethical Robustness, Feedback Loops, Systemic Polarization, Agent Coordination, Workflow Analysis, Benchmark Evaluation

93. ❌ WOMBET: World Model-based Experience Transfer for Robust and Sample-efficient Reinforcement Learning

作者: Mintae Kim, Koushil Sreenath 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08958v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文WOMBET专注于强化学习（RL）领域，提出了一种基于世界模型的经验迁移框架，用于提高机器人RL的样本效率和鲁棒性。论文核心是强化学习中的世界模型、经验迁移、离线到在线学习、不确定性规划等概念。与评分关键词列表对比，只有"World Models AND General World Models"高度相关（10分），因为论文明确使用并创新了世界模型技术。其他所有关键词（如LLMs、MoE、SFT、RAG、Agents等）均涉及大语言模型、深度学习技术原理或特定AI应用子领域，而本论文未涉及这些内容，因此评分为0分。论文属于强化学习领域，而非大模型或深度学习在科学领域的应用，因此不符合研究背景中"大模型在不同领域的研究应用"的要求。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于世界模型的经验迁移框架（WOMBET），通过不确定性惩罚规划和自适应采样，解决了强化学习中从源任务到目标任务的经验迁移问题，提高了样本效率和最终性能。

摘要翻译

机器人学中的强化学习常受限于数据收集的成本与风险，这促使研究者探索从源任务向目标任务的经验迁移。离线到在线强化学习虽能利用先验数据，但通常假设存在给定的固定数据集，并未解决如何生成可靠迁移数据的问题。我们提出基于世界模型的经验迁移框架（World Model-based Experience Transfer, WOMBET），该框架联合生成并利用先验数据。WOMBET在源任务中学习世界模型，通过不确定性惩罚规划生成离线数据，随后筛选出高回报与低认知不确定性的轨迹。在目标任务中，框架通过离线与在线数据间的自适应采样进行在线微调，实现从先验驱动初始化到任务特定适应的平稳过渡。我们证明不确定性惩罚目标函数为真实回报提供了下界，并推导出捕获分布失配与近似误差的有限样本误差分解。实验表明，在连续控制基准测试中，WOMBET相较于强基线方法提升了样本效率与最终性能，验证了联合优化数据生成与迁移策略的有效性。

摘要 (Abstract)

Reinforcement learning (RL) in robotics is often limited by the cost and risk of data collection, motivating experience transfer from a source task to a target task. Offline-to-online RL leverages prior data but typically assumes a given fixed dataset and does not address how to generate reliable data for transfer. We propose \textit{World Model-based Experience Transfer} (WOMBET), a framework that jointly generates and utilizes prior data. WOMBET learns a world model in the source task and generates offline data via uncertainty-penalized planning, followed by filtering trajectories with high return and low epistemic uncertainty. It then performs online fine-tuning in the target task using adaptive sampling between offline and online data, enabling a stable transition from prior-driven initialization to task-specific adaptation. We show that the uncertainty-penalized objective provides a lower bound on the true return and derive a finite-sample error decomposition capturing distribution mismatch and approximation error. Empirically, WOMBET improves sample efficiency and final performance over strong baselines on continuous control benchmarks, demonstrating the benefit of jointly optimizing data generation and transfer.

关键词: Reinforcement Learning, World Model, Experience Transfer, Offline-to-online RL, Uncertainty-penalized Planning, Sample Efficiency, Continuous Control, Robust Learning

94. ❌ MuTSE: A Human-in-the-Loop Multi-use Text Simplification Evaluator

作者: Rares-Alexandru Roscan, Gabriel Petre1, Adrian-Marius Dumitran, Angela-Liliana Dumitran 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08947v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究LLM在文本简化任务中的评估方法，开发了一个交互式评估工具MuTSE。论文明确提到LLMs在文本简化中的应用，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术内容，如MoE、SLMs、训练方法、推理技术、代理系统、模型压缩等，因此这些关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM在文本简化任务中缺乏系统评估方法的问题，开发了一个交互式人机协同评估工具MuTSE，能够实时比较不同提示策略和模型架构的简化效果，并支持结构化标注用于下游NLP数据集构建。

摘要翻译

随着大语言模型在文本简化任务中的应用日益广泛，如何系统评估不同提示策略与模型架构下的输出结果，已成为自然语言处理研究和智能导学系统领域的关键方法论挑战。当前，开发鲁棒的提示方案常因缺乏结构化、可视化的对比文本分析框架而受阻。研究人员通常依赖静态的计算脚本，而教育工作者则受限于标准对话界面——这两种范式均无法支持对提示-模型组合进行系统的多维度评估。为应对这些局限，我们提出了 MuTSE（注：项目代码与演示版本已通过以下匿名链接开放供同行评审。https://osf.io/njs43/overview?view_only=4b4655789f484110a942ebb7788cdf2a），这是一个融入人类反馈循环的交互式网络应用程序，旨在简化为任意欧洲语言共同参考框架（CEFR）能力目标生成的大语言模型文本简化结果的评估流程。该系统支持并行执行 $P \times M$ 个提示-模型组合，实时生成全面的对比矩阵。通过集成一个结合线性偏差启发式参数（$λ$）的新型分层语义对齐引擎，MuTSE 能够将源语句与其简化版本进行可视化映射，从而降低质性分析的认知负荷，并为下游自然语言处理数据集构建提供可复现的结构化标注支持。

摘要 (Abstract)

As Large Language Models (LLMs) become increasingly prevalent in text simplification, systematically evaluating their outputs across diverse prompting strategies and architectures remains a critical methodological challenge in both NLP research and Intelligent Tutoring Systems (ITS). Developing robust prompts is often hindered by the absence of structured, visual frameworks for comparative text analysis. While researchers typically rely on static computational scripts, educators are constrained to standard conversational interfaces – neither paradigm supports systematic multi-dimensional evaluation of prompt-model permutations. To address these limitations, we introduce \textbf{MuTSE}\footnote{The project code and the demo have been made available for peer review at the following anonymized URL. https://osf.io/njs43/overview?view_only=4b4655789f484110a942ebb7788cdf2a, an interactive human-in-the-loop web application designed to streamline the evaluation of LLM-generated text simplifications across arbitrary CEFR proficiency targets. The system supports concurrent execution of $P \times M$ prompt-model permutations, generating a comprehensive comparison matrix in real-time. By integrating a novel tiered semantic alignment engine augmented with a linearity bias heuristic ($λ$), MuTSE visually maps source sentences to their simplified counterparts, reducing the cognitive load associated with qualitative analysis and enabling reproducible, structured annotation for downstream NLP dataset construction.

关键词: Large Language Models, text simplification, evaluation framework, human-in-the-loop, prompting strategies, interactive web application, semantic alignment, NLP dataset construction

95. ❌ Enhancing LLM Problem Solving via Tutor-Student Multi-Agent Interaction

作者: Nurullah Eymen Özdemir, Erhan Oztop 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08931v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在多智能体系统中的协同问题解决，与’Large Language Models’、‘LLM Agents’、‘Multi-agent Systems’高度相关（10分）；涉及迭代改进和反思过程，与’Self-Correction/Self-Improvement/Self-Reflection’相关（8分）；包含结构化推理和评估，与’Chain of Thought’和’System 2 Thinking’有一定关联（5分）；其他关键词如MoE、量化、RAG等未在论文中涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文研究如何通过导师-学生多智能体交互结构提升LLM的问题解决能力，在APPS编码基准测试中实现了与最先进方法相当或更高的准确率，同时显著减少了token消耗。

摘要翻译

人类认知发展不仅受个体努力影响，更受到结构化社会互动的塑造——例如导师与学习者之间基于角色的交流，能够催生任何一方单独无法实现的解决方案。受这些发展原则启发，我们提出以下问题：通过将大型语言模型（LLM）置于导师-学生多智能体系统中，能否产生协同效应，使其超越现有框架下的能力极限？为验证这一设想，我们采用自主编程问题领域作为测试场景，其中两个源自同一大型语言模型的智能体被赋予非对称角色：学生智能体负责生成并迭代优化解决方案，而导师智能体则在无法接触标准答案的情况下提供结构化评估反馈。在我们提出的框架（PETITE）中，目标是通过互补角色构建交互结构，而非依赖更强的监督模型或异构集成，从而从单一模型中提取更优的问题解决性能。我们在APPS编码基准上评估了该模型，并与自洽性推理、自我优化、多智能体辩论及多智能体评审等前沿方法进行比较。结果表明，我们的模型在消耗显著更少计算令牌的同时，取得了相当或更高的准确率。这些发现表明，基于发展心理学原理的角色分化交互结构，为通过结构化类同伴互动增强大型语言模型的问题解决能力，提供了一个原则性强且资源高效的新范式。关键词- 同伴辅导，支架式教学，大型语言模型，多智能体系统，代码生成

摘要 (Abstract)

Human cognitive development is shaped not only by individual effort but by structured social interaction, where role-based exchanges such as those between a tutor and a learner, enable solutions that neither could achieve alone. Inspired by these developmental principles, we ask the question whether a tutor-student multi-agent system can create a synergistic effect by pushing Large Language Model (LLM) beyond what it can do within existing frameworks. To test the idea, we adopt autonomous coding problem domain where two agents instantiated from the same LLM assigned asymmetric roles: a student agent generates and iteratively refines solutions, while a tutor agent provides structured evaluative feedback without access to ground-truth answers. In our proposed framework (PETITE), we aim to extract better problem-solving performance from one model by structuring its interaction through complementary roles, rather than relying on stronger supervisory models or heterogeneous ensembles. Our model is evaluated on the APPS coding benchmark against state-of-the-art approaches of Self-Consistency, Self-Refine, Multi-Agent Debate, and Multi-Agent Review. The results show that our model achieves similar or higher accuracy while consuming significantly fewer tokens. These results suggest that developmentally grounded role-differentiated interaction structures provide a principled and resource-efficient paradigm for enhancing LLM problem-solving through structured peer-like interactions. Index Terms- Peer Tutoring, Scaffolding, Large Language Models, Multi-Agent Systems, Code Generation

关键词: Large Language Models, Multi-Agent Systems, Code Generation, Peer Tutoring, Problem Solving, Autonomous Agents, Self-Refine, Scaffolding

96. ❌ Beyond Relevance: Utility-Centric Retrieval in the LLM Era

作者: Hengran Zhang, Minghao Tang, Keping Bi, Jiafeng Guo 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08920v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究检索增强生成（RAG）范式下信息检索系统的效用评估，与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（10分），并涉及LLM在信息访问中的应用，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文提到’agentic RAG’，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’有一定关联（5分）。其他关键词如MoE、量化、推理加速等未在摘要中提及，评分为0分。

!!! tip deepseek-chat TL;DR

该论文探讨了在检索增强生成（RAG）范式下，信息检索系统应从传统相关性优化转向以LLM为中心的效用评估，并提出了一个统一框架来指导设计符合LLM信息需求的检索系统。

摘要翻译

信息检索系统传统上主要优化主题相关性——即检索文档与查询的匹配程度。然而，相关性仅近似于一个更深层目标：效用，即检索到的信息是否真正有助于完成用户的底层任务。检索增强生成（RAG）的出现从根本上改变了这一范式。检索到的文档不再由用户直接消费，而是作为大型语言模型（LLM）生成答案的证据。因此，检索效果必须通过其对生成质量的贡献来评估，而不能仅依赖基于相关性的排序指标。本教程主张，检索目标正从以相关性为中心的优化，演变为以LLM为中心的效用优化。我们提出了一个统一框架，涵盖与LLM无关的效用与LLM特定的效用、上下文无关的效用与上下文依赖的效用，以及与LLM信息需求和智能体式RAG的联系。通过综合近期进展，本教程为设计与基于LLM的信息访问需求相一致的检索系统，提供了概念基础和实践指导。

摘要 (Abstract)

Information retrieval systems have traditionally optimized for topical relevance-the degree to which retrieved documents match a query. However, relevance only approximates a deeper goal: utility, namely, whether retrieved information helps accomplish a user’s underlying task. The emergence of retrieval-augmented generation (RAG) fundamentally changes this paradigm. Retrieved documents are no longer consumed directly by users but instead serve as evidence for large language models (LLMs) that produce answers. As a result, retrieval effectiveness must be evaluated by its contribution to generation quality rather than by relevance-based ranking metrics alone. This tutorial argues that retrieval objectives are evolving from relevance-centric optimization toward LLM-centric utility. We present a unified framework covering LLM-agnostic versus LLM-specific utility, context-independent versus context-dependent utility, and the connection with LLM information needs and agentic RAG. By synthesizing recent advances, the tutorial provides conceptual foundations and practical guidance for designing retrieval systems aligned with the requirements of LLM-based information access.

关键词: Retrieval-Augmented Generation, RAG, Large Language Models, LLMs, Utility-Centric Retrieval, Information Retrieval, LLM-Centric Utility, Agentic RAG

97. ❌ Large-Scale Universal Defect Generation: Foundation Models and Datasets

作者: Yuanting Fan, Jun Liu, Bin-Bin Gao, Xiaochen Chen, Yuhuan Lin, Zhewei Dai, Jiawei Zhan, Chengjie Wang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08915v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出UniDG，一个用于缺陷生成的通用基础模型，核心是基础模型技术（Foundation Models）和两阶段训练策略（Diversity-SFT和Consistency-RFT），涉及监督微调（SFT）。论文创建了大规模数据集（UDG），与数据质量相关。模型支持基于文本指令的缺陷编辑，与指令调优有一定关联。研究属于AI在工业检测等领域的应用，与AI for Science相关。其他关键词如MoE、SLMs、RLHF、RAG、推理加速等未涉及。

!!! tip deepseek-chat TL;DR

该论文针对现有缺陷生成方法因缺乏大规模配对数据而泛化能力有限的问题，提出了UniDG通用缺陷生成基础模型和UDG大规模数据集，通过两阶段训练策略显著提升了缺陷生成的多样性和真实性，并在多个基准测试中超越了现有方法。

摘要翻译

现有缺陷/异常生成方法通常依赖小样本学习，由于缺乏大规模成对缺陷编辑数据，容易对特定缺陷类别产生过拟合。缺陷尺度和形态的巨大差异加剧了这一问题，导致模型泛化能力有限、生成真实性不足且类别一致性差。为解决这些挑战，我们提出了UDG——一个涵盖多领域、包含30万组正常-异常-掩码-描述四元组的大规模数据集，并构建了UniDG通用缺陷生成基础模型。该模型支持基于参考图像的缺陷生成和基于文本指令的缺陷编辑，无需针对每个类别进行微调。UniDG通过自适应缺陷裁剪和结构化双联画输入格式执行缺陷上下文编辑，并借助MM-DiT多模态注意力机制融合参考条件与目标条件。采用两阶段训练策略（多样性监督微调Diversity-SFT后接一致性强化微调Consistency-RFT），在提升生成多样性的同时增强了真实感与参考一致性。在MVTec-AD和VisA数据集上的大量实验表明，UniDG在合成质量及下游单类别/多类别异常检测与定位任务中，均优于现有小样本异常生成和图像插入/编辑基线方法。代码将在https://github.com/RetoFan233/UniDG发布。

摘要 (Abstract)

Existing defect/anomaly generation methods often rely on few-shot learning, which overfits to specific defect categories due to the lack of large-scale paired defect editing data. This issue is aggravated by substantial variations in defect scale and morphology, resulting in limited generalization, degraded realism, and category consistency. We address these challenges by introducing UDG, a large-scale dataset of 300K normal-abnormal-mask-caption quadruplets spanning diverse domains, and by presenting UniDG, a universal defect generation foundation model that supports both reference-based defect generation and text instruction-based defect editing without per-category fine-tuning. UniDG performs Defect-Context Editing via adaptive defect cropping and structured diptych input format, and fuses reference and target conditions through MM-DiT multimodal attention. A two-stage training strategy, Diversity-SFT followed by Consistency-RFT, further improves diversity while enhancing realism and reference consistency. Extensive experiments on MVTec-AD and VisA show that UniDG outperforms prior few-shot anomaly generation and image insertion/editing baselines in synthesis quality and downstream single- and multi-class anomaly detection/localization. Code will be available at https://github.com/RetoFan233/UniDG.

关键词: defect generation, foundation model, large-scale dataset, supervised fine-tuning, anomaly detection, multimodal attention, instruction-based editing, universal model

98. ❌ StaRPO: Stability-Augmented Reinforcement Policy Optimization

作者: Jinghan Zhang, Fengran Mo, Tharindu Cyril Weerasooriya, Ruimin Dai, Xiaoyan Han, Yanjie Fu, Dakuo Wang, Kunpeng Liu 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08905v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出StaRPO框架，专注于通过强化学习优化大语言模型在复杂推理任务中的表现。核心相关关键词包括：1) ‘Large Language Models’ (10分)：论文明确研究LLMs在推理任务中的优化；2) ‘Chain of Thought’和’System 2 Thinking’ (各10分)：论文直接针对多步推理过程的逻辑一致性和深度推理稳定性；3) ‘Self-Correction’ (5分)：通过稳定性指标间接促进自我改进；4) ‘Hallucination Mitigation’和’Mechanistic Interpretability’ (各5分)：减少逻辑不一致性并增强推理过程的可解释性。其他关键词如MoE、SFT、RAG等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在复杂推理任务中生成逻辑不一致、结构混乱响应的问题，提出了StaRPO强化学习框架，通过引入推理稳定性指标（ACF和PE）来优化模型，实验证明该方法能同时提升最终答案准确性和逻辑稳定性。

摘要翻译

强化学习（RL）在提升大语言模型于复杂推理任务中的准确性方面具有显著效果。现有的强化学习策略优化框架主要依赖最终答案的正确性作为反馈信号，很少能捕捉推理过程的内部逻辑结构。因此，模型生成的回答可能流畅且语义相关，但在逻辑上不一致、结构混乱或存在冗余。为此，我们提出StaRPO，一种稳定性增强的强化学习框架，其明确将推理稳定性纳入优化目标。我们的StaRPO将稳定性分解为两个可计算的轻量级指标：用于评估局部步骤间连贯性的自相关函数（Autocorrelation Function, ACF），以及用于评估推理轨迹全局目标导向性的路径效率（Path Efficiency, PE）。这些稳定性奖励与任务奖励相结合，提供互补且过程感知的反馈。我们通过展示ACF和PE奖励与两个骨干模型上逻辑错误的相关性，验证了使用这两种奖励的有效性。在四个推理基准测试上的实验表明，StaRPO持续优于现有基线方法，并能同时提升最终答案的准确性和逻辑稳定性。

摘要 (Abstract)

Reinforcement learning (RL) is effective in enhancing the accuracy of large language models in complex reasoning tasks. Existing RL policy optimization frameworks rely on final-answer correctness as feedback signals and rarely capture the internal logical structure of the reasoning process. Consequently, the models would generate fluent and semantically relevant responses but logically inconsistent, structurally erratic, or redundant. To this end, we propose StaRPO, a stability-augmented reinforcement learning framework that explicitly incorporates reasoning stability into the optimization objective. Our StaRPO decomposes stability into two computable lightweight metrics: the Autocorrelation Function (ACF) to evaluate local step-to-step coherence, and Path Efficiency (PE) to evaluate global goal-directedness of the reasoning trajectory. These stability rewards are combined with task rewards to provide complementary and process-aware feedback. We validate the effectiveness of using ACF and PE rewards by showing their correlation with logic errors on two backbone models. Experiments on four reasoning benchmarks show that StaRPO consistently outperforms compared baselines and can enhance both final-answer accuracy and logical stability.

关键词: Reinforcement Learning, Large Language Models, Reasoning Stability, Policy Optimization, Logical Consistency, Autocorrelation Function, Path Efficiency, Complex Reasoning Tasks

99. ❌ Ge$^\text{2}$mS-T: Multi-Dimensional Grouping for Ultra-High Energy Efficiency in Spiking Transformer

作者: Zecheng Hao, Shenghao Xie, Kang Chen, Wenxuan Liu, Zhaofei Yu, Tiejun Huang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08894v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于脉冲神经网络（SNNs）和脉冲视觉变换器（S-ViTs）的架构优化，特别是通过分组计算提高能效和性能。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是脉冲神经网络这一特定子领域，与所有关键词无直接关联。论文未涉及任何大模型技术、训练方法、推理优化、对齐技术或科学AI应用，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文解决了脉冲视觉变换器（S-ViTs）在训练和推理中内存、精度和能耗无法同时优化的问题，提出了一种名为Ge²mS-T的多维分组计算架构，实现了超高的能效和优越的性能。

摘要翻译

脉冲神经网络（SNNs）相较于人工神经网络（ANNs）具有显著的能效优势。然而，在应用于脉冲视觉变换器（S-ViTs）时，其在训练和推理指标上存在明显不足。现有范式，包括ANN-SNN转换和时空反向传播（STBP），均受到固有局限性的制约，无法同时优化内存占用、精度和能耗。为解决这些问题，我们提出了Ge$^\text{2}$mS-T，一种在时间、空间和网络结构维度上实现分组计算的新型架构。具体而言，我们引入了基于分组指数编码的积分发放（ExpG-IF）模型，该模型能以恒定的训练开销实现无损转换，并对脉冲模式进行精确调控。此外，我们开发了分组脉冲自注意力（GW-SSA）机制，通过在多尺度令牌分组和混合注意力-卷积框架内采用无乘法运算，以降低计算复杂度。实验证实，我们的方法在具有挑战性的基准测试中能够以超高的能效实现卓越性能。据我们所知，这是首个系统性地建立多维分组计算以解决S-ViTs中内存开销、学习能力和能量预算三者矛盾的工作。

摘要 (Abstract)

Spiking Neural Networks (SNNs) offer superior energy efficiency over Artificial Neural Networks (ANNs). However, they encounter significant deficiencies in training and inference metrics when applied to Spiking Vision Transformers (S-ViTs). Existing paradigms including ANN-SNN Conversion and Spatial-Temporal Backpropagation (STBP) suffer from inherent limitations, precluding concurrent optimization of memory, accuracy and energy consumption. To address these issues, we propose Ge$^\text{2}$mS-T, a novel architecture implementing grouped computation across temporal, spatial and network structure dimensions. Specifically, we introduce the Grouped-Exponential-Coding-based IF (ExpG-IF) model, enabling lossless conversion with constant training overhead and precise regulation for spike patterns. Additionally, we develop Group-wise Spiking Self-Attention (GW-SSA) to reduce computational complexity via multi-scale token grouping and multiplication-free operations within a hybrid attention-convolution framework. Experiments confirm that our method can achieve superior performance with ultra-high energy efficiency on challenging benchmarks. To our best knowledge, this is the first work to systematically establish multi-dimensional grouped computation for resolving the triad of memory overhead, learning capability and energy budget in S-ViTs.

关键词: Spiking Neural Networks, Spiking Vision Transformers, Energy Efficiency, Grouped Computation, Multi-dimensional Grouping, GW-SSA, ExpG-IF, Memory Overhead

100. ❌ A Closer Look at the Application of Causal Inference in Graph Representation Learning

作者: Hang Gao, Kunyu Li, Huang Hong, Baoquan Cui, Fengge Wu 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08890v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于图表示学习中的因果推断应用，研究如何确保图数据中因果建模的有效性，并提出了理论模型和增强模块。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文的核心内容（图表示学习、因果推断）与这些关键词无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了图表示学习中因果建模的有效性问题，证明了现有方法中图元素聚合会损害因果有效性，提出了基于最小不可分单元的理论模型来保证因果有效性，并开发了可集成到现有图学习流程中的因果建模增强模块。

摘要翻译

在图表征学习中建模因果关系仍是一个基础性挑战。现有方法常借助因果推断的理论与方法来识别因果子图或缓解混杂因素影响。然而，由于图结构数据固有的复杂性，这些方法往往将多样的图元素聚合为单一因果变量，这种操作可能违反因果推断的核心假设。本文证明了此类聚合会损害因果有效性。基于此结论，我们提出一个以图数据最小不可分割单元为根基的理论模型，以确保因果有效性得到保障。借助该模型，我们进一步分析了在图表征学习中实现精确因果建模所需代价，并明确了问题可被简化的条件。为实证支持理论，我们构建了一个反映真实世界因果结构的可控合成数据集，并进行了大量实验验证。最终，我们开发了一个可无缝集成至现有图学习流程的因果建模增强模块，并通过综合对比实验证明了其有效性。

摘要 (Abstract)

Modeling causal relationships in graph representation learning remains a fundamental challenge. Existing approaches often draw on theories and methods from causal inference to identify causal subgraphs or mitigate confounders. However, due to the inherent complexity of graph-structured data, these approaches frequently aggregate diverse graph elements into single causal variables, an operation that risks violating the core assumptions of causal inference. In this work, we prove that such aggregation compromises causal validity. Building on this conclusion, we propose a theoretical model grounded in the smallest indivisible units of graph data to ensure that the causal validity is guaranteed. With this model, we further analyze the costs of achieving precise causal modeling in graph representation learning and identify the conditions under which the problem can be simplified. To empirically support our theory, we construct a controllable synthetic dataset that reflects realworld causal structures and conduct extensive experiments for validation. Finally, we develop a causal modeling enhancement module that can be seamlessly integrated into existing graph learning pipelines, and we demonstrate its effectiveness through comprehensive comparative experiments.

关键词: causal inference, graph representation learning, causal validity, causal modeling, graph-structured data, synthetic dataset, enhancement module

101. ❌ Adaptive Dual Residual U-Net with Attention Gate and Multiscale Spatial Attention Mechanisms (ADRUwAMS)

作者: Mohsen Yaghoubi Suraki 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08893v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学图像分割，提出了一种用于脑胶质瘤分割的深度学习模型ADRUwAMS，结合了自适应双残差网络、注意力门和多尺度空间注意力机制。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词主要针对大语言模型（LLMs）及相关技术。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物医学（具体是医学影像分析）领域的应用，属于’AI for Science’的范畴，但并非核心匹配（论文未明确提及这些术语，且专注于特定医学任务而非广义科学AI），因此给予5分（有一定关联）。其他所有关键词评分为0分（完全无关）。

!!! tip deepseek-chat TL;DR

该研究提出了一种名为ADRUwAMS的深度学习模型，用于脑胶质瘤的自动分割，在BraTS数据集上实现了高精度的肿瘤检测和分割结果。

摘要翻译

胶质瘤是一种危害性极高的脑部肿瘤，其早期检测对改善患者预后至关重要。实现有效治疗的关键在于早期发现肿瘤，这通常依赖于自动化的分割流程。然而，由于肿瘤的位置、大小等特性，精准定位肿瘤是一项具有挑战性的任务。深度学习模型作为一种可靠方法，能够准确区分肿瘤区域与健康组织，并在近年来展现出显著成效。本研究提出了一种结合注意力门与多尺度空间注意力机制的自适应双残差U-Net模型（ADRUwAMS）。该模型创新性地融合了自适应双残差网络、注意力机制以及多尺度空间注意力。其中，双自适应残差网络架构能够从脑部图像中提取高层语义信息与复杂的低层细节，确保对不同肿瘤部位、类型及困难区域进行精确分割；注意力门利用门控信号与输入信号计算输入特征的注意力系数；多尺度空间注意力则生成缩放后的注意力图，并融合这些特征以保留关于脑肿瘤的最关键信息。我们在BraTS 2020和BraTS 2019数据集上使用ReLU激活函数对模型进行了200轮训练。这些改进使得模型在BraTS 2020数据集上实现了高精度的肿瘤检测与分割效果，其全肿瘤、肿瘤核心及增强肿瘤区域的Dice分数分别达到0.9229、0.8432和0.8004。

摘要 (Abstract)

Glioma is a harmful brain tumor that requires early detection to ensure better health results. Early detection of this tumor is key for effective treatment and requires an automated segmentation process. However, it is a challenging task to find tumors due to tumor characteristics like location and size. A reliable method to accurately separate tumor zones from healthy tissues is deep learning models, which have shown promising results over the last few years. In this research, an Adaptive Dual Residual U-Net with Attention Gate and Multiscale Spatial Attention Mechanisms (ADRUwAMS) is introduced. This model is an innovative combination of adaptive dual residual networks, attention mechanisms, and multiscale spatial attention. The dual adaptive residual network architecture captures high-level semantic and intricate low-level details from brain images, ensuring precise segmentation of different tumor parts, types, and hard regions. The attention gates use gating and input signals to compute attention coefficients for the input features, and multiscale spatial attention generates scaled attention maps and combines these features to hold the most significant information about the brain tumor. We trained the model for 200 epochs using the ReLU activation function on BraTS 2020 and BraTS 2019 datasets. These improvements resulted in high accuracy for tumor detection and segmentation on BraTS 2020, achieving dice scores of 0.9229 for the whole tumor, 0.8432 for the tumor core, and 0.8004 for the enhancing tumor.

关键词: Glioma segmentation, Deep learning, U-Net, Attention mechanisms, Medical image analysis, Brain tumor, Residual networks, Multiscale spatial attention

102. ❌ HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

作者: Xinyu Zhang, Zurong Mai, Qingmei Li, Zjin Liao, Yibin Wen, Yuhang Chen, Xiaoya Fan, Chan Tsz Ho, Bi Tianyuan, Haoyuan Liang, Ruifeng Su, Zihao Qian, Juepeng Zheng, Jianxi Huang, Yutong Lu, Haohuan Fu 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08884v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在遥感科学领域的应用，与’Large Language Models’高度相关（10分），属于’AI for Science’范畴（10分）。论文涉及从基础感知到光谱推理的任务，与推理相关的关键词’Chain of Thought’和’System 2 Thinking’有一定关联（各5分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在遥感高光谱图像理解能力不足的问题，提出了首个高光谱多模态基准HM-Bench，通过大规模数据集和双模态评估框架对18个代表性模型进行系统评估，发现模型在复杂空间-光谱推理任务上存在显著困难，且视觉输入通常优于文本输入。

摘要翻译

尽管多模态大语言模型（MLLMs）在自然图像理解方面已取得显著进展，但其对高光谱图像（Hyperspectral Image, HSI）的感知与推理能力仍未被充分探索，而HSI是遥感领域的重要模态。HSI的高维特性及复杂的光谱-空间属性，对主要基于RGB数据训练的模型构成了独特挑战。为填补这一空白，我们提出了高光谱多模态基准（Hyperspectral Multimodal Benchmark, HM-Bench），这是首个专门用于评估MLLMs在HSI理解方面性能的基准。我们构建了一个包含13个任务类别、总计19,337个问答对的大规模数据集，涵盖从基础感知到光谱推理的多种任务。鉴于现有MLLMs无法直接处理原始高光谱立方体数据，我们提出了一种双模态评估框架，将HSI数据转换为两种互补的表示形式：基于主成分分析（PCA）的合成图像与结构化文本报告。该方法有助于系统比较不同表示形式对模型性能的影响。对18个代表性MLLMs的广泛评估表明，现有模型在处理复杂的光谱-空间推理任务时存在显著困难。此外，我们的结果显示视觉输入通常优于文本输入，这凸显了基于光谱-空间证据进行推理对于有效理解HSI的重要性。数据集与附录可通过https://github.com/HuoRiLi-Yu/HM-Bench访问。

摘要 (Abstract)

While multimodal large language models (MLLMs) have made significant strides in natural image understanding, their ability to perceive and reason over hyperspectral image (HSI) remains underexplored, which is a vital modality in remote sensing. The high dimensionality and intricate spectral-spatial properties of HSI pose unique challenges for models primarily trained on RGB data.To address this gap, we introduce Hyperspectral Multimodal Benchmark (HM-Bench), the first benchmark designed specifically to evaluate MLLMs in HSI understanding. We curate a large-scale dataset of 19,337 question-answer pairs across 13 task categories, ranging from basic perception to spectral reasoning. Given that existing MLLMs are not equipped to process raw hyperspectral cubes natively, we propose a dual-modality evaluation framework that transforms HSI data into two complementary representations: PCA-based composite images and structured textual reports. This approach facilitates a systematic comparison of different representation for model performance. Extensive evaluations on 18 representative MLLMs reveal significant difficulties in handling complex spatial-spectral reasoning tasks. Furthermore, our results demonstrate that visual inputs generally outperform textual inputs, highlighting the importance of grounding in spectral-spatial evidence for effective HSI understanding. Dataset and appendix can be accessed at https://github.com/HuoRiLi-Yu/HM-Bench.

关键词: Multimodal Large Language Models, Hyperspectral Remote Sensing, Benchmark, Spectral-Spatial Reasoning, Dual-modality Evaluation, HSI Understanding, Question-Answer Pairs, Model Performance Comparison

作者: Chengjie Fan, Cong Pan, Zijian Liu, Ningzhong Liu, Jie Qin 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08883v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于城市空中视觉与语言导航（Aerial VLN）任务，提出了一种结合模仿学习（IL）和强化学习（RL）的混合导航框架HTNav，并引入了分层决策机制和地图表示学习模块。论文的核心是机器人导航、计算机视觉和强化学习技术，未涉及任何大语言模型（LLM）、深度学习技术原理创新或AI for Science的具体应用。所有评分关键词均与大模型技术、训练方法、推理优化、AI代理或科学AI应用相关，而本文研究的是传统的视觉语言导航问题，未使用或改进大模型技术，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对复杂城市环境中空中视觉与语言导航任务面临的泛化性不足、长距离路径规划性能欠佳和空间连续性理解不充分等挑战，提出了一个结合模仿学习和强化学习的混合导航框架HTNav，通过在CityNav基准测试中实现最先进的性能，显著提高了导航精度和鲁棒性。

摘要翻译

受通用视觉语言导航任务的启发，空中视觉语言导航因其在物流配送和城市巡检等应用中的显著实用价值而受到广泛关注。然而，现有方法在复杂城市环境中面临若干挑战，包括对未见场景的泛化能力不足、长距离路径规划性能欠佳以及对空间连续性理解不充分。为应对这些挑战，我们提出HTNav——一种在混合模仿学习与强化学习框架中整合了模仿学习与强化学习的新型协同导航框架。该框架采用分阶段训练机制，在确保基础导航策略稳定性的同时增强其环境探索能力。通过集成分层决策机制，实现了宏观路径规划与细粒度动作控制之间的协同交互。此外，我们引入了地图表征学习模块，以深化其对开放域空间连续性的理解。在CityNav基准测试中，我们的方法在所有场景层级和任务难度上均取得了最先进的性能。实验结果表明，该框架显著提升了复杂城市环境中的导航精度与鲁棒性。

摘要 (Abstract)

Inspired by the general Vision-and-Language Navigation (VLN) task, aerial VLN has attracted widespread attention, owing to its significant practical value in applications such as logistics delivery and urban inspection. However, existing methods face several challenges in complex urban environments, including insufficient generalization to unseen scenes, suboptimal performance in long-range path planning, and inadequate understanding of spatial continuity. To address these challenges, we propose HTNav, a new collaborative navigation framework that integrates Imitation Learning (IL) and Reinforcement Learning (RL) within a hybrid IL-RL framework. This framework adopts a staged training mechanism to ensure the stability of the basic navigation strategy while enhancing its environmental exploration capability. By integrating a tiered decision-making mechanism, it achieves collaborative interaction between macro-level path planning and fine-grained action control. Furthermore, a map representation learning module is introduced to deepen its understanding of spatial continuity in open domains. On the CityNav benchmark, our method achieves state-of-the-art performance across all scene levels and task difficulties. Experimental results demonstrate that this framework significantly improves navigation precision and robustness in complex urban environments.

关键词: Aerial Vision-and-Language Navigation, Hybrid IL-RL Framework, Tiered Decision-making, Map Representation Learning, Urban Navigation, Path Planning, Imitation Learning, Reinforcement Learning

104. ❌ Revisiting the Capacity Gap in Chain-of-Thought Distillation from a Practical Perspective

作者: Tokio Kajitsuka, Ukyo Honda, Sho Takase 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08880v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究链式思维（CoT）蒸馏中的能力差距问题，因此与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（10分）。研究涉及从强教师模型到较小学生模型的蒸馏，与’Large Language Models OR LLMs OR Foundation Models’和’Small Language Models OR SLMs OR On-device AI’相关（各8分）。蒸馏属于微调技术，与’Post-training OR Supervised Fine-tuning OR SFT’相关（8分）。CoT蒸馏涉及推理过程，与’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’有一定关联（8分）。其他关键词如MoE、Scaling Laws、RAG、量化等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文重新审视了链式思维蒸馏中的能力差距问题，发现现有评估方法可能高估了该问题的影响，并提出了更现实的评估协议，为教师-学生模型对的选择提供了实用指导。

摘要翻译

思维链（Chain-of-thought, CoT）蒸馏旨在将推理能力从强大的教师模型迁移至更小的学生模型，但先前研究指出存在能力差距问题：当师生模型能力不匹配程度较大时，蒸馏可能失效。本文从实践角度重新审视这一能力差距，对常用实验设置进行了系统性再评估。值得注意的是，我们发现相较于学生模型蒸馏前的基线性能，CoT蒸馏反而常常导致性能下降——这一现象在仅报告蒸馏后对比结果的研究中容易被掩盖。为此，我们提出了一套更贴近现实的评估方案，并发现能力差距的影响并非在所有任务和设置中均占主导地位，尤其在候选教师模型性能差异显著时更为明显。我们的研究结果为CoT蒸馏中师生模型配对的选择提供了实践指导。

摘要 (Abstract)

Chain-of-thought (CoT) distillation transfers reasoning behaviors from a strong teacher to a smaller student, but prior work reports a capacity gap: distillation may fail when the teacher-student capability mismatch is large. We revisit the capacity gap from a practical perspective by re-examining commonly used experimental settings. Notably, we find that CoT distillation often degrades performance compared to the student’s pre-distillation baseline, an issue obscured when only post-distillation comparisons are reported. We therefore propose a more realistic evaluation protocol and find that the impact of capacity gap effects does not consistently dominate across tasks and settings, especially when candidate teachers differ substantially in performance. Our results offer practical guidance for selecting teacher-student pairs in CoT distillation.

关键词: Chain-of-thought distillation, capacity gap, teacher-student models, reasoning behaviors, evaluation protocol, model distillation, knowledge transfer, practical guidance

105. ❌ A Mathematical Framework for Temporal Modeling and Counterfactual Policy Simulation of Student Dropout

作者: Rafael da Silva, Jeff Eicher, Gregory Longo 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08874v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究高等教育中学生辍学的时序建模和反事实政策模拟，使用逻辑回归等传统机器学习方法分析LMS参与数据和行政记录。论文完全不涉及大模型、深度学习或任何AI技术原理的创新，也未在科学领域应用AI技术。所有关键词均与大模型技术、深度学习原理或AI科学应用相关，而本文是纯教育数据分析和传统统计建模研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一个基于时序建模和反事实政策模拟的框架来预测高等教育中的学生辍学风险，使用逻辑回归模型分析学习管理系统参与数据，并在观察性数据约束下展示了框架进行内部结构情景比较的能力。

摘要翻译

本研究提出一种融合反事实政策模拟层的时间建模框架，用于预测高等教育中的学生辍学问题，该框架利用学习管理系统参与度数据与行政退学记录。辍学行为被操作化为注册层面的时间-事件终点；通过基于人-周期行数据的惩罚性类别平衡逻辑回归，在离散时间中对周度风险进行建模。在延迟事件时间保留验证下，模型取得行级别AUC值0.8350（训练集）与0.8405（测试集），整体校准度可接受但在最高风险区间支持数据稀疏。消融分析表明模型性能对特征集构成敏感，凸显了时间参与度信号的关键作用。通过场景索引的政策模拟层在显式触发/调度合约下生成生存对比$ΔS(T)$：正向对比仅出现在冲击分支（$T_{\rm policy}=18$：0.0102, 0.0260, 0.0819），而机制感知分支呈现负向对比（$ΔS_{\rm mech}(18)=-0.0078$, $ΔS_{\rm mech}(38)=-0.0134$）。基于性别的亚组分析通过自助法量化了场景引致的生存差异，对比方向稳定但幅度较小。研究结果未进行因果识别，但证明了该框架在观测数据约束下进行内部结构场景比较的能力。

摘要 (Abstract)

This study proposes a temporal modeling framework with a counterfactual policy-simulation layer for student dropout in higher education, using LMS engagement data and administrative withdrawal records. Dropout is operationalized as a time-to-event outcome at the enrollment level; weekly risk is modeled in discrete time via penalized, class-balanced logistic regression over person–period rows. Under a late-event temporal holdout, the model attains row-level AUCs of 0.8350 (train) and 0.8405 (test), with aggregate calibration acceptable but sparsely supported in the highest-risk bins. Ablation analyses indicate performance is sensitive to feature set composition, underscoring the role of temporal engagement signals. A scenario-indexed policy layer produces survival contrasts $ΔS(T)$ under an explicit trigger/schedule contract: positive contrasts are confined to the shock branch ($T_{\rm policy}=18$: 0.0102, 0.0260, 0.0819), while the mechanism-aware branch is negative ($ΔS_{\rm mech}(18)=-0.0078$, $ΔS_{\rm mech}(38)=-0.0134$). A subgroup analysis by gender quantifies scenario-induced survival gaps via bootstrap; contrasts are directionally stable but small. Results are not causally identified; they demonstrate the framework’s capacity for internal structural scenario comparison under observational data constraints.

关键词: temporal modeling, counterfactual policy simulation, student dropout, higher education, logistic regression, LMS engagement data, survival analysis, scenario comparison

106. ❌ Temporal Dropout Risk in Learning Analytics: A Harmonized Survival Benchmark Across Dynamic and Early-Window Representations

作者: Rafael da Silva, Jeff Eicher, Gregory Longo 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08870v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于学习分析中的学生辍学预测，使用生存分析模型（如随机生存森林、泊松分段指数模型）和传统机器学习方法，不涉及大语言模型、深度学习或任何评分关键词中的技术。唯一的相关性在于：1）‘Explainable AI’（5分），因为论文包含可解释性分析；2）‘AI for Science’（5分），因为教育科学属于科学应用领域。其他关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一个用于学习分析中时间辍学风险建模的生存分析基准，发现辍学风险主要受时间行为因素而非静态背景属性驱动。

摘要翻译

学生辍学问题是学习分析领域持续关注的焦点，然而现有比较研究常在异质性评估框架下检验预测模型，往往更注重区分度而忽视了时间可解释性与校准性。本研究基于开放大学学习分析数据集（OULAD），构建了一个面向生存分析的时序辍学风险建模基准框架。研究比较了两个协调统一的建模路径：其一是动态周度路径，采用人-周期表征模型；其二是与之对应的连续时间路径，扩展了模型家族类型——包括基于树的生存模型、参数模型及神经网络模型。评估框架整合了四个分析维度：预测性能、特征消融、可解释性及校准性。由于跨路径的统一排名缺乏方法论依据，结果分别在两条路径内独立报告。在连续时间路径中，随机生存森林在区分度与特定时间范围的Brier分数上表现最优；在动态路径中，泊松分段指数模型在五个紧密聚集的模型家族中以微弱优势取得综合Brier分数领先。无重拟合自助法抽样变异表明这些排序应视为方向性信号而非绝对优劣判定。所有模型家族的特征消融与可解释性分析均指向一致结论：主导预测信号并非主要来自人口统计学或结构性特征，而是源于时序行为特征。校准分析在区分度较高的模型中验证了这一规律，但XGBoost加速失效时间模型表现出系统性偏差例外。这些结果证实了构建协调统一的多维基准框架在学习分析领域的价值，并将辍学风险定位于一种时序行为过程，而非静态背景属性的函数。

摘要 (Abstract)

Student dropout is a persistent concern in Learning Analytics, yet comparative studies frequently evaluate predictive models under heterogeneous protocols, prioritizing discrimination over temporal interpretability and calibration. This study introduces a survival-oriented benchmark for temporal dropout risk modelling using the Open University Learning Analytics Dataset (OULAD). Two harmonized arms are compared: a dynamic weekly arm, with models in person-period representation, and a comparable continuous-time arm, with an expanded roster of families – tree-based survival, parametric, and neural models. The evaluation protocol integrates four analytical layers: predictive performance, ablation, explainability, and calibration. Results are reported within each arm separately, as a single cross-arm ranking is not methodologically warranted. Within the comparable arm, Random Survival Forest leads in discrimination and horizon-specific Brier scores; within the dynamic arm, Poisson Piecewise-Exponential leads narrowly on integrated Brier score within a tight five-family cluster. No-refit bootstrap sampling variability qualifies these positions as directional signals rather than absolute superiority. Ablation and explainability analyses converged, across all families, on a shared finding: the dominant predictive signal was not primarily demographic or structural, but temporal and behavioral. Calibration corroborated this pattern in the better-discriminating models, with the exception of XGBoost AFT, which exhibited systematic bias. These results support the value of a harmonized, multi-dimensional benchmark in Learning Analytics and situate dropout risk as a temporal-behavioral process rather than a function of static background attributes.

关键词: Learning Analytics, Student Dropout, Survival Analysis, Temporal Risk Modeling, Benchmark Evaluation, Explainability, Calibration, Behavioral Factors

107. ❌ MedFormer-UR: Uncertainty-Routed Transformer for Medical Image Classification

作者: Mohammed Maaz Sibhai, Abedalrhman Alkhateeb, Saad B. Ahmed 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08868v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于医学图像分类，提出了一种结合原型学习和不确定性引导路由的Transformer模型（MedFormer-UR），核心贡献在于不确定性量化和模型校准。与绝大多数关键词（涉及大模型技术、训练方法、推理优化、智能体等）完全无关。仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文提到了透明度问题，但主要关注不确定性而非可解释性。与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为论文明确属于AI在生物医学（医学影像）领域的应用，符合’AI for Science’子领域。

!!! tip deepseek-chat TL;DR

该研究解决了医学视觉Transformer在临床数据中存在的过度自信预测和缺乏透明度的问题，通过引入基于狄利克雷分布的不确定性量化与引导路由机制，显著提升了模型校准（ECE降低达35%）和选择性预测性能。

摘要翻译

为确保深度学习模型在临床实践中安全整合，其不仅需具备高精度，更需提供可靠的不确定性量化。当前医学视觉变换器（Medical Vision Transformers）虽表现良好，却常面临预测过度自信与透明度不足的问题，而临床数据的噪声与不均衡特性进一步放大了这些缺陷。为解决此问题，我们在融合原型学习与不确定性引导路由的改进型医学变换器（MedFormer）基础上进行增强，通过采用狄利克雷分布（Dirichlet distribution）对每个特征标记进行证据不确定性建模，使框架能够实时量化并定位预测中的模糊性。这种不确定性不仅是输出结果，更作为训练过程的主动参与者，可过滤不可靠的特征更新。此外，类别特异性原型的使用确保了嵌入空间保持结构化，从而支持基于视觉相似性的决策。在四种影像模态（乳腺X线摄影、超声、磁共振成像和组织病理学）上的测试证实，我们的方法显著提升了模型校准能力——预期校准误差（ECE）降低最高达35%，并改善了选择性预测性能，即使在精度提升有限的情况下仍保持优势。

摘要 (Abstract)

To ensure safe clinical integration, deep learning models must provide more than just high accuracy; they require dependable uncertainty quantification. While current Medical Vision Transformers perform well, they frequently struggle with overconfident predictions and a lack of transparency, issues that are magnified by the noisy and imbalanced nature of clinical data. To address this, we enhanced the modified Medical Transformer (MedFormer) that incorporates prototype-based learning and uncertainty-guided routing, by utilizing a Dirichlet distribution for per-token evidential uncertainty, our framework can quantify and localize ambiguity in real-time. This uncertainty is not just an output but an active participant in the training process, filtering out unreliable feature updates. Furthermore, the use of class-specific prototypes ensures the embedding space remains structured, allowing for decisions based on visual similarity. Testing across four modalities (mammography, ultrasound, MRI, and histopathology) confirms that our approach significantly enhances model calibration, reducing expected calibration error (ECE) by up to 35%, and improves selective prediction, even when accuracy gains are modest.

关键词: Medical Image Classification, Transformer, Uncertainty Quantification, Model Calibration, Prototype-based Learning, Dirichlet Distribution, Clinical Data, Selective Prediction

108. ❌ AudioGuard: Toward Comprehensive Audio Safety Protection Across Diverse Threat Models

作者: Mintong Kang, Chen Fang, Bo Li 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08867v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究音频安全保护系统AudioGuard，涉及基础模型（Foundation Models）在音频接口中的应用，因此与’Large Language Models OR LLMs OR Foundation Models’关键词高度相关（8分）。论文未涉及其他关键词的具体技术（如MoE、SLMs、训练方法、推理优化、代理系统等）或科学领域应用，因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对音频系统（作为基础模型接口）面临的安全风险，提出了AudioGuard统一防护框架，包括SoundGuard和ContentGuard组件，并在AudioSafetyBench基准测试中显著提升了防护准确性并降低了延迟。

摘要翻译

音频正迅速成为基础模型的主要交互界面，驱动着实时语音助手的发展。确保音频系统的安全性本质上比“不安全文本被朗读出来”更为复杂：现实世界中的风险可能取决于音频原生的有害声音事件、说话者属性（如儿童声音）、模仿/语音克隆滥用，以及语音-内容组合性危害（例如儿童声音与色情内容的结合）。音频的特性使得针对这一独特风险领域开发全面的基准测试或防护机制具有挑战性。为填补这一空白，我们对音频系统进行了大规模红队测试，系统性地揭示了音频中的脆弱点，并构建了一个全面的、基于政策规范的音频风险分类体系以及首个跨多种威胁模型的基于政策的音频安全基准测试集——AudioSafetyBench。AudioSafetyBench支持多种语言、可疑声音（如名人/模仿声音和儿童声音）、高风险语音-内容组合以及非语音声音事件。为防御这些威胁，我们提出了AudioGuard，这是一个统一的防护框架，包含：1) SoundGuard，用于波形级音频原生检测；2) ContentGuard，用于基于政策规范的语义保护。在AudioSafetyBench和四个补充基准测试集上的大量实验表明，与基于音频大语言模型的强基线方法相比，AudioGuard能持续提升防护准确性，并显著降低延迟。

摘要 (Abstract)

Audio has rapidly become a primary interface for foundation models, powering real-time voice assistants. Ensuring safety in audio systems is inherently more complex than just “unsafe text spoken aloud”: real-world risks can hinge on audio-native harmful sound events, speaker attributes (e.g., child voice), impersonation/voice-cloning misuse, and voice-content compositional harms, such as child voice plus sexual content. The nature of audio makes it challenging to develop comprehensive benchmarks or guardrails against this unique risk landscape. To close this gap, we conduct large-scale red teaming on audio systems, systematically uncover vulnerabilities in audio, and develop a comprehensive, policy-grounded audio risk taxonomy and AudioSafetyBench, the first policy-based audio safety benchmark across diverse threat models. AudioSafetyBench supports diverse languages, suspicious voices (e.g., celebrity/impersonation and child voice), risky voice-content combinations, and non-speech sound events. To defend against these threats, we propose AudioGuard, a unified guardrail consisting of 1) SoundGuard for waveform-level audio-native detection and 2) ContentGuard for policy-grounded semantic protection. Extensive experiments on AudioSafetyBench and four complementary benchmarks show that AudioGuard consistently improves guardrail accuracy over strong audio-LLM-based baselines with substantially lower latency.

关键词: audio safety, foundation models, audio guardrail, AudioSafetyBench, red teaming, voice assistants, threat models, policy-grounded

109. ❌ Many Ways to Be Fake: Benchmarking Fake News Detection Under Strategy-Driven AI Generation

作者: Xinyu Wang, Sai Koneru, Wenbo Zhang, Wenliang Zheng, Saksham Ranjan, Sarah Rajtmajer 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09514v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确研究LLMs生成虚假新闻的问题，与’Large Language Models’高度相关（10分）；虚假新闻检测涉及事实性和真实性，与’Hallucination Mitigation OR Factuality OR Truthfulness’高度相关（10分）；其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理技术、模型压缩、AI for Science等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLMs生成策略驱动的混合真假虚假新闻问题，构建了MANYFAKE基准并评估现有检测器，发现即使先进模型对完全虚假内容检测接近饱和，但对微妙、优化的混合虚假内容仍很脆弱。

摘要翻译

近期大规模语言模型（LLM）的进展使得大规模生成高度流畅且具有欺骗性的类新闻内容成为可能。以往研究常将虚假新闻检测视为二元分类问题，但现代虚假新闻越来越多地通过人机协作产生，即在其他方面准确可信的叙述中嵌入策略性不实信息。这类真假混杂的案例构成了现实且影响深远的威胁，却在现有基准数据集中代表性不足。为填补这一空白，我们提出了MANYFAKE——一个包含6,798篇虚假新闻文章的合成基准数据集，这些文章通过多种策略驱动的提示流程生成，涵盖了虚假新闻构建与优化的多种方式。基于该基准，我们评估了一系列前沿虚假新闻检测器。实验结果表明，即使是具备先进推理能力的模型，在面对完全虚构的报道时性能已接近饱和，但当虚假信息以隐蔽、优化且与真实信息交织的方式呈现时，这些模型仍表现出明显的脆弱性。

摘要 (Abstract)

Recent advances in large language models (LLMs) have enabled the large-scale generation of highly fluent and deceptive news-like content. While prior work has often treated fake news detection as a binary classification problem, modern fake news increasingly arises through human-AI collaboration, where strategic inaccuracies are embedded within otherwise accurate and credible narratives. These mixed-truth cases represent a realistic and consequential threat, yet they remain underrepresented in existing benchmarks. To address this gap, we introduce MANYFAKE, a synthetic benchmark containing 6,798 fake news articles generated through multiple strategy-driven prompting pipelines that capture many ways fake news can be constructed and refined. Using this benchmark, we evaluate a range of state-of-the-art fake news detectors. Our results show that even advanced reasoning-enabled models approach saturation on fully fabricated stories, but remain brittle when falsehoods are subtle, optimized, and interwoven with accurate information.

关键词: fake news detection, large language models, LLMs, synthetic benchmark, strategy-driven generation, mixed-truth cases, news-like content, human-AI collaboration

110. ❌ You Can’t Fight in Here! This is BBS!

作者: Richard Futrell, Kyle Mahowald 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09501v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心讨论语言模型（LLMs）在语言科学领域的应用潜力，与’Large Language Models’高度相关（10分），属于’AI for Science’范畴（10分）。论文涉及对语言模型能力的解释和澄清，与’Mechanistic Interpretability’有一定关联（5分）。其他关键词（如MoE、SFT、RAG等）涉及具体技术细节，论文未讨论，故评0分。

!!! tip deepseek-chat TL;DR

该论文探讨了现代语言模型能否为语言科学提供重要见解，澄清了相关误解，并倡导在AI时代开展更广泛的语言科学研究。

摘要翻译

形式理论语言学家诺姆与计算语言科学家克劳德特进行了一场愉快的讨论，探讨现代语言模型能否为语言科学的重要问题提供启示。就在他们即将分别、期待下次重逢之际，二十五位来自语言学、神经科学、认知科学、心理学、哲学与计算机科学领域的亲密同行加入了对话。我们借此次讨论阐明两个普遍存在的核心误区：其一是“字符串统计稻草人谬误”（即错误地认为语言模型与其前身马尔可夫模型一样，仅是学习字符串的统计模型，因而缺乏语言能力与研究价值）；其二是“现状即极限假设”（即认为截至2026年的语言模型研究已触及其对语言学启示的边界）。本文旨在厘清基于语言模型的研究对人类语言科学认知的贡献，并倡导在人工智能时代拓展语言科学的研究范式——通过深入回应评论者提出的关切，构建一个更完善、更坚实的人类语言科学与语言模型科学体系。

摘要 (Abstract)

Norm, the formal theoretical linguist, and Claudette, the computational language scientist, have a lovely time discussing whether modern language models can inform important questions in the language sciences. Just as they are about to part ways until they meet again, 25 of their closest friends show up – from linguistics, neuroscience, cognitive science, psychology, philosophy, and computer science. We use this discussion to highlight what we see as some common underlying issues: the String Statistics Strawman (the mistaken idea that LMs can’t be linguistically competent or interesting because they, like their Markov model predecessors, are statistical models that learn from strings) and the As Good As it Gets Assumption (the idea that LM research as it stands in 2026 is the limit of what it can tell us about linguistics). We clarify the role of LM-based work for scientific insights into human language and advocate for a more expansive research program for the language sciences in the AI age, one that takes on the commentators’ concerns in order to produce a better and more robust science of both human language and of LMs.

关键词: language models, linguistics, language sciences, AI for science, statistical models, scientific insights, human language, research program

111. ❌ Across the Levels of Analysis: Explaining Predictive Processing in Humans Requires More Than Machine-Estimated Probabilities

作者: Sathvik Nair, Colin Phillips 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09466v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要探讨语言模型（特别是大语言模型）在心理语言学中的应用和局限性，属于大模型在科学领域的应用研究。与’Large Language Models’高度相关（8分），因为论文明确讨论LLMs对心理语言学的贡献。与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文涉及模型解释和人类认知机制的比较。与’AI for Science’有一定关联（5分），因为论文属于AI在科学（心理学/语言学）领域的应用。其他关键词（如MoE、量化、推理方法等）与论文内容完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该论文批判性地分析了语言模型在预测处理和心理语言学中的作用，指出仅靠机器估计的概率不足以解释人类语言处理，并提出了结合LLMs和心理语言学模型的未来研究方向。

摘要翻译

在马尔分析层次的视角下，我们审视并拓展了关于语言模型与语言处理的两个主张：其一，基于上下文预测后续语言信息是语言处理的核心；其二，若无大语言模型，心理语言学的诸多进展将难以实现。我们进一步展望了结合大语言模型与心理语言学模型优势的未来研究方向。

摘要 (Abstract)

Under the lens of Marr’s levels of analysis, we critique and extend two claims about language models (LMs) and language processing: first, that predicting upcoming linguistic information based on context is central to language processing, and second, that many advances in psycholinguistics would be impossible without large language models (LLMs). We further outline future directions that combine the strengths of LLMs with psycholinguistic models.

关键词: language models, large language models, psycholinguistics, predictive processing, Marr’s levels of analysis, language processing, cognitive modeling, AI for science

112. ❌ Agentic Jackal: Live Execution and Semantic Value Grounding for Text-to-JQL

作者: Vishnu Murali, Anmol Gulati, Elias Lumer, Kevin Frank, Sindy Campagna, Vamse Kumar Subbiah 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09470v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM代理系统（Agentic Jackal）在文本到JQL转换任务中的应用，通过工具增强（Jira MCP服务器和JiraAnchor检索工具）解决语义歧义和实时验证问题。因此与’LLM Agents/Autonomous Agents/Agentic Workflow’和’Tool Use/Function Calling/API Tool Use’高度相关（10分），与’Large Language Models/LLMs/Foundation Models’高度相关（10分），因为评估了9个前沿LLM。与’Retrieval-Augmented Generation/RAG/Retrieval-Generation’有一定关联（5分），因为JiraAnchor使用嵌入相似性搜索进行语义检索。其他关键词如MoE、Scaling Laws、SFT等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对自然语言到Jira查询语言（JQL）转换的准确性问题，提出了Agentic Jackal代理系统，通过实时查询执行和语义检索工具显著提升了9个前沿LLM的性能，并发布了首个大规模执行基准Jackal。

摘要翻译

将自然语言转换为Jira查询语言（JQL）需要解决模糊的字段引用、特定实例的分类值以及复杂的布尔谓词问题。单次生成的大语言模型无法发现给定Jira实例中实际存在的分类值（例如组件名称或修复版本），也无法针对实时数据源验证生成的查询，这限制了其在转述或模糊请求上的准确性。目前尚无公开的、基于执行的自然语言到JQL映射基准。我们推出了Jackal，这是首个大规模的、基于执行的文本到JQL基准，它包含在一个拥有超过20万个事项的实时Jira实例上验证的10万个自然语言-JQL对。为了在Jackal上建立基线，我们提出了Agentic Jackal，这是一个工具增强的智能体，它通过Jira MCP服务器为LLM提供实时查询执行能力，并配备了JiraAnchor——一种通过基于嵌入的相似性搜索来解析自然语言中分类值提及的语义检索工具。在评估的9个前沿大语言模型中，单次生成模型在简短自然语言查询上的平均执行准确率仅为43.4%，这表明文本到JQL的转换仍是一个开放挑战。智能体方法改进了9个模型中的7个，在语言最具挑战性的变体上实现了9.0%的相对提升；在隔离JiraAnchor的受控消融实验中，分类值准确率从48.7%提升至71.7%，其中组件字段准确率从16.9%跃升至66.2%。我们的分析指出，固有的语义模糊性（如事项类型消歧和文本字段选择）是主要的失败模式，而非值解析错误，这为未来工作指明了具体方向。我们公开了该基准、所有智能体交互记录和评估代码，以支持可复现性研究。

摘要 (Abstract)

Translating natural language into Jira Query Language (JQL) requires resolving ambiguous field references, instance-specific categorical values, and complex Boolean predicates. Single-pass LLMs cannot discover which categorical values (e.g., component names or fix versions) actually exist in a given Jira instance, nor can they verify generated queries against a live data source, limiting accuracy on paraphrased or ambiguous requests. No open, execution-based benchmark exists for mapping natural language to JQL. We introduce Jackal, the first large-scale, execution-based text-to-JQL benchmark comprising 100,000 validated NL-JQL pairs on a live Jira instance with over 200,000 issues. To establish baselines on Jackal, we propose Agentic Jackal, a tool-augmented agent that equips LLMs with live query execution via the Jira MCP server and JiraAnchor, a semantic retrieval tool that resolves natural-language mentions of categorical values through embedding-based similarity search. Among 9 frontier LLMs evaluated, single-pass models average only 43.4% execution accuracy on short natural-language queries, highlighting that text-to-JQL remains an open challenge. The agentic approach improves 7 of 9 models, with a 9.0% relative gain on the most linguistically challenging variant; in a controlled ablation isolating JiraAnchor, categorical-value accuracy rises from 48.7% to 71.7%, with component-field accuracy jumping from 16.9% to 66.2%. Our analysis identifies inherent semantic ambiguities, such as issue-type disambiguation and text-field selection, as the dominant failure modes rather than value-resolution errors, pointing to concrete directions for future work. We publicly release the benchmark, all agent transcripts, and evaluation code to support reproducibility.

关键词: text-to-JQL, LLM agents, tool-augmented agents, live query execution, semantic retrieval, Jira benchmark, natural language to query, execution accuracy

113. ❌ Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM

作者: Solomiia Bilyk, Volodymyr Getmanskyi, Taras Firman 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09418v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的任务适应策略，包括AIR（基于规则归纳的指令修订）、提示优化、检索方法和微调，因此与’Large Language Models’、‘Post-training/SFT’和’Instruction Tuning’高度相关（10分）。论文提到检索方法（KNN retrieval），与’Retrieval-Augmented Generation’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、RLHF等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型（LLMs）的任务适应策略，提出了一种基于规则归纳的自动指令修订（AIR）方法，并通过实验发现不同适应方法的性能高度依赖于任务类型，没有单一方法在所有设置中占优。

摘要翻译

本文研究基于规则归纳的自动指令修订方法，旨在利用有限的任务特定示例使大语言模型适应下游任务。我们将该方法置于更广泛的适应策略背景下进行探讨，包括提示优化、基于检索的方法以及微调技术。随后，我们通过一套多样化基准测试对比这些方法，该测试体系旨在检验不同任务需求，如知识注入、结构化信息提取、标签重映射与逻辑推理。本文指出，适应性能高度依赖于任务特性：没有任何单一方法能在所有场景中占据绝对优势。在五项基准测试中，自动指令修订在标签重映射分类任务中表现最优或接近最优；KNN检索在闭卷问答任务中效果最佳；而微调则在结构化提取与事件顺序推理任务中占据主导地位。当任务行为可通过紧凑、可解释的指令规则捕捉时，自动指令修订最具潜力；而在以特定领域知识或数据集标注规律为主导的任务中，检索与微调方法仍保持优势。

摘要 (Abstract)

This paper studies Automated Instruction Revision (AIR), a rule-induction-based method for adapting large language models (LLMs) to downstream tasks using limited task-specific examples. We position AIR within the broader landscape of adaptation strategies, including prompt optimization, retrieval-based methods, and fine-tuning. We then compare these approaches across a diverse benchmark suite designed to stress different task requirements, such as knowledge injection, structured extraction, label remapping, and logical reasoning. The paper argues that adaptation performance is strongly task-dependent: no single method dominates across all settings. Across five benchmarks, AIR was strongest or near-best on label-remapping classification, while KNN retrieval performed best on closed-book QA, and fine-tuning dominated structured extraction and event-order reasoning. AIR is most promising when task behavior can be captured by compact, interpretable instruction rules, while retrieval and fine-tuning remain stronger in tasks dominated by source-specific knowledge or dataset-specific annotation regularities.

关键词: Automated Instruction Revision, LLM adaptation, task adaptation strategies, fine-tuning, retrieval-based methods, instruction rules, benchmark evaluation, task-dependent performance

114. ❌ UIPress: Bringing Optical Token Compression to UI-to-Code Generation

作者: Dasen Dai, Shuoqi Li, Ronghao Chen, Huacan Wang, Biao Wu, Qizhen Lan 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09442v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文UIPress提出了一种用于UI-to-Code生成的视觉令牌压缩方法，主要涉及大模型（LLM）和参数高效微调（PEFT/LoRA）。论文基于Qwen3-VL-8B模型，因此与’Large Language Models’高度相关（10分）。方法中使用了LoRA进行微调，与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分）。压缩方法旨在减少视觉令牌数量以加速推理，与’Quantization OR Model Compression OR Low-bit Weights’有一定关联（5分），但更侧重于序列压缩而非权重压缩。同时，压缩带来了时间加速，与’Speculative Decoding OR Inference Acceleration’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、RAG、CoT等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

UIPress提出了一种轻量级学习压缩模块，将UI截图中的约6700个视觉令牌压缩到256个，结合LoRA微调，在Design2Code基准上实现了CLIP分数提升7.5%和首次令牌生成速度提升9.1倍。

摘要翻译

UI到代码生成任务要求视觉语言模型（VLMs）从单张截图生成数千个结构化HTML/CSS标记，这使得视觉标记效率至关重要。现有压缩方法要么在推理时使用与任务无关的启发式方法选择标记，要么仅将低注意力特征置零而未实际缩短序列长度——这两种方法均未真正减少预填充延迟或适应UI截图中非均匀的信息密度。与此同时，光学（编码器端学习式）压缩在文档OCR任务中已展现出显著效果，但此前尚无研究将这一范式应用于UI到代码生成领域。我们提出UIPress，一种轻量级学习式压缩模块，将其嵌入Qwen3-VL-8B的冻结ViT编码器与大语言模型（LLM）解码器之间。UIPress结合深度可分离卷积、元素引导的空间重加权及Transformer细化模块，将约6,700个视觉标记压缩至固定的256个标记预算。配合在解码器端采用低秩适应（LoRA）以弥合表征差距，整个系统仅增加约2,170万个可训练参数（占80亿基础模型的0.26%）。在Design2Code基准测试中，基于相同基础模型与四种基线方法的公平比较下，采用256标记配置的UIPress获得0.8127的CLIP分数，较未压缩基线提升7.5%，比最强的推理时压缩方法提升4.6%，同时实现首标记生成速度9.1倍的加速。据我们所知，UIPress是首个针对UI到代码任务的编码器端学习式压缩方法。

摘要 (Abstract)

UI-to-Code generation requires vision-language models (VLMs) to produce thousands of tokens of structured HTML/CSS from a single screenshot, making visual token efficiency critical. Existing compression methods either select tokens at inference time using task-agnostic heuristics, or zero out low-attention features without actually shortening the sequence – neither truly reduces prefill latency or adapts to the non-uniform information density of UI screenshots. Meanwhile, optical (encoder-side learned) compression has shown strong results for document OCR, yet no prior work has adapted this paradigm to UI-to-Code generation. We propose UIPress, a lightweight learned compression module inserted between the frozen ViT encoder and the LLM decoder of Qwen3-VL-8B. UIPress combines depthwise-separable convolutions, element-guided spatial reweighting, and Transformer refinement to compress ${\sim}$6{,}700 visual tokens to a fixed budget of 256. Together with Low-Rank Adaptation (LoRA) on the decoder to bridge the representation gap, the entire system adds only ${\sim}$21.7M trainable parameters (0.26% of the 8B base model). Under a fair comparison on the same base model against four baselines on Design2Code, UIPress at 256 tokens achieves a CLIP score of 0.8127, outperforming the uncompressed baseline by +7.5% and the strongest inference-time method by +4.6%, while delivering 9.1$\times$ time-to-first-token speedup. To the best of our knowledge, UIPress is the first encoder-side learned compression method for the UI-to-Code task.

关键词: UI-to-Code generation, visual token compression, optical compression, Low-Rank Adaptation (LoRA), inference acceleration, Qwen3-VL-8B, Design2Code, encoder-side learned compression

115. ❌ Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder

作者: Götz-Henrik Wiegand, Lorena Raichle, Rico Städeli, Tomas Hrycej, Bernhard Bermeitinger, Siegfried Handschuh 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09389v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	10.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Transformer语言模型的dataset scaling laws，与’Scaling Laws AND Data Quality’高度相关（10分），直接研究数据集规模对性能的影响；涉及语言模型预训练（8分）和基础模型概念（8分）；其他关键词如MoE、SFT、RAG等未在研究中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在小型注意力解码器架构中数据集规模对Transformer语言模型性能的影响，发现使用约30%的训练数据即可达到约90%的全数据验证准确率，为计算和数据受限环境提供了实用的数据集规模平衡指导。

摘要翻译

训练Transformer语言模型成本高昂，因为其性能通常随数据集规模和计算预算的增加而提升。尽管缩放定律在大规模场景下描述了这一趋势，但在受控的小规模设定中其影响仍较少被探索。在本研究中，我们通过采用一个大幅简化的纯注意力解码器架构来隔离数据集规模的影响。通过在按2的幂次递增的子集上进行训练，我们观察到性能的平稳提升伴随着明显的收益递减现象，这与缩放定律的行为一致。仅使用约30%的训练数据即可达到全数据验证集词元级别准确率的约90%。这些结果为在受控且组件隔离的环境中理解数据集缩放提供了可操作的见解，并为计算资源和数据受限的环境（如小型研究实验室和探索性模型开发）中平衡数据集规模与计算成本提供了实用指导。

摘要 (Abstract)

Training Transformer language models is expensive, as performance typically improves with increasing dataset size and computational budget. Although scaling laws describe this trend at large scale, their implications in controlled, smaller-scale settings remain less explored. In this work, we isolate dataset-size effects using a strongly reduced attention-only decoder architecture. By training on progressively larger power-of-two subsets, we observe smooth performance improvements accompanied by clear diminishing returns, consistent with scaling-law behavior. Using only about 30% of the training data is sufficient to reach approximately 90% of the full-data validation token-level accuracy. These results provide actionable insights into dataset scaling in a controlled, component-isolated setting and offer practical guidance for balancing dataset size and computational cost in compute- and data-restricted environments, such as small research labs and exploratory model development.

关键词: dataset scaling laws, Transformer language models, attention-only decoder, data efficiency, computational cost, validation accuracy, diminishing returns, power-of-two subsets

116. ❌ Task-Aware LLM Routing with Multi-Level Task-Profile-Guided Data Synthesis for Cold-Start Scenarios

作者: Hui Liu, Bin Zou, Kecheng Chen, Jie Liu, Wenya Wang, Haoliang Li 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09377v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM路由系统（TRouter）和冷启动场景下的数据合成框架，仅与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为全文围绕LLM性能差异和路由选择展开。其他关键词如MoE、SLMs、训练方法、推理优化、代理系统、科学AI应用等均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对冷启动场景下LLM路由系统泛化能力差的问题，提出了一个多级任务轮廓引导的数据合成框架和任务感知路由器TRouter，有效提升了路由效用。

摘要翻译

大型语言模型（LLM）在不同任务和查询中的性能与计算成本存在显著差异，这促使路由系统通过选择模型来满足用户特定的成本-性能权衡需求。然而，现有路由方法在缺乏领域内训练数据的冷启动场景中泛化能力较差。为解决这一局限，我们提出了一种多层次任务画像引导的数据合成框架，该框架构建了层次化任务分类体系，并生成多样化问答对以近似测试时的查询分布。在此基础上，我们提出了TRouter——一种任务类型感知的路由方法，该方法通过潜在任务类型变量对查询相关的成本和性能进行建模，并利用从合成任务分类体系中推导的先验正则化。这一设计提升了TRouter在冷启动和领域内场景下的路由效用。在多个基准测试中，我们证明所提出的合成框架能有效缓解冷启动问题，且TRouter能够实现高效的大型语言模型路由。

摘要 (Abstract)

Large language models (LLMs) exhibit substantial variability in performance and computational cost across tasks and queries, motivating routing systems that select models to meet user-specific cost-performance trade-offs. However, existing routers generalize poorly in cold-start scenarios where in-domain training data is unavailable. We address this limitation with a multi-level task-profile-guided data synthesis framework that constructs a hierarchical task taxonomy and produces diverse question-answer pairs to approximate the test-time query distribution. Building on this, we introduce TRouter, a task-type-aware router approach that models query-conditioned cost and performance via latent task-type variables, with prior regularization derived from the synthesized task taxonomy. This design enhances TRouter’s routing utility under both cold-start and in-domain settings. Across multiple benchmarks, we show that our synthesis framework alleviates cold-start issues and that TRouter delivers effective LLM routing.

关键词: LLM routing, cold-start scenarios, task-aware router, data synthesis, task taxonomy, query distribution, performance-cost trade-off, TRouter

117. ❌ Arbitration Failure, Not Perceptual Blindness: How Vision-Language Models Resolve Visual-Linguistic Conflicts

作者: Farhad Nooralahzadeh, Omid Rohanian, Yi Zhang, Jonathan Fürst, Kurt Stockinger 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09364v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Vision-Language Models（VLMs）在视觉-语言冲突中的行为机制，属于大模型（LLMs）在视觉语言领域的应用研究，因此与’Large Language Models OR LLMs OR Foundation Models’相关（评分8.0）。论文核心是使用Logit Lens probing、activation patching等方法分析模型内部表示和因果机制，这直接属于’Mechanistic Interpretability OR Explainable AI’范畴（评分10.0）。其他关键词如MoE、Scaling Laws、Fine-tuning、RAG、Reasoning、Agents、Quantization等均未在论文中涉及，因此评分为0。论文未提及任何指定的专家作者。

!!! tip deepseek-chat TL;DR

该论文研究发现，当视觉语言模型（VLMs）面对视觉与先验知识冲突时（如蓝色香蕉），问题主要在于仲裁失败而非感知盲视，通过层间分析和因果干预证明模型能编码视觉证据但难以基于其做出正确响应，而早期层的激活引导可提升视觉基础能力。

摘要翻译

当视觉语言模型（VLM）看到一根蓝色香蕉并回答“黄色”时，问题在于感知还是决策？我们以十种不同规模的VLM为研究对象，揭示了编码与接地的分离现象：那些未能报告所见内容（因而给出错误答案）的模型，其视觉证据的编码强度与给出正确答案的模型相当。通过采用多模态仲裁交叉（MAC）分析及逐层Logit Lens探测，我们追踪了每个模型各层中视觉信号与先验信号的竞争过程。研究表明，视觉属性可从早期层线性解码（AUC > 0.86），且成功样本与失败样本的解码准确率几乎完全一致。然而，最终层逻辑值的差距——而非编码强度——能更好地预测接地结果，其相关性达。在探究VLM何时依据图像线索而非先验知识作答后，我们试图理解其中的因果关系。通过全序列激活修补实验，我们确立了因果机制：传统LLM可解释性研究中针对末位词元的干预对VLM无效；相反，在MAC识别的特定层替换完整词元序列可改变60%至84%的输出结果。部分词元分解显示，图像词元承载了几乎全部的因果影响，而文本词元的影响为零。通过扩展架构设计可完全消除剩余的结构差异。从诊断转向干预，我们发现无需训练的激活导向技术——包括线性导向与稀疏自编码器引导——在早期层的应用可将视觉接地能力提升最高+3.8%，但在某些场景中可能导致性能下降。总体而言，这些发现得出明确结论：VLM已具备良好的视觉感知能力，其核心挑战在于如何依据所见内容进行决策。针对性干预有助于弥合这一鸿沟。

摘要 (Abstract)

When a Vision-Language Model (VLM) sees a blue banana and answers “yellow”, is the problem of perception or arbitration? We explore the question in ten VLMs with various sizes and reveal an Encoding–Grounding Dissociation: models that fail to report what they see (and thus provide a wrong answer) still encode the visual evidence as strongly as models that provide the correct answer. Using Multimodal Arbitration Crossover (MAC) analysis with layer-by-layer Logit Lens probing, we track the competition between visual and prior signals across every layer of each model. We show that visual attributes can be linearly decodable from early layers (AUC > 0.86). The accuracy remains nearly identical for both successful and failed samples. However, the gap in the final-layer logit – not the strength of encoding – better predicts grounding outcomes with a correlation of . After having studied when VLMs base their answers on image clues rather than prior knowledge, we want to understand the causal relationships. We establish causality through full-sequence activation patching. The standard last-token interventions in LLM interpretability do not affect VLMs. In contrast, replacing the full token sequence at layers identified by MAC alters 60 to 84% of outputs. Partial-token decomposition shows that image tokens carry almost all of the causal impact, while text tokens have none. Scaling addresses the remaining architectural differences to achieve perfect retention. Moving from diagnosis to intervention, we show that training-free activation steering – both linear and sparse autoencoder-guided – in early layers can improve visual grounding by up to +3.8% with degrading performance in some setups. Overall, these findings lead to a clear conclusion: VLMs already see well, but the challenge is acting on what they see. Targeted interventions can help to bridge this gap.

关键词: Vision-Language Models, Visual-Linguistic Conflicts, Multimodal Arbitration Crossover, Logit Lens, Activation Patching, Visual Grounding, Interpretability, Causal Intervention

118. ❌ EthicMind: A Risk-Aware Framework for Ethical-Emotional Alignment in Multi-Turn Dialogue

作者: Jiawen Deng, Wei Li, Wentao Zhang, Ziyun Jiao, Fuji Ren 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09265v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究对话系统中的伦理-情感对齐问题，与’Instruction Tuning OR Alignment OR Value Alignment’高度相关（10分），因为该框架专注于在推理时实现伦理和情感的对齐，这是对齐研究的核心应用。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分），因为智能对话系统通常基于大语言模型，但论文未明确指定模型类型。其他关键词如MoE、SFT、RAG等均未在摘要中提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了EthicMind框架，用于在多轮对话中实现风险感知的伦理-情感对齐，实验表明其在高风险和道德模糊场景下比基线模型提供更一致的伦理指导和情感参与。

摘要翻译

智能对话系统正日益部署于情感与伦理敏感的场景中，在此类场景下，情感协调或伦理判断的失误均可能造成严重伤害。现有对话模型通常孤立地处理共情能力与伦理安全性，且往往难以在多轮交互中随伦理风险与用户情绪的动态变化而调整其行为。本文将对话中的伦理-情感对齐问题明确表述为一种逐轮次的决策问题，并提出\textsc{EthicMind}——一个风险感知的推理框架，可在多轮对话推理时实现该表述。在每一轮交互中，\textsc{EthicMind} 协同分析伦理风险信号与用户情绪，规划高层级回应策略，并生成上下文敏感的回复，以平衡伦理引导与情感投入，且无需额外的模型训练。为评估伦理复杂交互下的对齐行为，我们引入一种基于风险分层、多轮次的评估协议，并配合上下文感知的用户模拟流程。实验结果表明，与现有基线模型相比，\textsc{EthicMind} 能实现更一致的伦理引导与情感投入，尤其在高风险与道德模糊情境中表现更为突出。

摘要 (Abstract)

Intelligent dialogue systems are increasingly deployed in emotionally and ethically sensitive settings, where failures in either emotional attunement or ethical judgment can cause significant harm. Existing dialogue models typically address empathy and ethical safety in isolation, and often fail to adapt their behavior as ethical risk and user emotion evolve across multi-turn interactions. We formulate ethical-emotional alignment in dialogue as an explicit turn-level decision problem, and propose \textsc{EthicMind}, a risk-aware framework that implements this formulation in multi-turn dialogue at inference time. At each turn, \textsc{EthicMind} jointly analyzes ethical risk signals and user emotion, plans a high-level response strategy, and generates context-sensitive replies that balance ethical guidance with emotional engagement, without requiring additional model training. To evaluate alignment behavior under ethically complex interactions, we introduce a risk-stratified, multi-turn evaluation protocol with a context-aware user simulation procedure. Experimental results show that \textsc{EthicMind} achieves more consistent ethical guidance and emotional engagement than competitive baselines, particularly in high-risk and morally ambiguous scenarios.

关键词: ethical-emotional alignment, multi-turn dialogue, risk-aware framework, ethical risk, user emotion, dialogue systems, ethical guidance, emotional engagement

119. ❌ ScheMatiQ: From Research Question to Structured Data through Interactive Schema Discovery

作者: Shahar Levy, Eliya Habba, Reshef Mintz, Barak Raveh, Renana Keydar, Gabriel Stanovsky 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09237v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文ScheMatiQ的核心是使用LLM（大语言模型）从文档集合中提取结构化数据以回答研究问题，这直接涉及LLM的应用（高度相关，10分）。它通过LLM调用处理文档并生成结构化输出，与检索增强生成（RAG）有一定关联，因为涉及从文档中检索信息（5分）。论文在计算生物学等科学领域的应用与’AI for Science’相关（8分）。其他关键词如MoE、SFT、量化等未在摘要中提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

ScheMatiQ利用大语言模型从文档集合中自动生成结构化数据以回答研究问题，并通过交互式界面支持领域专家在法学和计算生物学中进行实际分析。

摘要翻译

许多学科领域都会针对大规模文献集提出自然语言研究问题，其答案通常需要结构化证据支撑。传统方法依赖于人工设计标注框架并对语料库进行详尽标注，这一过程不仅耗时且容易出错。我们提出ScheMatiQ系统，该系统通过调用核心大语言模型，能够根据研究问题与语料库自动生成结构化框架与基于证据的数据库，并配备可引导和修正信息提取过程的网络交互界面。通过与领域专家合作，我们证明ScheMatiQ生成的输出能够有效支持法律与计算生物学领域的实际分析工作。我们将ScheMatiQ作为开源项目发布，并提供公共网络接口，邀请各学科专家将其应用于自有数据。所有资源，包括网站、源代码及演示视频，均可在以下网址获取：www.ScheMatiQ-ai.com

摘要 (Abstract)

Many disciplines pose natural-language research questions over large document collections whose answers typically require structured evidence, traditionally obtained by manually designing an annotation schema and exhaustively labeling the corpus, a slow and error-prone process. We introduce ScheMatiQ, which leverages calls to a backbone LLM to take a question and a corpus to produce a schema and a grounded database, with a web interface that lets steer and revise the extraction. In collaboration with domain experts, we show that ScheMatiQ yields outputs that support real-world analysis in law and computational biology. We release ScheMatiQ as open source with a public web interface, and invite experts across disciplines to use it with their own data. All resources, including the website, source code, and demonstration video, are available at: www.ScheMatiQ-ai.com

关键词: LLM, structured data extraction, research questions, document collections, interactive schema discovery, computational biology, law, open source

作者: Passant Elchafei, Monorama Swain, Shahed Masoudian, Markus Schedl 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09174v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	15.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	15.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究RAG系统中的幻觉问题，开发了facet-level诊断框架分析证据使用和生成对齐，因此与’Retrieval-Augmented Generation’和’Hallucination Mitigation’高度相关（15分）。论文评估了GPT、Gemini、LLaMA等大模型，与’Large Language Models’相关（10分）。研究涉及医学QA，与’AI for Science’有一定关联（5分）。诊断框架提供解释性分析，与’Mechanistic Interpretability’有一定关联（5分）。其他关键词如MoE、SFT、量化等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了RAG系统中即使检索到相关证据仍产生幻觉的问题，通过提出facet-level诊断框架分析证据使用模式，发现幻觉主要源于生成过程中证据整合不当而非检索准确性。

摘要翻译

检索增强生成（RAG）旨在通过将答案建立在检索到的证据基础上来减少幻觉，但即使相关文档可用，幻觉答案仍然普遍存在。现有评估主要关注答案级或段落级准确性，对生成过程中证据的使用方式提供有限洞察。本研究引入了一个面向问答的层面级诊断框架，将每个输入问题分解为原子推理层面。针对每个层面，我们通过一个结合检索相关性与基于自然语言推理的忠实度评分的结构化“层面×文本块”矩阵，评估证据充分性与证据支撑度。为诊断证据使用情况，我们分析了三种受控推理模式：严格RAG（强制仅依赖检索证据）、宽松RAG（允许整合检索证据与参数知识）以及无检索的纯LLM生成。通过比较这些模式，可深入分析检索与生成的错位现象——即相关证据已被检索到但未在生成过程中被正确整合的情况。在医疗问答和HotpotQA数据集上，我们评估了三种开源与闭源大语言模型（GPT、Gemini和LLaMA），提供了可解释的诊断结果，揭示了反复出现的层面级失败模式，包括证据缺失、证据错位以及先验知识驱动的证据覆盖。我们的结果表明，RAG系统中的幻觉较少由检索准确性驱动，而更多源于生成过程中对检索证据的整合方式；层面级分析揭示了在答案级评估下被掩盖的系统性证据覆盖与错位模式。

摘要 (Abstract)

Retrieval-Augmented Generation (RAG) aims to reduce hallucination by grounding answers in retrieved evidence, yet hallucinated answers remain common even when relevant documents are available. Existing evaluations focus on answer-level or passage-level accuracy, offering limited insight into how evidence is used during generation. In this work, we introduce a facet-level diagnostics framework for QA that decomposes each input question into atomic reasoning facets. For each facet, we assess evidence sufficiency and grounding using a structured Facet x Chunk matrix that combines retrieval relevance with natural language inference-based faithfulness scores. To diagnose evidence usage, we analyze three controlled inference modes: Strict RAG, which enforces exclusive reliance on retrieved evidence; Soft RAG, which allows integration of retrieved evidence and parametric knowledge; and LLM-only generation without retrieval. Comparing these modes enables thorough analysis of retrieval-generation misalignment, defined as cases where relevant evidence is retrieved but not correctly integrated during generation. Across medical QA and HotpotQA, we evaluate three open-source and closed-source LLMs (GPT, Gemini, and LLaMA), providing interpretable diagnostics that reveal recurring facet-level failure modes, including evidence absence, evidence misalignment, and prior-driven overrides. Our results demonstrate that hallucinations in RAG systems are driven less by retrieval accuracy and more by how retrieved evidence is integrated during generation, with facet-level analysis exposing systematic evidence override and misalignment patterns that remain hidden under answer-level evaluation.

关键词: Retrieval-Augmented Generation, Hallucination, Evidence Integration, Facet-level Diagnostics, Medical QA, LLM Evaluation, Retrieval-Generation Misalignment, Interpretable Analysis

121. ❌ SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation

作者: Han Luo, Guy Laban 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09212v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SPASM专注于大语言模型在多轮对话生成中的应用，特别是解决LLM代理在长对话中出现的角色漂移、身份混淆和回声问题。核心贡献是提出了一个稳定的、模块化的框架，包括角色创建、对话生成和终止检测，并引入了Egocentric Context Projection（ECP）技术来提升稳定性而不改变模型权重。因此，论文与’Large Language Models’高度相关（10分），因为它直接研究LLM在多轮对话中的应用；与’LLM Agents’高度相关（10分），因为它涉及LLM驱动的自主代理模拟；与’Multi-agent Systems’高度相关（10分），因为它处理Client-Responder配对的多代理协调。其他关键词如MoE、SLMs、Scaling Laws、训练技术、推理优化、AI for Science等，论文未涉及或仅间接提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

论文SPASM解决了大语言模型在多轮对话生成中出现的角色漂移和回声问题，通过引入Egocentric Context Projection技术，显著提升了对话代理的稳定性，并在多个LLM骨干上验证了其有效性。

摘要翻译

大型语言模型正日益应用于多轮对话场景，如教学辅导、技术支持与心理疏导等，其可靠性取决于在长程交互中能否保持角色、人设与目标的一致性。当使用LLM生成用于训练和评估的合成对话时，这一要求变得尤为关键，因为LLM与LLM之间的对话可能累积身份相关的故障，例如人设漂移、角色混淆以及“回声效应”——即一个智能体逐渐镜像其对话伙伴的行为。本文提出SPASM（面向多轮对话生成的稳定人设驱动智能体模拟框架），这是一个模块化、以稳定性优先的框架，将模拟过程分解为：（i）通过模式采样、合理性验证和自然语言人设构建实现人设创建；（ii）客户端—应答者对话生成；（iii）用于连贯终止的结束检测。为了在不改变模型权重的前提下提升长程稳定性，我们提出自我中心语境投射（Egocentric Context Projection, ECP）：对话历史以视角无关的表示形式存储，并在生成前确定性地投射到每个智能体的自我中心视图中。基于三种LLM骨干模型（GPT-4o-mini、DeepSeek-V3.2、Qwen-Plus）和九组客户端—应答者配对，我们构建了一个包含4,500个人设和45,000段对话的数据集（每组配对包含500个人设，每人设生成10段对话）。消融实验表明，ECP显著减少了人设漂移，并在人工验证下消除了回声效应；嵌入分析还原了人设结构，并揭示了应答者主导的强交互几何特征。代码已发布于https://github.com/lhannnn/SPASM。

摘要 (Abstract)

Large language models are increasingly deployed in multi-turn settings such as tutoring, support, and counseling, where reliability depends on preserving consistent roles, personas, and goals across long horizons. This requirement becomes critical when LLMs are used to generate synthetic dialogues for training and evaluation, since LLM–LLM conversations can accumulate identity-related failures such as persona drift, role confusion, and “echoing”, where one agent gradually mirrors its partner. We introduce SPASM (Stable Persona-driven Agent Simulation for Multi-turn dialogue generation), a modular, stability-first framework that decomposes simulation into (i) persona creation via schema sampling, plausibility validation, and natural-language persona crafting, (ii) Client–Responder dialogue generation, and (iii) termination detection for coherent stopping. To improve long-horizon stability without changing model weights, we propose Egocentric Context Projection (ECP): dialogue history is stored in a perspective-agnostic representation and deterministically projected into each agent’s egocentric view before generation. Across three LLM backbones (GPT-4o-mini, DeepSeek-V3.2, Qwen-Plus) and nine Client–Responder pairings, we construct a dataset of 4,500 personas and 45,000 conversations (500 personas X 10 conversations per pairing). Ablations show ECP substantially reduces persona drift and, under human validation, eliminates echoing; embedding analyses recover persona structure and reveal strong responder-driven interaction geometry. Our code is available at https://github.com/lhannnn/SPASM.

关键词: Large Language Models, Multi-turn Dialogue Generation, Agent Simulation, Persona Drift, Egocentric Context Projection, Stability Framework, LLM Agents, Multi-agent Systems

122. ❌ Think Less, Know More: State-Aware Reasoning Compression with Knowledge Guidance for Efficient Reasoning

作者: Yi Sui, Chaozhuo Li, Dawei Song 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09150v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	15.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于大模型推理效率优化，核心贡献是STACK框架，通过状态感知的推理压缩和知识引导来减少Chain-of-Thought的冗余步骤。高度相关的关键词包括：‘Chain of Thought’（核心研究对象，15分）、‘Large Language Models’（研究基础，10分）、‘Retrieval-Augmented Generation’（知识引导机制，10分）、‘RLHF/DPO’（训练策略使用PPO和DPO，10分）、‘Speculative Decoding/Inference Acceleration’（目标为降低推理延迟，10分）。其他关键词如MoE、量化、科学AI等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大推理模型在复杂任务中因过度思考导致推理步骤冗长和推理延迟高的问题，提出了状态感知推理压缩框架STACK，通过知识引导和动态压缩策略，在三个数学推理基准上实现了响应长度减少59.9%且准确率提升4.8个百分点的优越精度-效率平衡。

摘要翻译

大型推理模型（LRMs）通过利用长链思维（Chain-of-Thought，CoT）在复杂任务上实现了强劲性能，但常因过度思考而导致推理步骤冗长和推理延迟高企。现有的CoT压缩方法难以平衡准确性与效率，且缺乏对冗余和推理偏差的细粒度、步骤级适应。为此，我们提出基于知识引导的状态感知推理压缩框架（State-Aware Reasoning Compression with Knowledge Guidance，STACK），该框架通过显式建模阶段特定的冗余来源，并结合检索增强的引导，实现逐步的CoT压缩。STACK构建在线长短对比样本，并动态切换两种压缩模式：针对不确定或存在偏差的推理状态采用知识引导压缩，而对过长但自信的状态则采用自提示压缩，辅以基于答案收敛的早停机制以抑制冗余验证。我们进一步提出结合近端策略优化（Proximal Policy Optimization，PPO）与直接偏好优化（Direct Preference Optimization，DPO）的奖励差异驱动训练策略，使模型能够学习状态条件化的压缩策略。在三个数学推理基准测试上的实验表明，STACK实现了更优的准确率-效率平衡，相较于现有方法，平均响应长度降低59.9%，同时准确率提升4.8个百分点。

摘要 (Abstract)

Large Reasoning Models (LRMs) achieve strong performance on complex tasks by leveraging long Chain-of-Thought (CoT), but often suffer from overthinking, leading to excessive reasoning steps and high inference latency. Existing CoT compression methods struggle to balance accuracy and efficiency, and lack fine-grained, step-level adaptation to redundancy and reasoning bias. Therefore, we propose State-Aware Reasoning Compression with Knowledge Guidance (STACK), a framework that performs step-wise CoT compression by explicitly modeling stage-specific redundancy sources and integrating with a retrieval-augmented guidance. STACK constructs online long-short contrastive samples and dynamically switches between knowledge-guided compression for uncertain or biased reasoning state and self-prompted compression for overly long but confident state, complemented by an answer-convergence-based early stopping mechanism to suppress redundant verification. We further propose a reward-difference-driven training strategy by combining Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO), enabling models to learn state-conditioned compression strategies. Experiments on three mathematical reasoning benchmarks show that STACK achieves a superior accuracy-efficiency balance, reducing average response length by 59.9% while improving accuracy by 4.8 points over existing methods.

关键词: Reasoning Compression, Chain-of-Thought, Inference Efficiency, Retrieval-Augmented Guidance, State-Aware Compression, Direct Preference Optimization, Mathematical Reasoning, Overthinking Reduction

123. ❌ Prototype-Regularized Federated Learning for Cross-Domain Aspect Sentiment Triplet Extraction

作者: Zongming Cai, Jianhang Tang, Zhenyong Zhang, Jinghui Qin, Kebing Jin, Hankz Hankui Zhuo 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09123v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是面向方面情感三元组提取（ASTE）的原型正则化联邦学习框架，属于自然语言处理中的情感分析子领域。虽然涉及跨领域学习和联邦学习，但论文未提及任何大模型、深度学习技术原理创新或大模型在不同领域的应用。所有关键词均与大模型技术、训练方法、推理优化、对齐技术、代理系统等直接相关，而本文专注于传统NLP任务和联邦学习框架，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

本文提出了一种原型正则化联邦学习框架（PCD-SpanProto），用于解决跨领域方面情感三元组提取中的数据隐私和领域异质性问题，实验表明该方法优于基线并降低了通信成本。

摘要翻译

方面情感三元组抽取（Aspect Sentiment Triplet Extraction, ASTE）旨在从句子中提取所有由方面词、观点词和情感极性构成的情感三元组。现有方法通常仅在单一数据集上独立训练，未能联合捕捉跨领域共享的共性特征表示。此外，数据隐私限制阻碍了集中式数据聚合。为应对这些挑战，我们提出基于原型的跨领域跨度原型抽取（Prototype-based Cross-Domain Span Prototype extraction, PCD-SpanProto），这是一个通过原型正则化的联邦学习框架，使分布式客户端能够交换类别级原型而非完整的模型参数。具体而言，我们设计了加权性能感知聚合策略和对比正则化模块，以在领域异构性下优化全局原型，并促进客户端间类内紧凑性与类间分离性的提升。在四个ASTE数据集上的大量实验表明，我们的方法优于基线模型，同时降低了通信成本，验证了基于原型的跨领域知识迁移的有效性。

摘要 (Abstract)

Aspect Sentiment Triplet Extraction (ASTE) aims to extract all sentiment triplets of aspect terms, opinion terms, and sentiment polarities from a sentence. Existing methods are typically trained on individual datasets in isolation, failing to jointly capture the common feature representations shared across domains. Moreover, data privacy constraints prevent centralized data aggregation. To address these challenges, we propose Prototype-based Cross-Domain Span Prototype extraction (PCD-SpanProto), a prototype-regularized federated learning framework to enable distributed clients to exchange class-level prototypes instead of full model parameters. Specifically, we design a weighted performance-aware aggregation strategy and a contrastive regularization module to improve the global prototype under domain heterogeneity and the promotion between intra-class compactness and inter-class separability across clients. Extensive experiments on four ASTE datasets demonstrate that our method outperforms baselines and reduces communication costs, validating the effectiveness of prototype-based cross-domain knowledge transfer.

关键词: Aspect Sentiment Triplet Extraction, Federated Learning, Cross-domain Learning, Prototype-based Learning, Knowledge Transfer, Domain Heterogeneity, Communication Efficiency, Sentiment Analysis

124. ❌ Few-Shot Contrastive Adaptation for Audio Abuse Detection in Low-Resource Indic Languages

作者: Aditya Narayan Sankaran, Reza Farahbakhsh, Noel Crespi 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09094v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究基于对比语言-音频预训练（CLAP）模型进行音频滥用检测，属于预训练模型在特定任务上的应用与适应。与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（8分），因为论文核心涉及CLAP预训练模型的跨语言适应。与’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（5分），因为使用了少量样本的监督对比适应（few-shot supervised contrastive adaptation），这属于微调范畴。其他关键词主要涉及大语言模型（LLM）的特定技术、推理、对齐、压缩等，而本文聚焦音频-文本对比预训练模型（CLAP）在音频分类任务中的应用，未涉及LLM、MoE、推理、代理、量化等主题，因此相关度为0。

!!! tip deepseek-chat TL;DR

该论文研究了基于对比语言-音频预训练（CLAP）模型在低资源印度语言中进行音频滥用检测，发现CLAP能产生强大的跨语言音频表示，轻量级适应即可达到与全监督系统竞争的性能，但适应效果具有语言依赖性。

摘要翻译

随着社交媒体日益转向语音交互，辱骂性言论检测在多语言及低资源环境中的重要性愈发凸显。当前多数系统依赖于自动语音识别（ASR）与基于文本的仇恨言论分类串联的流程，但该流程易受转写错误影响，且丢弃了语音中的韵律信息。本研究探讨对比式语言-音频预训练模型（Contrastive Language-Audio Pre-training, CLAP）能否直接基于音频支持辱骂性言论检测。利用ADIMA数据集，我们在跨语言及留一语言设定下，通过小样本监督对比适应方法评估基于CLAP的表征，并辅以零样本提示作为补充分析。实验结果表明：CLAP在十种印度语言中能产生强大的跨语言音频表征；仅通过轻量级投影层适应即可达到与完整训练数据上全监督系统相当的性能。然而，小样本适应的效果具有语言依赖性，且不随样本数量单调增长。这些发现表明，对比式音频-文本模型为低资源环境下的跨语言音频辱骂检测提供了有前景的基础，同时也揭示出模型迁移仍存在局限性，并在重要维度上表现出语言特异性。

摘要 (Abstract)

Abusive speech detection is becoming increasingly important as social media shifts towards voice-based interaction, particularly in multilingual and low-resource settings. Most current systems rely on automatic speech recognition (ASR) followed by text-based hate speech classification, but this pipeline is vulnerable to transcription errors and discards prosodic information carried in speech. We investigate whether Contrastive Language-Audio Pre-training (CLAP) can support abusive speech detection directly from audio. Using the ADIMA dataset, we evaluate CLAP-based representations under few-shot supervised contrastive adaptation in cross-lingual and leave-one-language-out settings, with zero-shot prompting included as an auxiliary analysis. Our results show that CLAP yields strong cross-lingual audio representations across ten Indic languages, and that lightweight projection-only adaptation achieves competitive performance with respect to fully supervised systems trained on complete training data. However, the benefits of few-shot adaptation are language-dependent and not monotonic with shot size. These findings suggest that contrastive audio-text models provide a promising basis for cross-lingual audio abuse detection in low-resource settings, while also indicating that transfer remains incomplete and language-specific in important ways.

关键词: Audio abuse detection, Contrastive Language-Audio Pre-training (CLAP), Few-shot adaptation, Cross-lingual, Low-resource languages, Indic languages, Supervised contrastive adaptation, ADIMA dataset

125. ❌ Hierarchical Alignment: Enforcing Hierarchical Instruction-Following in LLMs through Logical Consistency

作者: Shu Yang, Zihao Zhou, Di Wang, Wenda Li 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09075v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的层次化指令对齐问题，与’Large Language Models’和’Instruction Tuning/Alignment’高度相关（10分）。论文涉及工具使用和智能体工作流（8分），采用约束求解推理方法，与’Chain of Thought’和’System 2 Thinking’相关（8分）。论文提到检索增强生成和事实性（5分），以及监督微调（5分）。其他关键词如MoE、量化、科学AI等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLMs在现实应用中遇到的良性指令冲突问题，提出了神经符号层次对齐方法，通过约束求解推理和训练时蒸馏，显著提升了模型在层次化指令遵循下的性能，同时保持了任务效用。

摘要翻译

大型语言模型日益需要在来自不同权威层级异构源的多元指令下运行，这些指令包括系统策略、用户请求、工具输出以及检索上下文。尽管先前关于指令层级的研究强调了遵循指令优先级的重要性，但其主要聚焦于对抗性攻击，忽视了现实应用中常见且非恶性的指令冲突。在此类场景中，模型不仅需要避免安全违规，还必须在指令部分或隐式冲突时保持任务效用和行为一致性。我们提出神经符号层级对齐方法，通过显式建模并强化指令优先级来实现层级化指令遵循。在推理阶段，我们引入求解器引导的推理机制，将指令解析构建为约束满足问题，使模型能够在层级约束下推导出最大程度一致的可适用指令集。在训练阶段，该方法利用自动构建的监督信号，将基于求解器的决策蒸馏至模型参数中。我们在规则遵循、任务执行、工具使用和安全性等场景中对本方法进行评估，涵盖单轮和多轮交互，结果表明该方法在此类冲突下显著提升了性能，同时在基准场景中保持了具有竞争力的效用。

摘要 (Abstract)

Large language models increasingly operate under multiple instructions from heterogeneous sources with different authority levels, including system policies, user requests, tool outputs, and retrieved context. While prior work on instruction hierarchy highlights the importance of respecting instruction priorities, it mainly focuses on adversarial attacks and overlooks the benign but common instruction conflicts that arise in real-world applications. In such settings, models must not only avoid security violations but also preserve task utility and behavioral consistency when instructions partially or implicitly conflict. We propose Neuro-Symbolic Hierarchical Alignment (NSHA) for hierarchical instruction-following by explicitly modeling and enforcing instruction priorities. At inference time, we introduce solver-guided reasoning that formulates instruction resolution as a constraint satisfaction problem, enabling the model to derive a maximally consistent set of applicable instructions under hierarchical constraints. At training time, NSHA distills solver-based decisions into model parameters using automatically constructed supervision. We evaluate our approach on rule following, task execution, tool use, and safety, covering both single-turn and multi-turn interactions, and show that NSHA significantly improves performance under such conflicts while maintaining competitive utility in reference settings.

关键词: Hierarchical Alignment, Instruction-Following, Logical Consistency, Constraint Satisfaction, Neuro-Symbolic, Tool Use, Multi-turn Interactions, Behavioral Consistency

126. ❌ Anchored Sliding Window: Toward Robust and Imperceptible Linguistic Steganography

作者: Ruiyi Yan, Shiao Meng, Yugo Murawaki 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09066v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究基于语言模型的隐写术，提出锚定滑动窗口（ASW）框架以提高文本质量和鲁棒性。核心创新在于上下文窗口管理，与’Context Window Extension OR Long Context LLMs’高度相关（10分），因为ASW通过锚定提示和桥接上下文优化了上下文窗口的使用。论文使用语言模型，与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分），但未涉及其他关键词的具体技术。

!!! tip deepseek-chat TL;DR

该论文针对语言模型隐写术在文本传输中易受修改影响的问题，提出了锚定滑动窗口框架，通过优化上下文窗口显著提升了文本质量、不可感知性和鲁棒性。

摘要翻译

基于语言模型的隐写术通常假设隐写文本在传输过程中未经改动，使其即使面对微小修改也极为脆弱。先前研究通过限制上下文窗口来缓解这种脆弱性，但这严重损害了文本质量。本文提出锚定滑动窗口框架，以提升不可感知性与鲁棒性。除最新生成的词元外，该框架将提示词与桥接上下文锚定在上下文窗口内，促使模型对被排除的词元进行补偿。我们将桥接上下文的优化问题构建为提示词蒸馏的变体，并进一步通过自蒸馏策略对其进行扩展。实验表明，在不同设置下，我们的锚定滑动窗口框架在文本质量、不可感知性和鲁棒性方面均显著且持续优于基线方法。代码发布于github.com/ryehr/ASW_steganography。

摘要 (Abstract)

Linguistic steganography based on language models typically assumes that steganographic texts are transmitted without alteration, making them fragile to even minor modifications. While previous work mitigates this fragility by limiting the context window, it significantly compromises text quality. In this paper, we propose the anchored sliding window (ASW) framework to improve imperceptibility and robustness. In addition to the latest tokens, the prompt and a bridge context are anchored within the context window, encouraging the model to compensate for the excluded tokens. We formulate the optimization of the bridge context as a variant of prompt distillation, which we further extend using self-distillation strategies. Experiments show that our ASW significantly and consistently outperforms the baseline method in text quality, imperceptibility, and robustness across diverse settings. The code is available at github.com/ryehr/ASW_steganography.

关键词: Linguistic Steganography, Language Models, Anchored Sliding Window, Context Window, Robustness, Imperceptibility, Prompt Distillation, Self-distillation

127. ❌ SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos

作者: Xiyang Huang, Jiawei Lin, Keying Wu, Jiaxin Huang, Kailai Yang, Renxiong Wei, Cheng zeng, Jiayi Xiang, Ziyan Kuang, Min Peng, Qianqian Xie, Sophia Ananiadou 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09037v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究多模态大语言模型（MLLMs）在临床技能视频中评估程序正确性的能力，属于大模型在生物医学领域的应用创新，因此与’Large Language Models OR LLMs OR Foundation Models’（8分）和’AI for Science OR Bioinformatics OR Cheminformatics’（10分）高度相关；其他关键词如MoE、量化、推理加速、对齐等均未在摘要中提及或涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了SiMing-Bench基准，用于评估多模态大语言模型从临床技能视频中判断程序正确性的能力，发现当前模型与医生判断的一致性较弱，且全局评估高估了模型的程序判断能力。

摘要翻译

当前针对多模态大语言模型（MLLMs）的视频基准测试主要关注事件识别、时序排序和长上下文记忆能力，却忽视了一项专家级流程判断所需的更高阶能力：追踪持续交互如何更新流程状态，并由此判定后续动作的正确性。为此，我们推出SiMing-Bench，首个基于完整临床技能视频评估此项能力的基准测试。该基准聚焦于依据标准化评分细则，对交互驱动的状态更新是否在整个工作流程中保持流程正确性进行过程级判断。SiMing-Bench通过SiMing-Score实现具体化，这是一个由医师标注的真实临床技能考核视频数据集，涵盖心肺复苏（cardiopulmonary resuscitation）、自动体外除颤器（automated external defibrillator）操作和球囊面罩通气（bag-mask ventilation）等项目，每个视频均配有标准化的分步评分细则及双专家标注标签。在对多种开源与闭源MLLMs的测试中，我们发现其判断结果与医师评估的一致性持续偏低。此外，即使整体流程层面的相关性看似可接受，模型在评分细则定义的中间步骤上仍表现薄弱，这表明粗略的全局评估严重高估了当前模型的流程判断能力。通过二元步骤判断和步骤对齐视频片段的进一步分析表明，瓶颈不仅在于细粒度评分或时序定位，更在于如何建模持续交互随时间更新流程状态的过程。

摘要 (Abstract)

Current video benchmarks for multimodal large language models (MLLMs) focus on event recognition, temporal ordering, and long-context recall, but overlook a harder capability required for expert procedural judgment: tracking how ongoing interactions update the procedural state and thereby determine the correctness of later actions. We introduce SiMing-Bench, the first benchmark for evaluating this capability from full-length clinical skill videos. It targets rubric-grounded process-level judgment of whether interaction-driven state updates preserve procedural correctness across an entire workflow. SiMing-Bench is instantiated with SiMing-Score, a physician-annotated dataset of real clinical skill examination videos spanning cardiopulmonary resuscitation, automated external defibrillator operation, and bag-mask ventilation, each paired with a standardized step-wise rubric and dual-expert labels. Across diverse open- and closed-source MLLMs, we observe consistently weak agreement with physician judgments. Moreover, weak performance on rubric-defined intermediate steps persists even when overall procedure-level correlation appears acceptable, suggesting that coarse global assessment substantially overestimates current models’ procedural judgment ability. Additional analyses with binary step judgment and step-aligned clips indicate that the bottleneck is not merely fine-grained scoring or temporal localization, but modeling how continuous interactions update procedural state over time.

关键词: multimodal large language models, clinical skill videos, procedural correctness, benchmark evaluation, state tracking, interaction-driven updates, medical AI, rubric-grounded judgment

128. ❌ Testing the Assumptions of Active Learning for Translation Tasks with Few Samples

作者: Lorenzo Jaime Yu Flores, Cesare Spinoso di-Piano, Ori Ernst, David Ifeoluwa Adelani, Jackie Chi Kit Cheung 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08977v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究主动学习（AL）在翻译任务中的假设检验，关注训练数据选择策略（信息性和多样性）与测试性能的关系。所有关键词均涉及大模型、深度学习技术原理或特定AI应用领域（如生物信息学），而本文专注于传统机器学习中的主动学习算法评估，未涉及大模型、深度学习或科学AI应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

本文研究了主动学习在少量样本翻译任务中表现不佳的原因，发现其核心假设（训练数据的信息性和多样性）与测试性能无关，而训练样本顺序和预训练数据交互影响更大。

摘要翻译

主动学习（Active Learning, AL）是一种通过选择未标注样本进行标注以提升模型在测试集上性能的训练范式，在仅能标注有限数量样本的场景中尤为有用。这类算法通常通过优化待标注训练数据的信息量和多样性来实现目标。近期研究发现，在使用100-500个样本时，主动学习策略在多种语言生成任务中未能超越随机采样效果。为探究主动学习在仅使用少量样本时表现不佳的原因，我们检验了其核心假设是否成立。研究发现，主动学习所优化的训练数据信息量与多样性均与测试集性能无关；相反，训练样本的顺序、与预训练数据的交互等因素对性能影响更为显著。这表明未来的主动学习方法必须考虑这些因素，才能在极少量样本条件下有效工作。

摘要 (Abstract)

Active learning (AL) is a training paradigm for selecting unlabeled samples for annotation to improve model performance on a test set, which is useful when only a limited number of samples can be annotated. These algorithms often work by optimizing for the informativeness and diversity of the training data to be annotated. Recent work found that AL strategies fail to outperform random sampling on various language generation tasks when using 100-500 samples. To understand AL’s poor performance when only using few samples, we investigate whether the core assumptions underlying AL strategies hold. We find that neither the informativeness nor diversity of the training data, which AL strategies optimize for, are correlated with test set performance. Instead, factors like the ordering of the training samples and interactions with pre-training data have a larger impact on performance. This suggests that future AL methods must take these factors into account in order to work with very few samples.

关键词: Active Learning, Translation Tasks, Few Samples, Informativeness, Diversity, Training Data Selection, Pre-training Data, Model Performance

129. ❌ Quantisation Reshapes the Metacognitive Geometry of Language Models

作者: Jon-Paul Cacioli 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08976v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究量化（quantisation）对LLM元认知几何结构的影响，直接相关关键词为’Quantization OR Model Compression OR Low-bit Weights’（10分）和’Large Language Models OR LLMs OR Foundation Models’（10分）。论文明确使用SFT进行实验，因此’Post-training OR Supervised Fine-tuning OR SFT’得10分。其他关键词如MoE、SLMs、Scaling Laws、RAG、RLHF等均未在摘要中提及或相关，得0分。

!!! tip deepseek-chat TL;DR

该研究发现模型量化会重塑LLM的领域级元认知效率结构而非均匀降低，且基于M-ratio的诊断结果在不同量化格式间不具可迁移性，而AUROC_2指标则保持稳定。

摘要翻译

本研究发现，模型量化并非均匀削弱大语言模型的元认知效率，而是重构了其领域层面的元认知结构。通过对Llama-3-8B-Instruct模型在Q5_K_M与f16两种精度下进行相同3000道问题的评估，我们发现四种知识领域的M-ratio（元认知效率比）分布在两种格式间完全不相关（斯皮尔曼相关系数rho = 0.00）。艺术与文学领域从最差监控状态（Q5_K_M下M-ratio = 0.606）转变为最佳监控状态（f16下1.542），而地理领域则从良好监控（1.210）转为监控不足（0.798）。然而，Type-2 AUROC（第二类接收者操作特征曲线下面积）分布在两种格式间完全稳定（rho = 1.00），表明重构效应仅存在于M-ratio标准化过程，而非底层判别信号本身。
这一发现源于一项通过领域条件化训练提升元认知能力的预注册研究。我们针对诊断出的弱势领域设计了置信度放大的监督微调（SFT），并设置了等量资源的无关领域对照组与错误处方对照组。四项验证性假设均未成立（10000次自助重采样，随机种子=42）。训练成功重塑了置信度分布，将科学领域的自然语言处理差距从0.076扩大至0.152，但未能改善元认知敏感度指标（meta-d’），因为诊断出的领域特征未能在不同量化格式间保持稳定。
任何依赖领域级M-ratio分布的系统都存在对推理格式的隐性依赖。采用AUROC_2（第二类AUROC）的系统更具稳定性。我们已公开全部代码、预注册方案及试验级数据。

摘要 (Abstract)

We report that model quantisation restructures domain-level metacognitive efficiency in LLMs rather than degrading it uniformly. Evaluating Llama-3-8B-Instruct on the same 3,000 questions at Q5_K_M and f16 precision, we find that M-ratio profiles across four knowledge domains are uncorrelated between formats (Spearman rho = 0.00). Arts & Literature moves from worst-monitored (M-ratio = 0.606 at Q5_K_M) to best-monitored (1.542 at f16). Geography moves from well-monitored (1.210) to under-monitored (0.798). However, Type-2 AUROC profiles are perfectly stable across formats (rho = 1.00), localising the restructuring to the M-ratio normalisation rather than the underlying discrimination signal. This finding emerged from a pre-registered attempt to improve metacognition through domain-conditional training. We prescribed confidence-amplification SFT for the diagnosed weak domain, with matched-budget agnostic and wrong-prescription controls. All four confirmatory hypotheses were null (10,000 bootstrap resamples, seed = 42). The training successfully reshaped confidence distributions, doubling the NLP gap in Science from 0.076 to 0.152, but did not improve meta-d’ because the diagnostic profile did not transfer across formats. Any system relying on domain-level M-ratio profiles has an unexamined dependency on inference format. Systems using AUROC_2 are safer. We release all code, pre-registrations, and trial-level data.

关键词: quantisation, LLMs, metacognition, M-ratio, AUROC, domain-level, SFT, inference format

130. ❌ Confident in a Confidence Score: Investigating the Sensitivity of Confidence Scores to Supervised Fine-Tuning

作者: Lorenzo Jaime Yu Flores, Cesare Spinoso di-Piano, Jackie Chi Kit Cheung 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08974v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究监督微调（SFT）对语言模型置信度分数的影响，因此与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分）。置信度分数用于检测幻觉和评估预测质量，与’Hallucination Mitigation OR Factuality OR Truthfulness’相关（8分）。研究涉及不确定性量化和置信度行为分析，与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分）。论文讨论语言模型置信度，与’Large Language Models OR LLMs OR Foundation Models’相关（8分）。其他关键词如MoE、量化、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了监督微调如何影响语言模型置信度分数与输出质量之间的相关性，发现微调后相关性下降，导致置信度分数在下游任务中的实用性降低。

摘要翻译

不确定性量化是一组用于衡量语言模型置信度的技术。例如，这些技术可用于检测幻觉或提醒用户审查不确定的预测。为使其具有实用性，置信度分数必须与输出质量相关联。然而，近期研究发现，微调可能影响置信度分数与质量之间的相关性。因此，我们探究置信度分数的内在行为，以理解其对监督微调（Supervised Fine-Tuning, SFT）的敏感性。我们发现，在监督微调后，多种置信度分数的相关性出现下降，这可能源于置信度分数因输出质量之外的因素（例如输出与训练分布的相似性）而产生变化。通过一项案例研究，我们展示了若未能解决这种错误关联，将如何降低置信度分数在下游任务中的实用性。我们的研究结果表明，未经测试的置信度指标无法直接投入使用，这促使我们需要开发对微调更具鲁棒性的评估指标。

摘要 (Abstract)

Uncertainty quantification is a set of techniques that measure confidence in language models. They can be used, for example, to detect hallucinations or alert users to review uncertain predictions. To be useful, these confidence scores must be correlated with the quality of the output. However, recent work found that fine-tuning can affect the correlation between confidence scores and quality. Hence, we investigate the underlying behavior of confidence scores to understand its sensitivity to supervised fine-tuning (SFT). We find that post-SFT, the correlation of various confidence scores degrades, which can stem from changes in confidence scores due to factors other than the output quality, such as the output’s similarity to the training distribution. We demonstrate via a case study how failing to address this miscorrelation reduces the usefulness of the confidence scores on a downstream task. Our findings show how confidence metrics cannot be used off-the-shelf without testing, and motivate the need for developing metrics which are more robust to fine-tuning.

关键词: confidence scores, supervised fine-tuning, uncertainty quantification, language models, hallucination detection, output quality, correlation degradation, downstream task

131. ❌ Breaking Block Boundaries: Anchor-based History-stable Decoding for Diffusion Large Language Models

作者: Shun Zou, Yong Wang, Zehui Chen, Lin Chen, Chongyang Tao, Feng Zhao, Xiangxiang Chu 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08964v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究扩散大语言模型（dLLMs）的解码策略创新，与’Large Language Models’高度相关（10分），因为论文明确研究dLLMs作为ARMs的替代方案。与’Speculative Decoding OR Inference Acceleration’高度相关（10分），因为论文提出AHD解码策略，旨在提高推理效率（减少80%解码步骤），属于推理加速范畴。其他关键词如MoE、SFT、RAG、量化等均未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

论文针对扩散大语言模型（dLLMs）中半自回归解码存在的块约束问题，提出了一种基于锚点的历史稳定解码策略（AHD），在语言、视觉-语言和音频-语言领域实验中同时提升了模型性能和推理效率。

摘要翻译

扩散大语言模型（dLLMs）近期已成为自回归大语言模型（ARMs）的一种前景广阔的替代方案。半自回归解码因其优越性能，被广泛应用于基础dLLMs及先进解码策略中。然而，我们的观察发现，半自回归解码存在固有的块约束问题，这导致许多跨块稳定令牌的解码被不必要地延迟。为应对这一挑战，我们系统研究了稳定令牌的识别，并提出了三个关键发现：（1）简单的前瞻解码不可靠；（2）令牌稳定性与收敛趋势密切相关；（3）历史信息处于孤立状态。基于这些发现，我们提出了基于锚点的历史稳定解码（Anchor-based History-stable Decoding, AHD），这是一种无需训练、即插即用的动态解码策略。具体而言，AHD通过动态锚点实时监测令牌的稳定性趋势。一旦令牌达到稳定状态，即启动早期跨块解码以提升效率与性能。在语言、视觉-语言及音频-语言领域的广泛实验表明，AHD能同时提升模型性能与推理效率。值得注意的是，AHD有效逆转了现有先进解码加速策略中常见的性能下降现象。例如，在BBH基准测试中，我们的方法在减少80%解码步骤的同时，将性能提升了3.67%。

摘要 (Abstract)

Diffusion Large Language Models (dLLMs) have recently become a promising alternative to autoregressive large language models (ARMs). Semi-autoregressive (Semi-AR) decoding is widely employed in base dLLMs and advanced decoding strategies due to its superior performance. However, our observations reveal that Semi-AR decoding suffers from inherent block constraints, which cause the decoding of many cross-block stable tokens to be unnecessarily delayed. To address this challenge, we systematically investigate the identification of stable tokens and present three key findings: (1) naive lookahead decoding is unreliable, (2) token stability closely correlates with convergence trend, and (3) historical information is isolated. Building on these insights, we propose Anchor-based History-stable Decoding (AHD), a training-free, plug-and-play dynamic decoding strategy. Specifically, AHD monitors the stability trend of tokens in real time through dynamic anchors. Once a token reaches stability, it initiates early cross-block decoding to enhance efficiency and performance. Extensive experiments across language, vision-language, and audio-language domains demonstrate that AHD simultaneously improves both performance and inference efficiency. Notably, AHD effectively reverses the performance degradation typically observed in existing advanced decoding acceleration strategies. For instance, on the BBH benchmark, our approach reduces decoding steps by 80% while improving performance by 3.67%.

关键词: Diffusion Large Language Models, Semi-autoregressive decoding, Decoding acceleration, Anchor-based History-stable Decoding, Inference efficiency, Cross-block decoding, Token stability, Training-free decoding strategy

132. ❌ MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits

作者: Yixin Xiang, Yunshan Ma, Xiaoyu Du, Yibing Chen, Yanxin Zhang, Jinhui Tang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08952v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究文档问答中的多模态检索增强生成（RAG）问题，提出MAB-DQA框架来改进视觉文档问答中的页面检索效果。因此，与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（10分），因为RAG是论文的核心技术框架。论文涉及文档理解和问答，可能隐含使用大模型，但与’Large Language Models OR LLMs OR Foundation Models’仅有一定关联（5分），因为摘要未明确提及LLM，但RAG通常与LLM结合。其他关键词与论文内容无直接关系，均得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态检索增强生成在视觉文档问答中因检索页面有限而忽略重要内容的问题，提出了基于多臂老虎机的MAB-DQA框架，通过动态分配检索预算来提升页面选择效果，在四个基准测试上平均性能提升5%-18%。

摘要翻译

文档问答（Document Question Answering, DQA）旨在根据用户查询从文档中生成答案，是文档理解领域的一项关键任务。该任务需要理解视觉布局，这促使近期研究采用多模态检索增强生成（Retrieval-Augmented Generation, RAG）方法，通过处理页面图像来生成答案。然而，在多模态RAG中，视觉DQA难以有效利用大量图像，因为检索阶段通常仅保留少量候选页面（例如Top-4），导致信息丰富但视觉显著性较低的内容被忽视，而常见却信息量低的页面被优先选取。为解决这一问题，我们提出一种基于多臂老虎机（Multi-Armed Bandit）的DQA框架（MAB-DQA），以显式建模查询中多个隐含方面的不同重要性。具体而言，MAB-DQA将查询分解为面向方面的子查询，并为每个子查询检索一个特定方面的候选集。它将每个子查询视为一个“臂”，并利用少量代表性页面的初步推理结果作为奖励信号来估计方面效用。在探索-利用策略的引导下，MAB-DQA动态地将检索预算重新分配给高价值方面。基于最具信息量的页面及其关联关系，MAB-DQA生成最终答案。在四个基准测试中，MAB-DQA相比现有最优方法平均提升5%-18%，持续增强了文档理解能力。代码发布于https://github.com/ElephantOH/MAB-DQA。

摘要 (Abstract)

Document Question Answering (DQA) involves generating answers from a document based on a user’s query, representing a key task in document understanding. This task requires interpreting visual layouts, which has prompted recent studies to adopt multimodal Retrieval-Augmented Generation (RAG) that processes page images for answer generation. However, in multimodal RAG, visual DQA struggles to utilize a large number of images effectively, as the retrieval stage often retains only a few candidate pages (e.g., Top-4), causing informative but less visually salient content to be overlooked in favor of common yet low-information pages. To address this issue, we propose a Multi-Armed Bandit-based DQA framework (MAB-DQA) to explicitly model the varying importance of multiple implicit aspects in a query. Specifically, MAB-DQA decomposes a query into aspect-aware subqueries and retrieves an aspect-specific candidate set for each. It treats each subquery as an arm and uses preliminary reasoning results from a small number of representative pages as reward signals to estimate aspect utility. Guided by an exploration-exploitation policy, MAB-DQA dynamically reallocates retrieval budgets toward high-value aspects. With the most informative pages and their correlations, MAB-DQA generates the expected results. On four benchmarks, MAB-DQA shows an average improvement of 5%-18% over the state-of-the-art method, consistently enhancing document understanding. Code at https://github.com/ElephantOH/MAB-DQA.

关键词: Document Question Answering, Multimodal RAG, Multi-Armed Bandit, Query Aspect Importance, Visual DQA, Retrieval-Augmented Generation, Document Understanding, Page Retrieval

133. ❌ TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice

作者: Gang Hu, Yating Chen, Haiyan Ding, Wang Gao, Jiajia Huang, Min Peng, Qianqian Xie, Kun Yu 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08948v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是评估LLMs在中文税务领域的实际应用能力，因此与’Large Language Models’高度相关（10分）。论文提到YaYi2 LLM经过税务数据微调，与’Post-training/SFT’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、RAG、CoT、Agents、Quantization等均未涉及，且论文专注于特定领域评估而非技术原理创新，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对LLMs在专业中文税务领域的能力不足问题，提出了首个中文税务实践基准TaxPraBen，评估了19个LLMs并发现闭源大参数模型表现优异，而经过税务数据微调的模型改进有限。

摘要翻译

尽管大语言模型（LLMs）在多个通用领域表现出色，但在高度专业化、知识密集且受严格法律监管的中国税务领域中，它们仍存在显著的能力差距。因此，尽管税务相关基准测试日益受到关注，但多数研究聚焦于孤立的自然语言处理任务，忽视了真实场景下的实务能力。为应对这一问题，我们推出了首个专注于中国税务实践的基准测试——TaxPraBen。该基准整合了10项传统应用任务，并开创性地引入3个真实场景：税务风险防范、税务稽查分析与税务策略规划，数据来源于14个数据集共计7.3千条实例。TaxPraBen采用可扩展的结构化评估范式，其设计遵循“结构化解析-字段对齐提取-数值与文本匹配”流程，既能实现端到端的税务实践评估，也可扩展至其他领域。基于布鲁姆分类法，我们对19个大语言模型进行了评估。结果显示显著的性能差异：所有闭源大参数模型表现优异，以Qwen2.5为代表的中文大语言模型普遍优于多语言模型，而使用部分税务数据微调的YaYi2模型仅展现出有限改进。TaxPraBen为推动大语言模型在实际应用中的评估提供了重要资源。

摘要 (Abstract)

While Large Language Models (LLMs) excel in various general domains, they exhibit notable gaps in the highly specialized, knowledge-intensive, and legally regulated Chinese tax domain. Consequently, while tax-related benchmarks are gaining attention, many focus on isolated NLP tasks, neglecting real-world practical capabilities. To address this issue, we introduce TaxPraBen, the first dedicated benchmark for Chinese taxation practice. It combines 10 traditional application tasks, along with 3 pioneering real-world scenarios: tax risk prevention, tax inspection analysis, and tax strategy planning, sourced from 14 datasets totaling 7.3K instances. TaxPraBen features a scalable structured evaluation paradigm designed through process of “structured parsing-field alignment extraction-numerical and textual matching”, enabling end-to-end tax practice assessment while being extensible to other domains. We evaluate 19 LLMs based on Bloom’s taxonomy. The results indicate significant performance disparities: all closed-source large-parameter LLMs excel, and Chinese LLMs like Qwen2.5 generally exceed multilingual LLMs, while the YaYi2 LLM, fine-tuned with some tax data, shows only limited improvement. TaxPraBen serves as a vital resource for advancing evaluations of LLMs in practical applications.

关键词: Large Language Models, Chinese tax domain, benchmark, structured evaluation, real-world scenarios, tax practice, performance evaluation, domain-specific assessment

134. ❌ NCL-BU at SemEval-2026 Task 3: Fine-tuning XLM-RoBERTa for Multilingual Dimensional Sentiment Regression

作者: Tong Wu, Nicolay Rusnachenko, Huizhi Liang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08923v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用XLM-RoBERTa-base进行监督微调（SFT）解决多语言维度情感回归任务，与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分）。论文比较了GPT-5.2、LLaMA等大模型，但仅作为基准对比，未深入研究大模型技术本身，因此’Large Language Models OR LLMs OR Foundation Models’得5分。其他关键词均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文通过微调XLM-RoBERTa-base模型进行多语言维度情感回归，在对比实验中证明任务特定微调方法显著优于GPT-5.2、LLaMA等大模型的少样本提示方法。

摘要翻译

维度方面情感分析（DimABSA）将传统方面情感分析从离散的情感极性分类扩展至连续的效价-唤醒度（Valence-Arousal, VA）回归。本文描述了为赛道A——子任务1（维度方面情感回归）开发的系统，旨在为文本中每个给定方面预测[1, 9]区间内的实值VA分数。研究采用基于XLM-RoBERTa-base的微调方法，将输入构建为[CLS] T [SEP] a_i [SEP]的形式，并训练独立的回归头，通过sigmoid缩放输出分别预测效价和唤醒度。针对每种语言-领域组合（涵盖餐厅、笔记本电脑和金融领域的英文与中文数据）分别训练模型，并将训练集与开发集合并以进行最终测试预测。在开发实验中，将微调方法与包括GPT-5.2、LLaMA-3-70B、LLaMA-3.3-70B和LLaMA-4-Maverick在内的多个大语言模型在少样本提示设置下进行比较，结果表明，在所有评估数据集上，针对特定任务的微调方法均显著且持续地优于这些基于大语言模型的方法。代码已公开于https://github.com/tongwu17/SemEval-2026-Task3-Track-A。

摘要 (Abstract)

Dimensional Aspect-Based Sentiment Analysis (DimABSA) extends traditional ABSA from categorical polarity labels to continuous valence-arousal (VA) regression. This paper describes a system developed for Track A - Subtask 1 (Dimensional Aspect Sentiment Regression), aiming to predict real-valued VA scores in the [1, 9] range for each given aspect in a text. A fine-tuning approach based on XLM-RoBERTa-base is adopted, constructing the input as [CLS] T [SEP] a_i [SEP] and training dual regression heads with sigmoid-scaled outputs for valence and arousal prediction. Separate models are trained for each language-domain combination (English and Chinese across restaurant, laptop, and finance domains), and training and development sets are merged for final test predictions. In development experiments, the fine-tuning approach is compared against several large language models including GPT-5.2, LLaMA-3-70B, LLaMA-3.3-70B, and LLaMA-4-Maverick under a few-shot prompting setting, demonstrating that task-specific fine-tuning substantially and consistently outperforms these LLM-based methods across all evaluation datasets. The code is publicly available at https://github.com/tongwu17/SemEval-2026-Task3-Track-A.

关键词: Dimensional Aspect-Based Sentiment Analysis, XLM-RoBERTa, fine-tuning, multilingual sentiment regression, valence-arousal prediction, supervised fine-tuning, LLM comparison, SemEval-2026

135. ❌ GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification

作者: Faxian Wan, Xiaocui Yang, Yifan Cao, Shi Feng, Daling Wang, Yifei Zhang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08879v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	15.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出GRASP框架，核心创新是结合视觉定位与显式Chain-of-Thought（CoT）推理进行多模态讽刺目标识别，因此与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（15分）。论文明确使用监督微调（SFT）进行优化，与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分）。论文涉及LLM评估（LLM-as-a-Judge），与’Large Language Models OR LLMs OR Foundation Models’相关（8分）。CoT推理体现深度推理过程，与’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’相关（8分）。框架旨在提升可解释性，与’Mechanistic Interpretability OR Explainable AI’相关（8分）。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态讽刺目标识别任务，提出了GRASP框架，通过结合视觉定位与显式Chain-of-Thought推理以及双阶段优化策略，显著提升了细粒度讽刺目标的识别性能，并创建了MSTI-MAX数据集。

摘要翻译

超越多模态讽刺检测的传统二元分类范式，多模态讽刺目标识别（MSTI）提出了更为艰巨的挑战，它要求对文本短语和视觉区域等细粒度目标进行精确定位。现有方法主要依赖于隐式的跨模态对齐，其可解释性有限，细粒度定位效果欠佳。为应对这些局限，我们提出GRASP框架——基于视觉接地的思维链推理与双阶段优化的多模态讽刺预测与目标识别。该框架将视觉接地与显式思维链推理相结合，以突破黑箱式MSTI的局限。具体而言，我们构建了MSTI-MAX数据集，该数据集缓解了类别不平衡问题并丰富了多模态讽刺线索。我们引入了基于接地的思维链推理方法，该方法在推理轨迹中显式锚定与讽刺相关的视觉区域，并促使模型在预测最终分类标签和讽刺目标前阐明推理依据。此外，我们采用双阶段结果监督的联合优化策略：首先通过坐标感知加权损失进行监督微调，随后进行细粒度目标策略优化。大量实验表明，GRASP在多模态细粒度讽刺目标识别任务上优于现有基线模型，同时基于大语言模型作为评判者的评估定量衡量了内部推理链的质量。我们的数据集与源代码将在GitHub上公开。

摘要 (Abstract)

Moving beyond the traditional binary classification paradigm of Multimodal Sarcasm Detection, Multimodal Sarcasm Target Identification (MSTI) presents a more formidable challenge, requiring precise localization of fine-grained targets such as textual phrases and visual regions. Existing approaches predominantly rely on implicit cross-modal alignment, offering limited interpretability and suboptimal fine-grained localization. To address these limitations, we propose GRASP, Grounded Chain-of-Thought ReAsoning with Dual-Stage Optimization for Multimodal Sarcasm Prediction and Target Identification, a framework that integrates visual grounding with explicit Chain-of-Thought (CoT) reasoning to move beyond black-box MSTI. Specifically, we curate MSTI-MAX, a refined dataset that mitigates class imbalance and enriches multimodal sarcasm cues. We introduce Grounded CoT reasoning, which explicitly anchors sarcasm-related visual regions within the reasoning trajectory and prompts the model to articulate rationales before predicting the final classification labels and sarcasm targets. Furthermore, we employ a dual-stage outcome-supervised joint optimization strategy: Supervised Fine-Tuning with a coordinate-aware weighted loss, followed by Fine-Grained Target Policy Optimization. Extensive experiments demonstrate that GRASP outperforms existing baselines in fine-grained sarcasm target identification across modalities, and an LLM-as-a-Judge evaluation quantitatively measures the quality of internal reasoning chains. Our dataset and source code will be released on GitHub.

关键词: Multimodal Sarcasm Target Identification, Chain-of-Thought Reasoning, Visual Grounding, Supervised Fine-Tuning, Dual-Stage Optimization, Fine-Grained Localization, LLM-as-a-Judge, Interpretability

136. ❌ Cross-Lingual Attention Distillation with Personality-Informed Generative Augmentation for Multilingual Personality Recognition

作者: Jing Jie Tan, Ban-Hoe Kwan, Danny Wee-Kiat Ng, Yan-Chai Hum, Noriyuki Kawarazaki, Kosuke Takano 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08851v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文的核心是使用大语言模型（LLM）进行翻译增强以生成多语言训练数据，并采用注意力蒸馏方法进行跨语言人格识别。因此，仅与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为LLM是其数据增强的关键工具。论文未涉及其他关键词所描述的具体技术原理（如MoE、量化、推理加速等）或特定应用领域（如生物信息学）。

!!! tip deepseek-chat TL;DR

该研究提出了一种结合大语言模型翻译增强和跨语言注意力蒸馏的方法（ADAM），以解决多语言人格识别中数据集缺乏的问题，并在多个语言和数据集上显著提升了识别性能。

摘要翻译

尽管人格识别领域已有大量研究，但多语言数据集的缺乏仍是一个尚未解决的挑战。为此，我们提出了ADAM（面向多语言人格识别的跨语言注意力蒸馏与人格引导生成增强方法），这是一种旨在推动多语言人格识别发展的先进方法。我们的方法以现有英语人格数据集为主要来源，并利用大语言模型进行基于翻译的数据增强，同时通过人格信息引导生成增强技术提升数据质量，从而生成包括日语、中文、马来语和法语在内的多语言高质量训练数据。我们提供了详尽的分析以论证这些增强技术的有效性。基于这些进展，ADAM整合了跨语言注意力蒸馏技术，以训练一个能够理解和识别跨语言人格特质的模型，从而弥合人格分析中的语言与文化差异。本研究对提出的增强方法进行了全面评估，包含对识别性能的消融实验，以确保公平比较和稳健验证。总体而言，实验结果表明，在使用PIGA增强后，CLAD方法在所有语言和人格特质上均显著优于标准二元交叉熵损失，在平均平衡准确率上取得了显著提升——在Essays数据集上达到0.6332（提升0.0573），在Kaggle数据集上达到0.7448（提升0.0968）。经CLAD训练的模型还展现出强大的泛化能力，并取得了与当前领先编码器模型相当的基准性能。模型权重、数据集及算法代码库已公开于https://research.jingjietan.com/?q=ADAM。

摘要 (Abstract)

While significant work has been done on personality recognition, the lack of multilingual datasets remains an unresolved challenge. To address this, we propose ADAM (Cross-Lingual (A)ttention (D)istillation with Personality-Guided Generative (A)ugmentation for (M)ultilingual Personality Recognition), a state-of-the-art approach designed to advance multilingual personality recognition. Our approach leverages an existing English-language personality dataset as the primary source and employs a large language model (LLM) for translationbased augmentation, enhanced by Personality-Informed Generative Augmentation (PIGA), to generate high-quality training data in multiple languages, including Japanese, Chinese, Malay, and French. We provide a thorough analysis to justify the effectiveness of these augmentation techniques. Building on these advancements, ADAM integrates Cross-Lingual Attention Distillation (CLAD) to train a model capable of understanding and recognizing personality traits across languages, bridging linguistic and cultural gaps in personality analysis. This research presents a thorough evaluation of the proposed augmentation method, incorporating an ablation study on recognition performance to ensure fair comparisons and robust validation. Overall, with PIGA augmentation, the findings demonstrate that CLAD significantly outperforms the standard BCE across all languages and personality traits, achieving notable improvements in average BA scores - 0.6332 (+0.0573) on the Essays dataset and 0.7448 (+0.0968) on the Kaggle dataset. The CLAD-trained model also demonstrated strong generalizability and achieved benchmark performance comparable to current leading encoder models. The model weight, dataset, and algorithm repository are available at https://research.jingjietan.com/?q=ADAM.

关键词: multilingual personality recognition, large language model, data augmentation, attention distillation, cross-lingual, personality-informed generative augmentation, translation-based augmentation, benchmark performance

137. ❌ Tango: Taming Visual Signals for Efficient Video Large Language Models

作者: Shukang Yin, Sirui Zhao, Hanchao Wang, Baozhi Jia, Xianquan Wang, Chaoyou Fu, Enhong Chen 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09547v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频大语言模型（Video LLMs）的效率优化，核心是视觉信号的标记剪枝技术。与’Large Language Models’高度相关（10分），因为论文明确研究Video LLMs，属于大模型在视频理解领域的应用。与’Speculative Decoding OR Inference Acceleration’有一定关联（5分），因为论文提到实现了1.88倍推理加速，属于推理加速范畴，但未涉及推测解码等具体技术。其他关键词如MoE、SFT、RAG、量化等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对现有视频大语言模型中基于注意力和相似性的标记剪枝方法存在注意力分布考虑不全面和聚类碎片化的问题，提出了Tango框架，通过多样性驱动的策略和时空旋转位置嵌入优化视觉信号利用，在仅保留10%视频标记时能保持98.9%的原始性能并实现1.88倍推理加速。

摘要翻译

令牌剪枝已成为开发高效视频大语言模型的主流方法。本研究重新审视并推进了两种主流的令牌剪枝范式：基于注意力的选择和基于相似性的聚类。我们的研究揭示了现有方法的两个关键局限：（1）传统的top-k选择策略未能充分考虑注意力分布，该分布通常在空间上呈多模态且在量级上呈长尾分布；（2）直接基于相似性的聚类经常产生碎片化的簇，导致池化后表征失真。为应对这些瓶颈，我们提出了Tango——一个旨在优化视觉信号利用的新型框架。Tango集成了多样性驱动策略以增强基于注意力的令牌选择，并引入了时空旋转位置嵌入（ST-RoPE），通过局部性先验来保持几何结构。跨多种视频大语言模型和视频理解基准的综合实验证明了我们方法的有效性和泛化能力。值得注意的是，在仅保留10%视频令牌的情况下，Tango在LLaVA-OV基准上保持了原始性能的98.9%，同时实现了1.88倍的推理加速。

摘要 (Abstract)

Token pruning has emerged as a mainstream approach for developing efficient Video Large Language Models (Video LLMs). This work revisits and advances the two predominant token-pruning paradigms: attention-based selection and similarity-based clustering. Our study reveals two critical limitations in existing methods: (1) conventional top-k selection strategies fail to fully account for the attention distribution, which is often spatially multi-modal and long-tailed in magnitude; and (2) direct similarity-based clustering frequently generates fragmented clusters, resulting in distorted representations after pooling. To address these bottlenecks, we propose Tango, a novel framework designed to optimize the utilization of visual signals. Tango integrates a diversity-driven strategy to enhance attention-based token selection, and introduces Spatio-temporal Rotary Position Embedding (ST-RoPE) to preserve geometric structure via locality priors. Comprehensive experiments across various Video LLMs and video understanding benchmarks demonstrate the effectiveness and generalizability of our approach. Notably, when retaining only 10% of the video tokens, Tango preserves 98.9% of the original performance on LLaVA-OV while delivering a 1.88x inference speedup.

关键词: Video Large Language Models, token pruning, attention-based selection, similarity-based clustering, inference speedup, visual signals, Spatio-temporal Rotary Position Embedding, efficient video understanding

138. ❌ Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

作者: Jinqi Luo, Jinyu Yang, Tal Neiman, Lei Fan, Bing Yin, Son Tran, Mubarak Shah, René Vidal 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08846v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于多模态大语言模型（MLLMs）的安全防护，核心创新在于提出Dictionary-Aligned Concept Control（DACO）框架，通过稀疏自编码器（SAE）和概念词典对模型激活进行细粒度控制以提升安全性。因此，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为MLLMs是LLMs的扩展；与’Instruction Tuning OR Alignment OR Value Alignment’高度相关（10分），因为安全防护属于对齐范畴。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型易受恶意查询产生不安全响应的问题，提出了Dictionary-Aligned Concept Control（DACO）框架，通过概念词典和稀疏自编码器对模型激活进行细粒度干预，显著提升了模型安全性同时保持了通用能力。

摘要翻译

多模态大语言模型（MLLMs）已被证明易受恶意查询的攻击，这些查询可能引发不安全的回应。近期研究通过提示工程、响应分类或微调来提升MLLM的安全性。然而，这类方法往往难以应对不断演变的恶意模式，可能需要重新运行查询，或需消耗大量计算资源。在推理阶段通过引导冻结模型的激活值来调控输出，近期已成为一种灵活有效的解决方案。但现有的MLLM引导方法通常仅能处理有限的安全相关概念，或在调整特定概念时难以避免影响其他概念。为应对这些挑战，我们提出了词典对齐概念控制（Dictionary-Aligned Concept Control, DACO）框架，该框架利用精心构建的概念词典和稀疏自编码器（Sparse Autoencoder, SAE）来实现对MLLM激活值的细粒度控制。首先，我们通过检索超过40万条图文刺激数据并将其激活值归纳为概念方向，构建了一个包含1.5万个多模态概念的概念词典，并将该数据集命名为DACO-400K。其次，我们证明该词典可通过稀疏编码用于干预激活值。第三，我们提出了一种新的引导方法，使用该词典初始化SAE的训练，并自动标注SAE原子的语义，以保障MLLM的安全。在多个MLLM（如QwenVL、LLaVA、InternVL）和安全性基准（如MM-SafetyBench、JailBreakV）上的实验表明，DACO在显著提升MLLM安全性的同时，保持了其通用能力。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) have been shown to be vulnerable to malicious queries that can elicit unsafe responses. Recent work uses prompt engineering, response classification, or finetuning to improve MLLM safety. Nevertheless, such approaches are often ineffective against evolving malicious patterns, may require rerunning the query, or demand heavy computational resources. Steering the activations of a frozen model at inference time has recently emerged as a flexible and effective solution. However, existing steering methods for MLLMs typically handle only a narrow set of safety-related concepts or struggle to adjust specific concepts without affecting others. To address these challenges, we introduce Dictionary-Aligned Concept Control (DACO), a framework that utilizes a curated concept dictionary and a Sparse Autoencoder (SAE) to provide granular control over MLLM activations. First, we curate a dictionary of 15,000 multimodal concepts by retrieving over 400,000 caption-image stimuli and summarizing their activations into concept directions. We name the dataset DACO-400K. Second, we show that the curated dictionary can be used to intervene activations via sparse coding. Third, we propose a new steering approach that uses our dictionary to initialize the training of an SAE and automatically annotate the semantics of the SAE atoms for safeguarding MLLMs. Experiments on multiple MLLMs (e.g., QwenVL, LLaVA, InternVL) across safety benchmarks (e.g., MM-SafetyBench, JailBreakV) show that DACO significantly improves MLLM safety while maintaining general-purpose capabilities.

关键词: Multimodal Large Language Models, MLLM safety, Dictionary-Aligned Concept Control, Sparse Autoencoder, activation steering, concept dictionary, safeguarding, granular control

139. ❌ EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks

作者: Lulin Liu, Dayou Li, Yiqing Liang, Sicong Jiang, Hitesh Vijay, Hezhen Hu, Xuhai Xu, Zirui Liu, Srinivas Shakkottai, Manling Li, Zhiwen Fan 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09535v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大模型（Foundation Models）在具身智能和日常家庭任务中的应用，涉及VLM（视觉语言模型）和World Models的评估与微调。高度相关的关键词包括：1) ‘Large Language Models OR LLMs OR Foundation Models’（10分）- 论文直接研究大模型在具身智能中的应用；2) ‘Post-training OR Supervised Fine-tuning OR SFT’（10分）- 论文提到对基础模型进行微调以改进性能；3) ‘Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’（10分）- 论文引入think-aloud chains并关注多步推理，直接涉及CoT；4) ‘Hallucination Mitigation OR Factuality OR Truthfulness’（10分）- 论文旨在解决VLM推理中的幻觉问题（如虚构对象、跳过步骤）；5) ‘World Models AND General World Models’（10分）- 论文评估World Models并构建世界模型合成。其他关键词如MoE、SLMs、Scaling Laws、Instruction Tuning、RAG等未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

论文针对视觉语言模型在长时域日常家庭任务中存在的噪声标注、推理幻觉和空间基础不准确问题，提出了EgoTL框架，通过构建think-aloud捕获管道和微调基础模型，显著提升了长时域规划、推理、指令跟随和空间基础能力。

摘要翻译

大型基础模型在具身智能领域取得了显著进展，能够针对家庭任务合成并推理以自我为中心的输入信息。然而，基于视觉语言模型（VLM）的自动标注往往存在噪声，因为主要数据源缺乏精确的人类动作标签、思维链（CoT）以及空间标注；这些误差在长时程空间指令跟随过程中会被放大。这些问题源于对长达数分钟的日常家庭规划任务覆盖不足，以及空间定位不准确。因此，VLM的推理链与世界模型合成可能会出现物体幻觉、跳过步骤或未能遵循真实世界物理属性的情况。为弥补这些不足，我们提出了EgoTL。EgoTL构建了一个用于自我中心数据的出声思考采集流程。它采用“先言后行”协议，记录逐步目标及带有词级时间戳的口语推理，随后通过度量级空间估计器校准物理属性，利用记忆库漫游获取场景上下文，并为导航指令与详细操作动作添加片段级标签。借助EgoTL，我们能够在三个层次的六个任务维度上对VLM和世界模型进行基准测试，并在涵盖100多项日常家庭任务的分钟级序列上进行长时程生成评估。研究发现，基础模型作为自我中心助手或开放世界模拟器仍存在不足。最后，我们使用与EgoTL训练集中度量标签对齐的人类思维链对基础模型进行微调，从而提升了长时程规划与推理、逐步推理、指令跟随及空间定位能力。

摘要 (Abstract)

Large foundation models have made significant advances in embodied intelligence, enabling synthesis and reasoning over egocentric input for household tasks. However, VLM-based auto-labeling is often noisy because the primary data sources lack accurate human action labels, chain-of-thought (CoT), and spatial annotations; these errors are amplified during long-horizon spatial instruction following. These issues stem from insufficient coverage of minute-long, daily household planning tasks and from inaccurate spatial grounding. As a result, VLM reasoning chains and world-model synthesis can hallucinate objects, skip steps, or fail to respect real-world physical attributes. To address these gaps, we introduce EgoTL. EgoTL builds a think-aloud capture pipeline for egocentric data. It uses a say-before-act protocol to record step-by-step goals and spoken reasoning with word-level timestamps, then calibrates physical properties with metric-scale spatial estimators, a memory-bank walkthrough for scene context, and clip-level tags for navigation instructions and detailed manipulation actions. With EgoTL, we are able to benchmark VLMs and World Models on six task dimensions from three layers and long-horizon generation over minute-long sequences across over 100 daily household tasks. We find that foundation models still fall short as egocentric assistants or open-world simulators. Finally, we finetune foundation models with human CoT aligned with metric labels on the training split of EgoTL, which improves long-horizon planning and reasoning, step-wise reasoning, instruction following, and spatial grounding.

关键词: Egocentric Think-Aloud Chains, Long-Horizon Tasks, Foundation Models, Visual Language Models, World Models, Chain-of-Thought, Hallucination Mitigation, Spatial Grounding

140. ❌ RIRF: Reasoning Image Restoration Framework

作者: Wending Yan, Rongkai Zhang, Kaihua Tang, Yu Cheng, Qiankun Liu 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09511v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出R&R框架，将结构化Chain-of-Thought推理集成到图像修复流程中，使用微调的Qwen3-VL作为推理器进行诊断，并利用强化学习信号指导修复器。核心相关关键词：Chain of Thought（核心方法，10分）、System 2 Thinking（涉及深度推理，8分）、LLM Agents（基于LLM的代理系统，8分）、Large Language Models（使用Qwen3-VL，8分）、Post-training（涉及微调，8分）、Explainable AI（提供可解释性，8分）。其他关键词与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为R&R的新型通用图像修复框架，通过集成结构化Chain-of-Thought推理来诊断图像退化并指导修复过程，在多个基准测试中实现了最先进的性能，同时提供了修复过程的可解释性。

摘要翻译

通用图像复原（UIR）旨在通过统一模型从多样且未知的退化类型中恢复清晰图像。现有UIR方法主要关注像素级重建，通常在复原前缺乏对退化构成、严重程度及场景语义的显式诊断推理。我们提出“推理与复原”（R&R）框架，一种将结构化思维链（CoT）推理融入图像复原流程的新方法。R&R引入一个通过微调Qwen3-VL实现的显式推理器，用于诊断退化类型、量化退化严重程度、推断关键退化相关因素，并描述相关场景与物体语义。由此产生的结构化推理为复原器提供了可解释的细粒度诊断先验。为进一步提升复原质量，推理器生成的量化退化严重度被用作强化学习（RL）信号，以指导并增强复原器性能。与现有基于多模态大语言模型的代理系统将推理与低级视觉任务解耦不同，R&R在统一框架内将语义诊断推理与像素级复原紧密耦合。在多种UIR基准测试上的大量实验表明，R&R在实现最先进性能的同时，为复原过程提供了独特的可解释性。

摘要 (Abstract)

Universal image restoration (UIR) aims to recover clean images from diverse and unknown degradations using a unified model. Existing UIR methods primarily focus on pixel reconstruction and often lack explicit diagnostic reasoning over degradation composition, severity, and scene semantics prior to restoration. We propose Reason and Restore (R&R), a novel framework that integrates structured Chain-of-Thought (CoT) reasoning into the image restoration pipeline. R&R introduces an explicit reasoner, implemented by fine-tuning Qwen3-VL, to diagnose degradation types, quantify degradation severity, infer key degradation-related factors, and describe relevant scene and object semantics. The resulting structured reasoning provides interpretable and fine-grained diagnostic priors for the restorer. To further improve restoration quality, the quantified degradation severity produced by the reasoner is leveraged as reinforcement learning (RL) signals to guide and strengthen the restorer. Unlike existing multimodal LLM-based agentic systems that decouple reasoning from low-level vision tasks, R&R tightly couples semantic diagnostic reasoning with pixel-level restoration in a unified framework. Extensive experiments across diverse UIR benchmarks demonstrate that R&R achieves state-of-the-art performance while offering unique interpretability into the restoration process.

关键词: Universal Image Restoration, Chain-of-Thought Reasoning, Degradation Diagnosis, Reinforcement Learning, Interpretable Restoration, Multimodal LLM, Agentic Systems, State-of-the-art Performance

141. ❌ Online3R: Online Learning for Consistent Sequential Reconstruction Based on Geometry Foundation Model

作者: Shunkai Zhou, Zike Yan, Fei Xue, Dong Wu, Yuchen Deng, Hongbin Zha 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09480v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文Online3R提出了一种基于几何基础模型的在线学习框架，用于解决顺序重建中的不一致性问题。该研究主要涉及计算机视觉和几何重建领域，而非自然语言处理或大语言模型。因此，大多数关键词（如LLMs、MoE、RLHF、RAG等）完全不相关，评分为0。然而，论文明确提到了使用预训练的几何基础模型（‘pretrained, frozen geometry foundation model’），这与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联，因为预训练是基础模型的核心，评分为8。此外，论文引入了轻量级视觉提示进行在线学习，这类似于参数高效微调（PEFT）的概念，旨在高效适应新场景，因此与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’有一定关联，评分为8。其他关键词如AI for Science等不直接相关，因为论文聚焦于几何重建而非生物信息学等科学领域。

!!! tip deepseek-chat TL;DR

Online3R提出了一种基于预训练几何基础模型的在线学习框架，通过引入轻量级视觉提示和局部-全局自监督学习策略，有效解决了顺序重建中的不一致性问题，并在多个基准测试中超越了现有方法。

摘要翻译

本文提出Online3R——一种能够通过在线学习适应新场景的新型序列重建框架，有效解决了不一致性问题。具体而言，我们在一个预训练且冻结的几何基础模型中引入一组可学习的轻量化视觉提示，以捕捉新环境的知识，同时保留基础模型进行几何预测的核心能力。针对测试时更新这些视觉提示所面临的真值缺失与高效性要求的问题，我们提出了一种局部-全局自监督学习策略，通过对预测施加局部与全局一致性约束来实现。局部一致性约束作用于中间结果及先前局部融合的结果，使模型能够利用高质量伪真值信号进行训练；全局一致性约束则作用于跨越长距离的稀疏关键帧而非逐帧处理，使模型能够以高效方式从长轨迹的一致性预测中学习。实验表明，Online3R在多个基准测试中超越了以往最先进的方法。项目页面：https://shunkaizhou.github.io/online3r-1.0/

摘要 (Abstract)

We present Online3R, a new sequential reconstruction framework that is capable of adapting to new scenes through online learning, effectively resolving inconsistency issues. Specifically, we introduce a set of learnable lightweight visual prompts into a pretrained, frozen geometry foundation model to capture the knowledge of new environments while preserving the fundamental capability of the foundation model for geometry prediction. To solve the problems of missing groundtruth and the requirement of high efficiency when updating these visual prompts at test time, we introduce a local-global self-supervised learning strategy by enforcing the local and global consistency constraints on predictions. The local consistency constraints are conducted on intermediate and previously local fused results, enabling the model to be trained with high-quality pseudo groundtruth signals; the global consistency constraints are operated on sparse keyframes spanning long distances rather than per frame, allowing the model to learn from a consistent prediction over a long trajectory in an efficient way. Our experiments demonstrate that Online3R outperforms previous state-of-the-art methods on various benchmarks. Project page: https://shunkaizhou.github.io/online3r-1.0/

关键词: Online Learning, Sequential Reconstruction, Geometry Foundation Model, Visual Prompts, Self-supervised Learning, Consistency Constraints, 3D Reconstruction, Computer Vision

142. ❌ Incremental Semantics-Aided Meshing from LiDAR-Inertial Odometry and RGB Direct Label Transfer

作者: Muhammad Affan, Ville Lehtola, George Vosselman 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09478v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算机视觉和机器人学领域的几何重建技术，提出了一种结合LiDAR、惯性测量和RGB语义的增量式网格重建方法。论文使用了视觉基础模型进行语义标注，但核心贡献在于几何重建算法和传感器融合，而非大模型技术本身。所有关键词均与大模型、深度学习技术原理或AI for Science的特定子领域相关，但论文仅涉及视觉基础模型的应用，与大多数关键词（如LLM、MoE、训练方法、推理优化、智能体等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学/工程领域的应用（机器人感知与重建），但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文解决了在大型复杂室内环境中从LiDAR-惯性扫描进行高保真几何网格重建的挑战，提出了一种增量式RGB+LiDAR语义辅助融合方法，通过视觉基础模型提供语义指导来改善边界几何模糊，在Oxford Spires数据集上超越了ImMesh和Voxblox等基线方法。

摘要翻译

在大型复杂室内环境（如文化建筑）中，基于激光雷达-惯性扫描的几何高保真网格重建仍面临挑战——点云稀疏性、几何漂移及固定融合参数会导致结构边界处出现孔洞、过度平滑及伪表面。我们提出一种模块化、增量式的RGB+激光雷达流程，通过基于扫描帧的直接标签传递，从室内扫描生成增量式语义辅助的高质量网格。视觉基础模型对每帧输入的RGB图像进行标注；标签被增量投影并融合至激光雷达-惯性里程计地图；随后通过增量式语义感知截断符号距离函数（TSDF）融合步骤，利用移动立方体算法生成最终网格。这种帧级融合策略在保持激光雷达几何保真度的同时，利用丰富的视觉语义信息解决因激光雷达点云稀疏性和几何漂移导致的边界重建歧义。我们证明语义引导能提升几何重建质量；因此使用牛津尖塔数据集的几何指标进行定量评估，并对NTU VIRAL数据集结果进行定性分析。所提方法在几何网格质量上优于当前最先进的几何基线方法ImMesh与Voxblox，验证了语义辅助融合对几何网格质量的提升价值。最终生成的语义标注网格在重建通用场景描述（USD）资产时具有重要应用价值，为从室内激光雷达到扩展现实（XR）及数字建模提供了技术路径。

摘要 (Abstract)

Geometric high-fidelity mesh reconstruction from LiDAR-inertial scans remains challenging in large, complex indoor environments – such as cultural buildings – where point cloud sparsity, geometric drift, and fixed fusion parameters produce holes, over-smoothing, and spurious surfaces at structural boundaries. We propose a modular, incremental RGB+LiDAR pipeline that generates incremental semantics-aided high-quality meshes from indoor scans through scan frame-based direct label transfer. A vision foundation model labels each incoming RGB frame; labels are incrementally projected and fused onto a LiDAR-inertial odometry map; and an incremental semantics-aware Truncated Signed Distance Function (TSDF) fusion step produces the final mesh via marching cubes. This frame-level fusion strategy preserves the geometric fidelity of LiDAR while leveraging rich visual semantics to resolve geometric ambiguities at reconstruction boundaries caused by LiDAR point-cloud sparsity and geometric drift. We demonstrate that semantic guidance improves geometric reconstruction quality; quantitative evaluation is therefore performed using geometric metrics on the Oxford Spires dataset, while results from the NTU VIRAL dataset are analyzed qualitatively. The proposed method outperforms state-of-the-art geometric baselines ImMesh and Voxblox, demonstrating the benefit of semantics-aided fusion for geometric mesh quality. The resulting semantically labelled meshes are of value when reconstructing Universal Scene Description (USD) assets, offering a path from indoor LiDAR scanning to XR and digital modeling.

关键词: mesh reconstruction, LiDAR-inertial odometry, semantic fusion, indoor scanning, TSDF fusion, geometric fidelity, vision foundation model, incremental pipeline

143. ❌ Realizing Immersive Volumetric Video: A Multimodal Framework for 6-DoF VR Engagement

作者: Zhengxian Yang, Shengqi Wang, Shi Pan, Hongshuai Li, Haoxiang Wang, Lin Li, Guanjun Li, Zhengqi Wen, Borong Lin, Jianhua Tao, Tao Yu 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09473v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉、多媒体和虚拟现实领域，研究沉浸式体积视频的采集、重建和生成技术，包括多视图数据集构建、高斯时空表示、光场重建和声场重建。论文内容完全不涉及大语言模型、深度学习技术原理或AI在科学领域的应用，所有关键词均与大模型、深度学习技术、AI科学应用无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种沉浸式体积视频（IVV）的完整构建框架，通过创建多视图多模态数据集ImViD，并开发基于高斯时空表示的光场重建方法和首个多视图视听数据的声场重建方法，实现了高质量、时间稳定的6自由度交互式视听体积内容生成。

摘要翻译

能够紧密融合六自由度视觉与听觉交互的完全沉浸式体验，是虚拟现实与增强现实的核心需求。尽管此类体验可通过计算机生成内容实现，但直接从现实世界捕获的视频构建此类体验仍基本处于探索空白。我们提出沉浸式体视频，这是一种新型体媒体格式，旨在提供大范围六自由度交互空间、视听反馈以及高分辨率、高帧率的动态内容。为支持沉浸式体视频构建，我们推出了ImViD数据集，这是一个基于空间化捕获理念构建的多视角、多模态数据集。我们定制的捕获装置能够在运动过程中实现同步的多视角视频-音频采集，从而高效捕获具有丰富前景-背景交互和复杂动态的室内外场景。该数据集提供5K分辨率、60帧/秒、持续1-5分钟的视频，在空间覆盖、时间连续性和多模态完整性上均超越现有基准。基于此数据集，我们开发了一种基于高斯时空表征的动态光场重建框架，融合了流引导稀疏初始化、联合相机时序标定以及多约束时空监督，能够对复杂运动进行鲁棒且精确的建模。此外，我们首次提出了基于此类多视角视听数据的声场重建方法。这些组件共同构成了沉浸式体视频生产的统一流程。大量基准测试与沉浸式VR实验表明，我们的流程能够生成具有大范围六自由度交互空间的高质量、时序稳定的视听体内容。本研究为沉浸式体视频提供了基础性定义与实用性构建方法论。

摘要 (Abstract)

Fully immersive experiences that tightly integrate 6-DoF visual and auditory interaction are essential for virtual and augmented reality. While such experiences can be achieved through computer-generated content, constructing them directly from real-world captured videos remains largely unexplored. We introduce Immersive Volumetric Videos, a new volumetric media format designed to provide large 6-DoF interaction spaces, audiovisual feedback, and high-resolution, high-frame-rate dynamic content. To support IVV construction, we present ImViD, a multi-view, multi-modal dataset built upon a space-oriented capture philosophy. Our custom capture rig enables synchronized multi-view video-audio acquisition during motion, facilitating efficient capture of complex indoor and outdoor scenes with rich foreground–background interactions and challenging dynamics. The dataset provides 5K-resolution videos at 60 FPS with durations of 1-5 minutes, offering richer spatial, temporal, and multimodal coverage than existing benchmarks. Leveraging this dataset, we develop a dynamic light field reconstruction framework built upon a Gaussian-based spatio-temporal representation, incorporating flow-guided sparse initialization, joint camera temporal calibration, and multi-term spatio-temporal supervision for robust and accurate modeling of complex motion. We further propose, to our knowledge, the first method for sound field reconstruction from such multi-view audiovisual data. Together, these components form a unified pipeline for immersive volumetric video production. Extensive benchmarks and immersive VR experiments demonstrate that our pipeline generates high-quality, temporally stable audiovisual volumetric content with large 6-DoF interaction spaces. This work provides both a foundational definition and a practical construction methodology for immersive volumetric videos.

关键词: Immersive Volumetric Video, 6-DoF VR, Multi-view Dataset, Gaussian-based Representation, Light Field Reconstruction, Sound Field Reconstruction, Audiovisual Interaction, Dynamic Content

144. ❌ DSVTLA: Deep Swin Vision Transformer-Based Transfer Learning Architecture for Multi-Type Cancer Histopathological Cancer Image Classification

作者: Muazzem Hussain Khan, Tasdid Hasnain, Md. Jamil khan, Ruhul Amin, Md. Shamim Reza, Md. Al Mehedi Hasan, Md Ashad Alam 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09468v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于计算机视觉领域，提出了一种基于Swin Vision Transformer的迁移学习架构用于多类型癌症组织病理学图像分类。论文内容与绝大多数关键词（主要涉及大语言模型技术、训练方法、推理优化、对齐技术等）完全无关，因为这些关键词都是针对自然语言处理和大语言模型领域的。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在生物医学（具体是癌症诊断）领域的应用，属于AI for Science的范畴，因此给予10分（高度相关）。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于深度Swin Vision Transformer的迁移学习架构，用于多类型癌症组织病理学图像分类，并在多个癌症数据集上实现了接近完美的分类准确率。

摘要翻译

本研究提出了一种基于深度Swin视觉Transformer迁移学习架构的鲁棒性多癌种组织病理学图像分类方法。该框架将分层式Swin Transformer与基于ResNet50的卷积特征提取相结合，使模型能够同时捕获组织病理学图像中的长程上下文依赖关系和细粒度局部形态学特征。为验证所提架构的效能，我们在涵盖乳腺癌、口腔癌、肺结肠癌、肾癌及急性淋巴细胞白血病（Acute Lymphocytic Leukemia, ALL）的综合多癌种数据集上进行了大规模实验，同时分析原始图像与分割图像以评估模型在异质性临床成像条件下的鲁棒性。我们将所提方法与包括DenseNet121、DenseNet201、InceptionV3、ResNet50、EfficientNetB3、多种ViT变体及Swin Transformer模型在内的先进CNN及迁移模型进行基准对比，所有模型均通过统一流程进行训练与验证，该流程整合了平衡数据预处理、迁移学习与微调策略。实验结果表明，所提架构始终取得卓越性能：在肺结肠癌和分割白血病数据集上达到100%测试准确率，在乳腺癌分类中最高获得99.23%的准确率。该模型同时实现了接近完美的精确率、F1分数与召回率，表明其在多种癌症类型中均保持高度稳定的评分。总体而言，所提模型构建了一个高精度、可解释且鲁棒的多癌种分类系统，为未来研究确立了强有力的基准，并为设计可靠的AI辅助组织病理学诊断与临床决策提供了统一的对比评估框架。

摘要 (Abstract)

In this study, we proposed a deep Swin-Vision Transformer-based transfer learning architecture for robust multi-cancer histopathological image classification. The proposed framework integrates a hierarchical Swin Transformer with ResNet50-based convolution features extraction, enabling the model to capture both long-range contextual dependencies and fine-grained local morphological patterns within histopathological images. To validate the efficiency of the proposed architecture, an extensive experiment was executed on a comprehensive multi-cancer dataset including Breast Cancer, Oral Cancer, Lung and Colon Cancer, Kidney Cancer, and Acute Lymphocytic Leukemia (ALL), including both original and segmented images were analyzed to assess model robustness across heterogeneous clinical imaging conditions. Our approach is benchmarked alongside several state-of-the-art CNN and transfer models, including DenseNet121, DenseNet201, InceptionV3, ResNet50, EfficientNetB3, multiple ViT variants, and Swin Transformer models. However, all models were trained and validated using a unified pipeline, incorporating balanced data preprocessing, transfer learning, and fine-tuning strategies. The experimental results demonstrated that our proposed architecture consistently gained superior performance, reaching 100% test accuracy for lung-colon cancer, segmented leukemia datasets, and up to 99.23% accuracy for breast cancer classification. The model also achieved near-perfect precision, f1 score, and recall, indicating highly stable scores across divers cancer types. Overall, the proposed model establishes a highly accurate, interpretable, and also robust multi-cancer classification system, demonstrating strong benchmark for future research and provides a unified comparative assessment useful for designing reliable AI-assisted histopathological diagnosis and clinical decision-making.

关键词: Swin Vision Transformer, Transfer Learning, Histopathological Image Classification, Multi-Cancer Classification, Deep Learning, Medical Image Analysis, AI for Healthcare, Computer Vision

145. ❌ AsymLoc: Towards Asymmetric Feature Matching for Efficient Visual Localization

作者: Mohammad Omama, Gabriele Berton, Eric Foxlin, Yelin Kim 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09445v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的视觉定位任务，提出了一种基于知识蒸馏的非对称特征匹配框架（AsymLoc），用于在资源受限的边缘设备上实现高效、准确的定位。所有评分关键词均与大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文研究的是纯视觉任务（特征提取、匹配、定位），未涉及任何语言模型、大模型技术、科学AI应用或评分列表中的特定技术方法（如MoE、RLHF、RAG等）。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对AR/VR和机器人等应用中资源受限边缘设备上的实时视觉定位问题，提出了一种名为AsymLoc的非对称知识蒸馏框架，通过几何驱动匹配和联合检测器-描述符蒸馏，在显著减小模型规模的同时实现了接近教师模型95%的定位精度，确立了新的效率-精度权衡最优水平。

摘要翻译

精确且实时的视觉定位对于增强现实/虚拟现实（AR/VR）和机器人等应用至关重要，尤其在智能眼镜等资源受限的边缘设备上，电池续航和散热可能成为主要制约因素。尽管已有许多高效模型，但在不牺牲精度的前提下进一步减少计算量，对于实际部署至关重要。为此，我们提出非对称视觉定位方法：一个大型教师模型（Teacher）离线处理预先构建地图的数据库图像，而一个轻量级学生模型（Student）在线处理查询图像。这带来了一个挑战，即如何在不依赖复杂、基于学习的匹配器的情况下，匹配来自两个不同模型的特征。
我们提出了AsymLoc，一种新颖的蒸馏框架，通过结合几何驱动的匹配目标和联合检测器-描述符蒸馏目标，将学生模型与教师模型对齐，从而实现快速、无参数的最近邻匹配。在HPatches、ScanNet、IMC2022和Aachen数据集上进行的大量实验表明，AsymLoc使用小一个数量级的模型，实现了教师模型高达95%的定位精度，显著超越了现有基线，并在效率与精度权衡方面确立了新的先进水平。

摘要 (Abstract)

Precise and real-time visual localization is critical for applications like AR/VR and robotics, especially on resource-constrained edge devices such as smart glasses, where battery life and heat dissipation can be a primary concerns. While many efficient models exist, further reducing compute without sacrificing accuracy is essential for practical deployment. To address this, we propose asymmetric visual localization: a large Teacher model processes pre-mapped database images offline, while a lightweight Student model processes the query image online. This creates a challenge in matching features from two different models without resorting to heavy, learned matchers. We introduce AsymLoc, a novel distillation framework that aligns a Student to its Teacher through a combination of a geometry-driven matching objective and a joint detector-descriptor distillation objective, enabling fast, parameter-less nearest-neighbor matching. Extensive experiments on HPatches, ScanNet, IMC2022, and Aachen show that AsymLoc achieves up to 95% of the teacher’s localization accuracy using an order of magnitude smaller models, significantly outperforming existing baselines and establishing a new state-of-the-art efficiency-accuracy trade-off.

关键词: visual localization, asymmetric feature matching, knowledge distillation, edge devices, efficient models, teacher-student framework, geometry-driven matching, parameter-less matching

146. ❌ SCoRe: Clean Image Generation from Diffusion Models Trained on Noisy Images

作者: Yuta Matsuzaki, Seiichi Uchida, Shumpei Takezaki 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09436v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于扩散模型在图像生成领域的噪声处理问题，提出了一种名为SCoRe的频谱再生方法。所有评分关键词均与大语言模型（LLMs）相关，包括其架构、训练、推理、对齐、应用等方面。然而，论文内容完全不涉及大语言模型或任何文本生成技术，而是专注于计算机视觉中的扩散模型。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了在噪声图像上训练的扩散模型会生成带有高频伪影图像的问题，提出了一种无需重新训练的训练时频谱再生方法SCoRe，通过理论映射和实验验证，显著提升了生成图像的清洁度。

摘要翻译

在含噪数据集上训练的扩散模型常会复现高频训练伪影，显著降低生成质量。为解决此问题，我们提出SCoRe（谱截断再生），这是一种无需训练、在生成阶段运行的谱再生方法，可从含噪图像训练的扩散模型中生成清晰图像。该方法利用扩散模型的谱偏置特性（即从低频线索推断高频细节），通过频率截断抑制生成图像中被污染的高频成分，并借助SDEdit进行再生。关键创新在于，我们基于径向平均功率谱密度推导出截断频率与SDEdit初始化时间步之间的理论映射关系，从而避免再生过程中引入过量噪声。在合成噪声数据集（CIFAR-10）和真实噪声数据集（SIDD）上的实验表明，SCoRe显著优于后处理和噪声鲁棒基线方法，无需任何重新训练或微调即可将样本恢复至更接近清晰图像的分布。

摘要 (Abstract)

Diffusion models trained on noisy datasets often reproduce high-frequency training artifacts, significantly degrading generation quality. To address this, we propose SCoRe (Spectral Cutoff Regeneration), a training-free, generation-time spectral regeneration method for clean image generation from diffusion models trained on noisy images. Leveraging the spectral bias of diffusion models, which infer high-frequency details from low-frequency cues, SCoRe suppresses corrupted high-frequency components of a generated image via a frequency cutoff and regenerates them via SDEdit. Crucially, we derive a theoretical mapping between the cutoff frequency and the SDEdit initialization timestep based on Radially Averaged Power Spectral Density (RAPSD), which prevents excessive noise injection during regeneration. Experiments on synthetic (CIFAR-10) and real-world (SIDD) noisy datasets demonstrate that SCoRe substantially outperforms post-processing and noise-robust baselines, restoring samples closer to clean image distributions without any retraining or fine-tuning.

关键词: Diffusion Models, Noisy Images, Clean Image Generation, Spectral Cutoff Regeneration, SCoRe, SDEdit, Frequency Cutoff, RAPSD

147. ❌ Do Vision Language Models Need to Process Image Tokens?

作者: Sambit Ghosh, R. Venkatesh Babu, Chirag Agarwal 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09425v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉语言模型（VLMs）中图像令牌处理的必要性，核心涉及大语言模型（LLMs）在视觉-语言多模态架构中的应用，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文分析视觉表示在深度变换器中的演化，涉及模型内部工作机制的理解，与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分）。论文提到视觉深度截断影响中间推理轨迹，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’有一定关联（5分）。其他关键词如MoE、SLMs、训练方法、对齐、推理加速、科学AI应用等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了视觉语言模型中图像令牌处理的必要性，发现视觉表示在早期层后迅速稳定，深层处理并非总是必需，且视觉处理需求因任务而异，挑战了当前多模态LLM架构的范式。

摘要翻译

视觉语言模型（Vision Language Models, VLMs）通过将视觉编码器与大型语言模型（Large Language Models, LLMs）相结合，取得了显著成功。尽管VLMs在深层Transformer堆栈中处理密集的图像标记（带来了可观的计算开销），但其性能是否确实需要持续的图像标记处理，以及视觉表征是否在从浅层到深层的过程中发生有意义的演变，这些问题在根本上仍不明确。本研究系统性地探究了图像标记在VLMs中的功能作用，并发现视觉表征会迅速收敛到一个有界复杂度状态，即其熵值趋于稳定、内在维度被压缩、轨迹曲率接近恒定。相比之下，文本表征在深度方向上持续经历显著的重构。一旦稳定后，视觉表征在不同层之间基本可互换，表明在更深阶段其额外转化有限。此外，深度方向的视觉截断实验显示，视觉处理的必要性取决于任务类型：单标记预测对视觉深度截断相对稳健，而多标记生成则需要持续访问视觉表征。在确定性解码条件下，减少视觉深度对中间推理轨迹的干扰强于对最终输出的影响，这表明图像标记更多影响推理的结构而非最终结论。总体而言，这些发现质疑了更深层视觉处理在VLMs中普遍必要的假设，对当前多模态LLM架构的范式提出了挑战。

摘要 (Abstract)

Vision Language Models (VLMs) have achieved remarkable success by integrating visual encoders with large language models (LLMs). While VLMs process dense image tokens across deep transformer stacks (incurring substantial computational overhead), it remains fundamentally unclear whether sustained image-token processing is necessary for their performance or visual representations meaningfully evolve from early to later layers. In this work, we systematically investigate the functional role of image tokens in VLMs and show that visual representations rapidly converge to a bounded-complexity regime, \ie their entropy stabilizes, intrinsic dimensionality compresses, and trajectory curvature approaches a near-constant profile. In contrast, textual representations continue to undergo substantial restructuring across depth. Once stabilized, visual representations become largely interchangeable between layers, indicating limited additional transformation in deeper stages. Further, depth-wise visual truncation reveals that the necessity of visual processing is task-dependent, where single-token predictions remain comparatively robust to truncated visual depth, but multi-token generation require sustained access to visual representations. Under deterministic decoding, reducing visual depth perturbs intermediate reasoning trajectories more strongly than final outputs, suggesting that image tokens influence the structure of reasoning more than the ultimate conclusions. Collectively, these findings \textbf{question the assumption} that deeper visual processing is uniformly essential in VLMs, challenging the current paradigm of multimodal LLM architectures.

关键词: Vision Language Models, VLMs, image tokens, visual representations, transformer stacks, computational overhead, multimodal LLM architectures, reasoning trajectories

148. ❌ SynFlow: Scaling Up LiDAR Scene Flow Estimation with Synthetic Data

作者: Qingwen Zhang, Xiaomeng Zhu, Chenhan Jiang, Patric Jensfelt 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09411v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于LiDAR场景流估计，使用合成数据进行扩展，属于计算机视觉和3D感知领域。论文内容涉及数据生成、合成数据、领域泛化、零样本迁移等，但完全不涉及大语言模型、深度学习技术原理、AI for Science等关键词。所有关键词均与大模型、深度学习技术、AI科学应用无关，因此相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出SynFlow数据生成管道，通过大规模合成数据解决LiDAR场景流估计中标注数据稀缺的问题，在零样本和少样本设置下显著提升了模型在真实场景中的性能。

摘要翻译

可靠的3D动态感知需要能够预测超出预定义类别运动的模型，然而密集高质量运动标注数据的稀缺阻碍了该领域进展。尽管对未标注真实数据进行自监督学习提供了一条可行路径，但实证研究表明，由于代理信号存在噪声，单纯扩大未标注数据规模无法弥补性能差距。本文提出一种范式转换：完全从可扩展的仿真中学习鲁棒的真实世界运动先验。我们引入SynFlow——一个专门为激光雷达场景流生成大规模合成数据集的数据生成流程。与先前优先考虑传感器特定真实感的研究不同，SynFlow采用以运动为导向的策略，在4000个序列（约94万帧）中合成了多样化的运动学模式，该数据集命名为SynFlow-4k。其标注数据量达到现有真实世界基准数据集的34倍。实验表明，SynFlow-4k提供了高度领域不变的运动先验。在零样本场景下，仅使用我们合成数据训练的模型能够泛化至多个真实世界基准数据集，在nuScenes上与领域内监督基线方法表现相当，在TruckScenes上以31.8%的优势超越现有最优方法。此外，SynFlow-4k可作为标签高效学习的基础：仅使用5%的真实世界标签进行微调，其性能即可超越在完整标注预算下从头训练的模型。我们开源该流程与数据集以促进可泛化3D运动估计的研究。更多细节请访问：https://kin-zhang.github.io/SynFlow。

摘要 (Abstract)

Reliable 3D dynamic perception requires models that can anticipate motion beyond predefined categories, yet progress is hindered by the scarcity of dense, high-quality motion annotations. While self-supervision on unlabeled real data offers a path forward, empirical evidence suggests that scaling unlabeled data fails to close the performance gap due to noisy proxy signals. In this paper, we propose a shift in paradigm: learning robust real-world motion priors entirely from scalable simulation. We introduce SynFlow, a data generation pipeline that generates large-scale synthetic dataset specifically designed for LiDAR scene flow. Unlike prior works that prioritize sensor-specific realism, SynFlow employs a motion-oriented strategy to synthesize diverse kinematic patterns across 4,000 sequences ($\sim$940k frames), termed SynFlow-4k. This represents a 34x scale-up in annotated volume over existing real-world benchmarks. Our experiments demonstrate that SynFlow-4k provides a highly domain-invariant motion prior. In a zero-shot regime, models trained exclusively on our synthetic data generalize across multiple real-world benchmarks, rivaling in-domain supervised baselines on nuScenes and outperforming state-of-the-art methods on TruckScenes by 31.8%. Furthermore, SynFlow-4k serves as a label-efficient foundation: fine-tuning with only 5% of real-world labels surpasses models trained from scratch on the full available budget. We open-source the pipeline and dataset to facilitate research in generalizable 3D motion estimation. More detail can be found at https://kin-zhang.github.io/SynFlow.

关键词: LiDAR scene flow, synthetic data, domain generalization, zero-shot transfer, motion estimation, data generation pipeline, 3D dynamic perception, label-efficient learning

149. ❌ Multi-task Just Recognizable Difference for Video Coding for Machines: Database, Model, and Coding Application

作者: Junqi Liu, Yun Zhang, Xiaoxia Huang, Long Xu, Weisi Lin 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09421v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于视频编码中的机器视觉感知阈值建模（Multi-task Just Recognizable Difference），属于计算机视觉和视频编码领域。论文内容与绝大多数关键词（如LLM、MoE、RLHF、RAG等）完全无关，因为这些关键词涉及大语言模型、深度学习技术原理、对齐方法等，而本文研究的是视频编码优化。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及AI在机器视觉任务（对象检测、实例分割、关键点检测）中的应用，属于AI在特定领域（计算机视觉）的应用，但并非典型的科学领域（如生物信息学或化学信息学），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一个多任务可识别差异（MT-JRD）数据集和属性辅助模型（AMT-JRD），用于视频编码中的机器视觉任务，以提高预测精度和编码效率，实验表明该模型在多个任务上优于现有方法并提升了编码性能。

摘要翻译

恰可识别差异（Just Recognizable Difference, JRD）通过可见性阈值建模提升机器视觉的编码效率，但目前局限于单任务场景。为解决此问题，我们提出了一种面向机器视觉视频编码（Video Coding for Machines, VCM）的多任务JRD（MT-JRD）数据集及属性辅助的MT-JRD（AMT-JRD）模型，以同时提升预测精度与编码效率。首先，我们构建了一个包含27,264条机器标注JRD的数据集，支持目标检测、实例分割和关键点检测三项代表性任务。其次，我们提出了AMT-JRD预测模型，该模型集成广义特征提取模块（Generalized Feature Extraction Module, GFEM）与专用特征提取模块（Specialized Feature Extraction Module, SFEM），以促进跨多任务的联合学习。第三，我们通过属性特征融合模块（Attribute Feature Fusion Module, AFFM）创新地将目标属性信息引入面向目标的JRD预测中，该模块引入了关于目标尺寸与位置的先验知识。这一设计有效弥补了仅依赖图像特征的局限性，并增强了模型表征机器视觉感知机制的能力。最后，我们将AMT-JRD模型应用于VCM，利用精确预测的JRD在保持多任务机器视觉精度的同时降低编码码率。大量实验结果表明，AMT-JRD在三项任务中实现了精确且鲁棒的多任务预测，其平均绝对误差为3.781，误差方差为5.332，分别优于当前最优的单任务预测模型6.7%和6.3%。编码实验进一步表明，相较于基准VVC和JPEG，基于AMT-JRD的VCM在Bjontegaard Delta-平均精度均值（BD-mAP）上分别平均提升了3.861%和7.886%。

摘要 (Abstract)

Just Recognizable Difference (JRD) boosts coding efficiency for machine vision through visibility threshold modeling, but is currently limited to a single-task scenario. To address this issue, we propose a Multi-Task JRD (MT-JRD) dataset and an Attribute-assisted MT-JRD (AMT-JRD) model for Video Coding for Machines (VCM), enhancing both prediction accuracy and coding efficiency. First, we construct a dataset comprising 27,264 JRD annotations from machines, supporting three representative tasks including object detection, instance segmentation, and keypoint detection. Secondly, we propose the AMT-JRD prediction model, which integrates Generalized Feature Extraction Module (GFEM) and Specialized Feature Extraction Module (SFEM) to facilitate joint learning across multiple tasks. Thirdly, we innovatively incorporate object attribute information into object-wise JRD prediction through the Attribute Feature Fusion Module (AFFM), which introduces prior knowledge about object size and location. This design effectively compensates for the limitations of relying solely on image features and enhances the model’s capacity to represent the perceptual mechanisms of machine vision. Finally, we apply the AMT-JRD model to VCM, where the accurately predicted JRDs are applied to reduce the coding bit rate while preserving accuracy across multiple machine vision tasks. Extensive experimental results demonstrate that AMT-JRD achieves precise and robust multi-task prediction with a mean absolute error of 3.781 and error variance of 5.332 across three tasks, outperforming the state-of-the-art single-task prediction model by 6.7% and 6.3%, respectively. Coding experiments further reveal that compared to the baseline VVC and JPEG, the AMT-JRD-based VCM improves an average of 3.861% and 7.886% Bjontegaard Delta-mean Average Precision (BD-mAP), respectively.

关键词: Multi-task Just Recognizable Difference, Video Coding for Machines, Machine Vision, Object Detection, Instance Segmentation, Keypoint Detection, Coding Efficiency, Attribute-assisted Model

150. ❌ EGLOCE: Training-Free Energy-Guided Latent Optimization for Concept Erasure

作者: Junyeong Ahn, Seojin Yoon, Sungyong Baik 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09405v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于文本到图像扩散模型的概念擦除技术，提出了一种无需训练的推理时优化方法。所有评分关键词均与大语言模型（LLMs）相关，涉及模型架构、训练、对齐、推理、应用等方面。本文研究的是扩散模型（Diffusion Models），属于生成式AI的一个分支，但与大语言模型（LLMs）在模型类型、技术原理和应用场景上均有显著不同。论文未涉及任何LLM相关技术、方法或应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为EGLOCE的无需训练的能量引导潜在优化方法，用于在文本到图像扩散模型的推理过程中有效擦除特定概念，同时保持图像质量和提示对齐。

摘要翻译

随着文本到图像扩散模型日益普及，消除特定概念（主要为显性内容及众多受版权保护的角色或风格）的能力已成为安全合规的关键需求。现有遗忘方法通常需要昂贵的重新训练，以无关概念保真度下降为代价修改模型参数，或依赖间接推理时调整而削弱概念消除效果。受能量引导采样在保持扩散模型条件方面的成功启发，我们提出用于概念消除的能量引导潜在优化方法（EGLOCE），这是一种无需训练的推理阶段解决方案，通过重定向噪声潜在空间实现概念消除。该方法采用双目标框架：排斥能量通过潜在空间梯度下降使生成过程远离目标概念，保留能量则维持与原始提示的语义对齐。相较于以往需要修改模型权重或仅提供弱推理引导的方法，EGLOCE完全在推理阶段运行，在实现即插即用集成的同时增强了消除性能。大量实验表明，EGLOCE在各类基线方法中均能提升概念消除效果，同时保持图像质量与提示对齐性，即使在对抗性攻击下仍表现稳健。据我们所知，本研究首次通过采样过程中的双能量引导，为安全可控的图像生成建立了新范式。

摘要 (Abstract)

As text-to-image diffusion models grow increasingly prevalent, the ability to remove specific concepts-mostly explicit content and many copyrighted characters or styles-has become essential for safety and compliance. Existing unlearning approaches often require costly re-training, modify parameters at the cost of degradation of unrelated concept fidelity, or depend on indirect inference-time adjustment that compromise the effectiveness of concept erasure. Inspired by the success of energy-guided sampling for preservation of the condition of diffusion models, we introduce Energy-Guided Latent Optimization for Concept Erasure (EGLOCE), a training-free approach that removes unwanted concepts by re-directing noisy latent during inference. Our method employs a dual-objective framework: a repulsion energy that steers generation away from target concepts via gradient descent in latent space, and a retention energy that preserves semantic alignment to the original prompt. Combined with previous approaches that either require erroneous modified model weights or provide weak inference-time guidance, EGLOCE operates entirely at inference and enhances erasure performance, enabling plug-and-play integration. Extensive experiments demonstrate that EGLOCE improves concept removal while maintaining image quality and prompt alignment across baselines, even with adversarial attacks. To the best of our knowledge, our work is the first to establish a new paradigm for safe and controllable image generation through dual energy-based guidance during sampling.

关键词: text-to-image diffusion models, concept erasure, training-free approach, energy-guided sampling, latent optimization, inference-time adjustment, dual-objective framework, plug-and-play integration

151. ❌ Efficient Unlearning through Maximizing Relearning Convergence Delay

作者: Khoa Tran, Simon S. Woo 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09391v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于机器遗忘（machine unlearning）技术，提出了一种新的评估指标（relearning convergence delay）和遗忘框架（Influence Eliminating Unlearning）。虽然论文涉及预训练模型（pretrained models），但主要研究的是遗忘技术而非大模型技术本身。与大多数关键词（如LLMs、MoE、RLHF、RAG等）无直接关联。仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为论文提到从预训练模型中移除数据，但这不是论文的核心创新点。其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种新的机器遗忘评估指标（relearning convergence delay）和遗忘框架（Influence Eliminating Unlearning），以更全面地评估和实现从预训练模型中移除问题数据，同时保持模型在保留数据上的性能。

摘要翻译

机器遗忘旨在从预训练模型中移除误标注、受污染或存在问题的数据，这带来了诸多挑战。当前的遗忘方法与评估指标仅关注模型预测层面的变化，这限制了对模型底层真实数据特性的深入洞察。为解决这一问题，我们提出了一种称为“再学习收敛延迟”的新评估指标，该指标同时捕捉权重空间与预测空间的变化，从而对模型关于被遗忘数据集的理解提供更全面的评估。该指标可用于评估从已遗忘模型中恢复被遗忘数据的风险。基于此，我们提出了“影响消除遗忘”框架，该框架通过降低遗忘集的性能来消除其影响，并在保持留存集准确性的同时，结合权重衰减与向模型权重注入噪声的策略。大量实验表明，我们的方法在现有评估指标及我们提出的再学习收敛延迟指标上均表现优异，接近理想的遗忘性能。我们提供了理论保证，包括指数收敛性与上界分析，并在分类与生成式遗忘任务中提供了模型保持强留存能力及抵抗再学习的实证依据。

摘要 (Abstract)

Machine unlearning poses challenges in removing mislabeled, contaminated, or problematic data from a pretrained model. Current unlearning approaches and evaluation metrics are solely focused on model predictions, which limits insight into the model’s true underlying data characteristics. To address this issue, we introduce a new metric called relearning convergence delay, which captures both changes in weight space and prediction space, providing a more comprehensive assessment of the model’s understanding of the forgotten dataset. This metric can be used to assess the risk of forgotten data being recovered from the unlearned model. Based on this, we propose the Influence Eliminating Unlearning framework, which removes the influence of the forgetting set by degrading its performance and incorporates weight decay and injecting noise into the model’s weights, while maintaining accuracy on the retaining set. Extensive experiments show that our method outperforms existing metrics and our proposed relearning convergence delay metric, approaching ideal unlearning performance. We provide theoretical guarantees, including exponential convergence and upper bounds, as well as empirical evidence of strong retention and resistance to relearning in both classification and generative unlearning tasks.

关键词: machine unlearning, relearning convergence delay, influence eliminating unlearning, pretrained model, forgotten data, weight decay, noise injection

152. ❌ Region-Constrained Group Relative Policy Optimization for Flow-Based Image Editing

作者: Zhuohan Ouyang, Zhe Qian, Wenhuo Cui, Chaoqun Wang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09386v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于基于流的图像编辑方法，提出了一种区域约束的GRPO后训练框架。论文与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分），因为核心贡献是改进GRPO后训练方法。与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（5分），因为涉及指令遵循的图像编辑，但并非大语言模型的对齐。其他关键词均不相关（0分），因为论文不涉及大语言模型、MoE、量化、推理加速、科学AI等主题。

!!! tip deepseek-chat TL;DR

该论文针对基于流的指令引导图像编辑中全局探索扰动非目标区域导致奖励方差大的问题，提出了区域约束的GRPO后训练框架RC-GRPO-Editing，通过区域解耦的初始噪声扰动和注意力集中奖励，在CompBench上实现了编辑区域指令遵循和非目标内容保留的改进。

摘要翻译

指令引导的图像编辑需要在目标修改与非目标区域保持之间取得平衡。近年来，基于流的模型因其高保真度和高效的确定性常微分方程（ODE）采样，已成为指令引导图像编辑领域强大且日益广泛采用的骨干网络。在此基础上，基于GRPO的奖励驱动后训练方法被探索用于直接优化编辑相关的奖励，从而提升指令遵循能力和编辑一致性。然而，现有方法常受噪声信用分配问题困扰：全局探索也会扰动非目标区域，导致组内奖励方差增大并产生噪声GRPO优势。为解决此问题，我们提出RC-GRPO-Editing，这是一个在确定性ODE采样下用于基于流的图像编辑的区域约束GRPO后训练框架。该框架通过抑制背景引起的干扰方差，实现更清晰的局部信用分配，从而在保持非目标内容的同时，提升编辑区域的指令遵循度。具体而言，我们通过区域解耦的初始噪声扰动来局部化探索，以减少背景引起的奖励方差并稳定GRPO优势；同时引入注意力集中奖励，使整个生成过程中的交叉注意力与预期编辑区域对齐，从而减少非目标区域的意外改变。在CompBench上的实验表明，该方法在编辑区域指令遵循度和非目标区域保持方面均取得了持续改进。

摘要 (Abstract)

Instruction-guided image editing requires balancing target modification with non-target preservation. Recently, flow-based models have emerged as a strong and increasingly adopted backbone for instruction-guided image editing, thanks to their high fidelity and efficient deterministic ODE sampling. Building on this foundation, GRPO-based reward-driven post-training has been explored to directly optimize editing-specific rewards, improving instruction following and editing consistency. However, existing methods often suffer from noisy credit assignment: global exploration also perturbs non-target regions, inflating within-group reward variance and yielding noisy GRPO advantages. To address this, we propose RC-GRPO-Editing, a region-constrained GRPO post-training framework for flow-based image editing under deterministic ODE sampling. It suppresses background-induced nuisance variance to enable cleaner localized credit assignment, improving editing region instruction adherence while preserving non-target content. Concretely, we localize exploration via region-decoupled initial noise perturbations to reduce background-induced reward variance and stabilize GRPO advantages, and introduce an attention concentration reward that aligns cross-attention with the intended editing region throughout the rollout, reducing unintended changes in non-target regions. Experiments on CompBench show consistent improvements in editing region instruction adherence and non-target preservation.

关键词: flow-based image editing, instruction-guided editing, GRPO post-training, region-constrained optimization, deterministic ODE sampling, credit assignment, attention concentration reward, non-target preservation

153. ❌ Cluster-First Labelling: An Automated Pipeline for Segmentation and Morphological Clustering in Histology Whole Slide Images

作者: Muhammad Haseeb Ahmad, Sharmila Rajendran, Damion Young, Jon Mason 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09370v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于组织病理学全切片图像（WSI）的自动化分割和形态学聚类，提出了一种基于聚类优先的标注流水线。论文使用了Cellpose-SAM进行分割、ResNet-50提取嵌入、UMAP降维和DBSCAN聚类等技术，属于计算机视觉和生物医学图像分析领域。所有关键词中，仅“AI for Science OR Bioinformatics OR Cheminformatics”与论文高度相关，因为论文涉及生物信息学（Bioinformatics）中的组织图像分析，属于AI在科学领域的应用。其他关键词均与大模型、深度学习技术原理、推理、对齐、优化等无关，因此评分为0。加权总分仅来自该关键词的10分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于聚类优先的自动化流水线，用于组织病理学全切片图像的分割和形态学聚类，通过标注代表性集群而非单个对象，将标注工作量减少了数个数量级，并在13种组织类型上实现了96.8%的加权集群-标签对齐准确率。

摘要翻译

在组织学全切片图像（WSI）中标注组织成分是一项极其耗费人力的工作：单张切片可能包含数以万计的结构——细胞、细胞核及其他形态学上可区分的对象——每一个都需要手动勾画边界并进行分类。我们提出了一种云原生、端到端的处理流程，通过“聚类优先”范式实现了该过程的自动化。我们的系统首先将WSI分割为图块，过滤掉被认为不太可能包含有价值信息的图块，使用Cellpose-SAM分割组织成分（包括细胞、细胞核及其他形态相似结构），通过预训练的ResNet-50提取神经嵌入特征，利用UMAP进行降维，并采用DBSCAN聚类对形态相似的对象进行分组。在此范式下，标注人员只需标注具有代表性的聚类而非单个对象，从而将标注工作量降低数个数量级。我们在来自三个物种（人类、大鼠、兔子）的13种不同组织类型、共计3,696个组织成分上评估了该流程，通过基于图块的匈牙利算法匹配来衡量无监督聚类结果与独立人工标注的一致性。我们的系统实现了96.8%的加权聚类-标注对齐准确率，其中13种组织类型中有7种达到完全一致。该处理流程、配套的标注网络应用程序及全部评估代码均已作为开源软件发布。

摘要 (Abstract)

Labelling tissue components in histology whole slide images (WSIs) is prohibitively labour-intensive: a single slide may contain tens of thousands of structures–cells, nuclei, and other morphologically distinct objects–each requiring manual boundary delineation and classification. We present a cloudnative, end-to-end pipeline that automates this process through a cluster-first paradigm. Our system tiles WSIs, filters out tiles deemed unlikely to contain valuable information, segments tissue components with Cellpose-SAM (including cells, nuclei, and other morphologically similar structures), extracts neural embeddings via a pretrained ResNet-50, reduces dimensionality with UMAP, and groups morphologically similar objects using DBSCAN clustering. Under this paradigm, a human annotator labels representative clusters rather than individual objects, reducing annotation effort by orders of magnitude. We evaluate the pipeline on 3,696 tissue components across 13 diverse tissue types from three species (human, rat, rabbit), measuring how well unsupervised clusters align with independent human labels via per-tile Hungarian-algorithm matching. Our system achieves a weighted cluster-label alignment accuracy of 96.8%, with 7 of 13 tissue types reaching perfect agreement. The pipeline, a companion labelling web application, and all evaluation code are released as open-source software.

关键词: histology whole slide images, automated segmentation, morphological clustering, cluster-first labelling, Cellpose-SAM, DBSCAN clustering, annotation efficiency, tissue component analysis

154. ❌ EpiAgent: An Agent-Centric System for Ancient Inscription Restoration

作者: Shipeng Zhu, Ang Chen, Na Nie, Pengfei Fang, Min-Ling Zhang, Hui Xue 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09367v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出EpiAgent系统，核心是基于LLM的中央规划器协调多模态分析、历史经验、专业修复工具和迭代自我精炼，实现古代铭文修复。高度相关关键词：LLM（核心组件）、LLM Agents（系统本质）、Tool Use（协调专业工具）、Self-Reflection（迭代自我精炼）。中等相关：Multi-agent Systems（涉及协调但非多智能体系统）、AI for Science（文化遗产保护属于科学应用）。其他关键词未涉及技术细节或应用领域。

!!! tip deepseek-chat TL;DR

论文提出EpiAgent系统，通过基于LLM的智能体协调多模态分析和专业工具，解决了古代铭文修复中复杂退化问题，实现了优于现有方法的修复质量和泛化能力。

摘要翻译

古代铭文作为文化记忆的载体，历经数百年环境与人为因素导致的劣化。恢复其视觉与文本信息交织的完整性，是数字遗产保护领域最具挑战性的任务之一。然而，现有基于人工智能的方法通常依赖固定流程，难以应对现实中如此复杂且异质的劣化情况。受人类碑铭学家技能协同工作流程的启发，我们提出EpiAgent——一个以智能体为核心的系统，将铭文修复构建为分层规划问题。该系统遵循“观察-构思-执行-重评估”范式，由基于大语言模型（LLM）的中心规划器协调多模态分析、历史经验、专用修复工具和迭代自我优化模块的协作。这种以智能体为中心的协同机制实现了超越传统单次处理方法的灵活自适应修复流程。在真实世界的劣化铭文数据集上，EpiAgent相比现有方法展现出更优的修复质量与更强的泛化能力。本工作标志着向专家级智能体驱动文化遗产修复迈出了重要一步。代码已发布于https://github.com/blackprotoss/EpiAgent。

摘要 (Abstract)

Ancient inscriptions, as repositories of cultural memory, have suffered from centuries of environmental and human-induced degradation. Restoring their intertwined visual and textual integrity poses one of the most demanding challenges in digital heritage preservation. However, existing AI-based approaches often rely on rigid pipelines, struggling to generalize across such complex and heterogeneous real-world degradations. Inspired by the skill-coordinated workflow of human epigraphers, we propose EpiAgent, an agent-centric system that formulates inscription restoration as a hierarchical planning problem. Following an Observe-Conceive-Execute-Reevaluate paradigm, an LLM-based central planner orchestrates collaboration among multimodal analysis, historical experience, specialized restoration tools, and iterative self-refinement. This agent-centric coordination enables a flexible and adaptive restoration process beyond conventional single-pass methods. Across real-world degraded inscriptions, EpiAgent achieves superior restoration quality and stronger generalization compared to existing methods. Our work marks an important step toward expert-level agent-driven restoration of cultural heritage. The code is available at https://github.com/blackprotoss/EpiAgent.

关键词: Ancient inscription restoration, Agent-centric system, LLM-based planner, Multimodal analysis, Hierarchical planning, Self-refinement, Cultural heritage preservation, Digital heritage

155. ❌ Robust 4D Visual Geometry Transformer with Uncertainty-Aware Priors

作者: Ying Zang, Yidong Han, Chaotao Ding, Yuanqi Hu, Deyi Ji, Qi Zhu, Xuanfu Li, Jin Ma, Lingyun Sun, Tianrun Chen, Lanyun Zhu 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09366v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的动态4D场景重建，提出了一种基于Transformer的几何重建框架，并引入了不确定性感知机制。虽然论文提到了'3D foundation models’，但这指的是视觉几何基础模型（如VGGT），而非语言模型。论文的核心技术涉及多视图几何、注意力机制、不确定性建模和空间连续性约束，所有给定的关键词均针对大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理等），与论文的视觉几何重建主题完全无关。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种不确定性感知的4D视觉几何Transformer框架，通过熵引导子空间投影、局部一致性几何净化和不确定性感知跨视图一致性机制，有效解决了动态场景重建中的几何模糊问题，在动态基准测试中显著降低了平均精度误差并提高了分割F-measure。

摘要翻译

重建动态4D场景是一项重要但具有挑战性的任务。尽管VGGT等三维基础模型在静态场景中表现优异，但在动态序列中常因运动导致的显著几何模糊性而受限。为解决这一问题，我们提出了一个通过建模重建过程不同阶段的不确定性来解耦动态与静态组件的框架。该方法引入了三种协同机制：（1）熵引导子空间投影，利用信息论加权自适应聚合多头注意力分布，有效从语义噪声中分离动态运动线索；（2）局部一致性驱动的几何净化，通过基于半径的邻域约束强制空间连续性以消除结构异常值；（3）不确定性感知的跨视图一致性，将多视图投影优化构建为异方差最大似然估计问题，并以深度置信度作为概率权重。在动态基准测试上的实验表明，本方法优于当前最先进技术，将平均精度误差降低13.43%，分割F值提升10.49%。该框架保持了前馈推理的高效性，且无需任务特定微调或逐场景优化。

摘要 (Abstract)

Reconstructing dynamic 4D scenes is an important yet challenging task. While 3D foundation models like VGGT excel in static settings, they often struggle with dynamic sequences where motion causes significant geometric ambiguity. To address this, we present a framework designed to disentangle dynamic and static components by modeling uncertainty across different stages of the reconstruction process. Our approach introduces three synergistic mechanisms: (1) Entropy-Guided Subspace Projection, which leverages information-theoretic weighting to adaptively aggregate multi-head attention distributions, effectively isolating dynamic motion cues from semantic noise; (2) Local-Consistency Driven Geometry Purification, which enforces spatial continuity via radius-based neighborhood constraints to eliminate structural outliers; and (3) Uncertainty-Aware Cross-View Consistency, which formulates multi-view projection refinement as a heteroscedastic maximum likelihood estimation problem, utilizing depth confidence as a probabilistic weight. Experiments on dynamic benchmarks show that our approach outperforms current state-of-the-art methods, reducing Mean Accuracy error by 13.43% and improving segmentation F-measure by 10.49%. Our framework maintains the efficiency of feed-forward inference and requires no task-specific fine-tuning or per-scene optimization.

关键词: 4D scene reconstruction, visual geometry transformer, uncertainty-aware priors, dynamic motion disentanglement, multi-view consistency, entropy-guided attention, geometry purification, feed-forward inference

156. ❌ LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation

作者: Aytaç Sekmen, Fatih Emre Gunes, Furkan Horoz, Hüseyin Umut Işık, Mehmet Alp Ozaydin, Onur Altay Topaloglu, Şahin Umutcan Üstündaş, Yurdasen Alp Yeni, Halil Ersin Soken, Erol Sahin, Ramazan Gokberk Cinbis, Sinan Kalkan 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09352v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于月球单目深度估计（MDE）的计算机视觉任务，属于AI for Science（AI4Science）在空间探索领域的应用，因此与’AI for Science OR Bioinformatics OR Cheminformatics’关键词高度相关（8分）。论文提到在合成数据上微调基础模型（foundation model）进行领域适应，这涉及预训练/领域适应和微调技术，因此与’Pre-training OR Continual Pre-training OR Domain Adaptation’和’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（各5分）。论文未涉及大语言模型（LLMs）、MoE、推理加速、对齐、RAG、智能体等大模型核心技术，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了LuMon基准测试框架和数据集，用于评估月球单目深度估计方法，并通过系统评估揭示了当前网络在跨域迁移到真实月球图像时存在显著性能差距。

摘要翻译

单目深度估计（Monocular Depth Estimation, MDE）对于利用光电相机实现月球车自主导航至关重要。然而，将地面MDE网络部署至月球会因强烈的阴影、缺乏纹理的月壤以及零大气散射条件而产生严重的领域差异。现有评估方法依赖于模拟环境，这些环境既无法复现上述条件，也缺乏真实的度量级地面实况数据。为解决此问题，我们提出了LuMon——一个用于评估月球探索MDE方法的综合性基准测试框架。我们引入了两个新颖的数据集：包含来自真实嫦娥三号任务的高质量立体地面实况深度的数据集，以及CHERI暗模拟数据集。利用该框架，我们在合成、模拟及真实数据集上对多种先进架构进行了系统的零样本评估。我们针对陨石坑、岩石、极端光照变化及不同深度范围等任务关键挑战，严格评估了其性能。此外，我们通过在合成数据上微调一个基础模型，建立了一个从模拟到真实的领域自适应基线。尽管这种自适应方法在领域内带来了显著的性能提升，但其对真实月球图像的泛化能力极为有限，这凸显了跨领域迁移中持续存在的差距。我们广泛的分析揭示了当前网络的固有局限性，并为指导未来地外感知与领域自适应研究奠定了标准基础。

摘要 (Abstract)

Monocular Depth Estimation (MDE) is crucial for autonomous lunar rover navigation using electro-optical cameras. However, deploying terrestrial MDE networks to the Moon brings a severe domain gap due to harsh shadows, textureless regolith, and zero atmospheric scattering. Existing evaluations rely on analogs that fail to replicate these conditions and lack actual metric ground truth. To address this, we present LuMon, a comprehensive benchmarking framework to evaluate MDE methods for lunar exploration. We introduce novel datasets featuring high-quality stereo ground truth depth from the real Chang’e-3 mission and the CHERI dark analog dataset. Utilizing this framework, we conduct a systematic zero-shot evaluation of state-of-the-art architectures across synthetic, analog, and real datasets. We rigorously assess performance against mission critical challenges like craters, rocks, extreme shading, and varying depth ranges. Furthermore, we establish a sim-to-real domain adaptation baseline by fine tuning a foundation model on synthetic data. While this adaptation yields drastic in-domain performance gains, it exhibits minimal generalization to authentic lunar imagery, highlighting a persistent cross-domain transfer gap. Our extensive analysis reveals the inherent limitations of current networks and sets a standard foundation to guide future advancements in extraterrestrial perception and domain adaptation.

关键词: Monocular Depth Estimation, Lunar Exploration, Domain Adaptation, Benchmarking, Synthetic-to-Real Transfer, Chang’e-3 Mission, Zero-shot Evaluation, Extraterrestrial Perception

157. ❌ VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

作者: Xiaolei Lang, Yang Wang, Yukun Zhou, Chaojun Ni, Kerui Li, Jiagang Zhu, Tianze Liu, Jiajun Lv, Xingxing Zuo, Yun Ye, Guan Huang, Xiaofeng Wang, Zheng Zhu 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09330v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文VAG专注于机器人基础模型中的世界-动作模型，提出了一种双流框架来联合生成视频和动作轨迹，以解决现有世界模型缺乏配对动作轨迹的问题。该研究与关键词’World Models AND General World Models’高度相关（10分），因为论文明确讨论并改进了世界模型在具身数据合成中的应用。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、PEFT等）、推理方法（如CoT、System 2）、代理系统、模型优化（如量化、推理加速）或科学AI应用，因此其他所有关键词均得0分。论文的核心是视频-动作生成框架，而非大模型技术或广泛的大模型应用。

!!! tip deepseek-chat TL;DR

论文VAG提出了一种基于流匹配的双流框架，用于联合生成视频和动作轨迹，以解决现有世界模型在机器人策略学习中缺乏对齐视频-动作对的问题，从而为具身数据合成提供有效的合成预训练数据。

摘要翻译

基于大规模人类遥操作数据训练的机器人基础模型近期取得进展，使机器人能够执行日益复杂的现实世界任务。然而，扩展这些系统仍然困难，因为收集任务特定的演示数据成本高昂且劳动密集。合成数据，特别是生成视频，提供了一个有前景的方向，但现有的世界模型（World Models, WMs）并不直接适用于策略学习，因为它们无法提供配对的动作轨迹。世界-动作（World-Action, WA）模型通过预测动作并输出视觉内容部分解决了这一问题，但往往缺乏较强的视频-动作对齐能力；而首先生成视频再推断动作的两阶段流程则存在效率低下和误差累积的问题。为应对这些局限，我们提出VAG——一个基于流匹配的统一双流框架，能够在视觉和语言条件约束下联合生成视频与动作。通过同步两个分支的去噪过程，并利用自适应三维池化机制将紧凑的全局视频上下文传递至动作分支，VAG提升了生成过程中的跨模态一致性。在仿真与真实场景的实验中，VAG生成的视频-动作对具有良好对齐性与竞争力的预测质量，支持可执行轨迹回放，并能提供有效的合成预训练数据以提升下游策略的泛化能力，这表明其有潜力成为具身数据合成的实用化世界-动作模型。

摘要 (Abstract)

Recent advances in robot foundation models trained on large-scale human teleoperation data have enabled robots to perform increasingly complex real-world tasks. However, scaling these systems remains difficult because collecting task-specific demonstrations is expensive and labor-intensive. Synthetic data, especially generated videos, offer a promising direction, but existing World Models (WMs) are not directly suitable for policy learning since they do not provide paired action trajectories. World-Action (WA) models partially address this by predicting actions with visual outputs, yet often lack strong video-action alignment, while two-stage pipelines that generate video first and then infer actions introduce inefficiency and error accumulation. To address these limitations, we propose VAG, a unified flow-matching-based dual-stream framework that jointly generates video and action under visual and language conditioning. By synchronizing denoising in both branches and using an adaptive 3D pooling mechanism to transfer compact global video context to the action branch, VAG improves cross-modal consistency during generation. Across both simulated and real-world settings, VAG produces aligned video-action pairs with competitive prediction quality, supports executable trajectory replay, and provides useful synthetic pretraining data that improves downstream policy generalization, indicating its potential as a practical world-action model for embodied data synthesis.

关键词: World-Action Models, Video-Action Generation, Flow Matching, Dual-Stream Framework, Embodied Data Synthesis, Robot Foundation Models, Synthetic Data, Cross-modal Consistency

158. ❌ From Frames to Events: Rethinking Evaluation in Human-Centric Video Anomaly Detection

作者: Narges Rashvand, Shanle Yao, Armin Danesh Pazho, Babak Rahimi Ardabili, Hamed Tabkhi 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09327v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视频异常检测（VAD）的评估方法，特别是从帧级评估转向事件级评估。论文内容涉及计算机视觉、视频分析、异常检测和评估指标，但完全不涉及大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用。所有关键词均与大模型、深度学习技术或AI for Science相关，而本文研究的是传统的计算机视觉任务，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文指出传统基于帧的视频异常检测评估方法存在缺陷，提出了事件中心的评估视角，开发了事件定位方法并建立了首个基于事件的VAD评估标准，结果显示现有模型在事件级检测性能上远低于帧级性能。

摘要翻译

基于姿态的视频异常检测（VAD）因其隐私保护特性和对环境变化的鲁棒性而受到广泛关注。然而，传统的帧级评估将视频视为孤立帧的集合，这与异常在现实世界中的表现形式及应对方式存在根本性错位。在运营监控系统中，关键不在于标记单个帧，而在于对连贯的异常事件——即具有可识别起始点和持续时间的连续时间段——进行可靠的检测、定位与报告。帧级指标忽视了这一区别，因此，对于任何需要可操作、事件级警报的部署场景，它们会系统性地高估模型性能。在本研究中，我们提出将VAD的研究视角转向以事件为中心。我们首先审计了广泛使用的VAD基准数据集，包括SHT[19]、CHAD[6]、NWPUC[4]和HuVAD[25]，以刻画其事件结构。随后，我们引入了两种时序事件定位策略：一种采用分层高斯平滑和自适应二值化的分数优化流程，以及一种直接生成事件级检测结果的端到端双分支模型。最后，我们通过借鉴时序动作定位（Temporal Action Localization）的评估指标，包括基于时序交并比（tIoU）的事件匹配和多阈值F1评估，为VAD建立了首个基于事件的评估标准。我们的结果量化了显著的性能差距：尽管所有最先进（SoTA）模型在NWPUC[4]数据集上的帧级AUC-ROC均超过52%，但其事件级定位精度即使在最低tIoU=0.2时也低于10%，且在所有阈值下的事件级平均F1分数仅为0.11。本工作的代码库公开于https://github.com/TeCSAR-UNCC/EventCentric-VAD。

摘要 (Abstract)

Pose-based Video Anomaly Detection (VAD) has gained significant attention for its privacy-preserving nature and robustness to environmental variations. However, traditional frame-level evaluations treat video as a collection of isolated frames, fundamentally misaligned with how anomalies manifest and are acted upon in the real world. In operational surveillance systems, what matters is not the flagging of individual frames, but the reliable detection, localization, and reporting of a coherent anomalous event, a contiguous temporal episode with an identifiable onset and duration. Frame-level metrics are blind to this distinction, and as a result, they systematically overestimate model performance for any deployment that requires actionable, event-level alerts. In this work, we propose a shift toward an event-centric perspective in VAD. We first audit widely used VAD benchmarks, including SHT[19], CHAD[6], NWPUC[4], and HuVAD[25], to characterize their event structure. We then introduce two strategies for temporal event localization: a score-refinement pipeline with hierarchical Gaussian smoothing and adaptive binarization, and an end-to-end Dual-Branch Model that directly generates event-level detections. Finally, we establish the first event-based evaluation standard for VAD by adapting Temporal Action Localization metrics, including tIoU-based event matching and multi-threshold F1 evaluation. Our results quantify a substantial performance gap: while all SoTA models achieve frame-level AUC-ROC exceeding 52% on the NWPUC[4], their event-level localization precision falls below 10% even at a minimal tIoU=0.2, with an average event-level F1 of only 0.11 across all thresholds. The code base for this work is available at https://github.com/TeCSAR-UNCC/EventCentric-VAD.

关键词: Video Anomaly Detection, Event-centric Evaluation, Temporal Event Localization, Pose-based VAD, Frame-level vs Event-level, Temporal Action Localization metrics, Hierarchical Gaussian smoothing, Dual-Branch Model

159. ❌ Multimodal Anomaly Detection for Human-Robot Interaction

作者: Guilherme Ribeiro, Iordanis Antypas, Leonardo Bizzaro, João Bimbo, Nuno Cruz Garcia 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09326v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于人机交互中的多模态异常检测，提出MADRI框架，使用特征向量重建方法检测异常。研究内容涉及计算机视觉、机器人学和异常检测，但未涉及任何大语言模型、深度学习技术原理或AI for Science的具体关键词。所有关键词均与大模型技术、训练方法、推理优化、对齐技术、模型压缩、AI科学应用等相关，而本文研究的是传统计算机视觉和机器人异常检测，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出MADRI框架，通过将视频流转换为语义特征向量并结合机器人传感器数据和场景图进行重建，有效检测人机协作中的多模态异常，实验证明该方法能提高异常检测性能。

摘要翻译

确保人机交互（HRI）的安全性与可靠性，需要及时检测可能导致系统故障或不安全行为的意外事件。因此，异常检测在使机器人能够识别并响应协作任务中偏离正常操作的情况方面起着关键作用。尽管在人机交互领域，重建模型已被积极研究，但直接在特征向量上运行的方法仍很大程度上未被探索。在本研究中，我们提出了MADRI框架，该框架首先将视频流转换为具有语义意义的特征向量，然后执行基于重建的异常检测。此外，我们通过机器人的内部传感器读数和场景图（Scene Graph）对这些视觉特征向量进行增强，使模型能够同时捕捉视觉环境中的外部异常和机器人自身的内部故障。为评估我们的方法，我们收集了一个自定义数据集，包含正常和异常条件下简单的拾放机器人任务。实验结果表明，仅基于视觉特征向量的重建就能有效检测异常，而结合其他模态则进一步提升了检测性能，突显了多模态特征重建对于人机协作中鲁棒异常检测的益处。

摘要 (Abstract)

Ensuring safety and reliability in human-robot interaction (HRI) requires the timely detection of unexpected events that could lead to system failures or unsafe behaviours. Anomaly detection thus plays a critical role in enabling robots to recognize and respond to deviations from normal operation during collaborative tasks. While reconstruction models have been actively explored in HRI, approaches that operate directly on feature vectors remain largely unexplored. In this work, we propose MADRI, a framework that first transforms video streams into semantically meaningful feature vectors before performing reconstruction-based anomaly detection. Additionally, we augment these visual feature vectors with the robot’s internal sensors’ readings and a Scene Graph, enabling the model to capture both external anomalies in the visual environment and internal failures within the robot itself. To evaluate our approach, we collected a custom dataset consisting of a simple pick-and-place robotic task under normal and anomalous conditions. Experimental results demonstrate that reconstruction on vision-based feature vectors alone is effective for detecting anomalies, while incorporating other modalities further improves detection performance, highlighting the benefits of multimodal feature reconstruction for robust anomaly detection in human-robot collaboration.

关键词: anomaly detection, human-robot interaction, multimodal, feature vectors, reconstruction, scene graph, pick-and-place, robust detection

160. ❌ Structure-Aware Fine-Grained Gaussian Splatting for Expressive Avatar Reconstruction

作者: Yuze Su, Hongsong Wang, Jie Gui, Liang Wang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09324v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和图形学领域，提出了一种基于高斯泼溅（Gaussian Splatting）的3D人体化身重建方法，涉及动态特征捕捉、结构感知建模和手部细节细化等技术。所有评分关键词均与大语言模型（LLM）、深度学习技术原理或AI for Science应用直接相关，而本文研究的是3D重建和计算机视觉，未涉及任何大模型技术、深度学习创新原理或科学领域AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SFGS的新方法，用于从单目视频序列中重建具有精细细节（如手部动作和面部表情）的3D人体化身，通过空间-时间特征捕捉和结构感知建模，在单阶段训练中实现了优于现有基准的高保真度结果。

摘要翻译

从单目视频中重建具有照片级真实感且保持拓扑结构的人体虚拟化身，仍然是计算机视觉与图形学领域的一项重大挑战。现有的三维人体虚拟化身建模方法虽能有效捕捉身体运动，却往往难以精确建模手部动作和面部表情等精细细节。为此，我们提出了一种结构感知的细粒度高斯溅射方法（Structure-aware Fine-grained Gaussian Splatting, SFGS），这是一种从单目视频序列重建富有表现力且连贯的全身三维人体虚拟化身的新方法。SFGS 同时利用仅空间的三平面（triplane）和时序感知的六平面（hexplane）来捕捉连续帧间的动态特征。我们设计了一个结构感知的高斯模块，以空间连贯的方式捕捉姿态相关的细节，并提升姿态与纹理的表现力。为了更好地建模手部形变，我们还提出了一个基于细粒度手部重建的残差优化模块。我们的方法仅需单阶段训练，在定量与定性评估中均优于现有先进基线，能够生成具有自然运动和精细细节的高保真虚拟化身。代码已发布于 Github：https://github.com/Su245811YZ/SFGS

摘要 (Abstract)

Reconstructing photorealistic and topology-aware human avatars from monocular videos remains a significant challenge in the fields of computer vision and graphics. While existing 3D human avatar modeling approaches can effectively capture body motion, they often fail to accurately model fine details such as hand movements and facial expressions. To address this, we propose Structure-aware Fine-grained Gaussian Splatting (SFGS), a novel method for reconstructing expressive and coherent full-body 3D human avatars from a monocular video sequence. The SFGS use both spatial-only triplane and time-aware hexplane to capture dynamic features across consecutive frames. A structure-aware gaussian module is designed to capture pose-dependent details in a spatially coherent manner and improve pose and texture expression. To better model hand deformations, we also propose a residual refinement module based on fine-grained hand reconstruction. Our method requires only a single-stage training and outperforms state-of-the-art baselines in both quantitative and qualitative evaluations, generating high-fidelity avatars with natural motion and fine details. The code is on Github: https://github.com/Su245811YZ/SFGS

关键词: 3D human avatar reconstruction, Gaussian Splatting, monocular video, fine-grained details, hand movements, facial expressions, structure-aware modeling, single-stage training

161. ❌ UHD Low-Light Image Enhancement via Real-Time Enhancement Methods with Clifford Information Fusion

作者: Xiaohan Wang, Chen Wu, Dawei Zhao, Guangwei Gao, Dianjie Lu, Guijuan Zhang, Linwei Fan, Xu Lu, Shuai Wu, Hang Wei, Zhuoran Zheng 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09321v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的低光图像增强，提出了一种基于Clifford代数和轻量级U-Net的实时超高清图像增强网络。论文的核心技术涉及几何特征融合、深度可分离卷积、FP16混合精度计算等计算机视觉和图像处理技术。所有评分关键词均与大语言模型、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文研究的是传统的图像处理任务（低光增强），未涉及任何大语言模型、深度学习基础技术或AI for Science的具体应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于Clifford代数和轻量级U-Net的实时超高清低光图像增强网络，解决了现有方法在边缘设备上无法实现毫秒级推理的瓶颈，并在多个恢复指标上超越了最先进模型。

摘要翻译

考虑到效率问题，超高清低光照图像复原极具挑战性。现有基于Transformer架构或高维复杂卷积神经网络的方法常受限于“内存墙”瓶颈，难以在边缘设备上实现毫秒级推理。为解决此问题，我们提出一种基于二维欧几里得空间中克利福德代数几何特征融合的新型实时超高清低光照增强网络。首先，我们构建了一个分辨率逐层递增的四层特征金字塔，通过高斯模糊核将输入图像分解为低频与高频结构分量，并采用基于深度可分离卷积的轻量化U-Net进行双分支特征提取。其次，为克服传统高低频特征融合导致的结构信息丢失与伪影问题，我们引入空间感知克利福德代数，将特征张量映射到多向量空间（标量、向量、双向量），并利用克利福德相似性进行特征聚合，同时抑制噪声并保持纹理。在重建阶段，网络输出自适应的伽马图与增益图，通过Retinex理论进行物理约束的非线性亮度调整。结合FP16混合精度计算与动态算子融合技术，我们的方法在单张消费级设备上实现了对4K/8K图像的毫秒级推理，并在多项复原指标上超越了现有最优模型。

摘要 (Abstract)

Considering efficiency, ultra-high-definition (UHD) low-light image restoration is extremely challenging. Existing methods based on Transformer architectures or high-dimensional complex convolutional neural networks often suffer from the “memory wall” bottleneck, failing to achieve millisecond-level inference on edge devices. To address this issue, we propose a novel real-time UHD low-light enhancement network based on geometric feature fusion using Clifford algebra in 2D Euclidean space. First, we construct a four-layer feature pyramid with gradually increasing resolution, which decomposes input images into low-frequency and high-frequency structural components via a Gaussian blur kernel, and adopts a lightweight U-Net based on depthwise separable convolution for dual-branch feature extraction. Second, to resolve structural information loss and artifacts from traditional high-low frequency feature fusion, we introduce spatially aware Clifford algebra, which maps feature tensors to a multivector space (scalars, vectors, bivectors) and uses Clifford similarity to aggregate features while suppressing noise and preserving textures. In the reconstruction stage, the network outputs adaptive Gamma and Gain maps, which perform physically constrained non-linear brightness adjustment via Retinex theory. Integrated with FP16 mixed-precision computation and dynamic operator fusion, our method achieves millisecond-level inference for 4K/8K images on a single consumer-grade device, while outperforming state-of-the-art (SOTA) models on several restoration metrics.

关键词: UHD low-light image enhancement, real-time enhancement, Clifford algebra, feature fusion, lightweight U-Net, depthwise separable convolution, Retinex theory, millisecond-level inference

162. ❌ VAGNet: Vision-based accident anticipation with global features

作者: Vipooshan Vipulananthan, Charith D. Chitraranjan 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09305v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文VAGNet专注于计算机视觉和自动驾驶领域的事故预测，使用基于Transformer和Graph的深度学习架构以及VideoMAE-V2视觉基础模型进行全局特征提取。虽然涉及深度学习技术，但所有评分关键词均针对大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理系统等），而本论文完全不涉及语言模型、文本处理或LLM技术栈的任何方面，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于全局特征的视觉事故预测网络VAGNet，通过使用VideoMAE-V2提取交通场景的全局特征，在多个基准数据集上实现了更高的预测精度和计算效率。

摘要翻译

交通事故是全球范围内导致伤亡的主要原因之一。因此，提前预判危险情境的能力至关重要。自动化事故预测能够通过驾驶员警报和碰撞规避操作实现及时干预，构成高级驾驶辅助系统的关键组成部分。在自动驾驶中，此类预测能力支持主动安全行为，例如启动防御性驾驶以及在必要时进行人工接管。使用行车记录仪视频作为输入提供了一种经济高效的解决方案，但由于现实驾驶场景的复杂性，这仍具挑战性。事故预测系统需要实时运行。然而，现有方法涉及从每个检测到的物体中提取特征，计算成本高昂。我们提出VAGNet，一种深度神经网络，它利用交通场景的全局特征从行车记录仪视频中学习预测事故，而无需显式的物体级特征。该网络由Transformer模块和图模块组成，并使用视觉基础模型VideoMAE-V2进行全局特征提取。在四个基准数据集（DAD、DoTA、DADA和Nexar）上的实验表明，与现有方法相比，我们的方法以更高的平均精度和平均事故前时间预测事故，同时计算效率更优。

摘要 (Abstract)

Traffic accidents are a leading cause of fatalities and injuries across the globe. Therefore, the ability to anticipate hazardous situations in advance is essential. Automated accident anticipation enables timely intervention through driver alerts and collision avoidance maneuvers, forming a key component of advanced driver assistance systems. In autonomous driving, such predictive capabilities support proactive safety behaviors, such as initiating defensive driving and human takeover when required. Using dashcam video as input offers a cost-effective solution, but it is challenging due to the complexity of real-world driving scenes. Accident anticipation systems need to operate in real-time. However, current methods involve extracting features from each detected object, which is computationally intensive. We propose VAGNet, a deep neural network that learns to predict accidents from dash-cam video using global features of traffic scenes without requiring explicit object-level features. The network consists of transformer and graph modules, and we use the vision foundation model VideoMAE-V2 for global feature extraction. Experiments on four benchmark datasets (DAD, DoTA, DADA, and Nexar) show that our method anticipates accidents with higher average precision and mean time-to-accident while being computationally more efficient compared to existing methods.

关键词: accident anticipation, global features, dashcam video, VideoMAE-V2, transformer, graph neural network, autonomous driving, real-time prediction

163. ❌ Compositional-Degradation UAV Image Restoration: Conditional Decoupled MoE Network and A Benchmark

作者: Jinquan Yan, Zhicheng Zhao, Zhengzheng Tu, Chenglong Li, Jin Tang, Bin Luo 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09313v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心贡献是提出DAME-Net（Degradation-Aware Mixture-of-Experts Network），其中Conditioned Decoupled MoE模块是关键创新，因此与’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分）。论文属于计算机视觉中的图像修复领域，应用于无人机图像处理，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为AI for Science包括AI在科学和工程应用（如遥感、图像分析）。其他关键词主要涉及大语言模型（LLM）技术、训练方法、推理优化、代理系统等，与论文的计算机视觉和图像处理焦点无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文针对无人机图像中多种退化因素（如雨、雾、噪声）同时存在的复合退化问题，提出了一种降解感知的混合专家网络（DAME-Net），通过显式降解感知和条件解耦的MoE模块实现选择性因子校正，并在构建的大规模基准上验证了其优于现有统一修复方法的性能，同时提升了下游目标检测任务的效果。

摘要翻译

无人机影像在大范围测绘、基础设施巡检与应急响应等应用中至关重要。然而，在实际飞行环境中，单幅图像常受到多种退化因素的共同影响，包括雨雾、霾和噪声等，这会降低下游任务的性能。当前的一体化复原方法通常依赖于隐式的退化表征，将多种因素纠缠为单一条件，导致异质校正间相互干扰。为此，我们提出DAME-Net，一种退化感知的专家混合网络，通过将显式退化感知与基于退化条件的重建解耦，实现组合式无人机图像复原。具体而言，我们设计了因子级退化感知模块，通过基于标签相似性引导的软对齐多标签预测，为复原阶段提供显式的各因子退化线索，从而以可解释、可泛化的退化描述替代隐式纠缠的条件。此外，我们开发了条件解耦专家混合模块，利用这些线索进行分阶段条件控制、空频混合处理以及掩码约束的解耦专家路由，从而实现对特定因子的选择性校正，同时抑制无关干扰。我们还构建了首个面向组合式无人机图像复原的大规模基准数据集——多退化无人机复原基准，涵盖从单一退化到四因子组合的43种退化配置，并提供了标准化的可见/不可见退化划分。在MDUR上的大量实验表明，相较于代表性的统一复原方法，本方法取得了持续的性能提升，且在不可见及高阶复合退化上增益更为显著。下游实验进一步验证了该方法对无人机目标检测任务的提升效益。

摘要 (Abstract)

UAV images are critical for applications such as large-area mapping, infrastructure inspection, and emergency response. However, in real-world flight environments, a single image is often affected by multiple degradation factors, including rain, haze, and noise, undermining downstream task performance. Current unified restoration approaches typically rely on implicit degradation representations that entangle multiple factors into a single condition, causing mutual interference among heterogeneous corrections. To this end, we propose DAME-Net, a Degradation-Aware Mixture-of-Experts Network that decouples explicit degradation perception from degradation-conditioned reconstruction for compositional UAV image restoration. Specifically, we design a Factor-wise Degradation Perception module(FDPM) to provide explicit per-factor degradation cues for the restoration stage through multi-label prediction with label-similarity-guided soft alignment, replacing implicit entangled conditions with interpretable and generalizable degradation descriptions. Moreover, we develop a Conditioned Decoupled MoE module(CDMM) that leverages these cues for stage-wise conditioning, spatial-frequency hybrid processing, and mask-constrained decoupled expert routing, enabling selective factor-specific correction while suppressing irrelevant interference. In addition, we construct the Multi-Degradation UAV Restoration benchmark (MDUR), the first large-scale UAV benchmark for compositional UAV image restoration, with 43 degradation configurations from single degradations to four-factor composites and standardized seen/unseen splits.Extensive experiments on MDUR demonstrate consistent improvements over representative unified restoration methods, with greater gains on unseen and higher-order composite degradations. Downstream experiments further validate benefits for UAV object detection.

关键词: UAV image restoration, compositional degradation, Mixture-of-Experts, degradation-aware, conditional decoupled, multi-degradation benchmark, factor-wise correction, degradation perception

164. ❌ GeRM: A Generative Rendering Model From Physically Realistic to Photorealistic

作者: Jiayuan Lu, Rengan Xie, Xuancheng Jin, Zhizhen Wu, Qi Ye, Tian Xie, Hujun Bao, Rui Wang. Yuchi Huo 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09304v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《GeRM: A Generative Rendering Model From Physically Realistic to Photorealistic》专注于计算机图形学中的渲染技术，提出了一种生成式渲染模型来弥合物理真实渲染（PBR）与照片真实渲染（PRR）之间的差距。论文使用了多模态生成模型、ControlNet架构和多智能体VLM框架来构建数据集，但其核心内容与提供的关键词列表（主要围绕大语言模型、深度学习技术原理及其在科学领域的应用）完全无关。所有关键词均涉及自然语言处理、模型训练、对齐、推理、代理系统等特定技术，而本文研究的是图像生成和渲染，属于计算机视觉和图形学领域，没有涉及任何关键词中的技术或应用。

!!! tip deepseek-chat TL;DR

本文研究了物理真实渲染（PBR）与照片真实渲染（PRR）之间的差距问题，并提出了一种名为GeRM的多模态生成渲染模型，通过分布转移向量场和ControlNet架构来生成可控的照片真实图像。

摘要翻译

数十年来，基于物理的渲染（Physically-Based Rendering, PBR）一直是合成逼真图像的基础，因此有时被粗略地称为照片级真实感渲染（Photorealistic Rendering, PRR）。尽管PBR确实是一种保证物理真实性的光传输数学模拟，但照片级真实感还额外依赖于对现实世界几何与外观的真实数字模型，这使得从PBR到PRR（简称P2P）之间存在一个尚未被充分探索的鸿沟。因此，通向照片级真实感的路径面临一个关键困境：显式模拟PRR受限于难以获取的现实存在物的真实数字模型，而隐式生成模型则牺牲了可控性与几何一致性。基于这一洞察，本文提出了缓解P2P鸿沟的问题、数据与方法，并首次提出了一种多模态生成式渲染模型——GeRM，以统一PBR与PRR。GeRM将几何缓冲区（G-buffers）等物理属性与文本提示相结合，通过渐进式增量注入来生成可控的照片级真实感图像，使用户能够在严格的物理保真度与感知层面的照片级真实感之间流畅地探索连续谱。在技术上，我们将PBR与PRR图像之间的过渡建模为一种分布迁移，并旨在学习一个分布迁移向量场（DTV Field）来指导这一过程。为定义学习目标，我们首先利用一个多智能体视觉语言模型（VLM）框架构建了一个专家引导的成对P2P迁移数据集，命名为P2P-50K，其中数据集的每个配对样本对应于DTV Field中的一个迁移向量。随后，我们提出了一种多条件控制网络（multi-condition ControlNet）来学习DTV Field，该网络在几何缓冲区、文本提示及增强区域线索的引导下，合成PBR图像并逐步将其过渡为PRR图像。

摘要 (Abstract)

For decades, Physically-Based Rendering (PBR) is the fundation of synthesizing photorealisitic images, and therefore sometimes roughly referred as Photorealistic Rendering (PRR). While PBR is indeed a mathematical simulation of light transport that guarantees physical reality, photorealism has additional reliance on the realistic digital model of geometry and appearance of the real world, leaving a barely explored gap from PBR to PRR (P2P). Consequently, the path toward photorealism faces a critical dilemma: the explicit simulation of PRR encumbered by unreachable realistic digital models for real-world existence, while implicit generation models sacrifice controllability and geometric consistency. Based on this insight, this paper presents the problem, data, and approach of mitigating P2P gap, followed by the first multi-modal generative rendering model, dubbed GeRM, to unify PBR and PRR. GeRM integrates physical attributes like G-buffers with text prompts, and progressive incremental injection to generate controllable photorealistic images, allowing users to fluidly navigate the continuum between strict physical fidelity and perceptual photorealism. Technically, we model the transition between PBR and PRR images as a distribution transfer and aim to learn a distribution transfer vector field (DTV Field) to guide this process. To define the learning objective, we first leverage a multi-agent VLM framework to construct an expert-guided pairwise P2P transfer dataset, named P2P-50K, where each paired sample in the dataset corresponds to a transfer vector in the DTV Field. Subsequently, we propose a multi-condition ControlNet to learn the DTV Field, which synthesizes PBR images and progressively transitions them into PRR images, guided by G-buffers, text prompts, and cues for enhanced regions.

关键词: Generative Rendering Model, Physically-Based Rendering, Photorealistic Rendering, Distribution Transfer Vector Field, Multi-modal Generation, ControlNet, G-buffers, P2P Gap

165. ❌ Characterizing Lidar Range-Measurement Ambiguity due to Multiple Returns

作者: Jason H. Rife, Yifan Li 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09282v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究激光雷达（Lidar）在自动驾驶车辆定位中的测量模糊性问题，具体分析多返回情况下的概率性测量特性。所有评分关键词均涉及大模型、深度学习及相关技术（如训练方法、推理优化、对齐、代理系统等），而本文专注于传感器物理特性、信号处理和自动驾驶定位算法，属于传统工程领域，与人工智能模型技术无直接关联。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了激光雷达在存在多个散射表面时产生概率性多返回测量的现象，通过分析数据集提出了表征这些情况的累积分布函数，并讨论了其对基于激光雷达的定位系统的影响评估方法。

摘要翻译

可靠的位置与姿态感知对于在常规道路上运行的高度自动化车辆至关重要。激光雷达传感器正日益被集成到姿态估计系统中。尽管激光雷达具有巨大效用，但它是一种复杂的传感器，其在道路环境中的性能尚未得到充分理解。例如，激光雷达定位算法通常假设激光雷达总能沿给定射线路径识别出唯一的表面。然而，这一假设并非总是成立，因为大量先验证据表明，当激光雷达的锥形波束内出现多个散射表面时，激光雷达单元可能会以概率方式生成测量值。本文通过分析激光雷达数据集，以表征沿特定射线路径出现概率性回波的情况。我们的贡献在于，针对两台基座固定的不同机械旋转式激光雷达单元所观测到的射线路径，提供了具有代表性的累积分布函数（CDF）。在后续讨论中，我们概述了一种定性方法，用于评估概率性多回波情况对基于激光雷达的定位所产生的影响。

摘要 (Abstract)

Reliable position and attitude sensing is critical for highly automated vehicles that operate on conventional roadways. Lidar sensors are increasingly incorporated into pose-estimation systems. Despite its great utility, lidar is a complex sensor, and its performance in roadway environments is not yet well understood. For instance, it is often assumed in lidar-localization algorithms that a lidar will always identify a unique surface along a given raypath. However, this assumption is not always true, as ample prior evidence exists to suggest that lidar units may generate measurements probabilistically when more than one scattering surface appears within the lidar’s conical beam. In this paper, we analyze lidar datasets to characterize cases with probabilistic returns along particular raypaths. Our contribution is to present representative cumulative distribution functions (CDFs) for raypaths observed by two different mechanically rotating lidar units with stationary bases. In subsequent discussion, we outline a qualitative methodology to assess the effect of probabilistic multi-return cases on lidar-based localization.

关键词: Lidar, range measurement, multiple returns, probabilistic returns, localization, autonomous vehicles, cumulative distribution functions, sensor ambiguity

166. ❌ AMO-ENE: Attention-based Multi-Omics Fusion Model for Outcome Prediction in Extra Nodal Extension and HPV-associated Oropharyngeal Cancer

作者: Gautier Hénique, William Le, Gabriel Dayan, Coralie Brodeur, Kristoff Nelson, Apostolos Christopoulos, Edith Filion, Phuc-Felix Nguyen-Tan, Laurent Letourneau-Guillon, Houda Bahig, Samuel Kadoury 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09280v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像分析（CT扫描）和临床数据融合，用于头颈癌预后预测，属于AI在生物医学领域的应用。论文未涉及任何大语言模型（LLM）、深度学习技术原理创新（如MoE、Scaling Laws、Attention优化等）、训练对齐方法（如RLHF、SFT）、推理技术（如CoT、MCTS）、代理系统或模型效率技术（如量化、推测解码）。仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有弱关联（5分），因为其属于AI在生物医学（癌症研究）的应用，但未直接涉及生物信息学或化学信息学的典型方法（如基因组学、药物发现）。其他关键词均完全无关（0分）。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于CT影像和临床数据的全自动端到端流程，用于预测HPV阳性口咽癌的淋巴结外扩展状态和治疗结果，在内部队列验证中其预后预测性能优于基线模型。

摘要翻译

结外侵犯（Extranodal Extension, ENE）是人乳头瘤病毒（HPV）相关口咽癌（Oropharyngeal Cancer, OPC）中一个新兴的预后因素，但目前尚未被纳入临床分期标准。近期研究主张将影像学检测的ENE（iENE）作为HPV阳性口咽癌分期的预后标志物。然而，其临床整合仍面临若干实际限制，包括分割方法不一致、CT影像中转移淋巴结周边对比度低以及人工标注耗时费力等。为应对这些局限，我们提出了一种全自动端到端流程，利用计算机断层扫描（CT）影像与临床数据评估淋巴结ENE状态并预测治疗结局。我们的方法包含一个分层的三维半监督分割模型，旨在从放疗计划CT扫描中检测并勾画相关的iENE区域。基于这些分割结果，提取一组影像组学特征与深度特征，用以训练影像检测的ENE分级分类器。随后评估所预测的ENE状态的预后价值，并与现有分期标准进行比较。此外，我们将这些淋巴结特征与原发肿瘤特征整合到一个基于注意力的多模态结局预测模型中，从而构建了一个动态的结局预测框架。我们的方法在一个内部队列中得到验证，该队列包含2009年至2020年间接受放疗或放化疗治疗的397例HPV阳性口咽癌患者。在两年时间点的结局预测中，我们的流程在转移复发、总生存期和无病生存期的预测上均优于基线模型，其受试者工作特征曲线下面积（AUC）分别为88.2%（4.8）、79.2%（7.4）和78.1%（8.6）。同时，我们获得的转移复发一致性指数为83.3%（6.5），总生存期为71.3%（8.9），无病生存期为70.0%（8.1），表明该方法具备临床决策应用的可行性。

摘要 (Abstract)

Extranodal extension (ENE) is an emerging prognostic factor in human papillomavirus (HPV)-associated oropharyngeal cancer (OPC), although it is currently omitted as a clinical staging criteria. Recent works have advocated for the inclusion of iENE as a prognostic marker in HPV-positive OPC staging. However, several practical limitations continue to hinder its clinical integration, including inconsistencies in segmentation, low contrast in the periphery of metastatic lymph nodes on CT imaging, and laborious manual annotations. To address these limitations, we propose a fully automated end-to-end pipeline that uses computed tomography (CT) images with clinical data to assess the status of nodal ENE and predict treatment outcomes. Our approach includes a hierarchical 3D semi-supervised segmentation model designed to detect and delineate relevant iENE from radiotherapy planning CT scans. From these segmentations, a set of radiomics and deep features are extracted to train an imaging-detected ENE grading classifier. The predicted ENE status is then evaluated for its prognostic value and compared with existing staging criteria. Furthermore, we integrate these nodal features with primary tumor characteristics in a multimodal, attention-based outcome prediction model, providing a dynamic framework for outcome prediction. Our method is validated in an internal cohort of 397 HPV-positive OPC patients treated with radiation therapy or chemoradiotherapy between 2009 and 2020. For outcome prediction at the 2-year mark, our pipeline surpassed baseline models with 88.2% (4.8) in AUC for metastatic recurrence, 79.2% (7.4) for overall survival, and 78.1% (8.6) for disease-free survival. We also obtain a concordance index of 83.3% (6.5) for metastatic recurrence, 71.3% (8.9) for overall survival, and 70.0% (8.1) for disease-free survival, making it feasible for clinical decision making.

关键词: oropharyngeal cancer, extranodal extension, CT imaging, radiomics, deep features, attention-based fusion, outcome prediction, prognostic model

167. ❌ Beyond Segmentation: Structurally Informed Facade Parsing from Imperfect Images

作者: Maciej Janicki, Aleksander Plocharski, Przemyslaw Musialski 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09260v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是计算机视觉领域的建筑立面解析问题，具体针对YOLOv8目标检测器进行改进，通过添加自定义对齐损失来增强结构一致性。所有评分关键词均涉及大语言模型、深度学习技术原理或AI在科学领域的应用，而本文完全不涉及这些主题，属于传统的计算机视觉和图像处理研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对建筑立面解析中标准目标检测器缺乏结构一致性的问题，提出了一种通过添加自定义对齐损失来增强YOLOv8训练的方法，实验证明该方法能有效改善结构规律性并纠正对齐误差。

摘要翻译

标准目标检测器通常独立处理建筑元素，这往往导致立面解析结果缺乏下游程序化重建所需的结构连贯性。为克服这一局限，我们在YOLOv8训练目标中引入了一种定制的轻量化对齐损失函数。该正则化方法在训练过程中促使边界框形成网格一致的空间排列，从而在不改变标准推理流程的前提下有效注入几何先验知识。在CMP数据集上的实验表明，本方法成功提升了结构规整性，能够修正由透视和遮挡引起的对齐误差，同时与标准检测精度保持可控的平衡关系。

摘要 (Abstract)

Standard object detectors typically treat architectural elements independently, often resulting in facade parsings that lack the structural coherence required for downstream procedural reconstruction. We address this limitation by augmenting the YOLOv8 training objective with a custom lightweight alignment loss. This regularization encourages grid-consistent arrangements of bounding boxes during training, effectively injecting geometric priors without altering the standard inference pipeline. Experiments on the CMP dataset demonstrate that our method successfully improves structural regularity, correcting alignment errors caused by perspective and occlusion while maintaining a controllable trade-off with standard detection accuracy.

关键词: facade parsing, object detection, YOLOv8, alignment loss, structural coherence, geometric priors, CMP dataset, procedural reconstruction

168. ❌ FashionStylist: An Expert Knowledge-enhanced Multimodal Dataset for Fashion Understanding

作者: Kaidong Feng, Zhuoxuan Huang, Huizhong Guo, Yuting Jin, Xinyu Chen, Yue Liang, Yifei Gai, Li Zhou, Yunshan Ma, Zhu Sun 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09249v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《FashionStylist: An Expert Knowledge-enhanced Multimodal Dataset for Fashion Understanding》专注于时尚领域的多模态数据集构建和任务基准测试，核心贡献是创建了一个专家标注的时尚理解数据集，支持服装到物品的定位、服装搭配完成和服装评估等任务。虽然论文提到了MLLM（多模态大语言模型）在时尚系统中的应用，但研究重点在于数据集本身的设计、标注和评估，而非大模型或深度学习技术原理的创新。所有评分关键词均涉及大模型技术、训练方法、推理优化、对齐、压缩、代理系统等具体技术方向，与本文的时尚数据集构建主题无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有时尚数据集碎片化、任务特定化的问题，提出了一个专家知识增强的多模态数据集FashionStylist，用于支持服装到物品定位、服装搭配完成和服装评估等整体时尚理解任务，并验证了其作为统一基准和训练资源在基于MLLM的时尚系统中的有效性。

摘要翻译

时尚理解既需要视觉感知，也需要对风格、场合、搭配兼容性及整体造型逻辑进行专家级推理。然而，现有的时尚数据集仍处于碎片化且局限于特定任务的状态，通常仅关注单品属性、套装共现或弱文本监督，因而对整体造型理解的支持有限。本文提出FashionStylist，一个为整体且专家级时尚理解构建的专家标注基准。该数据集通过专门的时尚专家标注流程构建，在单品和套装两个层面提供了专业可靠的标注。它支持三项代表性任务：套装到单品的定位、套装补全以及套装评估。这些任务涵盖了从包含叠穿与配饰的复杂造型中还原实际单品、超越共现匹配的兼容性感知组合，以及对风格、季节、场合及整体协调性的专家级评估。实验结果表明，FashionStylist不仅可作为多项时尚任务的统一基准，还能作为有效的训练资源，用于提升基于多模态大语言模型（MLLM）的时尚系统在定位、补全及套装层面语义评估方面的性能。

摘要 (Abstract)

Fashion understanding requires both visual perception and expert-level reasoning about style, occasion, compatibility, and outfit rationale. However, existing fashion datasets remain fragmented and task-specific, often focusing on item attributes, outfit co-occurrence, or weak textual supervision, and thus provide limited support for holistic outfit understanding. In this paper, we introduce FashionStylist, an expert-annotated benchmark for holistic and expert-level fashion understanding. Constructed through a dedicated fashion-expert annotation pipeline, FashionStylist provides professionally grounded annotations at both the item and outfit levels. It supports three representative tasks: outfit-to-item grounding, outfit completion, and outfit evaluation. These tasks cover realistic item recovery from complex outfits with layering and accessories, compatibility-aware composition beyond co-occurrence matching, and expert-level assessment of style, season, occasion, and overall coherence. Experimental results show that FashionStylist serves not only as a unified benchmark for multiple fashion tasks, but also as an effective training resource for improving grounding, completion, and outfit-level semantic evaluation in MLLM-based fashion systems.

关键词: Fashion understanding, Multimodal dataset, Expert annotation, Outfit grounding, Outfit completion, Outfit evaluation, MLLM-based systems, Holistic fashion analysis

169. ❌ 2D or 3D: Who Governs Salience in VLA Models? – Tri-Stage Token Pruning Framework with Modality Salience Awareness

作者: Zihao Zheng, Sicheng Tian, Zhihao Mao, Lingyue Zhang, Chenyue Li, Ziyun Zhang, Hong Gao, Yuchen Huang, Yutong Xu, Guojie Luo, Xiang Chen 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09244v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是Vision-Language-Action (VLA)模型中的token pruning技术，专注于2D/3D多模态视觉输入的处理优化。虽然VLA模型属于多模态大模型范畴，但论文的核心是视觉模态的token pruning框架，而非语言模型本身的技术创新。所有评分关键词都针对纯语言模型或通用大模型技术（如LLM训练、对齐、推理优化、代理系统等），与本文的视觉-语言-动作模型和视觉token pruning技术没有直接关联。论文未涉及任何评分关键词中的具体技术或概念。

!!! tip deepseek-chat TL;DR

本文针对多视觉模态VLA模型提出了一种考虑2D/3D模态显著性的三阶段token pruning框架，实现了最高2.55倍的推理加速且精度损失最小，仅需5.8%的开销。

摘要翻译

视觉-语言-动作（Vision-Language-Action，VLA）模型已成为具身智能的主流范式。近期的VLA模型已将其输入模态从仅支持2D扩展至2D+3D范式，形成了多视觉模态VLA（Multi-Visual-modal VLA，MVLA）模型。尽管在空间感知能力上有所提升，但由于模态扩展导致输入令牌数量增加，MVLA模型面临着更大的加速需求。令牌剪枝是一种针对MVLA模型的有效优化方法。然而，现有的令牌剪枝方案均针对仅支持2D的VLA模型设计，忽略了2D与3D模态之间的显著性差异。本文遵循多模态数据在MVLA模型中的应用流程，提出了一种三阶段分析方法，以捕捉2D/3D模态显著性的差异性与动态变化。基于此，我们为MVLA模型设计了一个相应的三阶段令牌剪枝框架，以实现最优的2D/3D令牌选择与高效剪枝。实验表明，该框架在仅引入5.8%开销且精度损失极小的前提下，推理速度最高可提升至2.55倍。代码即将公开。

摘要 (Abstract)

Vision-Language-Action (VLA) models have emerged as the mainstream of embodied intelligence. Recent VLA models have expanded their input modalities from 2D-only to 2D+3D paradigms, forming multi-visual-modal VLA (MVLA) models. Despite achieving improved spatial perception, MVLA faces a greater acceleration demand due to the increased number of input tokens caused by modal expansion. Token pruning is an effective optimization methods tailored to MVLA models. However, existing token pruning schemes are designed for 2D-only VLA models, ignoring 2D/3D modality salience differences. In this paper, we follow the application process of multi-modal data in MVLA models and develop a tri-stage analysis to capture the discrepancy and dynamics of 2D/3D modality salience. Based on these, we propose a corresponding tri-stage token pruning framework for MVLA models to achieve optimal 2D/3D token selection and efficient pruning. Experiments show that our framework achieves up to a 2.55x inference speedup with minimal accuracy loss, while only costing 5.8% overhead. Our Code is coming soon.

关键词: Vision-Language-Action models, VLA models, token pruning, 2D/3D modality, multi-visual-modal VLA, inference acceleration, modality salience, MVLA models

170. ❌ Hitem3D 2.0: Multi-View Guided Native 3D Texture Generation

作者: Huiang He, Shengchu Zhao, Jianwen Huang, Jie Li, Jiaqi Wu, Hu Zhang, Pei Tang, Heliang Zheng, Yukun Li, Rongfei Jia 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09231v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于3D纹理生成的计算机视觉任务，使用多视图引导和原生3D表示方法解决纹理覆盖、视图一致性和几何对齐问题。所有评分关键词均与大语言模型、深度学习技术原理或科学AI应用相关，而本文未涉及任何大模型技术、训练方法、推理优化、对齐技术或特定科学领域应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了Hitem3D 2.0框架，通过多视图引导和原生3D纹理表示解决了3D纹理生成中的纹理覆盖不全、视图不一致和几何对齐问题，显著提高了纹理细节、保真度和一致性。

摘要翻译

尽管近期进展提升了三维纹理生成的质量，现有方法仍面临纹理覆盖不完整、跨视角不一致以及几何与纹理错位等问题。为突破这些局限，我们提出Hitem3D 2.0——一个多视角引导的原生三维纹理生成框架，该框架通过融合二维多视角生成先验与原生三维纹理表示来提升纹理质量。Hitem3D 2.0包含两个核心组件：多视角合成框架与原生三维纹理生成模型。多视角生成部分基于预训练的图像编辑主干网络构建，并引入即插即用模块以显式促进几何对齐、跨视角一致性与光照均匀性，从而实现高保真多视角图像的合成。在生成视角与三维几何条件的约束下，原生三维纹理生成模型将多视角纹理投影至三维表面，并对未观测区域的纹理进行合理补全。通过将多视角一致性约束与原生三维纹理建模相结合，Hitem3D 2.0显著提升了纹理完整性、跨视角连贯性与几何对齐度。实验结果表明，Hitem3D 2.0在纹理细节、保真度、一致性、连贯性及对齐度方面均优于现有方法。

摘要 (Abstract)

Although recent advances have improved the quality of 3D texture generation, existing methods still struggle with incomplete texture coverage, cross-view inconsistency, and misalignment between geometry and texture. To address these limitations, we propose Hitem3D 2.0, a multi-view guided native 3D texture generation framework that enhances texture quality through the integration of 2D multi-view generation priors and native 3D texture representations. Hitem3D 2.0 comprises two key components: a multi-view synthesis framework and a native 3D texture generation model. The multi-view generation is built upon a pre-trained image editing backbone and incorporates plug-and-play modules that explicitly promote geometric alignment, cross-view consistency, and illumination uniformity, thereby enabling the synthesis of high-fidelity multi-view images. Conditioned on the generated views and 3D geometry, the native 3D texture generation model projects multi-view textures onto 3D surfaces while plausibly completing textures in unseen regions. Through the integration of multi-view consistency constraints with native 3D texture modeling, Hitem3D 2.0 significantly improves texture completeness, cross-view coherence, and geometric alignment. Experimental results demonstrate that Hitem3D 2.0 outperforms existing methods in terms of texture detail, fidelity, consistency, coherence, and alignment.

关键词: 3D texture generation, multi-view guidance, native 3D representation, texture completeness, cross-view consistency, geometric alignment, multi-view synthesis, texture projection

171. ❌ Training-free, Perceptually Consistent Low-Resolution Previews with High-Resolution Image for Efficient Workflows of Diffusion Models

作者: Wongi Jeong, Hoigi Seo, Se Young Chun 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09227v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于扩散模型的高效工作流程，提出了一种训练无关的方法来生成与高分辨率图像感知一致的低分辨率预览图，以降低计算成本。论文的核心是扩散模型、流匹配模型、图像生成和计算效率优化，与所有评分关键词（均围绕大语言模型、对齐、推理、代理、科学AI等主题）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了扩散模型生成高分辨率图像时计算成本高的问题，提出了一种训练无关的方法来生成感知一致的低分辨率预览图，实现了高达33%的计算减少和3倍加速。

摘要翻译

图像生成模型已成为从普通用户到专业设计师群体中不可或缺的工具，能够为所有人产出精美的高分辨率图像。然而，要获得理想结果通常需要生成大量不同提示词和随机种子下的高分辨率图像，这给用户和服务提供商均带来了高昂的计算成本。首先生成低分辨率图像可以减轻计算负担，但如何生成在感知上与对应高分辨率图像保持一致的LR图像并非易事。本文研究生成高保真低分辨率图像（称为预览图）的任务，旨在保持其与高分辨率图像在感知上的相似性，从而构建高效工作流，使用户能在生成最终高分辨率图像前筛选出有潜力的候选结果。我们提出对易子为零条件，以确保流匹配模型中低分辨率与高分辨率图像的感知一致性，并由此推导出无需训练的解决方案，包含下采样矩阵选择与对易子为零引导。大量实验表明，我们的方法能以最高33%的计算量缩减生成低分辨率图像，同时保持高分辨率感知一致性。当与现有加速技术结合时，本方法可实现最高3倍的加速效果。此外，我们的框架可扩展至图像变形、平移等编辑操作，展现了其泛化能力。

摘要 (Abstract)

Image generative models have become indispensable tools to yield exquisite high-resolution (HR) images for everyone, ranging from general users to professional designers. However, a desired outcome often requires generating a large number of HR images with different prompts and seeds, resulting in high computational cost for both users and service providers. Generating low-resolution (LR) images first could alleviate computational burden, but it is not straightforward how to generate LR images that are perceptually consistent with their HR counterparts. Here, we consider the task of generating high-fidelity LR images, called Previews, that preserve perceptual similarity of their HR counterparts for an efficient workflow, allowing users to identify promising candidates before generating the final HR image. We propose the commutator-zero condition to ensure the LR-HR perceptual consistency for flow matching models, leading to the proposed training-free solution with downsampling matrix selection and commutator-zero guidance. Extensive experiments show that our method can generate LR images with up to 33% computation reduction while maintaining HR perceptual consistency. When combined with existing acceleration techniques, our method achieves up to 3$\times$ speedup. Moreover, our formulation can be extended to image manipulations, such as warping and translation, demonstrating its generalizability.

关键词: diffusion models, flow matching models, low-resolution previews, perceptual consistency, computational efficiency, training-free method, commutator-zero guidance, image generation

172. ❌ SHIFT: Steering Hidden Intermediates in Flow Transformers

作者: Nina Konovalova, Andrey Kuznetsov, Aibek Alanov 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09213v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SHIFT专注于扩散模型（DiT-based diffusion models）的概念移除和风格控制，其核心创新是受大语言模型中激活引导（activation steering）的启发，提出了一种在推理时通过操纵中间激活来引导生成的方法。因此，仅与关键词’Large Language Models OR LLMs OR Foundation Models’有间接关联（受其技术启发），评分为5分。论文主题是图像生成和扩散模型，而非大模型技术原理创新或大模型在不同领域的应用，与其他关键词无直接关联，均评0分。

!!! tip deepseek-chat TL;DR

论文提出SHIFT框架，通过受大语言模型激活引导启发的中间激活操纵方法，实现了在DiT扩散模型中无需重新训练即可有效移除概念、控制风格和添加目标对象的灵活生成控制。

摘要翻译

扩散模型已成为实现高保真度图像生成的主流方法。近期基于DiT的扩散模型尤其能够在生成高质量样本的同时，表现出强大的提示语遵从能力。受大语言模型中激活导向技术的启发，我们提出了SHIFT——一个简单、高效且轻量化的框架，用于在推理时通过对中间激活进行定向操控，实现DiT扩散模型中的概念移除。SHIFT通过学习导向向量，将其动态应用于选定的网络层和生成时间步，以抑制不需要的视觉概念，同时保持提示语中的其余内容及图像整体质量。除抑制功能外，同一机制还可将生成结果导向期望的\emph{风格域}，或使样本倾向于添加或改变目标物体。我们证明，SHIFT无需耗时重新训练，即可针对多样化的提示语和目标，为DiT生成提供有效且灵活的控制。

摘要 (Abstract)

Diffusion models have become leading approaches for high-fidelity image generation. Recent DiT-based diffusion models, in particular, achieve strong prompt adherence while producing high-quality samples. We propose SHIFT, a simple but effective and lightweight framework for concept removal in DiT diffusion models via targeted manipulation of intermediate activations at inference time, inspired by activation steering in large language models. SHIFT learns steering vectors that are dynamically applied to selected layers and timesteps to suppress unwanted visual concepts while preserving the prompt’s remaining content and overall image quality. Beyond suppression, the same mechanism can shift generations into a desired \emph{style domain} or bias samples toward adding or changing target objects. We demonstrate that SHIFT provides effective and flexible control over DiT generation across diverse prompts and targets without time-consuming retraining.

关键词: Diffusion Models, DiT, Concept Removal, Activation Steering, Intermediate Activations, Inference-time Control, Style Domain, Image Generation

173. ❌ TinyNeRV: Compact Neural Video Representations via Capacity Scaling, Distillation, and Low-Precision Inference

作者: Muhammad Hannan Akhtar, Ihab Amer, Tamer Shanableh 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09220v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是神经视频表示（NeRV）的紧凑化设计，属于计算机视觉和视频压缩领域，而非大语言模型或深度学习技术原理的创新。唯一相关的关键词是’Quantization OR Model Compression OR Low-bit Weights’，因为论文探讨了低精度推理（low-precision inference）、后训练量化和量化感知训练，这与模型压缩相关，但并非核心内容（只是作为提升紧凑模型性能的策略之一），因此给5分。其他所有关键词均与大语言模型、对齐、推理、代理等主题无关，故均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何设计紧凑的神经视频表示（TinyNeRV）架构，通过容量缩放、知识蒸馏和低精度推理来平衡重建质量与效率，实现在资源受限环境中的高效部署。

摘要翻译

隐式神经视频表示将完整视频序列编码于神经网络参数之中，并实现恒定时间的帧重建。近期关于神经视频表示（NeRV）的研究在避免传统视频编解码器顺序解码过程的同时，展现出具有竞争力的重建性能。然而，现有研究大多聚焦于中等或高容量模型，对受限环境下所需的极端紧凑配置的行为机制尚未充分探索。本文针对高效部署需求，系统研究了微型NeRV架构的设计。研究引入了两种轻量级配置——NeRV-T与NeRV-T+，并在多个视频数据集上进行评估，以分析激进的容量缩减如何影响重建质量、计算复杂度及解码吞吐量。除架构缩放外，本研究还探索了在不增加推理成本前提下提升紧凑模型性能的策略：通过频率感知聚焦监督的知识蒸馏方法，增强低容量网络的重建保真度。此外，研究通过训练后量化与量化感知训练两种方式，考察低精度推理的影响，以探究微型模型在数值精度降低条件下的鲁棒性。实验结果表明，精心设计的微型NeRV变体能够在显著减少参数量、计算成本和内存需求的同时，实现优越的质量-效率权衡。这些发现揭示了紧凑神经视频表示的实际应用边界，并为在资源受限和实时环境中部署NeRV风格模型提供了指导。官方实现发布于https://github.com/HannanAkhtar/TinyNeRV-Implementation。

摘要 (Abstract)

Implicit neural video representations encode entire video sequences within the parameters of a neural network and enable constant time frame reconstruction. Recent work on Neural Representations for Videos (NeRV) has demonstrated competitive reconstruction performance while avoiding the sequential decoding process of conventional video codecs. However, most existing studies focus on moderate or high capacity models, leaving the behavior of extremely compact configurations required for constrained environments insufficiently explored. This paper presents a systematic study of tiny NeRV architectures designed for efficient deployment. Two lightweight configurations, NeRV-T and NeRV-T+, are introduced and evaluated across multiple video datasets in order to analyze how aggressive capacity reduction affects reconstruction quality, computational complexity, and decoding throughput. Beyond architectural scaling, the work investigates strategies for improving the performance of compact models without increasing inference cost. Knowledge distillation with frequency-aware focal supervision is explored to enhance reconstruction fidelity in low-capacity networks. In addition, the impact of lowprecision inference is examined through both post training quantization and quantization aware training to study the robustness of tiny models under reduced numerical precision. Experimental results demonstrate that carefully designed tiny NeRV variants can achieve favorable quality efficiency trade offs while substantially reducing parameter count, computational cost, and memory requirements. These findings provide insight into the practical limits of compact neural video representations and offer guidance for deploying NeRV style models in resource constrained and real-time environments. The official implementation is available at https: //github.com/HannanAkhtar/TinyNeRV-Implementation.

关键词: Neural Video Representations, NeRV, Compact Models, Capacity Scaling, Knowledge Distillation, Low-precision Inference, Quantization, Resource-constrained Environments

174. ❌ Adding Another Dimension to Image-based Animal Detection

作者: Vandita Shukla, Fabio Remondino, Benjamin Risse 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09210v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算机视觉领域，研究利用Skinned Multi Animal Linear模型从单目RGB图像中估计动物3D边界框，并开发相机姿态优化算法将其投影到2D图像空间，用于生成3D检测算法的训练标签。论文的核心是计算机视觉、3D重建和动物姿态估计技术，与绝大多数关键词（涉及大模型、深度学习技术原理、训练方法、推理优化、对齐、代理系统等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在科学（动物学/生态学）领域的应用，但并非核心匹配（论文未明确使用这些术语，且更偏向计算机视觉而非典型的生物信息学/化学信息学），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文解决了单目RGB图像中缺乏动物3D检测标注数据的问题，提出了一种利用3D动物模型和相机姿态优化来生成3D边界框及可见性度量的管道，为未来单目3D动物检测算法的开发和评估提供了关键步骤，并在Animal3D数据集上验证了其准确性。

摘要翻译

动物单目成像本质上将三维结构简化为二维投影。检测算法产生的二维边界框缺乏动物相对于相机朝向的信息。为构建针对RGB动物图像的三维检测方法，目前缺乏带标注的数据集；此类标注流程需要三维输入流与RGB数据协同工作。我们提出一种流程，该流程利用蒙皮多动物线性模型估算三维边界框，并通过专用相机姿态优化算法将其作为鲁棒标签投影至二维图像空间。为评估动物哪些侧面被捕捉，我们计算了立方体表面可见性度量。这些三维边界框与度量指标构成了开发和评估未来单目三维动物检测算法的关键步骤。我们在Animal3D数据集上评估了本方法，证明了其在跨物种与多场景下的准确性能。

摘要 (Abstract)

Monocular imaging of animals inherently reduces 3D structures to 2D projections. Detection algorithms lead to 2D bounding boxes that lack information about animal’s orientation relative to the camera. To build 3D detection methods for RGB animal images, there is a lack of labeled datasets; such labeling processes require 3D input streams along with RGB data. We present a pipeline that utilises Skinned Multi Animal Linear models to estimate 3D bounding boxes and to project them as robust labels into 2D image space using a dedicated camera pose refinement algorithm. To assess which sides of the animal are captured, cuboid face visibility metrics are computed. These 3D bounding boxes and metrics form a crucial step toward developing and benchmarking future monocular 3D animal detection algorithms. We evaluate our method on the Animal3D dataset, demonstrating accurate performance across species and settings.

关键词: 3D animal detection, monocular imaging, bounding box estimation, camera pose refinement, Skinned Multi Animal Linear models, Animal3D dataset, computer vision, label generation

175. ❌ Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception

作者: Jiahao Wang, Zikun Xu, Yuner Zhang, Zhongwei Jiang, Chenyang Lu, Shuocheng Yang, Yuxuan Wang, Jiaru Zhong, Chuang Zhang, Shaobing Xu, Jianqiang Wang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09206v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于自动驾驶中的协同3D感知，提出了一种完全稀疏的框架（Long-SCOPE）来解决长距离感知的计算和关联问题。虽然论文涉及稀疏模型（Sparse Models）概念，但这是针对3D点云数据的稀疏表示，而非大语言模型（LLMs）中的稀疏激活或MoE架构。其他关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用无关，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Long-SCOPE的完全稀疏框架，解决了长距离协同3D感知中计算复杂度高和特征关联脆弱的问题，在V2X-Seq和Griffin数据集上实现了最先进的性能。

摘要翻译

基于车联网通信的协同三维感知是增强自动驾驶能力的前瞻性范式，能够扩展感知范围并解决遮挡问题。然而，现有方法在实际远距离部署时面临两大关键瓶颈的制约：密集鸟瞰图表示带来的二次计算复杂度增长，以及在显著观测与对齐误差下特征关联机制的脆弱性。为克服这些局限，我们提出了Long-SCOPE——一个专为鲁棒远距离协同三维感知设计的全稀疏框架。本方法包含两个创新模块：用于精确检测遥远小目标的几何引导查询生成模块，以及可在严重位置噪声下鲁棒匹配协同查询的可学习上下文感知关联模块。在V2X-Seq与Griffin数据集上的实验验证表明，Long-SCOPE实现了最先进的性能，尤其在具有挑战性的100-150米远距离场景中表现突出，同时保持了极具竞争力的计算与通信开销。

摘要 (Abstract)

Cooperative 3D perception via Vehicle-to-Everything communication is a promising paradigm for enhancing autonomous driving, offering extended sensing horizons and occlusion resolution. However, the practical deployment of existing methods is hindered at long distances by two critical bottlenecks: the quadratic computational scaling of dense BEV representations and the fragility of feature association mechanisms under significant observation and alignment errors. To overcome these limitations, we introduce Long-SCOPE, a fully sparse framework designed for robust long-distance cooperative 3D perception. Our method features two novel components: a Geometry-guided Query Generation module to accurately detect small, distant objects, and a learnable Context-Aware Association module that robustly matches cooperative queries despite severe positional noise. Experiments on the V2X-Seq and Griffin datasets validate that Long-SCOPE achieves state-of-the-art performance, particularly in challenging 100-150 m long-range settings, while maintaining highly competitive computation and communication costs.

关键词: cooperative 3D perception, autonomous driving, sparse framework, long-range perception, Vehicle-to-Everything communication, BEV representations, feature association, computational efficiency

176. ❌ CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

作者: Haoyu Zhao, Zihao Zhang, Jiaxi Gu, Haoran Chen, Qingping Zheng, Pin Tang, Yeyin Jin, Yuang Zhang, Junqi Cheng, Zenghui Lu, Peng Shu, Zuxuan Wu, Yu-Gang Jiang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09201v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的视频生成任务，特别是相机轨迹控制，虽然使用了Vision-Language模型和Diffusion Transformer，但核心内容与提供的关键词（主要围绕大语言模型技术、训练方法、推理优化、对齐技术等）无直接关联。论文未涉及任何关键词中提到的具体技术或概念，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为CT-1的视觉-语言-相机模型，通过准确估计相机轨迹将空间推理知识迁移到视频生成中，解决了现有方法相机控制不精确或依赖手动参数的问题，在CT-200K数据集上实验表明其将相机控制准确率提高了25.7%。

摘要翻译

相机可控视频生成旨在合成具有灵活且物理合理的相机运动的视频。然而，现有方法要么通过文本提示提供不精确的相机控制，要么依赖劳动密集的手动相机轨迹参数，限制了其在自动化场景中的应用。为解决这些问题，我们提出了一种新颖的视觉-语言-相机模型，称为CT-1（Camera Transformer 1），该专用模型旨在通过精确估计相机轨迹，将空间推理知识迁移至视频生成。基于视觉-语言模块和扩散Transformer模型构建，CT-1在频域中采用基于小波的正则化损失，以有效学习复杂的相机轨迹分布。这些轨迹被集成到视频扩散模型中，从而实现与用户意图一致的空间感知相机控制。为促进CT-1的训练，我们设计了一个专用的数据整理流程，并构建了CT-200K——一个包含超过4700万帧的大规模数据集。实验结果表明，我们的框架成功弥合了空间推理与视频合成之间的鸿沟，生成了忠实且高质量的相机可控视频，并将相机控制精度较先前方法提升了25.7%。

摘要 (Abstract)

Camera-controllable video generation aims to synthesize videos with flexible and physically plausible camera movements. However, existing methods either provide imprecise camera control from text prompts or rely on labor-intensive manual camera trajectory parameters, limiting their use in automated scenarios. To address these issues, we propose a novel Vision-Language-Camera model, termed CT-1 (Camera Transformer 1), a specialized model designed to transfer spatial reasoning knowledge to video generation by accurately estimating camera trajectories. Built upon vision-language modules and a Diffusion Transformer model, CT-1 employs a Wavelet-based Regularization Loss in the frequency domain to effectively learn complex camera trajectory distributions. These trajectories are integrated into a video diffusion model to enable spatially aware camera control that aligns with user intentions. To facilitate the training of CT-1, we design a dedicated data curation pipeline and construct CT-200K, a large-scale dataset containing over 47M frames. Experimental results demonstrate that our framework successfully bridges the gap between spatial reasoning and video synthesis, yielding faithful and high-quality camera-controllable videos and improving camera control accuracy by 25.7% over prior methods.

关键词: camera-controllable video generation, vision-language-camera model, spatial reasoning, camera trajectory estimation, diffusion transformer, wavelet-based regularization, CT-200K dataset, video synthesis

177. ❌ Globally Optimal Pose from Orthographic Silhouettes

作者: Agniva Sengupta, Dilara Kuş, Jianning Li, Stefan Zachow 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09199v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是计算机视觉中的三维姿态估计问题，使用轮廓面积连续性和椭圆拟合等几何方法，完全不涉及大模型、深度学习、AI for Science等关键词领域。所有关键词均与大模型技术原理、训练方法、推理优化、AI应用等主题相关，而本文是纯粹的几何算法研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种从无遮挡轮廓中高效估计三维形状全局最优姿态的新方法，利用轮廓面积连续性和椭圆拟合来加速旋转空间搜索，并在合成和真实数据上验证了其优于现有方法的准确性。

摘要翻译

我们解决了从无遮挡轮廓中确定已知三维形状在$\mathbb{R}^3$空间中的姿态估计问题。该方法利用轮廓面积的一个简单但尚未被充分探索的性质——其在旋转空间轨迹上的连续性——实现了全局最优的姿态确定。所提出的方法利用预计算的轮廓特征，将其建模为轮廓面积的响应曲面。通过查询该轮廓特征响应曲面进行姿态估计，能够对旋转搜索空间进行强力分支划分，从而使分辨率引导的候选搜索成为可能。此外，我们利用拟合到投影轮廓的二维椭圆的长宽比作为辅助全局形状特征，以加速姿态搜索。这种组合策略形成了第一种仅从轮廓高效估计全局最优姿态的方法，无需对应点引导，适用于任意形状，无论其凸性或亏格如何。我们在合成与真实数据上验证了本方法，结果表明其精度相较于同类方法有显著提升。代码、数据及补充材料见：https://agnivsen.github.io/pose-from-silhouette/

摘要 (Abstract)

We solve the problem of determining the pose of known shapes in $\mathbb{R}^3$ from their unoccluded silhouettes. The pose is determined up to global optimality using a simple yet under-explored property of the area-of-silhouette: its continuity w.r.t trajectories in the rotation space. The proposed method utilises pre-computed silhouette-signatures, modelled as a response surface of the area-of-silhouettes. Querying this silhouette-signature response surface for pose estimation leads to a strong branching of the rotation search space, making resolution-guided candidate search feasible. Additionally, we utilise the aspect ratio of 2D ellipses fitted to projected silhouettes as an auxiliary global shape signature to accelerate the pose search. This combined strategy forms the first method to efficiently estimate globally optimal pose from just the silhouettes, without being guided by correspondences, for any shape, irrespective of its convexity and genus. We validate our method on synthetic and real examples, demonstrating significantly improved accuracy against comparable approaches. Code, data, and supplementary in: https://agnivsen.github.io/pose-from-silhouette/

关键词: pose estimation, silhouettes, global optimality, rotation search, area-of-silhouette, ellipse fitting, 3D shape, computer vision

178. ❌ MixFlow: Mixed Source Distributions Improve Rectified Flows

作者: Nazir Nayal, Christopher Wewer, Jan Eric Lenssen 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09181v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是图像生成领域的扩散模型和整流流（rectified flows），专注于通过改进源分布来减少生成路径曲率、提高采样效率。所有评分关键词均与大语言模型（LLM）及其相关技术（如微调、对齐、推理、代理等）或特定科学AI应用（如生物信息学）相关，而本文完全不涉及LLM或文本生成，也未提及任何评分关键词中的技术概念。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对扩散模型和整流流中因源分布与数据分布不匹配导致的高曲率问题，提出了MixFlow训练策略，通过混合无条件分布和条件分布来改善对齐，从而显著提高了生成质量和采样效率。

摘要翻译

扩散模型及其变体（如整流流）能够生成多样且高质量的图像，但其生成路径的高曲率特性仍导致迭代采样速度缓慢。先前研究表明，高曲率的一个重要成因是源分布（标准高斯分布）与数据分布之间的独立性。本研究通过两项互补性贡献应对这一局限：首先，我们尝试突破标准高斯假设，提出一种通用框架 $κ\texttt{-FC}$，该框架将源分布与任意信号 $κ$ 进行条件化关联，从而使其与数据分布更好对齐；其次，我们提出 MixFlow——一种简单而有效的训练策略，通过降低生成路径曲率显著提升采样效率。MixFlow 在固定无条件分布与基于 $κ\texttt{-FC}$ 的分布的线性混合上训练流模型。这种简单的混合操作增强了源分布与数据分布的对齐性，在减少所需采样步数的同时提升了生成质量，并大幅加速了训练收敛过程。在固定采样预算下，我们的训练方法相较于标准整流流平均将生成质量（以 FID 指标衡量）提升 12%，较先前基线方法提升 7%。代码发布于：$\href{https://github.com/NazirNayal8/MixFlow}{https://github.com/NazirNayal8/MixFlow}$

摘要 (Abstract)

Diffusion models and their variations, such as rectified flows, generate diverse and high-quality images, but they are still hindered by slow iterative sampling caused by the highly curved generative paths they learn. An important cause of high curvature, as shown by previous work, is independence between the source distribution (standard Gaussian) and the data distribution. In this work, we tackle this limitation by two complementary contributions. First, we attempt to break away from the standard Gaussian assumption by introducing $κ\texttt{-FC}$, a general formulation that conditions the source distribution on an arbitrary signal $κ$ that aligns it better with the data distribution. Then, we present MixFlow, a simple but effective training strategy that reduces the generative path curvatures and considerably improves sampling efficiency. MixFlow trains a flow model on linear mixtures of a fixed unconditional distribution and a $κ\texttt{-FC}$-based distribution. This simple mixture improves the alignment between the source and data, provides better generation quality with less required sampling steps, and accelerates the training convergence considerably. On average, our training procedure improves the generation quality by 12% in FID compared to standard rectified flow and 7% compared to previous baselines under a fixed sampling budget. Code available at: $\href{https://github.com/NazirNayal8/MixFlow}{https://github.com/NazirNayal8/MixFlow}$

关键词: rectified flows, diffusion models, source distribution, generative path curvature, sampling efficiency, MixFlow, image generation, training strategy

179. ❌ UniSemAlign: Text-Prototype Alignment with a Foundation Encoder for Semi-Supervised Histopathology Segmentation

作者: Le-Van Thai, Tien Dat Nguyen, Hoai Nhan Pham, Lan Anh Dinh Thi, Duy-Dong Nguyen, Ngoc Lam Quang Bui 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09169v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于计算病理学中的半监督语义分割，属于AI for Science（生物信息学）领域，因此该关键词得10分。论文使用了预训练的Transformer编码器，与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联，但并非核心，得5分。其他关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出UniSemAlign框架，通过文本-原型对齐增强半监督病理图像分割，在GlaS和CRAG数据集上显著优于现有方法，仅用10%标注数据即可获得高达8.6%的Dice提升。

摘要翻译

计算病理学中的半监督语义分割因像素级标注稀缺及伪标签监督不可靠而面临挑战。本文提出UniSemAlign——一种双模态语义对齐框架，通过将显式的类别级结构注入像素级学习来增强视觉分割能力。该框架基于病理学预训练的Transformer编码器构建，在共享嵌入空间中引入了互补的原型级与文本级对齐分支，提供结构化指导以降低类别歧义并稳定伪标签优化过程。对齐后的表征与视觉预测结果融合，为未标注的组织病理学图像生成更可靠的监督信号。该框架通过分割监督、跨视图一致性及跨模态对齐目标进行端到端训练。在GlaS和CRAG数据集上的大量实验表明，UniSemAlign在有限监督条件下显著优于现有半监督基线方法：仅使用10%标注数据时，在GlaS和CRAG数据集上的Dice系数分别提升达2.6%和8.6%；在20%监督比例下亦实现显著提升。代码已开源：https://github.com/thailevann/UniSemAlign

摘要 (Abstract)

Semi-supervised semantic segmentation in computational pathology remains challenging due to scarce pixel-level annotations and unreliable pseudo-label supervision. We propose UniSemAlign, a dual-modal semantic alignment framework that enhances visual segmentation by injecting explicit class-level structure into pixel-wise learning. Built upon a pathology-pretrained Transformer encoder, UniSemAlign introduces complementary prototype-level and text-level alignment branches in a shared embedding space, providing structured guidance that reduces class ambiguity and stabilizes pseudo-label refinement. The aligned representations are fused with visual predictions to generate more reliable supervision for unlabeled histopathology images. The framework is trained end-to-end with supervised segmentation, cross-view consistency, and cross-modal alignment objectives. Extensive experiments on the GlaS and CRAG datasets demonstrate that UniSemAlign substantially outperforms recent semi-supervised baselines under limited supervision, achieving Dice improvements of up to 2.6% on GlaS and 8.6% on CRAG with only 10% labeled data, and strong improvements at 20% supervision. Code is available at: https://github.com/thailevann/UniSemAlign

关键词: semi-supervised segmentation, computational pathology, histopathology, semantic alignment, prototype alignment, Transformer encoder, pseudo-label refinement, cross-modal learning

180. ❌ ELT: Elastic Looped Transformers for Visual Generation

作者: Sahil Goyal, Swayam Agrawal, Gautham Govind Anil, Prateek Jain, Sujoy Paul, Aditya Kusupati 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09168v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ELT专注于视觉生成模型（图像和视频），提出了一种基于循环Transformer架构的参数高效方法。它与大多数关键词无关，因为这些关键词主要针对语言模型、对齐、推理、代理等。唯一高度相关的关键词是’PEFT OR LoRA OR Parameter-efficient Fine-tuning’（评分10），因为论文的核心贡献是参数高效的视觉生成模型，通过循环权重共享和蒸馏来减少参数数量，这与参数高效微调的精神一致。其他关键词如’Quantization OR Model Compression’可能有一定关联（但论文未明确涉及量化），但根据摘要，主要创新是架构层面的参数效率，而非后训练压缩，因此评分为0。论文不涉及语言模型、科学AI应用或列表中的其他主题。

!!! tip deepseek-chat TL;DR

论文提出了弹性循环Transformer（ELT），一种用于图像和视频生成的参数高效视觉生成模型，通过循环权重共享和层内自蒸馏技术，在减少4倍参数的同时保持竞争性的生成质量。

摘要翻译

我们提出弹性循环变换器（Elastic Looped Transformers, ELT），这是一种基于循环变换器架构的高参数效率视觉生成模型类别。传统生成模型依赖于深度堆叠的独立变换器层，而我们的方法采用迭代式、权重共享的变换器模块，在保持高合成质量的同时大幅减少参数量。为有效训练这些模型以进行图像和视频生成，我们提出“循环内自蒸馏”（Intra-Loop Self Distillation, ILSD）策略，其中学生配置（中间循环）从教师配置（最大训练循环）中蒸馏知识，以确保在单步训练中模型深度间的一致性。我们的框架通过单次训练即可得到一系列弹性模型，实现具备“任意时刻推理”能力的动态计算成本与生成质量权衡，且参数量保持不变。ELT显著推进了视觉合成的效率边界：在等推理计算量设置下，参数量减少4倍时，ELT在类别条件ImageNet 256×256数据集上达到2.0的竞争性FID分数，在类别条件UCF-101数据集上获得72.8的FVD分数。

摘要 (Abstract)

We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. While conventional generative models rely on deep stacks of unique transformer layers, our approach employs iterative, weight-shared transformer blocks to drastically reduce parameter counts while maintaining high synthesis quality. To effectively train these models for image and video generation, we propose the idea of Intra-Loop Self Distillation (ILSD), where student configurations (intermediate loops) are distilled from the teacher configuration (maximum training loops) to ensure consistency across the model’s depth in a single training step. Our framework yields a family of elastic models from a single training run, enabling Any-Time inference capability with dynamic trade-offs between computational cost and generation quality, with the same parameter count. ELT significantly shifts the efficiency frontier for visual synthesis. With $4\times$ reduction in parameter count under iso-inference-compute settings, ELT achieves a competitive FID of $2.0$ on class-conditional ImageNet $256 \times 256$ and FVD of $72.8$ on class-conditional UCF-101.

关键词: Elastic Looped Transformers, parameter-efficient, visual generative models, recurrent transformer, Intra-Loop Self Distillation, image generation, video generation, Any-Time inference

181. ❌ Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection

作者: Yicheng Qiu, Keiji Yanai 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09164v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视频时序动作检测，提出了一种结合状态空间模型（SSM）和高效时空焦点适配器的新框架。虽然论文涉及深度学习在计算机视觉领域的应用，但所有评分关键词都专门针对大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理系统等），而本文完全不涉及语言模型、文本生成或LLM的任何方面。论文的核心是视频理解中的时空建模，与LLM技术领域无直接关联。

!!! tip deepseek-chat TL;DR

该研究针对长视频序列中时序动作检测任务存在的特征冗余和全局依赖建模能力下降问题，提出了一种集成时序边界感知状态空间模型（TB-SSM）和高效时空焦点适配器（ESTF Adapter）的新框架，实验表明该方法显著提升了动作定位性能和鲁棒性。

摘要翻译

时序人体动作检测旨在识别并定位未剪辑视频中的动作片段，是视频理解领域的关键任务。尽管先前基于CNN和Transformer的架构已取得进展，但在处理长视频序列时，这些模型仍面临特征冗余与全局依赖建模能力下降的问题，严重制约了其在真实场景视频分析中的可扩展性。状态空间模型因其线性长程建模能力和强大的全局时序推理特性，为此提供了具有前景的替代方案。本研究重新思考状态空间模型在时序建模中的应用，构建了一种新颖的视频人体动作检测框架。具体而言，我们在预训练层中引入了高效时空聚焦适配器。该模块融合了我们提出的时序边界感知状态空间模型在时序特征建模方面的优势，同时实现了对空间特征的高效处理。我们在多个基准数据集上进行了全面定量分析，将所提方法与先前基于状态空间模型的方法及其他结构方法进行比较。大量实验表明，改进后的策略显著提升了定位性能与鲁棒性，验证了本方法的有效性。

摘要 (Abstract)

Temporal human action detection aims to identify and localize action segments within untrimmed videos, serving as a pivotal task in video understanding. Despite the progress achieved by prior architectures like CNN and Transformer models, these continue to struggle with feature redundancy and degraded global dependency modeling capabilities when applied to long video sequences. These limitations severely constrain their scalability in real-world video analysis. State Space Models (SSMs) offer a promising alternative with linear long-term modeling and robust global temporal reasoning capabilities. Rethinking the application of SSMs in temporal modeling, this research constructs a novel framework for video human action detection. Specifically, we introduce the Efficient Spatial-Temporal Focal (ESTF) Adapter into the pre-trained layers. This module integrates the advantages of our proposed Temporal Boundary-aware SSM(TB-SSM) for temporal feature modeling with efficient processing of spatial features. We perform comprehensive and quantitative analyses across multiple benchmarks, comparing our proposed method against previous SSM-based and other structural methods. Extensive experiments demonstrate that our improved strategy significantly enhances both localization performance and robustness, validating the effectiveness of our proposed method.

关键词: Temporal Action Detection, State Space Models (SSMs), Video Understanding, Temporal Boundary-aware SSM, Efficient Spatial-Temporal Focal Adapter, Long Video Sequences, Global Temporal Reasoning, Action Localization

182. ❌ MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding

作者: Henry Zheng, Chenyue Fang, Rui Huang, Siyuan Wei, Xiao Liu, Gao Huang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09167v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MAG-3D提出了一种免训练的多智能体框架，用于3D场景的视觉语言理解。其核心贡献在于多智能体协作设计，包括规划、定位和编码智能体，以实现灵活的3D推理。因此，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’和’Multi-agent Systems OR Agent Coordination’高度相关（10分），因为论文明确构建了多智能体系统来协调任务。与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’和’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’有一定关联（5分），因为论文涉及分解任务和进行几何推理，但未明确使用这些术语。其他关键词主要关注大语言模型（LLM）的技术细节、训练方法、优化或特定科学领域，而本论文使用现成的视觉语言模型（VLM），专注于3D场景的多智能体框架，未深入探讨LLM的内部机制、训练过程或科学应用，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个免训练的多智能体框架MAG-3D，通过动态协调规划、定位和编码智能体，解决了3D场景中基于视觉语言模型的开放查询推理问题，并在多个基准测试中取得了最先进的性能。

摘要翻译

视觉语言模型（VLMs）在多模态理解与推理方面已展现出强大性能，然而针对三维场景的具身推理仍待深入探索。有效的三维推理关键在于精准的定位：为回答开放式查询，模型必须首先在复杂场景中识别与查询相关的物体和区域，进而推理其空间与几何关系。近期研究方法已显示出在三维具身推理方面的巨大潜力，但这些方法通常依赖于领域内微调或手工构建的推理流程，限制了其在新环境中的灵活性与零样本泛化能力。本研究提出MAG-3D——一个无需训练的多智能体框架，可利用现成的视觉语言模型进行三维具身推理。该框架不依赖任务特定训练或固定推理流程，而是通过动态协调专家智能体应对三维推理的核心挑战。具体而言，我们设计了规划智能体以分解任务并统筹整体推理流程，定位智能体执行自由形式的三维定位并从海量三维场景观测中检索相关帧，以及编码智能体通过可执行程序进行灵活几何推理与显式验证。这种多智能体协同设计实现了跨多样场景的灵活免训练三维具身推理，并在具有挑战性的基准测试中取得了最先进的性能表现。

摘要 (Abstract)

Vision-language models (VLMs) have achieved strong performance in multimodal understanding and reasoning, yet grounded reasoning in 3D scenes remains underexplored. Effective 3D reasoning hinges on accurate grounding: to answer open-ended queries, a model must first identify query-relevant objects and regions in a complex scene, and then reason about their spatial and geometric relationships. Recent approaches have demonstrated strong potential for grounded 3D reasoning. However, they often rely on in-domain tuning or hand-crafted reasoning pipelines, which limit their flexibility and zero-shot generalization to novel environments. In this work, we present MAG-3D, a training-free multi-agent framework for grounded 3D reasoning with off-the-shelf VLMs. Instead of relying on task-specific training or fixed reasoning procedures, MAG-3D dynamically coordinates expert agents to address the key challenges of 3D reasoning. Specifically, we propose a planning agent that decomposes the task and orchestrates the overall reasoning process, a grounding agent that performs free-form 3D grounding and relevant frame retrieval from extensive 3D scene observations, and a coding agent that conducts flexible geometric reasoning and explicit verification through executable programs. This multi-agent collaborative design enables flexible training-free 3D grounded reasoning across diverse scenes and achieves state-of-the-art performance on challenging benchmarks.

关键词: Multi-agent, 3D reasoning, Vision-language models, Grounded reasoning, Training-free, Collaborative design, State-of-the-art

183. ❌ Benchmarking CNN- and Transformer-Based Models for Surgical Instrument Segmentation in Robotic-Assisted Surgery

作者: Sara Ameli 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09151v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于机器人辅助手术中的手术器械分割，使用CNN和Transformer架构（如UNet、DeepLabV3、SegFormer）进行基准测试。论文内容与大多数关键词（涉及大模型技术、训练方法、推理优化、代理系统等）完全无关，因为这些关键词主要针对大语言模型（LLMs）及相关技术，而本文研究的是计算机视觉中的语义分割任务。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为手术AI应用可视为科学领域（医疗AI）的应用，但论文未涉及生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究在机器人辅助前列腺切除术视频中，对UNet、DeepLabV3和SegFormer等深度学习模型进行手术器械分割的基准测试，发现卷积模型和Transformer模型在复杂手术场景中各有优势，为手术AI应用提供了模型选择见解。

摘要翻译

在机器人辅助手术中，手术器械的精确分割对于实现情境感知的计算机辅助干预至关重要，例如器械追踪、工作流程分析和自主决策。本研究基于SAR-RARP50数据集，对UNet、UNet++、DeepLabV3、Attention UNet和SegFormer五种深度学习架构进行了基准测试，以评估其在真实根治性前列腺切除术视频中手术器械多类别语义分割的性能。模型采用结合交叉熵损失与Dice损失的复合损失函数进行训练，以应对类别不平衡问题并捕捉精细的物体边界。实验结果表明，虽然卷积模型如UNet和Attention UNet提供了稳健的基线性能，但DeepLabV3取得了与SegFormer相当的结果，证明了空洞卷积和多尺度上下文聚合在捕捉复杂手术场景中的有效性。基于Transformer的架构如SegFormer进一步增强了全局上下文理解能力，从而提升了模型在不同器械外观和手术条件下的泛化性能。本研究为手术人工智能应用中分割模型的选择提供了全面比较与实践见解，并揭示了卷积方法与基于Transformer的方法之间的权衡关系。

摘要 (Abstract)

Accurate segmentation of surgical instruments in robotic-assisted surgery is critical for enabling context-aware computer-assisted interventions, such as tool tracking, workflow analysis, and autonomous decision-making. In this study, we benchmark five deep learning architectures-UNet, UNet, DeepLabV3, Attention UNet, and SegFormer on the SAR-RARP50 dataset for multi-class semantic segmentation of surgical instruments in real-world radical prostatectomy videos. The models are trained with a compound loss function combining Cross Entropy and Dice loss to address class imbalance and capture fine object boundaries. Our experiments reveal that while convolutional models such as UNet and Attention UNet provide strong baseline performance, DeepLabV3 achieves results comparable to SegFormer, demonstrating the effectiveness of atrous convolution and multi-scale context aggregation in capturing complex surgical scenes. Transformer-based architectures like SegFormer further enhance global contextual understanding, leading to improved generalization across varying instrument appearances and surgical conditions. This work provides a comprehensive comparison and practical insights for selecting segmentation models in surgical AI applications, highlighting the trade-offs between convolutional and transformer-based approaches.

关键词: surgical instrument segmentation, robotic-assisted surgery, deep learning, CNN, Transformer, semantic segmentation, benchmarking, medical AI

184. ❌ Deep Light Pollution Removal in Night Cityscape Photographs

作者: Hao Wang, Xiaolin Wu, Xi Zhang, Baoqing Sun 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09145v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究夜间城市景观照片的光污染去除，属于计算机视觉和图像处理领域。虽然摘要提到’leverages large generative model’，但这是指生成对抗网络（GAN）或扩散模型等图像生成模型，而非大语言模型（LLM）。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文核心是图像恢复的物理模型和训练策略，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于物理的退化模型和利用大型生成模型与合成-真实耦合的训练策略，以有效去除夜间城市景观照片中的光污染，恢复真实的夜间图像。

摘要翻译

夜间摄影因城市环境中普遍存在的人工照明所导致的光污染而严重退化。经过长距离散射与空间扩散后，多余的人造光掩盖了自然的夜间亮度，产生遮蔽星空与天体的天空辉光，并在光源周围形成光晕和辉光伪影。与旨在透过浓厚大气提升细节辨识度的夜间去雾不同，光污染消除的目标是通过中和地面照明的辐射影响，恢复纯净的夜间景象。本文提出一种基于物理的退化模型，该模型在先前夜间去雾模型的基础上增加了两个关键方面：(i) 定向光源的各向异性扩散，以及(ii) 天际线后方不可见地表光源引起的天空辉光。此外，我们构建了一种训练策略，利用大规模生成模型与合成-真实数据耦合，以弥补配对真实数据的稀缺性并增强泛化能力。大量实验表明，所提出的模型与学习框架能显著减少光污染伪影，并比先前的夜间复原方法更好地恢复真实的夜间影像。

摘要 (Abstract)

Nighttime photography is severely degraded by light pollution induced by pervasive artificial lighting in urban environments. After long-range scattering and spatial diffusion, unwanted artificial light overwhelms natural night luminance, generates skyglow that washes out the view of stars and celestial objects and produces halos and glow artifacts around light sources. Unlike nighttime dehazing, which aims to improve detail legibility through thick air, the objective of light pollution removal is to restore the pristine night appearance by neutralizing the radiative footprint of ground lighting. In this paper we introduce a physically-based degradation model that adds to the previous ones for nighttime dehazing two critical aspects; (i) anisotropic spread of directional light sources, and (ii) skyglow caused by invisible surface lights behind skylines. In addition, we construct a training strategy that leverages large generative model and synthetic-real coupling to compensate for the scarcity of paired real data and enhance generalization. Extensive experiments demonstrate that the proposed formulation and learning framework substantially reduce light pollution artifacts and better recover authentic night imagery than prior nighttime restoration methods.

关键词: light pollution removal, nighttime photography, physically-based degradation model, generative model, synthetic-real coupling, image restoration, skyglow, anisotropic light spread

185. ❌ Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching

作者: Jiahao Li, Xinhong Chen, Zhengmin Jiang, Cheng Huang, Yung-Hui Li, Jianping Wang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09142v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉中的立体匹配任务，提出了一种结合表面法线和稀疏注意力设计的框架（GREATEN），以解决合成到真实场景的零样本泛化问题。论文的核心技术涉及几何特征融合、数据增强策略和计算效率优化，但完全不涉及大语言模型（LLM）、深度学习技术原理创新或大模型在不同领域的应用。所有评分关键词均与大语言模型、对齐、推理、代理、科学AI应用等相关，与该论文的计算机视觉立体匹配研究无任何关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对合成到真实场景的零样本立体匹配泛化挑战，提出了GREATEN框架，通过融合表面法线几何特征和稀疏注意力设计，在多个基准数据集上显著降低了误差并提高了推理速度。

摘要翻译

尽管过去十年间图像驱动的立体匹配取得了显著进展，但合成数据到真实场景的零样本泛化仍是一个开放挑战。这种次优的泛化性能主要源于跨域偏移以及图像纹理中固有的不适定模糊性，尤其在遮挡、无纹理、重复和非朗伯（镜面反射/透明）区域。为提升合成到真实的泛化能力，我们提出GREATEN框架，该框架引入表面法线作为域不变、物体本质且具有判别性的几何线索，以弥补图像纹理的局限性。所提框架包含三个关键组成部分：首先，门控上下文-几何融合模块自适应地抑制图像特征中不可靠的上下文线索，并将过滤后的图像特征与法线驱动的几何特征融合，构建域不变且具有判别性的上下文-几何表征；其次，镜面-透明增强策略提升GCGF模块在非朗伯区域中对误导性视觉线索的鲁棒性；第三，稀疏注意力设计在显著降低计算开销的同时，保留了GREAT-Stereo处理遮挡和纹理相关模糊性的细粒度全局特征提取能力，包括稀疏空间注意力、稀疏双匹配注意力和简单体积注意力。仅在SceneFlow等合成数据上训练后，GREATEN-IGEV实现了卓越的合成到真实泛化性能：相较于FoundationStereo、Monster-Stereo和DEFOM-Stereo，其在ETH3D上误差降低30%，在非朗伯数据集Booster上降低8.5%，在KITTI-2015上降低14.1%。此外，GREATEN-IGEV运行速度比GREAT-IGEV提升19.2%，并支持Middlebury数据集高达3K分辨率、视差范围达768的高分辨率推理。

摘要 (Abstract)

Despite remarkable advances in image-driven stereo matching over the past decade, Synthetic-to-Realistic Zero-Shot (Syn-to-Real) generalization remains an open challenge. This suboptimal generalization performance mainly stems from cross-domain shifts and ill-posed ambiguities inherent in image textures, particularly in occluded, textureless, repetitive, and non-Lambertian (specular/transparent) regions. To improve Syn-to-Real generalization, we propose GREATEN, a framework that incorporates surface normals as domain-invariant, object-intrinsic, and discriminative geometric cues to compensate for the limitations of image textures. The proposed framework consists of three key components. First, a Gated Contextual-Geometric Fusion (GCGF) module adaptively suppresses unreliable contextual cues in image features and fuses the filtered image features with normal-driven geometric features to construct domain-invariant and discriminative contextual-geometric representations. Second, a Specular-Transparent Augmentation (STA) strategy improves the robustness of GCGF against misleading visual cues in non-Lambertian regions. Third, sparse attention designs preserve the fine-grained global feature extraction capability of GREAT-Stereo for handling occlusion and texture-related ambiguities while substantially reducing computational overhead, including Sparse Spatial (SSA), Sparse Dual-Matching (SDMA), and Simple Volume (SVA) attentions. Trained exclusively on synthetic data such as SceneFlow, GREATEN-IGEV achieves outstanding Syn-to-Real performance. Specifically, it reduces errors by 30% on ETH3D, 8.5% on the non-Lambertian Booster, and 14.1% on KITTI-2015, compared to FoundationStereo, Monster-Stereo, and DEFOM-Stereo, respectively. In addition, GREATEN-IGEV runs 19.2% faster than GREAT-IGEV and supports high-resolution (3K) inference on Middlebury with disparity ranges up to 768.

关键词: stereo matching, synthetic-to-real generalization, surface normals, sparse attention, domain-invariant features, geometric cues, computational efficiency, zero-shot learning

186. ❌ Strips as Tokens: Artist Mesh Generation with Native UV Segmentation

作者: Rui Xu, Dafei Qin, Kaichun Qiao, Qiujie Dong, Huaijin Pi, Qixuan Zhang, Longwen Zhang, Lan Xu, Jingyi Yu, Wenping Wang, Taku Komura 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09132v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机图形学中的网格生成，使用自回归变换器（autoregressive transformers）作为技术基础，但研究内容与所有评分关键词（主要关于大语言模型、深度学习技术原理及其在科学领域的应用）完全无关。论文未涉及任何大模型技术、AI for Science应用或深度学习创新原理，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SATO的新框架，通过受三角形条带启发的标记排序策略，解决了现有方法在生成艺术家质量网格时标记排序策略不满足专业标准的问题，实现了更好的几何质量、结构连贯性和UV分割效果。

摘要翻译

近期自回归变换器在生成艺术家级网格方面展现出显著潜力。然而，现有方法采用的标记排序策略通常难以达到专业艺术家标准：基于坐标的排序会产生低效的长序列，而基于面片启发式的方法会破坏高质量建模所必需的连续边流与结构规整性。为突破这些局限，我们提出“条带即标记”（SATO）这一新颖框架，其标记排序策略受三角形条带启发。通过将序列构建为显式编码UV边界的面连接链，我们的方法天然保留了艺术家创作网格特有的有序边流与语义布局特征。该框架的一个关键优势在于其统一表示能力——同一标记序列可解码为三角形或四边形网格。这种灵活性支持对两类数据进行联合训练：大规模三角形数据提供基础结构先验，而高质量四边形数据则增强输出的几何规整性。大量实验表明，SATO在几何质量、结构连贯性与UV分割效果上均持续优于现有方法。

摘要 (Abstract)

Recent advancements in autoregressive transformers have demonstrated remarkable potential for generating artist-quality meshes. However, the token ordering strategies employed by existing methods typically fail to meet professional artist standards, where coordinate-based sorting yields inefficiently long sequences, and patch-based heuristics disrupt the continuous edge flow and structural regularity essential for high-quality modeling. To address these limitations, we propose Strips as Tokens (SATO), a novel framework with a token ordering strategy inspired by triangle strips. By constructing the sequence as a connected chain of faces that explicitly encodes UV boundaries, our method naturally preserves the organized edge flow and semantic layout characteristic of artist-created meshes. A key advantage of this formulation is its unified representation, enabling the same token sequence to be decoded into either a triangle or quadrilateral mesh. This flexibility facilitates joint training on both data types: large-scale triangle data provides fundamental structural priors, while high-quality quad data enhances the geometric regularity of the outputs. Extensive experiments demonstrate that SATO consistently outperforms prior methods in terms of geometric quality, structural coherence, and UV segmentation.

关键词: autoregressive transformers, mesh generation, token ordering, triangle strips, UV segmentation, artist-quality meshes, SATO framework, geometric quality

187. ❌ Few-Shot Personalized Age Estimation

作者: Jakub Paplhám, Vojtěch Franc, Artem Moroz 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09125v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究个性化年龄估计，属于计算机视觉领域，主要涉及人脸分析、基准测试和个性化建模。所有评分关键词均与大模型、深度学习技术原理或AI for Science相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了首个开放基准OpenPAE用于少样本个性化年龄估计，通过建立层次化基线方法并实验验证了个人化能显著提升年龄估计性能。

摘要翻译

现有年龄估计方法将每张人脸视为独立样本，学习从外貌到年龄的全局映射关系。这种做法忽略了一个有充分记录的现象：由于遗传、生活方式和健康状况的差异，个体的衰老速率各不相同，使得从人脸到年龄的映射具有身份依赖性。当能够获取同一人物已知年龄的参考图像时，我们可以利用这种上下文信息对估计进行个性化调整。该任务目前唯一的基准测试（NIST FRVT）是闭源的，且仅限于单张参考图像。在本研究中，我们提出了OpenPAE——首个面向$N$-样本个性化年龄估计的开放基准测试平台，并制定了严格的评估协议。我们构建了由简至繁的基线方法体系：从算术偏移量法，到闭式贝叶斯线性回归，再到条件注意力神经过程。实验表明：个性化方法能持续提升性能，这种提升并非仅源于领域适应；非线性方法显著优于简单替代方案。我们公开了所有模型、代码、协议及评估划分数据。

摘要 (Abstract)

Existing age estimation methods treat each face as an independent sample, learning a global mapping from appearance to age. This ignores a well-documented phenomenon: individuals age at different rates due to genetics, lifestyle, and health, making the mapping from face to age identity-dependent. When reference images of the same person with known ages are available, we can exploit this context to personalize the estimate. The only existing benchmark for this task (NIST FRVT) is closed-source and limited to a single reference image. In this work, we introduce OpenPAE, the first open benchmark for $N$-shot personalized age estimation with strict evaluation protocols. We establish a hierarchy of increasingly sophisticated baselines: from arithmetic offset, through closed-form Bayesian linear regression, to a conditional attentive neural process. Our experiments show that personalization consistently improves performance, that the gains are not merely domain adaptation, and that nonlinear methods significantly outperform simpler alternatives. We release all models, code, protocols, and evaluation splits.

关键词: personalized age estimation, few-shot learning, face analysis, benchmark, OpenPAE, conditional attentive neural process, domain adaptation, Bayesian linear regression

188. ❌ FaceLiVTv2: An Improved Hybrid Architecture for Efficient Mobile Face Recognition

作者: Novendra Setyawan, Chi-Chia Sun, Mao-Hsiu Hsu, Wen-Kai Kuo, Jun-Wei Hsieh 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09127v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于移动设备上的轻量级人脸识别架构（FaceLiVTv2），属于计算机视觉领域而非大语言模型（LLM）研究。与大多数关键词（如LLM、MoE、对齐、RAG等）完全无关。仅与三个关键词有间接关联：1）‘Small Language Models OR SLMs OR On-device AI’（5分）：论文涉及移动/边缘设备上的高效AI部署，属于设备端AI范畴，但针对视觉任务而非语言模型；2）‘Quantization OR Model Compression OR Low-bit Weights’（5分）：论文通过轻量级架构设计（如Lite MHLA、RepMix块）实现模型效率优化，属于广义的模型压缩/效率提升；3）‘Speculative Decoding OR Inference Acceleration’（5分）：论文明确优化推理延迟（如22-41%的延迟改进），属于推理加速范畴，但针对视觉模型而非LLM解码。其他关键词均无直接或间接关联。

!!! tip deepseek-chat TL;DR

该论文提出了FaceLiVTv2，一种改进的混合CNN-Transformer架构，通过轻量级全局令牌交互模块（Lite MHLA）和统一的RepMix块，在移动设备上实现了更优的人脸识别准确性与效率平衡，相比现有方法减少了22-41%的推理延迟并保持了更高精度。

摘要翻译

轻量化人脸识别对于部署在边缘和移动设备上日益重要，这些设备在要求可靠准确性的同时，还必须满足对延迟、内存和能耗的严格限制。尽管近期的混合CNN-Transformer架构在全局上下文建模方面取得了进展，但在识别性能与计算效率之间实现有效平衡仍是一个开放的挑战。本文提出FaceLiVTv2，这是我们面向移动端人脸识别的混合架构FaceLiVT的改进版本，旨在实现高效的全局-局部特征交互。其核心是Lite MHLA（轻量级多头线性注意力）模块，这是一个轻量化的全局令牌交互模块，它通过多头线性令牌投影和仿射重缩放变换取代了原始的多层注意力设计，在保持多头间表征多样性的同时减少了冗余。我们进一步将Lite MHLA集成到一个统一的RepMix块中，该块协调局部与全局特征交互，并在嵌入阶段采用全局深度卷积进行自适应空间聚合。在我们的实验设置下，在LFW、CA-LFW、CP-LFW、CFP-FP、AgeDB-30和IJB数据集上的结果表明，与现有轻量化方法相比，FaceLiVTv2持续改善了精度-效率权衡。值得注意的是，FaceLiVTv2在移动端的推理延迟相较于FaceLiVTv1降低了22%，在移动设备上相比GhostFaceNets实现了高达30.8%的加速，并且在不同平台上相比EdgeFace和KANFace带来了20-41%的延迟改善，同时保持了更高的识别准确率。这些结果表明，FaceLiVTv2为实时人脸识别提供了一个实用且可部署的解决方案。代码发布于https://github.com/novendrastywn/FaceLiVT。

摘要 (Abstract)

Lightweight face recognition is increasingly important for deployment on edge and mobile devices, where strict constraints on latency, memory, and energy consumption must be met alongside reliable accuracy. Although recent hybrid CNN-Transformer architectures have advanced global context modeling, striking an effective balance between recognition performance and computational efficiency remains an open challenge. In this work, we present FaceLiVTv2, an improved version of our FaceLiVT hybrid architecture designed for efficient global–local feature interaction in mobile face recognition. At its core is Lite MHLA, a lightweight global token interaction module that replaces the original multi-layer attention design with multi-head linear token projections and affine rescale transformations, reducing redundancy while preserving representational diversity across heads. We further integrate Lite MHLA into a unified RepMix block that coordinates local and global feature interactions and adopts global depthwise convolution for adaptive spatial aggregation in the embedding stage. Under our experimental setup, results on LFW, CA-LFW, CP-LFW, CFP-FP, AgeDB-30, and IJB show that FaceLiVTv2 consistently improves the accuracy-efficiency trade-off over existing lightweight methods. Notably, FaceLiVTv2 reduces mobile inference latency by 22% relative to FaceLiVTv1, achieves speedups of up to 30.8% over GhostFaceNets on mobile devices, and delivers 20-41% latency improvements over EdgeFace and KANFace across platforms while maintaining higher recognition accuracy. These results demonstrate that FaceLiVTv2 offers a practical and deployable solution for real-time face recognition. Code is available at https://github.com/novendrastywn/FaceLiVT.

关键词: Face Recognition, Mobile Deployment, Hybrid CNN-Transformer, Lightweight Architecture, Inference Acceleration, Global-Local Feature Interaction, Efficiency-Accuracy Trade-off, Edge Computing

189. ❌ FIRE-CIR: Fine-grained Reasoning for Composed Fashion Image Retrieval

作者: François Gardères, Camille-Sovanneary Gauthier, Jean Ponce, Shizhe Chen 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09114v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文FIRE-CIR专注于计算机视觉和自然语言处理的交叉领域，具体研究组合图像检索（CIR）中的细粒度推理问题，特别是时尚领域的应用。论文的核心是视觉-语言模型（VLM）和视觉问答（VQA）技术，用于提升检索的准确性和可解释性。然而，所有给定的评分关键词均围绕大语言模型（LLM）及其相关技术（如MoE、Scaling Laws、RLHF、RAG、Agent等）、大模型优化技术（如量化、推理加速）或特定科学领域AI应用（如生物信息学）。论文未涉及任何大语言模型技术、大模型训练/对齐方法、代理系统或科学AI应用，因此所有关键词相关度均为0分。论文的创新在于视觉推理和可解释性，但不在评分关键词定义的大模型技术范围内。

!!! tip deepseek-chat TL;DR

该论文针对时尚领域的组合图像检索任务，提出了一种基于问题驱动视觉推理的FIRE-CIR模型，通过自动生成属性聚焦的视觉问题并验证图像证据来提升检索准确性和可解释性，在Fashion IQ基准上超越了现有方法。

摘要翻译

组合图像检索（CIR）的目标是根据经过文本描述修改的参考图像，检索出符合条件的目标图像。尽管当前的视觉语言模型（VLMs）通过将图像和文本嵌入共享空间进行检索，在CIR任务中取得了良好效果，但这些模型往往难以推理应保留和应修改的内容。这一局限影响了模型的可解释性，并导致检索结果欠佳，在时尚等细粒度领域中尤为明显。本文提出FIRE-CIR模型，该模型为时尚领域的组合图像检索引入了组合推理与可解释性。区别于单纯依赖嵌入相似度的方法，FIRE-CIR执行问题驱动的视觉推理：它自动从修改文本中生成以属性为中心的视觉问题，并在参考图像与候选图像中验证对应的视觉证据。为训练此推理系统，我们自动构建了大规模时尚领域专用视觉问答数据集，其中包含需要单图或双图分析的问题。在检索过程中，本模型利用这种显式推理对候选结果进行重排序，筛除与预期修改不符的图像。在Fashion IQ基准测试上的实验结果表明，FIRE-CIR在检索准确率上超越了现有最优方法，同时能为检索决策提供可解释的属性级分析依据。

摘要 (Abstract)

Composed image retrieval (CIR) aims to retrieve a target image that depicts a reference image modified by a textual description. While recent vision-language models (VLMs) achieve promising CIR performance by embedding images and text into a shared space for retrieval, they often fail to reason about what to preserve and what to change. This limitation hinders interpretability and yields suboptimal results, particularly in fine-grained domains like fashion. In this paper, we introduce FIRE-CIR, a model that brings compositional reasoning and interpretability to fashion CIR. Instead of relying solely on embedding similarity, FIRE-CIR performs question-driven visual reasoning: it automatically generates attribute-focused visual questions derived from the modification text, and verifies the corresponding visual evidence in both reference and candidate images. To train such a reasoning system, we automatically construct a large-scale fashion-specific visual question answering dataset, containing questions requiring either single- or dual-image analysis. During retrieval, our model leverages this explicit reasoning to re-rank candidate results, filtering out images inconsistent with the intended modifications. Experimental results on the Fashion IQ benchmark show that FIRE-CIR outperforms state-of-the-art methods in retrieval accuracy. It also provides interpretable, attribute-level insights into retrieval decisions.

关键词: Composed Image Retrieval, Fashion Retrieval, Visual Reasoning, Visual Question Answering, Interpretability, Vision-Language Models, Fine-grained Analysis, Attribute-focused Questions

190. ❌ Detecting Diffusion-generated Images via Dynamic Assembly ForestsDetecting Diffusion-generated Images via Dynamic Assembly Forests

作者: Mengxin Fu, Yuezun Li 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09106v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究扩散模型生成的图像检测，提出了一种基于深度森林的动态组装森林模型（DAF）。所有关键词均与大语言模型（LLM）相关，而本文专注于计算机视觉中的扩散模型检测，未涉及LLM的任何方面（如预训练、微调、推理、对齐、代理、科学应用等），因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于深度森林的动态组装森林模型（DAF），用于检测扩散模型生成的图像，在减少参数和计算成本的同时实现了与深度神经网络方法竞争的性能。

摘要翻译

扩散模型因能生成高质量图像而引发严重的安全隐忧。为应对此问题，现有研究大多依赖深度神经网络（如卷积神经网络与Transformer），却普遍忽视了传统机器学习模型的潜力。本文创新性地探索了此类替代方案，并提出一种新颖的动态集成森林模型（Dynamic Assembly Forest, DAF）用于检测扩散生成的图像。该模型基于深度森林框架构建，解决了特征学习与可扩展训练中的固有局限，使其成为高效的扩散生成图像检测器。与现有基于深度神经网络的方法相比，DAF参数量显著减少，计算成本大幅降低，且无需GPU即可部署，同时在标准评估协议下取得了具有竞争力的性能。这些结果表明，在资源受限的场景中，所提方法具备成为重型深度神经网络模型实用替代方案的巨大潜力。我们的代码与模型已公开于https://github.com/OUC-VAS/DAF。

摘要 (Abstract)

Diffusion models are known for generating high-quality images, causing serious security concerns. To combat this, most efforts rely on deep neural networks (e.g., CNNs and Transformers), while largely overlooking the potential of traditional machine learning models. In this paper, we freshly investigate such alternatives and proposes a novel Dynamic Assembly Forest model (DAF) to detect diffusion-generated images. Built upon the deep forest paradigm, DAF addresses the inherent limitations in feature learning and scalable training, making it an effective diffusion-generated image detector. Compared to existing DNN-based methods, DAF has significantly fewer parameters, much lower computational cost, and can be deployed without GPUs, while achieving competitive performance under standard evaluation protocols. These results highlight the strong potential of the proposed method as a practical substitute for heavyweight DNN models in resource-constrained scenarios. Our code and models are available at https://github.com/OUC-VAS/DAF.

关键词: Diffusion-generated images, Image detection, Dynamic Assembly Forest, Deep forest, Computational efficiency, Resource-constrained scenarios, Traditional machine learning

191. ❌ Physically Grounded 3D Generative Reconstruction under Hand Occlusion using Proprioception and Multi-Contact Touch

作者: Gabriele Mario Caddeo, Pasquale Marra, Lorenzo Natale 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09100v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究多模态物理交互下的3D物体重建，主要涉及计算机视觉、机器人学和触觉感知，与大多数大模型技术关键词无关。仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’和’Post-training OR Supervised Fine-tuning OR SFT’有中等关联（5分），因为论文提到在仅视觉数据上预训练扩散模型，并在遮挡场景上微调。与’AI for Science OR Bioinformatics OR Cheminformatics’有中等关联（5分），因其属于AI在机器人科学领域的应用。其他关键词均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用本体感觉和多触点触觉的物理基础多模态方法，用于在严重手部遮挡下进行度量尺度的无模态物体重建和姿态估计，实验表明该方法比仅视觉基线显著提高了遮挡下的完成度并产生了物理上合理的重建。

摘要翻译

我们提出一种多模态、物理基础的方法，用于在严重手部遮挡下进行公制尺度的非模态物体重建与姿态估计。与以往仅依赖视觉的遮挡感知三维生成方法不同，我们利用物理交互信号：本体感觉提供已摆姿的手部几何信息，而多触点触摸则约束物体表面必须位于的位置，从而减少遮挡区域的歧义性。我们将物体结构表示为一种姿态感知、相机对齐的有符号距离场（SDF），并通过结构变分自编码器（Structure-VAE）学习一个紧凑的潜在空间。在此潜在空间中，我们训练一个条件流匹配扩散模型，先在仅视觉图像上进行预训练，再在遮挡操作场景上进行微调，同时以可见RGB证据、遮挡物/可见性掩码、手部潜在表示及触觉信息为条件。关键的是，我们在微调和推理过程中引入了基于物理的目标函数和可微分解码器引导，以减少手-物体间的相互穿透，并使重建表面与接触观测对齐。由于我们的方法能够生成公制化、物理一致的结构估计，它可以自然地集成到现有的两阶段重建流程中，由下游模块进一步优化几何并预测外观。仿真实验表明，与仅视觉基线相比，加入本体感觉和触觉信息能显著提升遮挡下的补全效果，并在正确的真实世界尺度上产生物理合理的重建结果；我们进一步通过将模型部署于一个具有与训练所用末端执行器不同的真实仿人机器人上，验证了其迁移能力。

摘要 (Abstract)

We propose a multimodal, physically grounded approach for metric-scale amodal object reconstruction and pose estimation under severe hand occlusion. Unlike prior occlusion-aware 3D generation methods that rely only on vision, we leverage physical interaction signals: proprioception provides the posed hand geometry, and multi-contact touch constrains where the object surface must lie, reducing ambiguity in occluded regions. We represent object structure as a pose-aware, camera-aligned signed distance field (SDF) and learn a compact latent space with a Structure-VAE. In this latent space, we train a conditional flow-matching diffusion model, pretraining on vision-only images and finetuning on occluded manipulation scenes while conditioning on visible RGB evidence, occluder/visibility masks, the hand latent representation, and tactile information. Crucially, we incorporate physics-based objectives and differentiable decoder-guidance during finetuning and inference to reduce hand–object interpenetration and to align the reconstructed surface with contact observations. Because our method produces a metric, physically consistent structure estimate, it integrates naturally into existing two-stage reconstruction pipelines, where a downstream module refines geometry and predicts appearance. Experiments in simulation show that adding proprioception and touch substantially improves completion under occlusion and yields physically plausible reconstructions at correct real-world scale compared to vision-only baselines; we further validate transfer by deploying the model on a real humanoid robot with an end-effector different from those used during training.

关键词: 3D generative reconstruction, hand occlusion, proprioception, multi-contact touch, physically grounded, diffusion model, metric-scale, object pose estimation

192. ❌ Off-the-shelf Vision Models Benefit Image Manipulation Localization

作者: Zhengxuan Zhang, Keji Song, Junmin Hu, Ao Luo, Yuezun Li 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09096v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的图像篡改定位任务，提出了一种名为ReVi的可训练适配器，用于重新利用现成的通用视觉模型（如图像生成和分割网络）。论文的核心内容涉及视觉模型适配、特征解缠和参数高效微调，但所有给定的关键词都专门针对大语言模型（LLM）及其相关技术（如MoE、RLHF、RAG、量化等），或特定于LLM的应用（如AI for Science）。论文未涉及任何语言模型、大模型技术原理或LLM在不同领域的应用，因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ReVi的可训练适配器，通过解缠语义冗余和增强篡改特定信息，使现成的通用视觉模型能够有效应用于图像篡改定位任务，实现了优于现有方法的性能。

摘要翻译

图像篡改定位与通用视觉任务通常被视为两个独立的研究方向，这源于篡改特定特征与语义特征之间的本质差异。然而，本文通过引入一种新颖视角弥合了这一鸿沟：这两个方向本质上是相互关联的，且通用语义先验能够助力图像篡改定位。基于这一洞见，我们提出了一种可训练适配器（命名为ReVi），其能够将现有的通用视觉模型（如图像生成与分割网络）重新应用于图像篡改定位任务。受鲁棒主成分分析的启发，该适配器从这些模型所嵌入的信息中分离出语义冗余与篡改特定信息，并选择性增强后者。与现有需要大量模型重构和完整重训练的篡改定位方法不同，我们的方法基于参数冻结的现成视觉模型，仅需微调所提出的适配器。实验结果证明了本方法的优越性，展现了可扩展图像篡改定位框架的潜力。

摘要 (Abstract)

Image manipulation localization (IML) and general vision tasks are typically treated as two separate research directions due to the fundamental differences between manipulation-specific and semantic features. In this paper, however, we bridge this gap by introducing a fresh perspective: these two directions are intrinsically connected, and general semantic priors can benefit IML. Building on this insight, we propose a novel trainable adapter (named ReVi) that repurposes existing off-the-shelf general-purpose vision models (e.g., image generation and segmentation networks) for IML. Inspired by robust principal component analysis, the adapter disentangles semantic redundancy from manipulation-specific information embedded in these models and selectively enhances the latter. Unlike existing IML methods that require extensive model redesign and full retraining, our method relies on the off-the-shelf vision models with frozen parameters and only fine-tunes the proposed adapter. The experimental results demonstrate the superiority of our method, showing the potential for scalable IML frameworks.

关键词: Image manipulation localization, Off-the-shelf vision models, Trainable adapter, Semantic priors, Feature disentanglement, Parameter-efficient fine-tuning, Robust principal component analysis, Scalable IML frameworks

193. ❌ Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation

作者: Yutong Zhang, Jiaxin Chen, Honglin Chen, Kaiqi Zheng, Shengcai Liao, Hanwen Zhong, Weixin Li, Yunhong Wang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09088v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种名为Masked Dual Path Distillation (MDPD)的方法，专注于内存高效的迁移学习，通过蒸馏技术优化微调过程，并在推理时丢弃侧网络以加速推理。核心相关关键词包括：1) ‘Post-training OR Supervised Fine-tuning OR SFT’ (10分)：论文直接研究微调方法，是核心内容；2) ‘PEFT OR LoRA OR Parameter-efficient Fine-tuning’ (10分)：论文属于参数高效微调范畴，是核心创新；3) ‘Pre-training OR Continual Pre-training OR Domain Adaptation’ (5分)：论文基于预训练模型进行迁移学习，有一定关联；4) ‘Large Language Models OR LLMs OR Foundation Models’ (5分)：论文在语言任务中应用，但未明确限定于LLMs；5) ‘Speculative Decoding OR Inference Acceleration’ (5分)：论文加速推理，但非解码优化。其他关键词与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Masked Dual Path Distillation (MDPD)的内存高效迁移学习方法，通过蒸馏技术优化微调过程，在推理时丢弃侧网络，实现了至少25.2%的推理加速，同时保持参数和内存效率，并在多个任务上提升了准确性。

摘要翻译

内存高效迁移学习（Memory-efficient Transfer Learning，METL）方法近期在将预训练模型适配至下游任务方面展现出优异性能。这类方法避免了在大型骨干网络中进行梯度反向传播，从而显著减少了微调过程中的可训练参数量与高内存消耗。然而，由于它们通常采用轻量级可学习的旁路网络，这些方法在推理时不可避免地引入了额外的内存与时间开销，这与高效迁移学习的最终目标相悖。为解决上述问题，我们提出了一种名为掩码双路蒸馏（Masked Dual Path Distillation，MDPD）的新方法，旨在加速推理过程，同时在采用渐消式旁路网络的微调中保持参数与内存效率。具体而言，MDPD构建了一个通过双向蒸馏冻结骨干网络与可学习旁路网络以提升微调性能的框架，并能在推理阶段舍弃旁路网络而不损失精度。此外，我们针对多层编码器结构设计了一种新颖的基于特征的知识蒸馏方法。在视觉/纯语言及视觉-语言任务上的多种骨干网络实验表明，我们的方法不仅能在保持参数与内存消耗相当的同时将推理速度提升至少25.2%，而且相较于当前最优方法显著提高了准确率。源代码已发布于https://github.com/Zhang-VKk/MDPD。

摘要 (Abstract)

Memory-efficient transfer learning (METL) approaches have recently achieved promising performance in adapting pre-trained models to downstream tasks. They avoid applying gradient backpropagation in large backbones, thus significantly reducing the number of trainable parameters and high memory consumption during fine-tuning. However, since they typically employ a lightweight and learnable side network, these methods inevitably introduce additional memory and time overhead during inference, which contradicts the ultimate goal of efficient transfer learning. To address the above issue, we propose a novel approach dubbed Masked Dual Path Distillation (MDPD) to accelerate inference while retaining parameter and memory efficiency in fine-tuning with fading side networks. Specifically, MDPD develops a framework that enhances the performance by mutually distilling the frozen backbones and learnable side networks in fine-tuning, and discard the side network during inference without sacrificing accuracy. Moreover, we design a novel feature-based knowledge distillation method for the encoder structure with multiple layers. Extensive experiments on distinct backbones across vision/language-only and vision-and-language tasks demonstrate that our method not only accelerates inference by at least 25.2% while keeping parameter and memory consumption comparable, but also remarkably promotes the accuracy compared to SOTA approaches. The source code is available at https://github.com/Zhang-VKk/MDPD.

关键词: Memory-efficient transfer learning, Masked Dual Path Distillation, fine-tuning, parameter-efficient, knowledge distillation, inference acceleration, side networks, vision-and-language tasks

作者: Arbel Hizmi, Artemii Bakulin, Shai Bagon, Nir Yosef 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09076v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文研究跨模态知识蒸馏，将空间转录组学知识迁移到组织学图像分析中，属于生物信息学领域的AI应用。与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、智能体等）完全无关，仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文属于生物信息学中的AI应用，但并非核心大模型技术，故给8分。

!!! tip deepseek-chat TL;DR

该论文提出了一种跨模态知识蒸馏方法，利用配对的空间转录组学和H&E组织学数据，将转录组学衍生的组织生态结构迁移到仅使用组织学图像的模型中，从而在推理时仅需组织学图像即可恢复有生物学意义的组织区域结构。

摘要翻译

空间转录组学为组织架构提供了分子层面的丰富描述，使其能够无监督地发现组织微环境——即具有独特细胞类型组成与功能、且与生物学研究和临床解读相关的空间连贯区域。然而，空间转录组学技术仍成本高昂且数据稀缺，而苏木精-伊红（H&E）染色组织学图像虽数量庞大，却携带的信号粒度较粗。我们提出利用配对的空间转录组学与H&E数据，通过跨模态蒸馏将转录组学衍生的微环境结构迁移至仅需组织学图像的模型中。在多种组织类型及疾病背景下，相较于使用相同图像特征训练的无监督形态学基线模型，该蒸馏模型与转录组学衍生的微环境结构达成显著更高的一致性，并通过细胞类型分析证实其恢复了具有生物学意义的邻域组成。所提出的框架在训练阶段利用配对的空间转录组学与H&E数据，而后可在推理阶段仅使用组织学图像应用于未参与训练的组织区域，无需任何转录组学输入。

摘要 (Abstract)

Spatial transcriptomics provides a molecularly rich description of tissue organization, enabling unsupervised discovery of tissue niches – spatially coherent regions of distinct cell-type composition and function that are relevant to both biological research and clinical interpretation. However, spatial transcriptomics remains costly and scarce, while H&E histology is abundant but carries a less granular signal. We propose to leverage paired spatial transcriptomics and H&E data to transfer transcriptomics-derived niche structure to a histology-only model via cross-modal distillation. Across multiple tissue types and disease contexts, the distilled model achieves substantially higher agreement with transcriptomics-derived niche structure than unsupervised morphology-based baselines trained on identical image features, and recovers biologically meaningful neighborhood composition as confirmed by cell-type analysis. The resulting framework leverages paired spatial transcriptomic and H&E data during training, and can then be applied to held-out tissue regions using histology alone, without any transcriptomic input at inference time.

关键词: cross-modal knowledge distillation, spatial transcriptomics, histology, tissue niches, H&E, bioinformatics, AI for science, unsupervised learning

195. ❌ Nested Radially Monotone Polar Occupancy Estimation: Clinically-Grounded Optic Disc and Cup Segmentation for Glaucoma Screening

作者: Rimsa Goperma, Rojan Basnet, Liang Zhao 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09062v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学图像分割（眼底照片中的视盘和视杯分割），属于计算机视觉和医学影像分析领域。论文提出的NPS-Net框架是一种深度学习架构，用于解决特定临床约束下的分割问题，并强调零样本泛化能力。所有关键词均围绕大语言模型（LLM）及其相关技术（如训练方法、推理优化、代理系统等），而本文完全不涉及LLM、自然语言处理或通用AI基础模型。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为医学影像分析可视为AI在科学（生物医学）领域的应用，但论文核心是视觉分割而非典型的生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为NPS-Net的深度学习框架，用于从眼底照片中分割视盘和视杯，以解决现有方法无法保证临床有效性（如星凸性和嵌套结构）的问题，并在多个数据集上实现了高精度和强大的零样本泛化性能。

摘要翻译

从眼底照片中准确分割视盘（OD）与视杯（OC）是青光眼筛查的关键环节。然而，现有深度学习方法无法保证临床有效性约束——包括OD与OC的星形凸性及嵌套结构，这导致诊断指标（尤其在跨数据集域偏移情况下）出现偏差。为解决此问题，本文提出NPS-Net（嵌套极坐标形状网络），这是首个将OD/OC分割建模为嵌套径向单调极坐标占据估计的框架。该输出表征能严格保障上述临床有效性约束，同时实现高精度分割。在七个公开数据集上的评估表明，NPS-Net展现出强大的零样本泛化能力：在RIM-ONE数据集上保持100%解剖结构有效性，杯区Dice系数较最佳基线绝对提升12.8%，垂直杯盘比（vCDR）平均绝对误差降低超56%；在PAPILA数据集上取得视盘Dice系数0.9438、视盘豪斯多夫距离（HD95）2.78像素的指标，后者较最优对比方法降低83%。

摘要 (Abstract)

Valid segmentation of the optic disc (OD) and optic cup (OC) from fundus photographs is essential for glaucoma screening. Unfortunately, existing deep learning methods do not guarantee clinical validness including star-convexity and nested structure of OD and OC, resulting corruption in diagnostic metric, especially under cross-dataset domain shift. To adress this issue, this paper proposed NPS-Net (Nested Polar Shape Network), the first framework that formulates the OD/OC segmentation as nested radially monotone polar occupancy estimation.This output representation can guarantee the aforementioned clinical validness and achieve high accuracy. Evaluated across seven public datasets, NPS-Net shows strong zero-shot generalization. On RIM-ONE, it maintains 100% anatomical validity and improves Cup Dice by 12.8% absolute over the best baseline, reducing vCDR MAE by over 56%. On PAPILA, it achieves Disc Dice of 0.9438 and Disc HD95 of 2.78 px, an 83% reduction over the best competing method.

关键词: optic disc segmentation, optic cup segmentation, glaucoma screening, fundus photographs, nested polar shape network, clinical validity, zero-shot generalization, deep learning

196. ❌ Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence

作者: Junchao Liao, Zhenghao Zhang, Xiangyu Meng, Litao Li, Ziying Zhang, Siyu Zhu, Long Qin, Weizhi Wang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09057v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文Tora3专注于音频-视频生成任务，通过轨迹引导的框架提升物理一致性，属于多模态生成领域。所有评分关键词均针对大模型/深度学习技术原理或特定应用（如AI for Science），而本文不涉及语言模型、模型训练/优化技术、推理方法、代理系统、模型压缩等主题，也未应用于科学领域（如生物信息学）。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文提出了Tora3轨迹引导的音频-视频生成框架，通过共享对象轨迹作为运动先验，解决了现有方法中运动-声音关系不协调的问题，显著提升了生成内容的物理一致性和质量。

摘要翻译

视听（Audio-Video, AV）生成技术近期在感知质量与多模态一致性方面取得显著进展，但生成具有合理运动-声音关联的内容仍具挑战。现有方法常产生视觉上不稳定的物体运动，且生成的声音仅与显著运动或接触事件松散对齐，这主要源于缺乏视频与音频生成共享的显式运动感知结构。本文提出Tora3，一种轨迹引导的视听生成框架，通过将物体轨迹作为共享的运动学先验来提升物理一致性。Tora3并非将轨迹仅视为视频控制信号，而是以其共同引导视觉运动与声学事件。具体而言，我们设计了面向视频的轨迹对齐运动表征、由轨迹导出的二阶运动学状态驱动的运动-音频对齐模块，以及一种混合流匹配方案——该方案在轨迹条件区域内保持轨迹保真度，同时在其余区域维持局部一致性。我们还构建了PAV数据集，这是一个强调运动相关模式的大规模视听数据集，附带自动提取的运动标注。大量实验表明，相较于现有强开源基线模型，Tora3在运动真实感、运动-声音同步性及整体视听生成质量上均有提升。

摘要 (Abstract)

Audio-video (AV) generation has recently made strong progress in perceptual quality and multimodal coherence, yet generating content with plausible motion-sound relations remains challenging. Existing methods often produce object motions that are visually unstable and sounds that are only loosely aligned with salient motion or contact events, largely because they lack an explicit motion-aware structure shared by video and audio generation. We present Tora3, a trajectory-guided AV generation framework that improves physical coherence by using object trajectories as a shared kinematic prior. Rather than treating trajectories as a video-only control signal, Tora3 uses them to jointly guide visual motion and acoustic events. Specifically, we design a trajectory-aligned motion representation for video, a kinematic-audio alignment module driven by trajectory-derived second-order kinematic states, and a hybrid flow matching scheme that preserves trajectory fidelity in trajectory-conditioned regions while maintaining local coherence elsewhere. We further curate PAV, a large-scale AV dataset emphasizing motion-relevant patterns with automatically extracted motion annotations. Extensive experiments show that Tora3 improves motion realism, motion-sound synchronization, and overall AV generation quality over strong open-source baselines.

关键词: audio-video generation, trajectory-guided, physical coherence, motion-sound synchronization, kinematic prior, flow matching, multimodal generation, motion realism

197. ❌ Scene-Agnostic Object-Centric Representation Learning for 3D Gaussian Splatting

作者: Tsuheng Hsu, Guiyu Liu, Juho Kannala, Janne Heikkilä 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09045v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于3D场景理解和对象中心表示学习，使用3D高斯泼溅和槽注意力等技术，属于计算机视觉和3D重建领域。所有评分关键词均与大语言模型、深度学习技术原理或AI for Science应用直接相关，而本文研究内容完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了3D高斯泼溅中对象表示学习依赖场景特定监督的问题，提出了一种基于场景无关对象码本的监督方案，实现了更好的跨场景泛化和下游任务性能。

摘要翻译

近期关于三维场景理解的研究利用视觉基础模型（VFMs）生成的二维掩码来监督辐射场，实现了实例级的三维分割。然而，基础模型提供的监督信号本质上并非以物体为中心，通常需要额外的掩码前/后处理或专门设计的训练与损失函数来解决多视角间的掩码身份冲突。所学习的三维场景身份具有场景依赖性，限制了跨场景的泛化能力。为此，我们提出一种数据集级别的、以物体为中心的监督方案，用于在三维高斯泼溅（3DGS）中学习物体表征。基于预训练的基于槽注意力的全局物体中心学习（GOCL）模块，我们学习了一个场景无关的物体码本，该码本能够提供跨视角和跨场景的一致性、身份锚定的表征。通过将该码本与模块的无监督物体掩码相结合，我们可以直接监督三维高斯点的身份特征，而无需额外的掩码前/后处理或显式的多视角对齐。学习到的场景无关码本使得物体监督与识别无需进行逐场景微调或重新训练。因此，我们的方法将无监督的物体中心学习（OCL）引入三维高斯泼溅，从而产生更具结构化的表征，并为机器人交互、场景理解和跨场景泛化等下游任务带来更好的泛化性能。

摘要 (Abstract)

Recent works on 3D scene understanding leverage 2D masks from visual foundation models (VFMs) to supervise radiance fields, enabling instance-level 3D segmentation. However, the supervision signals from foundation models are not fundamentally object-centric and often require additional mask pre/post-processing or specialized training and loss design to resolve mask identity conflicts across views. The learned identity of the 3D scene is scene-dependent, limiting generalizability across scenes. Therefore, we propose a dataset-level, object-centric supervision scheme to learn object representations in 3D Gaussian Splatting (3DGS). Building on a pre-trained slot attention-based Global Object Centric Learning (GOCL) module, we learn a scene-agnostic object codebook that provides consistent, identity-anchored representations across views and scenes. By coupling the codebook with the module’s unsupervised object masks, we can directly supervise the identity features of 3D Gaussians without additional mask pre-/post-processing or explicit multi-view alignment. The learned scene-agnostic codebook enables object supervision and identification without per-scene fine-tuning or retraining. Our method thus introduces unsupervised object-centric learning (OCL) into 3DGS, yielding more structured representations and better generalization for downstream tasks such as robotic interaction, scene understanding, and cross-scene generalization.

关键词: 3D Gaussian Splatting, object-centric representation learning, scene-agnostic codebook, unsupervised object masks, slot attention, 3D scene understanding, cross-scene generalization, instance-level 3D segmentation

198. ❌ Towards Lifelong Aerial Autonomy: Geometric Memory Management for Continual Visual Place Recognition in Dynamic Environments

作者: Xingyu Shao, Zhiqiang Yan, Liangzheng Sun, Mengfan He, Chao Chen, Jinhui Zhang, Chunyu Li, Ziyang Meng 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09038v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的视觉地点识别（VPR）和持续学习（CL）问题，提出了一种用于动态环境中连续视觉地点识别的几何记忆管理框架。论文内容涉及无人机自主导航、地理定位、记忆管理、灾难性遗忘等计算机视觉和机器学习概念，但完全不涉及大语言模型（LLM）、深度学习技术原理创新或大模型在不同领域的应用。所有评分关键词均与大语言模型、深度学习技术原理或大模型应用相关，而该论文的研究领域和方法论与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文针对动态环境中长期空中自主导航的视觉地点识别问题，提出了一种解耦静态卫星锚点和动态经验回放缓冲区的异构记忆框架，通过空间约束分配策略优化缓冲区选择，显著提升了空间泛化能力并缓解了灾难性遗忘。

摘要翻译

在不断变化的环境条件下实现稳健的地理定位对于长期空中自主至关重要。虽然当机载视图与训练域匹配时，视觉位置识别模型表现良好，但在连续任务期间使其适应不断变化的分布会引发灾难性遗忘。现有的持续学习方法在此常告失败，因为地理特征表现出严重的类内差异。在本研究中，我们将空中视觉位置识别构建为一个基于任务的域增量学习问题，并提出了一种新颖的异构记忆框架。为满足严格的机载存储限制，我们的“学习与处置”流程将地理知识解耦为静态卫星锚点（保留全局几何先验）和动态经验回放缓冲区（保留特定域特征）。我们引入了一种空间约束的分配策略，该策略基于样本难度或特征空间多样性来优化缓冲区选择。为促进系统评估，我们提供了三个评估标准和一个源自21个不同任务序列的综合基准。大量实验表明，我们的架构显著提升了空间泛化能力；在知识保留方面，我们基于多样性的缓冲区选择策略比随机基线高出7.8%。与在非结构化环境中失效的类均值保留方法不同，最大化结构多样性实现了更优的可塑性-稳定性平衡，并确保了在随机序列中与顺序无关的鲁棒性。这些结果证明，对于解决终身空中自主中的灾难性遗忘问题，保持结构特征覆盖比样本难度更为关键。

摘要 (Abstract)

Robust geo-localization in changing environmental conditions is critical for long-term aerial autonomy. While visual place recognition (VPR) models perform well when airborne views match the training domain, adapting them to shifting distributions during sequential missions triggers catastrophic forgetting. Existing continual learning (CL) methods often fail here because geographic features exhibit severe intra-class variations. In this work, we formulate aerial VPR as a mission-based domain-incremental learning (DIL) problem and propose a novel heterogeneous memory framework. To respect strict onboard storage constraints, our “Learn-and-Dispose” pipeline decouples geographic knowledge into static satellite anchors (preserving global geometric priors) and a dynamic experience replay buffer (retaining domain-specific features). We introduce a spatially-constrained allocation strategy that optimizes buffer selection based on sample difficulty or feature space diversity. To facilitate systematic assessment, we provide three evaluation criteria and a comprehensive benchmark derived from 21 diverse mission sequences. Extensive experiments demonstrate that our architecture significantly boosts spatial generalization; our diversity-driven buffer selection outperforms the random baseline by 7.8% in knowledge retention. Unlike class-mean preservation methods that fail in unstructured environments, maximizing structural diversity achieves a superior plasticity-stability balance and ensures order-agnostic robustness across randomized sequences. These results prove that maintaining structural feature coverage is more critical than sample difficulty for resolving catastrophic forgetting in lifelong aerial autonomy.

关键词: visual place recognition, continual learning, catastrophic forgetting, aerial autonomy, memory management, domain-incremental learning, geometric memory, experience replay

作者: Jiahua Pang, Ying Li, Dongpu Cao, Jingcai Luo, Yanuo Zheng, Bao Yunfan, Yujie Lei, Rui Yuan, Yuxi Tian, Guojin Yuan, Hongchang Chen, Zhi Zheng, Yongchun Liu 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09023v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于汽车相关的多任务视觉异常检测数据集创建和基准测试，属于计算机视觉领域，与所有提供的大模型和深度学习技术原理关键词完全无关，未涉及任何语言模型、训练技术、推理方法、对齐、压缩、代理系统或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了CAD 100K数据集，一个用于汽车相关多任务视觉异常检测的大规模基准，并通过实验表明多任务学习能促进任务交互和知识转移，但也揭示了任务间的冲突。

摘要翻译

多任务视觉异常检测对于汽车相关制造质量评估至关重要。然而，现有方法仍局限于特定任务，缺乏统一的多任务评估基准阻碍了该领域发展。为填补这一空白，我们提出了CAD数据集——一个专为汽车相关多任务视觉异常检测设计的大规模综合性基准。该数据集包含跨越7种车辆领域和3类检测任务的超过100张图像，为模型提供了汽车异常检测的全面视角。这是首个专注于多任务学习的汽车异常检测数据集，同时结合合成数据增强技术以应对少样本异常图像问题。我们实现了多任务基线模型并进行了广泛的实证研究。结果表明，多任务学习能促进任务交互与知识迁移，同时也揭示了任务间存在的挑战性冲突。CAD数据集可作为标准化平台，推动汽车相关多任务视觉异常检测领域的未来发展。

摘要 (Abstract)

Multi-task visual anomaly detection is critical for car-related manufacturing quality assessment. However, existing methods remain task-specific, hindered by the absence of a unified benchmark for multi-task evaluation. To fill in this gap, We present the CAD Dataset, a large-scale and comprehensive benchmark designed for car-related multi-task visual anomaly detection. The dataset contains over 100 images crossing 7 vehicle domains and 3 tasks, providing models a comprehensive view for car-related anomaly detection. It is the first car-related anomaly dataset specialized for multi-task learning(MTL), while combining synthesis data augmentation for few-shot anomaly images. We implement a multi-task baseline and conduct extensive empirical studies. Results show MTL promotes task interaction and knowledge transfer, while also exposing challenging conflicts between tasks. The CAD dataset serves as a standardized platform to drive future advances in car-related multi-task visual anomaly detection.

关键词: visual anomaly detection, multi-task learning, car-related, dataset, benchmark, synthesis data augmentation, knowledge transfer, task conflicts

200. ❌ NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Multi-Exposure Image Fusion in Dynamic Scenes (Track 2)

作者: Lishen Qu, Yao Liu, Jie Liang, Hui Zeng, Wen Dai, Guanyi Qin, Ya-nan Guan, Shihao Zhou, Jufeng Yang, Lei Zhang, Radu Timofte, Xiyuan Yuan, Wanjie Sun, Shihang Li, Bo Zhang, Bin Chen, Jiannan Lin, Yuxu Chen, Qinquan Gao, Tong Tong, Song Gao, Jiacong Tang, Tao Hu, Xiaowen Ma, Qingsen Yan, Sunhan Xu, Juan Wang, Xinyu Sun, Lei Qi, He Xu, Jiachen Tu, Guoyi Xu, Yaoxin Jiang, Jiajia Liu, Yaokun Shi 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09030v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的多曝光图像融合挑战，涉及HDR成像、图像对齐、去伪影等技术，与所有提供的大模型和深度学习技术原理关键词（如LLM、MoE、RLHF、RAG等）以及AI for Science子领域均无直接关联。论文内容完全属于传统计算机视觉/图像处理范畴，未涉及任何大语言模型或相关技术。

!!! tip deepseek-chat TL;DR

该论文提出了NTIRE 2026多曝光图像融合挑战，针对动态场景中因运动、光照变化和相机抖动导致的图像错位和伪影问题，通过建立包含200个序列的基准数据集并评估了114个团队的987个提交，显著提升了多曝光融合的伪影去除和细节恢复能力。

摘要翻译

本文介绍了NTIRE 2026第三届“任意图像复原模型”（RAIM）挑战赛，聚焦于动态场景下的多曝光图像融合任务。我们提出了一个针对实用且高难度高动态范围成像场景的基准测试，该场景要求在存在物体运动、光照变化及手持相机抖动的情况下融合包围曝光序列。挑战数据包含100组训练序列（每序列7档曝光）和100组测试序列（每序列5档曝光），这些数据反映了真实世界中常导致图像错位与重影伪影的复杂情况。我们通过融合峰值信噪比、结构相似性指数和感知相似性指标得出的排行榜分数评估参赛方案，并在最终评审中综合考虑感知质量、计算效率与可复现性。本届赛道吸引了114支团队参与，共收到987份提交结果。优胜方法显著提升了消除多曝光融合伪影及恢复精细细节的能力。数据集及各团队代码已公开于代码库：https://github.com/qulishen/RAIM-HDR。

摘要 (Abstract)

This paper presents NTIRE 2026, the 3rd Restore Any Image Model (RAIM) challenge on multi-exposure image fusion in dynamic scenes. We introduce a benchmark that targets a practical yet difficult HDR imaging setting, where exposure bracketing must be fused under scene motion, illumination variation, and handheld camera jitter. The challenge data contains 100 training sequences with 7 exposure levels and 100 test sequences with 5 exposure levels, reflecting real-world scenarios that frequently cause misalignment and ghosting artefacts. We evaluate submissions with a leaderboard score derived from PSNR, SSIM, and LPIPS, while also considering perceptual quality, efficiency, and reproducibility during the final review. This track attracted 114 participating teams and received 987 submissions. The winning methods significantly improved the ability to remove artifacts from multi-exposure fusion and recover fine details. The dataset and the code of each team can be found at the repository: https://github.com/qulishen/RAIM-HDR.

关键词: multi-exposure image fusion, HDR imaging, dynamic scenes, ghosting artefacts, image alignment, PSNR, SSIM, LPIPS

201. ❌ BlendFusion – Scalable Synthetic Data Generation for Diffusion Model Training

作者: Thejas Venkatesh, Suguna Varshini Velury 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09022v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于扩散模型的合成数据生成，提出BlendFusion框架和FineBLEND数据集，涉及3D场景、路径追踪、图像-文本对生成等技术。所有评分关键词均与大语言模型（LLM）及其相关技术（如训练、推理、对齐、应用等）直接相关，而本文研究的是计算机视觉领域的扩散模型，两者在模型类型、技术方法和应用领域上完全不同，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对扩散模型训练中合成数据存在的视觉不一致和模型自噬问题，提出了一个基于3D场景路径追踪的可扩展合成数据生成框架BlendFusion，并构建了高质量的图像-文本对数据集FineBLEND。

摘要翻译

随着扩散模型的快速普及，合成数据生成已成为满足大规模图像数据集日益增长需求的一种前景广阔的方法。然而，完全由扩散模型生成的图像常存在视觉不一致性，在此类数据上训练模型可能形成自吞噬反馈循环，导致模型崩溃，这一现象通常被称为模型自噬障碍。为应对这些挑战，我们提出了BlendFusion——一个基于路径追踪技术、可从三维场景生成合成数据的可扩展框架。我们的流程整合了以物体为中心的相机布局策略、鲁棒过滤机制以及自动标注功能，以生成高质量的图像-文本对。利用该流程，我们构建了FineBLEND数据集，这是一个从多样化三维场景中创建的图像-文本数据集。我们通过实证分析了FineBLEND的质量，并将其与多个广泛使用的图像-文本数据集进行比较。同时，相较于物体无关的采样方法，我们验证了以物体为中心的相机布局策略的有效性。本开源框架设计具备高度可配置性，使研究社区能够基于三维场景创建自定义数据集。

摘要 (Abstract)

With the rapid adoption of diffusion models, synthetic data generation has emerged as a promising approach for addressing the growing demand for large-scale image datasets. However, images generated purely by diffusion models often exhibit visual inconsistencies, and training models on such data can create an autophagous feedback loop that leads to model collapse, commonly referred to as Model Autophagy Disorder (MAD). To address these challenges, we propose BlendFusion, a scalable framework for synthetic data generation from 3D scenes using path tracing. Our pipeline incorporates an object-centric camera placement strategy, robust filtering mechanisms, and automatic captioning to produce high-quality image-caption pairs. Using this pipeline, we curate FineBLEND, an image-caption dataset constructed from a diverse set of 3D scenes. We empirically analyze the quality of FineBLEND and compare it to several widely used image-caption datasets. We also demonstrate the effectiveness of our object-centric camera placement strategy relative to object-agnostic sampling approaches. Our open-source framework is designed for high configurability, enabling the community to create their own datasets from 3D scenes.

关键词: diffusion models, synthetic data generation, 3D scenes, path tracing, image-caption pairs, Model Autophagy Disorder, FineBLEND dataset, object-centric camera placement

202. ❌ Domain-generalizable Face Anti-Spoofing with Patch-based Multi-tasking and Artifact Pattern Conversion

作者: Seungjin Jung, Yonghyun Jeong, Minha Kim, Jimin Min, Youngjoon Yoo, Jongwon Choi 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09018v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究人脸反欺骗（Face Anti-Spoofing）的领域泛化问题，提出了一种基于生成对抗网络（PCGAN）和补丁多任务学习的方法。论文核心是计算机视觉中的图像生成、特征解耦和领域泛化技术，属于特定应用领域的深度学习研究。所有评分关键词均与大语言模型、大模型技术原理、AI for Science等主题相关，而本论文完全不涉及这些内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对人脸反欺骗算法在未知视觉域和欺骗方法上泛化能力不足的问题，提出了一种基于模式转换生成对抗网络（PCGAN）和补丁多任务学习的方法，有效提升了人脸识别系统的安全性。

摘要翻译

人脸防伪（Face Anti-Spoofing，FAS）算法旨在保护人脸识别系统免受伪造攻击，但受限于数据集多样性不足，其处理未知视觉域和伪造方法的能力受到影响。本文提出模式转换生成对抗网络（Pattern Conversion Generative Adversarial Network，PCGAN）以增强FAS中的域泛化能力。PCGAN能有效解耦伪造伪影与面部特征的潜在向量，从而生成具有多样化伪影的图像。我们进一步引入基于分块的学习与多任务学习策略，以应对局部攻击及对面部特征的过拟合问题。大量实验验证了PCGAN在域泛化和局部攻击检测方面的有效性，为人脸识别安全性带来了显著提升。

摘要 (Abstract)

Face Anti-Spoofing (FAS) algorithms, designed to secure face recognition systems against spoofing, struggle with limited dataset diversity, impairing their ability to handle unseen visual domains and spoofing methods. We introduce the Pattern Conversion Generative Adversarial Network (PCGAN) to enhance domain generalization in FAS. PCGAN effectively disentangles latent vectors for spoof artifacts and facial features, allowing to generate images with diverse artifacts. We further incorporate patch-based and multi-task learning to tackle partial attacks and overfitting issues to facial features. Our extensive experiments validate PCGAN’s effectiveness in domain generalization and detecting partial attacks, giving a substantial improvement in facial recognition security.

关键词: Face Anti-Spoofing, Domain Generalization, Generative Adversarial Network, Pattern Conversion, Patch-based Learning, Multi-task Learning, Spoof Artifacts, Facial Recognition Security

203. ❌ Robust by Design: A Continuous Monitoring and Data Integration Framework for Medical AI

作者: Mohammad Daouk, Jan Ulrich Becker, Neeraja Kambham, Anthony Chang, Chandra Mohan, Hien Van Nguyen 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09009v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于医学影像AI中的持续监控和数据集成框架，用于解决数据漂移问题。仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为其直接应用于生物医学领域（肾小球病理图像分类）。其他关键词均涉及大模型技术原理或应用，而本文使用ResNet18（传统CNN模型），未涉及LLMs、MoE、推理加速、对齐等大模型相关技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于医学影像AI的自主持续监控和数据集成框架，通过多指标特征分析和不确定性门控机制，在动态临床环境中防止性能退化，并在肾小球病理图像分类实验中保持了稳定的AUC（约0.92）和准确率（约89%）。

摘要翻译

自适应医疗人工智能模型在动态临床环境中常因数据漂移而面临性能下降。我们提出了一种自主持续监测与数据集成框架，以维持模型随时间推移的稳健性能。针对肾小球病理图像分类（增殖性与非增殖性狼疮性肾炎），我们的三阶段方法采用多指标特征分析和基于蒙特卡洛丢弃法的不确定性门控机制，以决定何时基于新数据进行模型重训练。仅统计特征与训练分布相似（通过欧几里得距离、余弦相似度、马氏距离度量）且预测熵较低的图像会被纳入集成。随后，在严格的性能保障下（各项指标下降不超过5%），模型使用这些图像进行增量式重训练。在基于多中心数据集、采用ResNet18集成模型的实验中，该框架有效防止了性能退化：新增图像并未导致AUC（约0.92）或准确率（约89%）发生显著变化。该方法解决了数据偏移问题，避免了灾难性遗忘，为医学影像人工智能的持续学习提供了可行路径。

摘要 (Abstract)

Adaptive medical AI models often face performance drops in dynamic clinical environments due to data drift. We propose an autonomous continuous monitoring and data integration framework that maintains robust performance over time. Focusing on glomerular pathology image classification (proliferative vs. non-proliferative lupus nephritis), our three-stage method uses multi-metric feature analysis and Monte Carlo dropout-based uncertainty gating to decide when to retrain on new data. Only images statistically similar to the training distribution (via Euclidean, cosine, Mahalanobis metrics) and with low predictive entropy are integrated. The model is then incrementally retrained with these images under strict performance safeguards (no metric degradation >5%). In experiments with a ResNet18 ensemble on a multi-center dataset, the framework prevents performance degradation: new images were added without significant change in AUC (~0.92) or accuracy (~89%). This approach addresses data shift and avoids catastrophic forgetting, enabling sustained learning in medical imaging AI.

关键词: medical AI, continuous monitoring, data integration, data drift, glomerular pathology, Monte Carlo dropout, incremental retraining, performance robustness

204. ❌ StreamMeCo: Long-Term Agent Memory Compression for Efficient Streaming Video Understanding

作者: Junxi Wang, Te Sun, Jiayi Zhu, Junxian Li, Haowen Xu, Zichen Wen, Xuming Hu, Zhiyu Li, Linfeng Zhang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09000v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文StreamMeCo专注于流媒体视频理解中的视觉代理内存压缩，提出了一种基于内存图连接性的压缩框架（边缘自由最小最大采样和边缘感知权重剪枝）和时间衰减内存检索机制。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用直接相关，而本文的核心是计算机视觉中的视频理解和内存管理优化，未涉及任何大语言模型技术、深度学习创新方法或AI for Science的具体应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对流媒体视频理解中视觉代理内存存储开销大的问题，提出了StreamMeCo压缩框架，在70%内存压缩下实现了1.87倍检索加速和1.0%的平均准确率提升。

摘要翻译

视觉智能体记忆在流式视频理解中展现出显著效能。然而，为视频存储此类记忆会产生巨大的内存开销，导致存储与计算成本高昂。为解决这一问题，我们提出了StreamMeCo——一种高效的流式智能体记忆压缩框架。具体而言，基于记忆图（memory graph）的连通性，StreamMeCo针对孤立节点引入无边缘最小最大采样（edge-free minmax sampling），并对连通节点采用边缘感知权重剪枝（edge-aware weight pruning），在保持准确性的同时剔除冗余记忆节点。此外，我们引入了一种时间衰减记忆检索机制，以进一步消除由记忆压缩引起的性能下降。在三个具有挑战性的基准数据集（M3-Bench-robot、M3-Bench-web和Video-MME-Long）上进行的大量实验表明，在70%的记忆图压缩率下，StreamMeCo实现了1.87倍的内存检索加速，同时平均准确率提升了1.0%。我们的代码公开于https://github.com/Celina-love-sweet/StreamMeCo。

摘要 (Abstract)

Vision agent memory has shown remarkable effectiveness in streaming video understanding. However, storing such memory for videos incurs substantial memory overhead, leading to high costs in both storage and computation. To address this issue, we propose StreamMeCo, an efficient Stream Agent Memory Compression framework. Specifically, based on the connectivity of the memory graph, StreamMeCo introduces edge-free minmax sampling for the isolated nodes and an edge-aware weight pruning for connected nodes, evicting the redundant memory nodes while maintaining the accuracy. In addition, we introduce a time-decay memory retrieval mechanism to further eliminate the performance degradation caused by memory compression. Extensive experiments on three challenging benchmark datasets (M3-Bench-robot, M3-Bench-web and Video-MME-Long) demonstrate that under 70% memory graph compression, StreamMeCo achieves a 1.87* speedup in memory retrieval while delivering an average accuracy improvement of 1.0%. Our code is available at https://github.com/Celina-love-sweet/StreamMeCo.

关键词: Streaming Video Understanding, Agent Memory Compression, Memory Graph, Edge-free Minmax Sampling, Edge-aware Weight Pruning, Time-decay Memory Retrieval, Memory Retrieval Speedup

205. ❌ Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

作者: Zile Wang, Zexiang Liu, Jaixing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, Yidan Xietian, Jiangbo Pei, Liang Hu, Boyi Jiang, Hua Xue, Zidong Wang, Haofeng Sun, Wei Li, Wanli Ouyang, Xianglong He, Yang Liu, Yangguang Li, Yahui Zhou 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08995v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文Matrix-Game 3.0专注于交互式世界模型（World Models）在视频生成中的应用，核心是扩散模型（diffusion models）而非大语言模型（LLMs）。因此，绝大多数关键词（如LLMs、MoE、Scaling Laws、Instruction Tuning、RLHF、RAG、CoT等）完全不相关，评分为0。仅三个关键词有弱相关：1. “World Models AND General World Models"高度相关（10分），因为论文明确研究世界模型；2. “Self-Correction OR Self-Improvement OR Self-Reflection"有一定关联（5分），论文提到模型学习自我校正（self-correction）；3. “Quantization OR Model Compression OR Low-bit Weights"有一定关联（5分），论文提及模型量化（model quantization）以实现高效推理。其他关键词如AI for Science等不匹配论文的计算机视觉/视频生成领域。

!!! tip deepseek-chat TL;DR

论文解决了交互式世界模型在实现长时记忆一致性和高分辨率实时视频生成方面的挑战，提出了Matrix-Game 3.0，通过数据引擎升级、长时一致性训练框架和高效推理优化，实现了720p分辨率下40 FPS的实时生成，并在分钟级序列中保持稳定的记忆一致性。

摘要翻译

随着交互式视频生成技术的发展，扩散模型日益展现出作为世界模型的潜力。然而，现有方法仍难以同时实现具备记忆能力的长时序一致性以及高分辨率实时生成，这限制了其在真实场景中的应用。为此，我们提出Matrix-Game 3.0，这是一个专为720p实时长视频生成而设计的记忆增强交互式世界模型。在Matrix-Game 2.0的基础上，我们从数据、模型和推理三个层面进行了系统性改进。首先，我们开发了升级的工业级无限数据引擎，该引擎整合了基于Unreal Engine的合成数据、从AAA游戏中大规模自动采集的数据以及真实世界视频增强数据，从而大规模生成高质量的视频-姿态-动作-提示词（Video-Pose-Action-Prompt）四元组数据。其次，我们提出了一个面向长时序一致性的训练框架：通过建模预测残差并在训练过程中重新注入不完美的生成帧，基础模型学会了自我校正；同时，相机感知的记忆检索与注入机制使基础模型能够实现长时程的时空一致性。第三，我们设计了一种基于分布匹配蒸馏（Distribution Matching Distillation, DMD）的多段自回归蒸馏策略，并结合模型量化与VAE解码器剪枝，以实现高效的实时推理。实验结果表明，Matrix-Game 3.0在5B参数规模下，能以高达40 FPS的速度实现720p分辨率的实时生成，并在长达数分钟的序列中保持稳定的记忆一致性。将模型规模扩展至2x14B进一步提升了生成质量、动态表现和泛化能力。我们的方法为构建可工业级部署的世界模型提供了一条切实可行的路径。

摘要 (Abstract)

With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting their applicability in real-world scenarios. To address this, we present Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time longform video generation. Building upon Matrix-Game 2.0, we introduce systematic improvements across data, model, and inference. First, we develop an upgraded industrial-scale infinite data engine that integrates Unreal Engine-based synthetic data, large-scale automated collection from AAA games, and real-world video augmentation to produce high-quality Video-Pose-Action-Prompt quadruplet data at scale. Second, we propose a training framework for long-horizon consistency: by modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency. Third, we design a multi-segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder pruning, to achieve efficient real-time inference. Experimental results show that Matrix-Game 3.0 achieves up to 40 FPS real-time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute-long sequences. Scaling up to a 2x14B model further improves generation quality, dynamics, and generalization. Our approach provides a practical pathway toward industrial-scale deployable world models.

关键词: interactive world model, real-time video generation, long-horizon memory, diffusion models, memory-augmented model, autoregressive distillation, model quantization, spatiotemporal consistency

206. ❌ ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning

作者: Shifeng Liu, Zhengye Zhang, Sirui Zhao, Xinglong Mao, Zhehan Kan, Zhixiang Wei, Shiwei Wu, Chaoyou Fu, Tong Xu, Enhong Chen 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08990v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ActFER提出了一种基于多模态大语言模型（MLLMs）的代理式面部表情识别框架，核心创新在于将FER重新定义为主动视觉证据获取和多模态推理。论文高度相关于’Large Language Models OR LLMs OR Foundation Models’（10分），因为它基于MLLMs；高度相关于’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’（10分），因为它采用了视觉思维链进行推理；高度相关于’LLM Agents OR Autonomous Agents OR Agentic Workflow’（10分），因为它提出了一个代理式框架；高度相关于’Tool Use OR Function Calling OR API Tool Use’（10分），因为它动态调用工具进行人脸检测、对齐和局部区域缩放。与’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’有一定关联（5分），因为其推理过程涉及深入分析。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、Quantization等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了现有基于多模态大语言模型的面部表情识别方法被动、缺乏主动感知能力的问题，提出了ActFER代理式框架，通过动态调用工具进行主动视觉证据获取和视觉思维链推理，并结合新开发的UC-GRPO强化学习算法，显著提升了面部表情识别和动作单元预测的准确性。

摘要翻译

多模态大语言模型（MLLMs）的最新进展为面部表情识别（FER）创造了新的机遇，使其从单纯的标签预测转向基于推理的情感理解。然而，现有的基于MLLM的FER方法仍遵循被动范式：它们依赖外部准备的面部输入，对固定的视觉证据进行单次推理，缺乏主动的面部感知能力。为应对这一局限，我们提出了ActFER，一个能动性框架，将FER重新定义为主动的视觉证据获取与多模态推理相结合的过程。具体而言，ActFER动态调用工具进行人脸检测与对齐，有选择性地放大信息丰富的局部区域，并通过视觉思维链（Chain-of-Thought）对面部动作单元（Action Units, AUs）和情绪进行推理。为实现此类行为，我们进一步开发了效用校准的GRPO（UC-GRPO），一种专为能动性FER设计的强化学习算法。UC-GRPO采用基于AU的多层次可验证奖励来密集化监督，通过查询条件对比效用估计实现局部审视的样本感知动态信用分配，并利用情绪感知的指数移动平均（EMA）校准来减少噪声效用估计，同时捕捉不同情绪下的审视倾向。该算法使ActFER能够学习何时进行局部审视有益以及如何对获取的证据进行推理。综合实验表明，通过UC-GRPO训练的ActFER持续优于被动的基于MLLM的FER基线方法，并显著提升了AU预测准确率。

摘要 (Abstract)

Recent advances in Multimodal Large Language Models (MLLMs) have created new opportunities for facial expression recognition (FER), moving it beyond pure label prediction toward reasoning-based affect understanding. However, existing MLLM-based FER methods still follow a passive paradigm: they rely on externally prepared facial inputs and perform single-pass reasoning over fixed visual evidence, without the capability for active facial perception. To address this limitation, we propose ActFER, an agentic framework that reformulates FER as active visual evidence acquisition followed by multimodal reasoning. Specifically, ActFER dynamically invokes tools for face detection and alignment, selectively zooms into informative local regions, and reasons over facial Action Units (AUs) and emotions through a visual Chain-of-Thought. To realize such behavior, we further develop Utility-Calibrated GRPO (UC-GRPO), a reinforcement learning algorithm tailored to agentic FER. UC-GRPO uses AU-grounded multi-level verifiable rewards to densify supervision, query-conditional contrastive utility estimation to enable sample-aware dynamic credit assignment for local inspection, and emotion-aware EMA calibration to reduce noisy utility estimates while capturing emotion-wise inspection tendencies. This algorithm enables ActFER to learn both when local inspection is beneficial and how to reason over the acquired evidence. Comprehensive experiments show that ActFER trained with UC-GRPO consistently outperforms passive MLLM-based FER baselines and substantially improves AU prediction accuracy.

关键词: Agentic Facial Expression Recognition, Multimodal Large Language Models, Active Visual Evidence Acquisition, Tool-Augmented Visual Reasoning, Visual Chain-of-Thought, Utility-Calibrated GRPO, Reinforcement Learning, Action Units Prediction

207. ❌ How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms

作者: Shengji Jin, Yuanhao Zou, Victor Zhu, Zhengping Ji, Chen Chen 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08966v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视频多模态大语言模型（MLLMs）在视频时序定位（VTG）任务中的输出范式，核心涉及大模型技术（LLMs）和参数高效微调（LoRA），同时关注资源受限的边缘部署（与SLMs/On-device AI相关）和系统效率（与推理加速相关）。其他关键词如MoE、Scaling Laws、RAG、CoT等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文通过对比三种视频时序定位输出范式（文本数字生成、时序标记生成、连续时序解码），发现连续分布范式在紧凑视觉语言模型上能实现最佳的效率-准确性权衡，为高效部署的VTG系统提供了实证指导。

摘要翻译

尽管多模态大语言模型（MLLMs）推动了视频时序定位（VTG）的发展，但现有方法通常将输出范式与不同的骨干网络、数据集和训练协议相耦合，这使得难以分离输出设计的具体影响。此外，随着VTG系统日益考虑部署于资源受限的边缘设备，输出形式与系统级效率之间的权衡需要系统性研究。本文通过一项受控实证研究，比较了三种主流的VTG输出范式：文本数字生成、时序标记生成和连续时序解码。我们在相同的紧凑视觉语言模型（包括SmolVLM2、FastVLM和Molmo2）上，使用一致的数据集和LoRA微调协议对这些范式进行评估。在Charades-STA、QVHighlights和YouCook2数据集上的评测同时衡量了定位精度和系统效率，包括推理延迟、训练吞吐量和参数量开销。我们的结果表明，输出形式的选择对定位精度和计算成本均有显著影响，且独立于模型规模。具体而言，连续分布范式在帕累托前沿上始终实现了最优的效率-精度权衡，以最小的延迟开销提供了稳健的定位性能。这些发现为设计高效、可部署的VTG系统提供了客观的实证指导。

摘要 (Abstract)

While Multimodal Large Language Models (MLLMs) have advanced Video Temporal Grounding (VTG), existing methods often couple output paradigms with different backbones, datasets, and training protocols. This makes it challenging to isolate the specific impact of the output design. Additionally, as VTG systems are increasingly considered for resource-constrained edge deployment, the trade-off between output formulation and system-level efficiency requires systematic investigation. In this paper, we present a controlled empirical study comparing three dominant VTG output paradigms: Text Numeral Generation, Temporal Token Generation, and Continuous Temporal Decoding. We evaluate these paradigms across identical compact VLMs (SmolVLM2, FastVLM, and Molmo2) using consistent datasets and LoRA fine-tuning protocols. Evaluations on Charades-STA, QVHighlights, and YouCook2 measure both localization accuracy and system efficiency, including inference latency, training throughput, and parameter overhead. Our results demonstrate that the choice of output formulation significantly affects both grounding accuracy and computational cost, independent of model scale. Specifically, the continuous distribution paradigm consistently achieves the most favorable efficiency-accuracy trade-off on the Pareto frontier, delivering robust localization with minimal latency overhead. These findings provide objective empirical guidelines for designing efficient, deployment-ready VTG systems.

关键词: Video LLMs, Multimodal Large Language Models, Temporal Grounding, Output Paradigms, Efficiency-Accuracy Trade-off, LoRA Fine-tuning, Edge Deployment, Inference Latency

208. ❌ Dynamic Class-Aware Active Learning for Unbiased Satellite Image Segmentation

作者: Gadi Hemanth Kumar, Athira Nambiar, Pankaj Bodani 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08965v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于卫星图像语义分割中的主动学习方法，提出了一种动态类感知不确定性采样策略来解决类别不平衡问题。论文的核心是计算机视觉和机器学习中的主动学习技术，与大多数关键词（涉及大模型架构、训练方法、推理优化、对齐技术等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为卫星图像分析属于地球科学和环境监测的应用范畴，属于AI for Science的广义应用领域，但并非论文的核心技术焦点，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对卫星图像语义分割中主动学习存在的类别偏差问题，提出了一种动态类感知不确定性采样方法（DCAU-AL），通过在OpenEarth数据集上的实验证明，该方法能显著提升类别不平衡情况下的分割性能和标注效率。

摘要翻译

卫星影像语义分割在土地覆盖制图与环境监测中扮演着关键角色。然而，对大规模高分辨率卫星数据集进行标注成本高昂且耗时，尤其是在覆盖广阔地理区域时。相较于随机标注数据或穷尽式标注整个数据集，主动学习（Active Learning, AL）通过人机协同（Human-in-the-loop, HITL）智能选择信息量最大的样本进行标注，提供了一种高效替代方案，从而在保持模型高性能的同时降低标注成本。对于大规模或资源受限的卫星应用，主动学习尤为有益，因为它能够以显著更少的标注样本实现高分割精度。尽管具备这些优势，标准主动学习策略通常依赖于全局不确定性或多样性度量，且缺乏在训练过程中针对表现不佳或稀有类别进行自适应调整的能力，导致系统产生偏差。为克服这些局限，本文提出一种新颖的自适应获取函数——基于动态类别感知不确定性的主动学习（Dynamic Class-Aware Uncertainty based Active learning, DCAU-AL），该函数根据实时类别性能差距优先选择样本，从而缓解类别不平衡问题。所提出的DCAU-AL机制持续追踪每个类别的分割性能，并在整个主动学习过程中动态调整采样权重，以聚焦于表现较差或代表性不足的类别。在OpenEarth土地覆盖数据集上的大量实验表明，DCAU-AL显著优于现有主动学习方法，尤其在严重类别不平衡条件下，实现了更优的各类别交并比（IoU）并提升了标注效率。

摘要 (Abstract)

Semantic segmentation of satellite imagery plays a vital role in land cover mapping and environmental monitoring. However, annotating large-scale, high-resolution satellite datasets is costly and time consuming, especially when covering vast geographic regions. Instead of randomly labeling data or exhaustively annotating entire datasets, Active Learning (AL) offers an efficient alternative by intelligently selecting the most informative samples for annotation with the help of Human-in-the-loop (HITL), thereby reducing labeling costs while maintaining high model performance. AL is particularly beneficial for large-scale or resource-constrained satellite applications, as it enables high segmentation accuracy with significantly fewer labeled samples. Despite these advantages, standard AL strategies typically rely on global uncertainty or diversity measures and lack the adaptability to target underperforming or rare classes as training progresses, leading to bias in the system. To overcome these limitations, we propose a novel adaptive acquisition function, Dynamic Class-Aware Uncertainty based Active learning (DCAU-AL) that prioritizes sample selection based on real-time class-wise performance gaps, thereby overcoming class-imbalance issue. The proposed DCAU-AL mechanism continuously tracks the performance of the segmentation per class and dynamically adjusts the sampling weights to focus on poorly performing or underrepresented classes throughout the active learning process. Extensive experiments on the OpenEarth land cover dataset show that DCAU-AL significantly outperforms existing AL methods, especially under severe class imbalance, delivering superior per-class IoU and improved annotation efficiency.

关键词: Active Learning, Semantic Segmentation, Satellite Imagery, Class Imbalance, Dynamic Class-Aware, Uncertainty Sampling, Land Cover Mapping, Annotation Efficiency

209. ❌ Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift

作者: Harshith Kethavath, Weiming Hu 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08956v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文研究视觉语言模型（CLIPSeg）在遥感卫星图像云分割任务中的适应方法，比较了提示工程与监督微调的效果。核心相关关键词：1）‘Post-training OR Supervised Fine-tuning OR SFT’（10分）- 论文核心比较了监督微调与提示方法，证明微调显著优于提示；2）‘Pre-training OR Continual Pre-training OR Domain Adaptation’（8分）- 涉及领域适应问题，从自然图像预训练到卫星图像的领域偏移；3）‘PEFT OR LoRA OR Parameter-efficient Fine-tuning’（8分）- 明确比较了全微调与低秩适应（LoRA）方法；4）‘AI for Science OR Bioinformatics OR Cheminformatics’（8分）- 应用于遥感科学（卫星图像分析），属于AI for Science范畴。其他关键词主要涉及大语言模型（LLM）的特定技术（如RLHF、RAG、推理方法等），而本文聚焦视觉语言模型（VLM）在特定领域的应用，与这些LLM技术无直接关联，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，在卫星图像云分割任务中，即使使用少量标注数据（如0.1%）进行监督微调，其性能也显著优于各种提示工程方法，挑战了提示作为领域适应主导范式的假设。

摘要翻译

将视觉语言模型应用于遥感影像面临一个根本性挑战：卫星数据的视觉与语言分布均远超出自然图像预训练语料库的范围。尽管如此，提示方法仍是当前主流的部署范式，其背后的假设是领域特定的语言能够引导冻结的模型表征适应专业任务。我们直接在一个分布失配尤为显著的领域——卫星影像云层分割——中检验这一假设。通过在CloudSEN12+云分割基准上使用CLIPSeg模型，我们评估了涵盖简单标签、领域术语、外观描述符和上下文线索的60种提示变体，发现所有变体的表现均低于零样本基线（0.255 mIoU），其中经过设计的提示词得分可低至0.07 mIoU。无论语言描述如何精炼，都无法弥合CLIP的自然图像表征与卫星光谱影像之间的差距。相比之下，仅使用0.1%的标注数据（约8张图像）进行监督微调即可整体超越零样本性能，而使用5-10%的数据可恢复约85%的最大可达到mIoU。全参数微调始终优于低秩适应方法0.03-0.09 mIoU，在光谱模糊类别上差距最为明显；当使用0.5%至1%的标注数据时，微调会暂时降低这些类别的性能，随后才逐步恢复，这一监督性能低谷可能被整体mIoU指标所掩盖。对于致力于将视觉语言模型适配至专业影像的研究者与实践者，我们的结果传递了一个明确信息：标注数据并非提示方法的高成本替代方案，而是值得投入的有效路径。

摘要 (Abstract)

Adapting vision-language models to remote sensing imagery presents a fundamental challenge: both the visual and linguistic distributions of satellite data lie far outside natural image pretraining corpora. Despite this, prompting remains the dominant deployment paradigm, driven by the assumption that domain-specific language can guide frozen model representations toward specialized tasks. We test this assumption directly on a domain where the mismatch is prominent: cloud segmentation for satellite imagery. Using CLIPSeg on the CloudSEN12+ cloud segmentation benchmark, we evaluate 60 prompt variants spanning simple labels, domain terminology, appearance descriptors, and contextual cues, finding that every variant underperforms the zero-shot baseline (0.255 mIoU), with engineered prompts scoring as low as 0.07 mIoU. No amount of linguistic refinement bridges the gap between CLIP’s natural image representations and satellite spectral imagery. In contrast, supervised fine-tuning with just 0.1% labeled data (~8 images) surpasses zero-shot performance overall, and 5-10% data recovers ~85% of maximum achievable mIoU. Full fine-tuning consistently outperforms low-rank adaptation by 0.03-0.09 mIoU, with the largest gaps for spectrally ambiguous classes, and at 0.5 to 1% labeled data, fine-tuning temporarily degrades performance on these classes before recovering, a supervision dip that aggregate mIoU can mask. For practitioners adapting vision-language models to specialized imagery, our results deliver a clear message: labeled data is not the expensive alternative to prompting; it is the worthwhile path.

关键词: vision-language models, domain adaptation, supervised fine-tuning, prompting, cloud segmentation, satellite imagery, low-data adaptation, CLIPSeg

210. ❌ TouchAnything: Diffusion-Guided 3D Reconstruction from Sparse Robot Touches

作者: Langzhe Gu, Hung-Jui Huang, Mohamad Qadri, Michael Kaess, Wenzhen Yuan 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08945v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究利用预训练的大规模视觉扩散模型作为先验，从稀疏触觉测量中进行3D重建。与关键词的相关性分析如下：1. “Pre-training OR Continual Pre-training OR Domain Adaptation” 得8分，因为论文明确使用了预训练的视觉扩散模型，并涉及知识迁移到触觉领域，属于预训练模型的应用和领域适应。2. “AI for Science OR Bioinformatics OR Cheminformatics” 得5分，因为论文属于机器人学和计算机视觉领域，虽非生物信息学或化学信息学，但涉及AI在科学应用（机器人感知），有一定关联。其他关键词（如LLMs、MoE、SFT等）均与论文内容无关，论文未涉及语言模型、专家混合、微调、推理加速、对齐等技术，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出TouchAnything框架，利用预训练的视觉扩散模型作为几何先验，从稀疏机器人触觉测量中重建准确的3D几何形状，解决了在遮挡或光照不佳条件下仅靠视觉不可靠时的物体几何估计问题。

摘要翻译

精确的物体几何估计对于机器人操作和物理交互等下游任务至关重要。尽管视觉是形状感知的主要模态，但在遮挡或复杂光照条件下其可靠性会降低。在此类场景中，触觉传感可通过物理接触提供直接的几何信息。然而，仅从稀疏的局部触觉数据重建全局三维几何本质上是一个约束不足的问题。本文提出TouchAnything框架，该框架利用预训练的大规模二维视觉扩散模型作为语义和几何先验，从稀疏触觉测量中实现三维重建。与以往训练特定类别重建网络或直接从触觉数据学习扩散模型的方法不同，我们将预训练视觉扩散模型中编码的几何知识迁移到触觉领域。给定稀疏的接触约束和物体粗略的类别级描述，我们将重建构建为一个优化问题，在强制触觉一致性的同时，引导解向符合扩散先验的形状收敛。我们的方法仅需少量触觉点即可重建精确几何，性能优于现有基线，并能实现对未见过的物体实例进行开放世界三维重建。项目页面见 https://grange007.github.io/touchanything 。

摘要 (Abstract)

Accurate object geometry estimation is essential for many downstream tasks, including robotic manipulation and physical interaction. Although vision is the dominant modality for shape perception, it becomes unreliable under occlusions or challenging lighting conditions. In such scenarios, tactile sensing provides direct geometric information through physical contact. However, reconstructing global 3D geometry from sparse local touches alone is fundamentally underconstrained. We present TouchAnything, a framework that leverages a pretrained large-scale 2D vision diffusion model as a semantic and geometric prior for 3D reconstruction from sparse tactile measurements. Unlike prior work that trains category-specific reconstruction networks or learns diffusion models directly from tactile data, we transfer the geometric knowledge encoded in pretrained visual diffusion models to the tactile domain. Given sparse contact constraints and a coarse class-level description of the object, we formulate reconstruction as an optimization problem that enforces tactile consistency while guiding solutions toward shapes consistent with the diffusion prior. Our method reconstructs accurate geometries from only a few touches, outperforms existing baselines, and enables open-world 3D reconstruction of previously unseen object instances. Our project page is https://grange007.github.io/touchanything .

关键词: 3D reconstruction, tactile sensing, diffusion model, robotic manipulation, sparse contacts, geometric prior, optimization, open-world reconstruction

211. ❌ MASS: Mesh-inellipse Aligned Deformable Surfel Splatting for Hand Reconstruction and Rendering from Egocentric Monocular Video

作者: Haoyu Zhu, Yi Zhang, Lei Yao, Lap-pui Chau, Yi Wang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08943v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和图形学领域，研究从单目视频进行3D手部重建和渲染的具体技术方法（MASS方法），涉及可变形2D高斯面元表示、网格对齐、训练策略等。所有评分关键词均与大模型、深度学习技术原理、AI科学应用等主题相关，但论文内容完全不涉及这些关键词所描述的技术、方法或应用领域。论文的核心是3D重建、渲染优化和特定数据集评估，与评分关键词列表中的任何主题都没有直接或间接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为MASS的新方法，用于从单目视频中重建高保真3D手部模型并实现逼真渲染，通过可变形2D高斯面元表示和改进的训练策略，在多个数据集上取得了优于现有方法的重建性能。

摘要翻译

从单目第一人称视频中重建高保真度的3D手部模型仍面临挑战，这主要源于高分辨率几何细节捕捉、手-物体交互以及手部复杂物体处理的局限性。此外，现有方法通常计算成本高昂，难以应用于实时场景。本研究提出网格内切椭圆对齐可变形曲面元点渲染方法（Mesh-inellipse Aligned deformable Surfel Splatting，简称MASS），通过利用可变形二维高斯曲面元表示来解决上述问题。我们首先引入网格对齐的施泰纳内切椭圆与分形致密化技术，实现从粗糙参数化手部网格到曲面元的转换，从而从基础网格生成高分辨率二维高斯曲面元，为表面表示提供具备照片级真实感渲染潜力的基础。其次，我们提出高斯曲面元变形方法，通过预测曲面元属性的残差更新并引入不透明度掩码来优化几何与纹理，无需依赖自适应密度控制，即可高效建模手部形变与个性化特征。此外，我们设计了两阶段训练策略与新型绑定损失函数，以提升优化鲁棒性与重建质量。在ARCTIC数据集、手部外观数据集及Interhand2.6M数据集上的大量实验表明，相较于现有先进方法，我们的模型实现了更优的重建性能。

摘要 (Abstract)

Reconstructing high-fidelity 3D hands from egocentric monocular videos remains a challenge due to the limitations in capturing high-resolution geometry, hand-object interactions, and complex objects on hands. Additionally, existing methods often incur high computational costs, making them impractical for real-time applications. In this work, we propose Mesh-inellipse Aligned deformable Surfel Splatting (MASS) to address these challenges by leveraging a deformable 2D Gaussian Surfel representation. We introduce the mesh-aligned Steiner Inellipse and fractal densification for mesh-to-surfel conversion that initiates high-resolution 2D Gaussian surfels from coarse parametric hand meshes, providing surface representation with photorealistic rendering potential. Second, we propose Gaussian Surfel Deformation, which enables efficient modeling of hand deformations and personalized features by predicting residual updates to surfel attributes and introducing an opacity mask to refine geometry and texture without adaptive density control. In addition, we propose a two-stage training strategy and a novel binding loss to improve the optimization robustness and reconstruction quality. Extensive experiments on the ARCTIC dataset, the Hand Appearance dataset, and the Interhand2.6M dataset demonstrate that our model achieves superior reconstruction performance compared to state-of-the-art methods.

关键词: hand reconstruction, egocentric monocular video, deformable surfel splatting, 2D Gaussian surfels, mesh-to-surfel conversion, photorealistic rendering, ARCTIC dataset, Interhand2.6M dataset

212. ❌ Customized Fusion: A Closed-Loop Dynamic Network for Adaptive Multi-Task-Aware Infrared-Visible Image Fusion

作者: Zengyi Yang, Yu Liu, Juan Cheng, Zhiqin Zhu, Yafei Zhang, Huafeng Li 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08924v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于红外-可见光图像融合的计算机视觉任务，提出了一种闭环动态网络（CLDyN）用于多任务自适应融合。论文内容涉及深度学习网络架构设计、自适应机制和任务性能优化，但完全不涉及大语言模型（LLM）、大模型技术原理、AI for Science应用或任何评分关键词中的具体技术（如MoE、RLHF、RAG等）。所有关键词均与大模型、语言模型或AI科学应用相关，而本文是纯粹的计算机视觉图像处理研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对红外-可见光图像融合方法难以同时适应多个下游任务的问题，提出了一种闭环动态网络（CLDyN），通过需求驱动的语义补偿模块实现任务定制化融合，实验表明该方法在保持高融合质量的同时具有强大的多任务适应性。

摘要翻译

红外-可见光图像融合旨在整合互补信息以实现鲁棒的视觉理解，但现有融合方法难以同时适应多种下游任务。为解决此问题，我们提出了一种闭环动态网络（Closed-Loop Dynamic Network, CLDyN），能够自适应响应不同下游任务的语义需求，实现任务定制化的图像融合。具体而言，CLDyN引入闭环优化机制，通过建立语义传输链，借助需求驱动的语义补偿（Requirement-driven Semantic Compensation, RSC）模块实现从下游任务到融合网络的显式反馈。RSC模块利用基向量库（Basis Vector Bank, BVB）和架构自适应的语义注入（Architecture-Adaptive Semantic Injection, A2SI）块，根据任务需求定制网络架构，从而实现任务特定的语义补偿，使融合网络无需重新训练即可主动适应多样化任务。为促进语义补偿，我们引入奖惩策略，根据任务性能变化对RSC模块进行奖励或惩罚。在M3FD、FMB和VT5000数据集上的实验表明，CLDyN不仅能保持高融合质量，还展现出强大的多任务适应能力。代码发布于https://github.com/YR0211/CLDyN。

摘要 (Abstract)

Infrared-visible image fusion aims to integrate complementary information for robust visual understanding, but existing fusion methods struggle with simultaneously adapting to multiple downstream tasks. To address this issue, we propose a Closed-Loop Dynamic Network (CLDyN) that can adaptively respond to the semantic requirements of diverse downstream tasks for task-customized image fusion. Specifically, CLDyN introduces a closed-loop optimization mechanism that establishes a semantic transmission chain to achieve explicit feedback from downstream tasks to the fusion network through a Requirement-driven Semantic Compensation (RSC) module. The RSC module leverages a Basis Vector Bank (BVB) and an Architecture-Adaptive Semantic Injection (A2SI) block to customize the network architecture according to task requirements, thereby enabling task-specific semantic compensation and allowing the fusion network to actively adapt to diverse tasks without retraining. To promote semantic compensation, a reward-penalty strategy is introduced to reward or penalize the RSC module based on task performance variations. Experiments on the M3FD, FMB, and VT5000 datasets demonstrate that CLDyN not only maintains high fusion quality but also exhibits strong multi-task adaptability. The code is available at https://github.com/YR0211/CLDyN.

关键词: Infrared-visible image fusion, Multi-task adaptation, Closed-loop dynamic network, Task-customized fusion, Semantic compensation, Adaptive architecture, Reward-penalty strategy, Visual understanding

213. ❌ M-IDoL: Information Decomposition for Modality-Specific and Diverse Representation Learning in Medical Foundation Model

作者: Yihang Liu, Ying Wen, Jiaxiong Yang, Longzhen Yang, Lianghua He, Heng Tao Shen 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08936v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出M-IDoL，一种医学基础模型，核心创新在于通过信息分解和MoE架构学习模态特定和多样化的表示。高度相关的关键词包括：1) ‘Foundation Models’（论文研究医学基础模型）；2) ‘Mixture of Experts’（使用MoE实现模态分离）；3) ‘Pre-training’（在115万医学图像上预训练）；4) ‘AI for Science’（应用于医学成像的AI）。其他关键词如LLM推理、对齐、压缩等与论文的医学图像表示学习主题无关。

!!! tip deepseek-chat TL;DR

论文解决了医学基础模型中多模态表示模糊导致模态特异性和多样性退化的问题，提出M-IDoL模型通过信息分解和MoE架构学习模态特定和多样化的表示，在21个下游临床任务上优于20个现有基础模型。

摘要翻译

医学基础模型旨在从多模态医学图像中学习通用表征，使其能够有效泛化至多样化的下游临床任务。然而，现有大多数医学基础模型存在信息模糊性问题，即将多模态表征混合于单一嵌入空间中，导致模态特异性和多样性的退化。本文提出M-IDoL，一种自监督医学基础模型，其通过信息分解进行多模态表征学习，具体实现基于两个目标：其一，通过将多模态表征分散至可分离的专家混合模型子空间中，最大化模态间熵，以实现跨模态的表征特异性；其二，通过在每个专家混合模型子空间内执行细粒度语义判别，最小化模态内不确定性，以丰富单模态的表征多样性。基于115万张医学图像的预训练，M-IDoL展现出两方面优势：其一，在涵盖五种成像模态（如X射线、眼底、光学相干断层扫描、皮肤镜及病理学）的21项下游临床任务中表现出卓越的泛化能力，其性能超越20个现有基础模型；其二，能够学习具有模态特异性且多样化的表征，在跨模态特征簇间呈现更清晰的分离性，并在各模态内部实现更细粒度的特征判别。

摘要 (Abstract)

Medical foundation models (MFMs) aim to learn universal representations from multimodal medical images that can generalize effectively to diverse downstream clinical tasks. However, most existing MFMs suffer from information ambiguity that blend multimodal representations in a single embedding space, leading to the degradation of modality specificity and diversity. In this paper, we propose M-IDoL, a self-supervised \underline{\textit{M}}FM that introduces Information Decomposition for multimodal representation Learning via two objectives: i) maximize inter-modality entropy by dispersing multimodal representation into separable Mixture-of-Experts (MoE) subspaces to achieve representation specificity across modalities; and ii) minimize intra-modality uncertainty by performing fine-grained semantic discrimination within each MoE subspace to enrich representation diversity per modality. By pre-training on 1.15 million medical images, M-IDoL i) delivers superior generalization across 21 downstream clinical tasks, outperforming 20 foundation models on five imaging modalities (e.g., X-ray, fundus, OCT, dermoscopy and pathology), and ii) learns modality-specific and diverse representations, showing clearer separation of feature cluster across modalities and finer-grained feature discrimination within each modality.

关键词: Medical Foundation Models, Information Decomposition, Mixture of Experts, Multimodal Representation Learning, Self-supervised Learning, Modality Specificity, Medical Imaging, Clinical Tasks

214. ❌ Degradation-Robust Fusion: An Efficient Degradation-Aware Diffusion Framework for Multimodal Image Fusion in Arbitrary Degradation Scenarios

作者: Yu Shi, Yu Liu, Zhong-Cheng Wu, Juan Cheng, Huafeng Li, Xun Chen 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08922v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的多模态图像融合任务，提出了一种基于扩散模型的图像融合框架。论文内容涉及图像处理、扩散模型、去噪、图像融合等技术，但完全不涉及大语言模型（LLMs）、深度学习技术原理创新、或大模型在不同领域的应用。所有评分关键词均与大语言模型、模型训练、推理优化、对齐、代理系统等大模型相关技术相关，而本文研究的是纯计算机视觉问题，与这些关键词无任何关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种高效的退化感知扩散框架，用于解决复杂退化场景下的多模态图像融合问题，通过隐式去噪和联合观测模型校正机制实现了在有限步骤内的高精度融合。

摘要翻译

噪声、模糊与低分辨率等复杂退化现象是现实世界图像融合任务中的典型挑战，限制了现有方法的性能与实用性。基于端到端神经网络的方法通常设计简单且推理高效，但其黑箱特性导致可解释性有限。基于扩散模型的方法通过提供强大的生成先验和更具结构化的推理过程，在一定程度上缓解了这一问题。然而，这类方法被训练用于学习单一域的目标分布，而图像融合任务缺乏天然的融合数据，且依赖于对多源互补信息的建模，这使得扩散模型难以直接应用于实践。为应对这些挑战，本文提出一种高效的、感知退化的扩散框架，用于任意退化场景下的图像融合。具体而言，与传统扩散模型显式预测噪声不同，我们的方法通过直接回归融合图像进行隐式去噪，从而能以有限步骤灵活适应复杂退化下的多种融合任务。此外，我们设计了一种联合观测模型校正机制，在采样过程中同时施加退化约束与融合约束，以确保高重建精度。在多种融合任务与退化配置上的实验表明，所提方法在复杂退化场景下具有优越性。

摘要 (Abstract)

Complex degradations like noise, blur, and low resolution are typical challenges in real world image fusion tasks, limiting the performance and practicality of existing methods. End to end neural network based approaches are generally simple to design and highly efficient in inference, but their black-box nature leads to limited interpretability. Diffusion based methods alleviate this to some extent by providing powerful generative priors and a more structured inference process. However, they are trained to learn a single domain target distribution, whereas fusion lacks natural fused data and relies on modeling complementary information from multiple sources, making diffusion hard to apply directly in practice. To address these challenges, this paper proposes an efficient degradation aware diffusion framework for image fusion under arbitrary degradation scenarios. Specifically, instead of explicitly predicting noise as in conventional diffusion models, our method performs implicit denoising by directly regressing the fused image, enabling flexible adaptation to diverse fusion tasks under complex degradations with limited steps. Moreover, we design a joint observation model correction mechanism that simultaneously imposes degradation and fusion constraints during sampling to ensure high reconstruction accuracy. Experiments on diverse fusion tasks and degradation configurations demonstrate the superiority of the proposed method under complex degradation scenarios.

关键词: image fusion, diffusion models, degradation-aware, multimodal, denoising, reconstruction accuracy, joint observation model, arbitrary degradation scenarios

215. ❌ ANTIC: Adaptive Neural Temporal In-situ Compressor

作者: Sandeep S. Cranganore, Andrei Bodnar, Gianluca Galleti, Fabian Paischer, Johannes Brandstetter 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09543v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文ANTIC专注于使用神经网络压缩高维物理模拟数据，属于AI在科学计算领域的应用，与"AI for Science"关键词有一定关联（5分），但未涉及大语言模型、深度学习技术原理创新或其他关键词，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

论文提出ANTIC自适应神经时序原位压缩器，通过自适应时序选择和空间神经压缩模块，解决了高维物理模拟数据存储需求过大的问题，实现了数个数量级的存储减少同时保持物理精度。

摘要翻译

受大规模高维偏微分方程支配的高分辨率时空演化场的持久存储需求已达到拍字节至艾字节量级。模拟纳维-斯托克斯方程、磁流体动力学、等离子体物理或双黑洞合并的瞬态仿真所产生的数据量，已对现代高性能计算基础设施构成严峻挑战。为应对这一瓶颈，我们提出ANTIC（自适应神经原位时序压缩器）——一种端到端的原位压缩流程。ANTIC包含两个核心组件：专为高维物理系统设计的自适应时序选择器，可在仿真运行时识别并筛选信息丰富的快照；以及基于持续微调的空间神经压缩模块，该模块通过神经场学习相邻快照间的残差更新。通过单次流式处理，ANTIC实现了时空维度的联合压缩，有效避免了对完整时间演化轨迹的显式磁盘存储需求。实验结果表明，该方法在实现数个数量级存储缩减的同时，仍能保持与物理精度的高度关联。

摘要 (Abstract)

The persistent storage requirements for high-resolution, spatiotemporally evolving fields governed by large-scale and high-dimensional partial differential equations (PDEs) have reached the petabyte-to-exabyte scale. Transient simulations modeling Navier-Stokes equations, magnetohydrodynamics, plasma physics, or binary black hole mergers generate data volumes that are prohibitive for modern high-performance computing (HPC) infrastructures. To address this bottleneck, we introduce ANTIC (Adaptive Neural Temporal in situ Compressor), an end-to-end in situ compression pipeline. ANTIC consists of an adaptive temporal selector tailored to high-dimensional physics that identifies and filters informative snapshots at simulation time, combined with a spatial neural compression module based on continual fine-tuning that learns residual updates between adjacent snapshots using neural fields. By operating in a single streaming pass, ANTIC enables a combined compression of temporal and spatial components and effectively alleviates the need for explicit on-disk storage of entire time-evolved trajectories. Experimental results demonstrate how storage reductions of several orders of magnitude relate to physics accuracy.

关键词: neural compression, in situ compression, adaptive temporal selector, physics simulation, high-dimensional PDEs, storage reduction, neural fields, continual fine-tuning

216. ❌ TAIHRI: Task-Aware 3D Human Keypoints Localization for Close-Range Human-Robot Interaction

作者: Ao Li, Yonggen Ling, Yiyang Lin, Yuji Wang, Yong Deng, Yansong Tang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08921v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文TAIHRI提出了一种用于近距离人机交互的视觉语言模型，专注于3D人体关键点定位和任务感知感知。虽然论文涉及视觉语言模型（VLM），但所有评分关键词均针对大语言模型（LLM）的技术原理、训练方法、推理优化、对齐、应用范式等具体方面，而论文的核心是计算机视觉和机器人感知任务，未涉及任何LLM相关技术、方法或应用。因此，所有关键词均得0分。

!!! tip deepseek-chat TL;DR

论文提出了一种任务感知的视觉语言模型TAIHRI，用于在近距离人机交互中精确局部化任务相关的3D人体关键点，并通过实验验证了其在关键身体部位估计精度上的优越性。

摘要翻译

精确的三维人体关键点定位是实现机器人与用户自然、安全物理交互的关键技术。传统的三维人体关键点估计方法主要关注相对于根关节的整体重建质量。然而，在实际的人机交互（HRI）场景中，机器人更关心在自我中心相机三维坐标系下，与任务相关的身体部位在精确度量尺度上的空间定位。我们提出了TAIHRI，这是首个为近距离HRI感知定制的视觉-语言模型（Vision-Language Model, VLM），它能够理解用户的动作指令，并将机器人的注意力引导至与任务最相关的关键点。通过将三维关键点量化到一个有限的交互空间中，TAIHRI通过下一词元预测进行二维关键点推理，从而精确定位关键身体部位的三维空间坐标，并能无缝适应下游任务，如自然语言控制或全局空间人体网格恢复。在自我中心交互基准测试上的实验表明，TAIHRI在对任务至关重要的身体部位上实现了卓越的估计精度。我们相信TAIHRI为具身人机交互领域开辟了新的研究途径。代码发布于：https://github.com/Tencent/TAIHRI。

摘要 (Abstract)

Accurate 3D human keypoints localization is a critical technology enabling robots to achieve natural and safe physical interaction with users. Conventional 3D human keypoints estimation methods primarily focus on the whole-body reconstruction quality relative to the root joint. However, in practical human-robot interaction (HRI) scenarios, robots are more concerned with the precise metric-scale spatial localization of task-relevant body parts under the egocentric camera 3D coordinate. We propose TAIHRI, the first Vision-Language Model (VLM) tailored for close-range HRI perception, capable of understanding users’ motion commands and directing the robot’s attention to the most task-relevant keypoints. By quantizing 3D keypoints into a finite interaction space, TAIHRI precisely localize the 3D spatial coordinates of critical body parts by 2D keypoint reasoning via next token prediction, and seamlessly adapt to downstream tasks such as natural language control or global space human mesh recovery. Experiments on egocentric interaction benchmarks demonstrate that TAIHRI achieves superior estimation accuracy for task-critical body parts. We believe TAIHRI opens new research avenues in the field of embodied human-robot interaction. Code is available at: https://github.com/Tencent/TAIHRI.

关键词: 3D human keypoints localization, Vision-Language Model, human-robot interaction, task-aware perception, egocentric camera, metric-scale spatial localization, next token prediction, embodied interaction

217. ❌ Event-Driven Temporal Graph Networks for Asynchronous Multi-Agent Cyber Defense in NetForge_RL

作者: Igor Jankowski 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09523v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多智能体强化学习（MARL）在网络安全防御中的应用，提出了Continuous-Time Graph MARL（CT-GMARL）框架来解决异步、连续时间的部分可观测半马尔可夫决策过程（POSMDP）。论文的核心是MARL和网络防御模拟，与大多数关键词（如LLMs、MoE、Scaling Laws、Pre-training等）完全无关。唯一相关的关键词是’Multi-agent Systems OR Agent Coordination’，因为论文明确研究多智能体系统在网络安全中的协调防御，评分为10分（高度相关，核心内容）。其他关键词如’AI for Science’可能间接相关（网络安全可视为应用领域），但论文未明确提及科学领域应用，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文解决了多智能体强化学习（MARL）在模拟网络防御中因Sim2Real差距导致的策略迁移瓶颈问题，通过提出NetForge_RL模拟器和Continuous-Time Graph MARL（CT-GMARL）框架，在零样本转移中实现了比基线方法更高的奖励和更有效的服务恢复。

摘要翻译

多智能体强化学习（MARL）策略从模拟网络兵棋推演向实战安全运营中心（SOC）的迁移，根本上受限于仿真到现实的差距。传统模拟器抽象了网络协议物理细节，依赖同步时钟周期，并提供洁净的状态向量而非真实、含噪声的遥测数据。为突破这些局限，我们提出NetForge_RL：一个高保真网络攻防模拟器，将网络防御重构为异步连续时间部分可观测半马尔可夫决策过程（POSMDP）。NetForge强制实施零信任网络访问（ZTNA）约束，并要求防御方处理自然语言编码的安全信息与事件管理（SIEM）遥测数据。其核心创新在于通过双模式引擎原生弥合仿真到现实差距，既支持在模拟管理程序中实现高吞吐量的MARL训练，又可在Docker管理程序中对实时攻击进行零样本评估。为应对这一连续时间POSMDP，我们提出连续时间图多智能体强化学习（CT-GMARL），利用固定步长神经常微分方程（ODE）处理非均匀采样的告警数据。我们在离散基线方法（R-MAPPO、QMIX）上评估本框架。实验结果表明，CT-GMARL实现收敛后的蓝队中位奖励达57,135分——较R-MAPPO提升2.0倍，较QMIX提升2.1倍。关键的是，CT-GMARL通过避免“焦土策略”失效模式（即通过摧毁网络效用简单化降低风险），恢复的受损服务数量比最强基线多12倍。在向实时Docker环境的零样本迁移中，CT-GMARL策略获得中位奖励98,026分，验证了仿真到现实桥梁的有效性。

摘要 (Abstract)

The transition of Multi-Agent Reinforcement Learning (MARL) policies from simulated cyber wargames to operational Security Operations Centers (SOCs) is fundamentally bottlenecked by the Sim2Real gap. Legacy simulators abstract away network protocol physics, rely on synchronous ticks, and provide clean state vectors rather than authentic, noisy telemetry. To resolve these limitations, we introduce NetForge_RL: a high-fidelity cyber operations simulator that reformulates network defense as an asynchronous, continuous-time Partially Observable Semi-Markov Decision Process (POSMDP). NetForge enforces Zero-Trust Network Access (ZTNA) constraints and requires defenders to process NLP-encoded SIEM telemetry. Crucially, NetForge bridges the Sim2Real gap natively via a dual-mode engine, allowing high-throughput MARL training in a mock hypervisor and zero-shot evaluation against live exploits in a Docker hypervisor. To navigate this continuous-time POSMDP, we propose Continuous-Time Graph MARL (CT-GMARL), utilizing fixed-step Neural Ordinary Differential Equations (ODEs) to process irregularly sampled alerts. We evaluate our framework against discrete baselines (R-MAPPO, QMIX). Empirical results demonstrate that CT-GMARL achieves a converged median Blue reward of 57,135 - a 2.0x improvement over R-MAPPO and 2.1x over QMIX. Critically, CT-GMARL restores 12x more compromised services than the strongest baseline by avoiding the “scorched earth” failure mode of trivially minimizing risk by destroying network utility. On zero-shot transfer to the live Docker environment, CT-GMARL policies achieve a median reward of 98,026, validating the Sim2Real bridge.

关键词: Multi-Agent Reinforcement Learning, Cyber Defense, Sim2Real Gap, Continuous-Time Graph MARL, NetForge_RL, Asynchronous POSMDP, Zero-Trust Network Access, Neural ODEs

218. ❌ Toward World Models for Epidemiology

作者: Zeeshan Memon, Yiqi Su, Christo Kurisummoottil Thomas, Walid Saad, Liang Zhao, Naren Ramakrishnan 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09519v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文聚焦于流行病学中的世界模型（World Models）应用，与关键词’World Models AND General World Models’高度相关（10分），因为这是论文的核心概念框架。同时，论文属于AI在科学领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（8分），但未涉及生物信息学或化学信息学的具体技术。其他关键词主要涉及大模型技术原理、训练方法、推理优化等，论文未讨论这些具体技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出将流行病学建模为受控、部分可观测的动态系统，并引入流行病学世界模型的概念框架，以解决政策相关推理中的潜在疾病负担、不完美监测信号和干预效果等问题。

摘要翻译

世界模型已成为学习潜在动态、模拟反事实未来以及在不确定性下支持规划的统一范式。本文认为，计算流行病学是世界模型一个天然且尚未充分开发的适用领域。这是因为流行病决策需要对潜在疾病负担、不完善且依赖于政策的监测信号进行推理，且干预效果通过适应性人类行为传递。我们提出了一个流行病学世界模型的概念框架，将流行病建模为受控、部分可观测的动态系统，其中（i）真实的流行病状态是潜在的，（ii）观测数据存在噪声且内生于政策，（iii）干预措施作为序列行动，其效果通过行为与社会反馈传播。我们通过三个案例研究阐明为何显式的世界建模对于政策相关推理不可或缺：行为监测中的策略性误报、住院与死亡等滞后信号中的系统性延迟，以及在相同历史条件下不同行动序列会导致结果分化的反事实干预分析。

摘要 (Abstract)

World models have emerged as a unifying paradigm for learning latent dynamics, simulating counterfactual futures, and supporting planning under uncertainty. In this paper, we argue that computational epidemiology is a natural and underdeveloped setting for world models. This is because epidemic decision-making requires reasoning about latent disease burden, imperfect and policy-dependent surveillance signals, and intervention effects are mediated by adaptive human behavior. We introduce a conceptual framework for epidemiological world models, formulating epidemics as controlled, partially observed dynamical systems in which (i) the true epidemic state is latent, (ii) observations are noisy and endogenous to policy, and (iii) interventions act as sequential actions whose effects propagate through behavioral and social feedback. We present three case studies that illustrate why explicit world modeling is necessary for policy-relevant reasoning: strategic misreporting in behavioral surveillance, systematic delays in time-lagged signals such as hospitalizations and deaths, and counterfactual intervention analysis where identical histories diverge under alternative action sequences.

关键词: World Models, Epidemiology, Computational Epidemiology, Latent Dynamics, Policy-relevant Reasoning, Counterfactual Intervention, Partially Observed Dynamical Systems, Behavioral Feedback

219. ❌ Integrated electro-optic attention nonlinearities for transformers

作者: Luis Mickeler, Kai Lion, Alfonso Nardi, Jost Kellner, Pierre Didier, Bhavin J. Shastri, Niao He, Rachel Grange 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09512v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出使用薄膜铌酸锂马赫-曾德尔调制器作为模拟非线性计算单元，替代数字Softmax和Sigmoid函数，以显著降低Transformer模型（包括视觉Transformer和大语言模型）中非线性计算的延迟。这直接与’Speculative Decoding OR Inference Acceleration’高度相关（10分），因为核心目标是加速推理。同时，论文评估了在模拟单元上进行4位输入输出量化时的性能，与’Quantization OR Model Compression OR Low-bit Weights’相关（8分）。论文明确提到在大型语言模型上评估其系统，因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分）。其他关键词如MoE、训练方法、对齐、RAG、思维链、智能体等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该研究提出使用薄膜铌酸锂调制器作为模拟非线性计算单元来替代数字Softmax和Sigmoid，以大幅降低Transformer模型（包括视觉Transformer和大语言模型）的推理延迟，同时在4位量化下保持高精度。

摘要翻译

Transformer已成为主流的神经网络架构，在语言处理和计算机视觉领域实现了最先进的性能。这些模型的核心在于注意力机制，该机制需要使用Softmax函数进行非线性、非负映射。然而，尽管Softmax运算仅占总运算量的不到1%，却可能不成比例地成为整体推理延迟的瓶颈。本文采用薄膜铌酸锂（TFLN）马赫-曾德尔调制器（MZMs）作为模拟非线性计算单元，以大幅降低非线性计算的延迟。我们实现了数字Softmax和Sigmoid的电光替代方案，并在视觉Transformer和大型语言模型中评估其性能。即使在模拟单元采用激进的4比特输入-输出量化条件下，我们的系统仍保持极具竞争力的准确度。我们进一步在高达10 GBaud的编码速度下表征系统噪声，并评估了不同噪声条件下的模型鲁棒性。研究结果表明，TFLN调制器可作为混合共封装硬件中的非线性功能单元，为实现高速、高能效的非线性计算提供了可能。

摘要 (Abstract)

Transformers have emerged as the dominant neural-network architecture, achieving state-of-the-art performance in language processing and computer vision. At the core of these models lies the attention mechanism, which requires a nonlinear, non-negative mapping using the Softmax function. However, although Softmax operations account for less than 1% of the total operation count, they can disproportionately bottleneck overall inference latency. Here, we use thin-film lithium niobate (TFLN) Mach-Zehnder modulators (MZMs) as analog nonlinear computational elements to drastically reduce the latency of nonlinear computations. We implement electro-optic alternatives to digital Softmax and Sigmoid, and evaluate their performance in Vision Transformers and Large Language Models. Our system maintains highly competitive accuracy, even under aggressive 4-bit input-output quantization of the analog units. We further characterize system noise at encoding speeds up to 10 GBaud and assess model robustness under various noise conditions. Our findings suggest that TFLN modulators can serve as nonlinear function units within hybrid co-packaged hardware, enabling high-speed and energy-efficient nonlinear computation.

关键词: Transformers, attention mechanism, Softmax, thin-film lithium niobate, Mach-Zehnder modulators, inference latency, analog nonlinear computation, quantization

220. ❌ Sim-to-Real Transfer for Muscle-Actuated Robots via Generalized Actuator Networks

作者: Jan Schneider, Mridul Mahajan, Le Chen, Simon Guist, Bernhard Schölkopf, Ingmar Posner, Dieter Büchler 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09487v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于机器人控制领域，研究肌腱驱动、肌肉驱动机器人的仿真到现实迁移问题，使用神经网络建模复杂驱动系统。所有评分关键词均涉及大语言模型、深度学习技术原理、AI科学应用等特定领域，而本论文完全不涉及这些内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了肌腱驱动、肌肉驱动机器人因非线性、摩擦和迟滞特性导致的仿真到现实迁移难题，提出了广义执行器网络（GeAN）方法，首次成功实现了四自由度肌肉驱动机器人手臂的仿真到现实策略迁移。

摘要翻译

肌腱驱动与软体肌肉致动相结合，能够实现更快速、更安全的机器人系统，并可能加速技能习得。然而，由于固有的非线性、摩擦和迟滞特性，这些系统的建模与控制变得复杂，导致其在实际应用中仍较少使用。迄今为止，这些挑战阻碍了从仿真到真实系统的策略迁移。为弥合这一差距，我们提出一种仿真到现实的流程，通过学习该复杂致动系统的神经网络模型，并利用成熟的刚体仿真技术处理手臂动力学及与环境的交互。我们提出的方法称为广义致动器网络（Generalized Actuator Network, GeAN），它能够直接从关节位置轨迹（而非依赖扭矩传感器）进行学习，从而实现对多种机器人致动模型的识别。在由气动人工肌肉驱动的肌腱驱动机器人PAMY2上应用GeAN，我们成功部署了完全在仿真环境中训练的高精度目标抵达及动态“杯中球”任务策略。据我们所知，该成果首次实现了四自由度肌肉致动机器人手臂从仿真到现实的成功迁移。

摘要 (Abstract)

Tendon drives paired with soft muscle actuation enable faster and safer robots while potentially accelerating skill acquisition. Still, these systems are rarely used in practice due to inherent nonlinearities, friction, and hysteresis, which complicate modeling and control. So far, these challenges have hindered policy transfer from simulation to real systems. To bridge this gap, we propose a sim-to-real pipeline that learns a neural network model of this complex actuation and leverages established rigid body simulation for the arm dynamics and interactions with the environment. Our method, called Generalized Actuator Network (GeAN), enables actuation model identification across a wide range of robots by learning directly from joint position trajectories rather than requiring torque sensors. Using GeAN on PAMY2, a tendon-driven robot powered by pneumatic artificial muscles, we successfully deploy precise goal-reaching and dynamic ball-in-a-cup policies trained entirely in simulation. To the best of our knowledge, this result constitutes the first successful sim-to-real transfer for a four-degrees-of-freedom muscle-actuated robot arm.

关键词: sim-to-real transfer, muscle-actuated robots, tendon-driven robots, pneumatic artificial muscles, Generalized Actuator Network, neural network model, policy transfer, robot control

221. ❌ An Open-Source, Open Data Approach to Activity Classification from Triaxial Accelerometry in an Ambulatory Setting

作者: Sepideh Nikookar, Edward Tian, Harrison Hoffman, Matthew Parks, J. Lucas McKay, Yashar Kiarashi, Tommy T. Thomas, Alex Hall, David W. Wright, Gari D. Clifford 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09451v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究基于三轴加速度计的活动分类，使用传统信号处理和卷积神经网络（CNN）方法，未涉及任何大语言模型（LLM）、深度学习技术原理创新或大模型在不同领域的应用。所有关键词均与大模型、深度学习技术或相关创新无关，仅“AI for Science OR Bioinformatics OR Cheminformatics”因涉及医疗健康监测（可视为科学应用的一个子领域）而获得5分（有一定关联），但论文未使用大模型或深度学习创新技术，核心是传统CNN和信号处理。

!!! tip deepseek-chat TL;DR

该研究开发了一个开源数据集和代码，使用三轴加速度计和卷积神经网络对五种日常活动（躺、坐、站、走、跑）进行分类，实现了二元活动分类F1分数0.79和多类分类F1分数0.83。

摘要翻译

加速度计已成为一种几乎无处不在的设备，其在健康监测领域的应用潜力远超步数统计或基于15-60秒时段的平均能量估算。目的：开发一个开源数据集及配套开源代码，用于处理50 Hz三轴加速度计数据，以对患者活动水平及自然运动类型进行分类。方法：使用一款包含三轴加速度计和同步II导联等效心电图（ECG）的移动设备，采集了23名年龄在23至62岁之间的健康受试者（16名男性，7名女性）的数据，每人平均采集26分钟。参与者遵循一套标准化的活动流程，包含五种不同活动：躺、坐、站、走和慢跑。构建了两种分类器：一种基于信号处理技术以区分高、低活动水平；另一种基于卷积神经网络（Convolutional Neural Network, CNN）对五种活动进行分类。主要结果：二分类（高/低）活动分类器的F1分数为0.79。基于CNN的多分类器的F1分数为0.83。本分析所使用的代码以及用于分类器训练和测试的数据均已依据开源许可公开提供。意义：如本研究所示，对行为活动的分类为解读传统健康指标提供了有价值的背景信息，并可能为未来开发用于患者监测、预测分析和个性化健康干预的临床决策支持工具提供情境信息。

摘要 (Abstract)

The accelerometer has become an almost ubiquitous device, providing enormous opportunities in healthcare monitoring beyond step counting or other average energy estimates in 15-60 second epochs. Objective: To develop an open data set with associated open-source code for processing 50 Hz tri-axial accelerometry-based to classify patient activity levels and natural types of movement. Approach: Data were collected from 23 healthy subjects (16 males and seven females) aged between 23 and 62 years using an ambulatory device, which included a triaxial accelerometer and synchronous lead II equivalent ECG for an average of 26 minutes each. Participants followed a standardized activity routine involving five distinct activities: lying, sitting, standing, walking, and jogging. Two classifiers were constructed: a signal processing technique to distinguish between high and low activity levels and a convolutional neural network (CNN)-based approach to classify each of the five activities. Main results: The binary (high/low) activity classifier exhibited an F1 score of 0.79. The multi-class CNN-based classifier provided an F1 score of 0.83. The code for this analysis has been made available under an open-source license together with the data on which the classifiers were trained and tested. Significance: The classification of behavioral activity, as demonstrated in this study, offers valuable context for interpreting traditional health metrics and may provide contextual information to support the future development of clinical decision-making tools for patient monitoring, predictive analytics, and personalized health interventions.

关键词: accelerometry, activity classification, convolutional neural network, open-source, healthcare monitoring, triaxial accelerometer, signal processing, ambulatory setting

222. ❌ Continuous Orthogonal Mode Decomposition: Haptic Signal Prediction in Tactile Internet

作者: Mohammad Ali Vahedifar, Mojtaba Nazari, Qi Zhang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09446v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究触觉互联网中的触觉信号预测，提出了一种基于连续正交模态分解的神经网络架构。虽然论文涉及神经网络和深度学习技术，但所有关键词都专门针对大语言模型（LLMs）及其相关技术（如微调、对齐、推理优化、代理系统等），而本文专注于触觉信号处理的特定神经网络架构，未涉及任何大语言模型、基础模型或相关技术。关键词中的’AI for Science’虽然范围较广，但论文的触觉互联网应用不属于生物信息学或化学信息学等典型科学AI领域，因此相关性为0。

!!! tip deepseek-chat TL;DR

该论文针对触觉互联网中高延迟和丢包导致的触觉控制不稳定问题，提出了一种基于连续正交模态分解的双向预测神经网络架构，实现了98.6%和97.3%的高预测精度以及0.065ms的超低推理延迟。

摘要翻译

触觉互联网要求亚毫秒级延迟与超高可靠性，因为高延迟或数据包丢失可能导致触觉控制失稳。为此，我们提出模式域架构——一种双向预测神经网络架构，旨在恢复人类操作端与机器人端的缺失信号。与传统模型从原始数据中隐式提取特征不同，MDA采用了一种新颖的连续正交模式分解框架。通过引入正交性约束，我们克服了现有先进分解方法中普遍存在的“模式混叠”问题。实验结果表明，这种结构化特征提取实现了高达98.6%（人类端）和97.3%（机器人端）的预测精度。此外，该模型实现了0.065毫秒的超低推理延迟，显著优于现有基准模型，满足触觉遥操作严格的实时性要求。

摘要 (Abstract)

The Tactile Internet demands sub-millisecond latency and ultra-high reliability, as high latency or packet loss could lead to haptic control instability. To address this, we propose the Mode-Domain Architecture (MDA), a bilateral predictive neural network architecture designed to restore missing signals on both the human and robot sides. Unlike conventional models that extract features implicitly from raw data, MDA utilizes a novel Continuous-Orthogonal Mode Decomposition framework. By integrating an orthogonality constraint, we overcome the pervasive issue of “mode overlapping” found in state-of-the-art decomposition methods. Experimental results demonstrate that this structured feature extraction achieves high prediction accuracies of 98.6% (human) and 97.3% (robot). Furthermore, the model achieves ultra-low inference latency of 0.065 ms, significantly outperforming existing benchmarks and meeting the stringent real-time requirements of haptic teleoperation.

关键词: Tactile Internet, haptic signal prediction, Continuous Orthogonal Mode Decomposition, Mode-Domain Architecture, bilateral predictive neural network, ultra-low latency, real-time haptic teleoperation, mode overlapping

223. ❌ AdaCubic: An Adaptive Cubic Regularization Optimizer for Deep Learning

作者: Ioannis Tsingalis, Constantine Kotropoulos, Corentin Briat 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09437v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种名为AdaCubic的自适应立方正则化优化器，属于深度学习优化算法领域。虽然论文在NLP任务上进行了实验，但其核心贡献是通用的优化方法（基于牛顿立方正则化方法），而非大模型技术本身。所有评分关键词都直接针对大模型技术、架构、训练方法、推理优化、应用等具体方面，而本论文研究的是底层优化算法，与这些关键词没有直接关联。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

论文提出了一种自适应立方正则化优化器AdaCubic，通过动态调整立方项权重和Hessian矩阵近似，在计算机视觉、自然语言处理和信号处理任务中优于或媲美现有优化器，且无需超参数调优。

摘要翻译

本文提出了一种新颖的正则化技术AdaCubic，其能够自适应调整三次项的权重。AdaCubic的核心在于一个带三次约束的辅助优化问题，该问题可动态调节牛顿三次正则化方法中三次项的权重。我们采用Hutchinson方法来近似海森矩阵（Hessian matrix），从而降低计算成本。我们证明了AdaCubic继承了三次正则化牛顿方法的局部收敛性保证。在计算机视觉、自然语言处理和信号处理任务中的实验表明，AdaCubic的性能优于或可与多种广泛使用的优化器相竞争。与其他需要超参数微调的自适应算法不同，AdaCubic在固定超参数集下进行评估，这使其在无法进行微调的场景中成为极具吸引力的优化器。因此，AdaCubic对研究人员和实践者而言都是一个有吸引力的选择。据我们所知，AdaCubic是首个在可扩展深度学习应用中利用三次正则化的优化器。

摘要 (Abstract)

A novel regularization technique, AdaCubic, is proposed that adapts the weight of the cubic term. The heart of AdaCubic is an auxiliary optimization problem with cubic constraints that dynamically adjusts the weight of the cubic term in Newton’s cubic regularized method. We use Hutchinson’s method to approximate the Hessian matrix, thereby reducing computational cost. We demonstrate that AdaCubic inherits the cubically regularized Newton method’s local convergence guarantees. Our experiments in Computer Vision, Natural Language Processing, and Signal Processing tasks demonstrate that AdaCubic outperforms or competes with several widely used optimizers. Unlike other adaptive algorithms that require hyperparameter fine-tuning, AdaCubic is evaluated with a fixed set of hyperparameters, rendering it a highly attractive optimizer in settings where fine-tuning is infeasible. This makes AdaCubic an attractive option for researchers and practitioners alike. To our knowledge, AdaCubic is the first optimizer to leverage cubic regularization in scalable deep learning applications.

关键词: AdaCubic, adaptive cubic regularization, optimizer, deep learning, Newton’s cubic regularized method, Hessian approximation, Hutchinson’s method, hyperparameter-free

224. ❌ Offline Local Search for Online Stochastic Bandits

作者: Gerdus Benadè, Rathish Das, Thomas Lavastida 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09423v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究在线随机组合多臂老虎机问题，提出了一种将离线局部搜索算法转换为在线算法的通用框架，并应用于调度、拟阵基和不确定聚类等组合优化问题。论文内容完全属于经典在线学习、组合优化和算法设计领域，不涉及任何大模型、深度学习、AI for Science或相关技术原理。所有关键词均与大模型、深度学习及其应用、优化技术相关，而本文研究的是传统随机优化算法，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了在线随机组合多臂老虎机问题，提出了一种将离线局部搜索算法转换为在线算法的通用框架，实现了O(log^3 T)的近似遗憾界，并应用于调度、拟阵基和不确定聚类等组合优化问题。

摘要翻译

组合多臂赌博机提供了一个基础的在线决策环境，决策者在 $T$ 个时间步中与环境交互，每次选择一个动作并学习该动作的代价。其目标是最小化遗憾，即与全信息下事后最优固定动作相比的损失。如何利用离线算法设计知识来改进这一在线设定已引起广泛关注。研究表明，离线的贪心算法和线性优化算法（包括精确与近似版本）在在线部署时能提供有效的性能保证。本文研究局部搜索方法——一类在理论与实践领域广泛使用但在此背景下尚未充分探索的算法。我们关注离线局部搜索能在近似最优解终止的问题，并提出一种通用方法，将此类离线算法转化为在线随机组合赌博机算法，其（近似）遗憾为 $O(\log^3 T)$。相比之下，现有的离线-在线转换框架产生的遗憾（及近似遗憾）虽为 $T$ 的亚线性函数，但仍具有多项式依赖。我们通过将框架应用于三个在线随机组合优化问题，展示了其灵活性：最小化总完成时间的调度问题、拟阵的最小代价基求解问题以及不确定聚类问题。

摘要 (Abstract)

Combinatorial multi-armed bandits provide a fundamental online decision-making environment where a decision-maker interacts with an environment across $T$ time steps, each time selecting an action and learning the cost of that action. The goal is to minimize regret, defined as the loss compared to the optimal fixed action in hindsight under full-information. There has been substantial interest in leveraging what is known about offline algorithm design in this online setting. Offline greedy and linear optimization algorithms (both exact and approximate) have been shown to provide useful guarantees when deployed online. We investigate local search methods, a broad class of algorithms used widely in both theory and practice, which have thus far been under-explored in this context. We focus on problems where offline local search terminates in an approximately optimal solution and give a generic method for converting such an offline algorithm into an online stochastic combinatorial bandit algorithm with $O(\log^3 T)$ (approximate) regret. In contrast, existing offline-to-online frameworks yield regret (and approximate regret) which depend sub-linearly, but polynomially on $T$. We demonstrate the flexibility of our framework by applying it to three online stochastic combinatorial optimization problems: scheduling to minimize total completion time, finding a minimum cost base of a matroid and uncertain clustering.

关键词: online stochastic bandits, combinatorial optimization, local search, regret minimization, offline-to-online conversion, scheduling, matroid base, uncertain clustering

225. ❌ NOMAD: Generating Embeddings for Massive Distributed Graphs

作者: Aishwarya Sarkar, Sayan Ghosh, Nathan R. Tallent, Ali Jannesari 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09419v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文NOMAD专注于分布式图嵌入框架，用于大规模图（如科学领域中的图）的机器学习。它不涉及大语言模型（LLMs）、深度学习技术原理或任何列出的具体大模型技术（如MoE、SFT、RLHF、RAG等）。唯一的相关性是关键词’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文提到应用于科学领域的大规模图，但这只是应用背景，并非核心创新。因此，该关键词得5分（有一定关联），其他所有关键词得0分（完全无关）。论文主题是图嵌入和分布式计算，而非大模型或深度学习创新。

!!! tip deepseek-chat TL;DR

论文提出了NOMAD，一个分布式内存图嵌入框架，解决了大规模图（如科学和网络领域）中嵌入生成的可扩展性问题，通过MPI实现，相比现有方法获得了显著的加速（如10-370倍速度提升）并保持竞争性的嵌入质量。

摘要翻译

在图形或网络上成功进行机器学习需要不仅能将节点和边表示为低维向量，同时能保持图结构的嵌入方法。现有的嵌入生成方法需通过重复使用随机游走来灵活探索整个图，这些随机游走通过节点和边的采样捕获图结构。对于拥有数百万至数十亿条边的大规模图，这些方法带来了可扩展性挑战，因为单节点解决方案的内存和处理能力不足。
我们提出了NOMAD，一个基于分布式内存的图嵌入框架，它使用消息传递接口（Message Passing Interface，MPI）来处理分布式图。NOMAD实现了广受欢迎的LINE（大规模信息网络嵌入）算法中提出的基于邻近度的模型。我们提出了若干实用的权衡策略，以改善不规则和分布式图嵌入方法所面临的可扩展性和通信开销问题，从而适应网络和科学领域中出现的超大规模图。在基于CPU的NERSC Perlmutter集群上，NOMAD相较于流行的多线程LINE和node2vec参考实现，实现了中位数10至100倍的加速；相较于分布式PBG，实现了35至76倍的加速；同时在嵌入质量上与LINE、node2vec及GraphVite具有竞争力，并在真实世界图上实现了12至370倍的端到端加速。

摘要 (Abstract)

Successful machine learning on graphs or networks requires embeddings that not only represent nodes and edges as low-dimensional vectors but also preserve the graph structure. Established methods for generating embeddings require flexible exploration of the entire graph through repeated use of random walks that capture graph structure with samples of nodes and edges. These methods create scalability challenges for massive graphs with millions-to-billions of edges because single-node solutions have inadequate memory and processing capabilities. We present NOMAD, a distributed-memory graph embedding framework using the Message Passing Interface (MPI) for distributed graphs. NOMAD implements proximity-based models proposed in the widely popular LINE (Large-scale Information Network Embedding) algorithm. We propose several practical trade-offs to improve the scalability and communication overheads confronted by irregular and distributed graph embedding methods, catering to massive-scale graphs arising in web and science domains. NOMAD demonstrates median speedups of 10/100x on CPU-based NERSC Perlmutter cluster relative to the popular reference implementations of multi-threaded LINE and node2vec, 35-76x over distributed PBG, and competitive embedding quality relative to LINE, node2vec, and GraphVite, while yielding 12-370x end-to-end speedups on real-world graphs.

关键词: graph embeddings, distributed-memory framework, MPI, scalability, massive graphs, LINE algorithm, speedup, embedding quality

226. ❌ Beyond Augmented-Action Surrogates for Multi-Expert Learning-to-Defer

作者: Yannis Montreuil, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09414v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是Learning-to-Defer with advice，这是一个机器学习中的决策路由问题，关注如何将输入路由到专家并选择额外信息（如检索文档、工具输出）以提高决策质量。论文的核心是提出一种增强代理方法来解决传统分离代理的不一致性问题，并在表格、语言和多模态任务上进行了实验。虽然论文提到了语言任务，但主要关注的是通用的决策路由框架，而非特定的大模型技术、深度学习原理或科学AI应用。所有关键词都涉及大模型、深度学习技术或特定AI应用领域，与论文的通用机器学习决策路由主题没有直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

论文研究了Learning-to-Defer with advice问题，提出了一种增强代理方法来解决传统分离代理的不一致性，并在实验中证明了该方法能改进标准Learning-to-Defer并适应成本机制下的建议获取行为。

摘要翻译

学习递延（Learning-to-Defer）方法将每个输入路由至能使期望成本最小化的专家，但其假设所有专家在决策时可获取的信息是固定的。许多现代系统违背了这一假设：在选择专家后，系统还可决定该专家应接收哪些额外信息，例如检索到的文档、工具输出或升级上下文。我们研究此问题并将其称为“带建议的学习递延”（Learning-to-Defer with advice）。我们证明，在最小的非平凡设定中，一大类自然的分离代理函数——即通过不同头部学习路由与建议的方法——是不一致的。随后，我们提出一种增强型代理函数，该函数在复合的专家-建议行动空间上操作，并证明了其具有$\mathcal{H}$一致性保证及超额风险转移界，从而能在极限情况下恢复贝叶斯最优策略。在表格数据、语言和多模态任务上的实验表明，所提出的方法优于标准学习递延方法，同时能根据成本机制调整其建议获取行为；一个合成基准测试验证了分离代理函数所预测的失效模式。

摘要 (Abstract)

Learning-to-Defer routes each input to the expert that minimizes expected cost, but it assumes that the information available to every expert is fixed at decision time. Many modern systems violate this assumption: after selecting an expert, one may also choose what additional information that expert should receive, such as retrieved documents, tool outputs, or escalation context. We study this problem and call it Learning-to-Defer with advice. We show that a broad family of natural separated surrogates, which learn routing and advice with distinct heads, is inconsistent even in the smallest non-trivial setting. We then introduce an augmented surrogate that operates on the composite expert–advice action space and prove an $\mathcal{H}$-consistency guarantee together with an excess-risk transfer bound, yielding recovery of the Bayes-optimal policy in the limit. Experiments on tabular, language, and multi-modal tasks show that the resulting method improves over standard Learning-to-Defer while adapting its advice-acquisition behavior to the cost regime; a synthetic benchmark confirms the failure mode predicted for separated surrogates.

关键词: Learning-to-Defer, advice, surrogates, routing, expert systems, decision-making, Bayes-optimal policy, multi-modal tasks

227. ❌ Sharp description of local minima in the loss landscape of high-dimensional two-layer ReLU neural networks

作者: Jie Huang, Bruno Loureiro, Stefano Sarao Mannelli 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09412v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究两层ReLU神经网络的损失函数景观，属于深度学习理论分析范畴，但所有评分关键词均针对大语言模型（LLM）及其相关技术（如训练方法、推理优化、应用等）。论文未涉及任何大模型、语言模型、特定训练技术（如RLHF、PEFT）、推理方法（如RAG、CoT）或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究高维两层ReLU神经网络在教师-学生设置下的损失函数景观，发现局部最小值具有精确的低维表示，并与SGD动态的吸引固定点直接相关，揭示了过参数化下全局最小值可及性增加而虚假解减少的机制。

摘要翻译

我们在高斯协变量的可实现的师生设定下，研究形式为 $\sum_{k=1}^K \mathrm{ReLU}(w_k^\top x)$ 的两层 ReLU 网络的总体损失函数景观。我们证明，局部极小值可以通过一组汇总统计量进行精确的低维表示，从而对损失景观给出清晰且可解释的描述。我们进一步建立了与单次随机梯度下降（SGD）的直接联系：局部极小值对应于汇总统计量空间中动力学的吸引不动点。这一视角揭示了极小值的层级结构：在设定正确的机制下，它们通常是孤立的；但随着网络宽度增加，它们会通过平坦方向相互连接。在这种过参数化机制下，全局极小值变得更容易接近，从而吸引动力学轨迹并减少收敛到伪解的可能性。总体而言，我们的结果揭示了常见简化假设的内在局限性，这些假设即使在最简单的神经网络模型中也可能忽略损失景观的关键特征。

摘要 (Abstract)

We study the population loss landscape of two-layer ReLU networks of the form $\sum_{k=1}^K \mathrm{ReLU}(w_k^\top x)$ in a realisable teacher-student setting with Gaussian covariates. We show that local minima admit an exact low-dimensional representation in terms of summary statistics, yielding a sharp and interpretable characterisation of the landscape. We further establish a direct link with one-pass SGD: local minima correspond to attractive fixed points of the dynamics in summary statistics space. This perspective reveals a hierarchical structure of minima: they are typically isolated in the well-specified regime, but become connected by flat directions as network width increases. In this overparameterised regime, global minima become increasingly accessible, attracting the dynamics and reducing convergence to spurious solutions. Overall, our results reveal intrinsic limitations of common simplifying assumptions, which may miss essential features of the loss landscape even in minimal neural network models.

关键词: two-layer ReLU networks, loss landscape, local minima, teacher-student setting, SGD dynamics, overparameterization, global minima, spurious solutions

228. ❌ Variational Quantum Physics-Informed Neural Networks for Hydrological PDE-Constrained Learning with Inherent Uncertainty Quantification

作者: Prasad Nimantha Madusanka Ukwatta Hewage, Midhun Chakkravarthy, Ruvan Kumara Abeysekara 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09374v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文主要研究量子增强的物理信息神经网络在洪水预测中的应用，属于AI for Science领域，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文提到了’pre-trains on multi-hazard disaster data before fine-tuning on flood-specific events’，这涉及预训练和微调的概念，与’Pre-training OR Continual Pre-training OR Domain Adaptation’和’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（各5分）。其他关键词主要涉及大语言模型（LLMs）的技术细节（如MoE、RLHF、RAG等）、推理方法（如CoT）、代理系统、模型优化（如量化）等，而本文专注于量子神经网络和物理约束学习，未涉及这些大语言模型相关技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种混合量子-经典物理信息神经网络（HQC-PINN），用于水文PDE约束的洪水预测，通过量子电路实现不确定性量化，并在训练效率和参数数量上优于经典方法。

摘要翻译

我们提出一种混合量子-经典物理信息神经网络（Hybrid Quantum-Classical Physics-Informed Neural Network, HQC-PINN），它将参数化变分量子电路集成到物理信息神经网络框架中，用于水文偏微分方程约束学习。该架构通过可训练的角度编码将多源遥感特征编码为量子态，经由包含纠缠层的硬件高效变分拟设进行处理，并利用圣维南浅水方程和曼宁水流方程作为可微分物理损失项对输出进行约束。量子测量固有的随机性为不确定性量化提供了自然机制，无需显式的贝叶斯推断框架。我们进一步引入一种量子迁移学习协议，先在多灾种灾害数据上进行预训练，再针对洪水事件进行微调。基于斯里兰卡卡鲁河流域多模态卫星与气象数据的数值模拟表明，相较于等效的经典物理信息神经网络，HQC-PINN 仅需约三分之一的训练轮次即可收敛，且可训练参数量减少约44%，同时保持具有竞争力的分类精度。理论分析指出，水文物理约束缩小了有效优化空间，为缓解变分量子电路中的贫瘠高原问题提供了自然途径。本研究首次将量子增强的物理信息学习应用于水文预测，并展示了在环境科学领域实现量子优势的可行路径。

摘要 (Abstract)

We propose a Hybrid Quantum-Classical Physics-Informed Neural Network (HQC-PINN) that integrates parameterized variational quantum circuits into the PINN framework for hydrological PDE-constrained learning. Our architecture encodes multi-source remote sensing features into quantum states via trainable angle encoding, processes them through a hardware-efficient variational ansatz with entangling layers, and constrains the output using the Saint-Venant shallow water equations and Manning’s flow equation as differentiable physics loss terms. The inherent stochasticity of quantum measurement provides a natural mechanism for uncertainty quantification without requiring explicit Bayesian inference machinery. We further introduce a quantum transfer learning protocol that pre-trains on multi-hazard disaster data before fine-tuning on flood-specific events. Numerical simulations on multi-modal satellite and meteorological data from the Kalu River basin, Sri Lanka, show that the HQC-PINN achieves convergence in ~3x fewer training epochs and uses ~44% fewer trainable parameters compared to an equivalent classical PINN, while maintaining competitive classification accuracy. Theoretical analysis indicates that hydrological physics constraints narrow the effective optimization landscape, providing a natural mitigation against barren plateaus in variational quantum circuits. This work establishes the first application of quantum-enhanced physics-informed learning to hydrological prediction and demonstrates a viable path toward quantum advantage in environmental science.

关键词: Quantum-Classical Physics-Informed Neural Network, Hydrological PDE-constrained learning, Variational quantum circuits, Uncertainty quantification, Quantum transfer learning, Flood prediction, Saint-Venant equations, Multi-modal satellite data

229. ❌ Biologically-Grounded Multi-Encoder Architectures as Developability Oracles for Antibody Design

作者: Simon J. Crouzet 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09369v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用预训练的蛋白质语言模型（6B编码器）作为基础，通过监督微调（SFT）构建抗体可开发性预测框架CrossAbSense，属于AI for Science（生物信息学）领域。论文与’Large Language Models’相关（使用6B蛋白质语言模型），与’Pre-training’相关（使用预训练编码器），与’Post-training/SFT’高度相关（通过超参数搜索进行监督微调），与’AI for Science/Bioinformatics’高度相关（抗体设计应用）。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、推理加速等均未涉及。

!!! tip deepseek-chat TL;DR

该研究开发了CrossAbSense框架，通过结合预训练的蛋白质语言模型编码器和可配置的注意力解码器，构建了抗体可开发性预测模型，在GDPa1基准测试中相比基线提升了12-20%，并展示了在抗体设计筛选中的实用价值。

摘要翻译

生成模型现已能够设计数千种全新抗体序列，但将这些设计转化为可行疗法仍受限于生物物理表征的成本。本文提出CrossAbSense框架，该框架包含一系列针对特定属性的神经预测模型，这些模型将冻结的蛋白质语言模型编码器与可配置的注意力解码器相结合，其结构通过针对每个属性进行超过200次实验的系统化超参数筛选确定。在包含242个治疗性IgG抗体的GDPa1基准测试中，我们的模型在五项可开发性检测中的三项上相比现有基线实现了12-20%的显著提升，在其余两项检测中也表现出可比性能。核心发现在于：最优解码器架构与我们最初的生物学假设相反——对于聚集相关属性（疏水相互作用色谱、多反应性），仅需自注意力机制即可实现最佳预测，因为相关序列特征（如CDR-H3疏水斑块）已通过高性能的60亿参数编码器在单链嵌入中得到充分解析。相比之下，表达产量和热稳定性这两个本质上依赖于重链与轻链兼容性的属性，则需要双向交叉注意力机制。学习到的链融合权重独立证实了重链在聚集性中的主导作用（w_H = 0.62），而在稳定性中则呈现双链平衡贡献（w_H = 0.51）。我们通过将CrossAbSense应用于100个由IgLM生成的抗体设计，展示了其实际应用价值，这为大幅降低实验筛选成本提供了可行路径。

摘要 (Abstract)

Generative models can now propose thousands of \emph{de novo} antibody sequences, yet translating these designs into viable therapeutics remains constrained by the cost of biophysical characterization. Here we present CrossAbSense, a framework of property-specific neural oracles that combine frozen protein language model encoders with configurable attention decoders, identified through a systematic hyperparameter campaign totaling over 200 runs per property. On the GDPa1 benchmark of 242 therapeutic IgGs, our oracles achieve notable improvements of 12–20% over established baselines on three of five developability assays and competitive performance on the remaining two. The central finding is that optimal decoder architectures \emph{invert} our initial biological hypotheses: self-attention alone suffices for aggregation-related properties (hydrophobic interaction chromatography, polyreactivity), where the relevant sequence signatures – such as CDR-H3 hydrophobic patches – are already fully resolved within single-chain embeddings by the high-capacity 6B encoder. Bidirectional cross-attention, by contrast, is required for expression yield and thermal stability – properties that inherently depend on the compatibility between heavy and light chains. Learned chain fusion weights independently confirm heavy-chain dominance in aggregation ($w_H = 0.62$) versus balanced contributions for stability ($w_H = 0.51$). We demonstrate practical utility by deploying CrossAbSense on 100 IgLM-generated antibody designs, illustrating a path toward substantial reduction in experimental screening costs.

关键词: antibody design, protein language model, developability prediction, supervised fine-tuning, cross-attention, biophysical characterization, therapeutic antibodies, neural oracles

230. ❌ Stochastic-Dimension Frozen Sampled Neural Network for High-Dimensional Gross-Pitaevskii Equations on Unbounded Domains

作者: Zhangyong Liang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09361v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文提出了一种随机维度冻结采样神经网络（SD-FSNN）用于求解无界域上的高维Gross-Pitaevskii方程（GPEs），属于计算物理和科学计算领域。论文的核心是开发一种高效的数值方法来解决特定类型的偏微分方程，而不是研究大模型或深度学习技术本身。所有关键词都聚焦于大模型技术（如LLMs、MoE、训练方法、推理优化、对齐、代理等）或特定科学AI应用（如生物信息学、化学信息学）。论文虽然使用了神经网络，但其应用是解决物理方程，与关键词列表中的大模型技术无关。唯一可能相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文属于AI在科学计算中的应用，但并非生物或化学信息学领域，因此给予5分（有一定关联）。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种随机维度冻结采样神经网络（SD-FSNN），用于高效求解无界域上的高维Gross-Pitaevskii方程，在计算成本和精度上优于现有方法。

摘要翻译

本文提出了一种随机维度冻结采样神经网络（SD-FSNN），用于求解无界域上的一类高维Gross-Pitaevskii方程（GPEs）。SD-FSNN在所有维度上均保持无偏性，且其计算成本与维度无关，避免了基于Hermite基离散化方法中计算与内存成本的指数级增长。此外，我们随机采样神经网络的隐藏层权重与偏置，在训练时间和精度上显著优于基于梯度的迭代优化方法。进一步，我们采用时空分离策略，利用自适应常微分方程（ODE）求解器更新演化系数并融入时间因果性。为保持GPEs的结构特性，我们在神经网络中引入高斯加权试探函数以强制解在无穷远处的指数衰减，嵌入归一化投影层以实现质量归一化，并添加能量守恒约束以缓解长时间数值耗散。与现有方法的对比实验表明，SD-FSNN在不同空间维度和相互作用参数下均表现出优越性能。相较于现有随机特征方法，SD-FSNN将复杂度从线性降低至与维度无关。同时，相比于通用高维求解器，SD-FSNN在专门针对无界域高维GPEs的求解中实现了更高精度与更快训练速度。

摘要 (Abstract)

In this paper, we propose a stochastic-dimension frozen sampled neural network (SD-FSNN) for solving a class of high-dimensional Gross-Pitaevskii equations (GPEs) on unbounded domains. SD-FSNN is unbiased across all dimensions, and its computational cost is independent of the dimension, avoiding the exponential growth in computational and memory costs associated with Hermite-basis discretizations. Additionally, we randomly sample the hidden weights and biases of the neural network, significantly outperforming iterative, gradient-based optimization methods in terms of training time and accuracy. Furthermore, we employ a space-time separation strategy, using adaptive ordinary differential equation (ODE) solvers to update the evolution coefficients and incorporate temporal causality. To preserve the structure of the GPEs, we integrate a Gaussian-weighted ansatz into the neural network to enforce exponential decay at infinity, embed a normalization projection layer for mass normalization, and add an energy conservation constraint to mitigate long-time numerical dissipation. Comparative experiments with existing methods demonstrate the superior performance of SD-FSNN across a range of spatial dimensions and interaction parameters. Compared to existing random-feature methods, SD-FSNN reduces the complexity from linear to dimension-independent. Additionally, SD-FSNN achieves better accuracy and faster training compared to general high-dimensional solvers, while focusing specifically on high-dimensional GPEs on unbounded domains.

关键词: Stochastic-dimension frozen sampled neural network, High-dimensional Gross-Pitaevskii equations, Unbounded domains, Dimension-independent computational cost, Random sampling of weights, Space-time separation strategy, Gaussian-weighted ansatz, Energy conservation constraint

231. ❌ Bringing Clustering to MLL: Weakly-Supervised Clustering for Partial Multi-Label Learning

作者: Yu Chen, Weijun Lv, Yue Huang, Xuhuan Zhu, Fang Li 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09359v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于多标签学习（MLL）和部分多标签学习（PML）中的标签噪声问题，提出了一种弱监督聚类方法。论文内容涉及传统机器学习、聚类算法和标签噪声处理，但完全不涉及大语言模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用。所有关键词均与大模型、深度学习技术或AI科学应用相关，而本文研究的是传统多标签学习问题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对部分多标签学习中的标签噪声问题，提出了一种通过成员矩阵分解将聚类与多标签学习相结合的弱监督聚类方法WSC-PML，在24个数据集上超越了六种先进方法。

摘要翻译

多标签学习中的标签噪声对模型训练构成显著挑战，尤其在部分多标签学习场景中，候选标签同时包含相关与无关标签。尽管聚类为利用数据结构进行噪声识别提供了一种自然思路，但传统聚类方法无法直接应用于多标签场景，其根本矛盾在于：聚类产生的隶属度值在每个实例上总和为一，而多标签分配需要可求和为任意数的二元值。本文提出一种面向部分多标签学习的新型弱监督聚类方法，通过隶属矩阵分解在聚类与多标签学习间建立桥梁。我们的核心创新在于将聚类隶属矩阵 $\mathbf{A}$ 分解为两个分量：$\mathbf{A} = \mathbfΠ \odot \mathbf{F}$，其中 $\mathbfΠ$ 维持聚类约束，$\mathbf{F}$ 则保留多标签特性。该分解实现了无监督聚类与多标签监督的无缝融合，从而有效处理标签噪声。该方法采用三阶段流程：基于噪声标签的初始原型学习、自适应置信度弱监督构建，以及通过迭代聚类优化的联合学习。在24个数据集上的大量实验表明，本方法在所有评估指标上均优于六种前沿方法。

摘要 (Abstract)

Label noise in multi-label learning (MLL) poses significant challenges for model training, particularly in partial multi-label learning (PML) where candidate labels contain both relevant and irrelevant labels. While clustering offers a natural approach to exploit data structure for noise identification, traditional clustering methods cannot be directly applied to multi-label scenarios due to a fundamental incompatibility: clustering produces membership values that sum to one per instance, whereas multi-label assignments require binary values that can sum to any number. We propose a novel weakly-supervised clustering approach for PML (WSC-PML) that bridges clustering and multi-label learning through membership matrix decomposition. Our key innovation decomposes the clustering membership matrix $\mathbf{A}$ into two components: $\mathbf{A} = \mathbfΠ \odot \mathbf{F}$, where $\mathbfΠ$ maintains clustering constraints while $\mathbf{F}$ preserves multi-label characteristics. This decomposition enables seamless integration of unsupervised clustering with multi-label supervision for effective label noise handling. WSC-PML employs a three-stage process: initial prototype learning from noisy labels, adaptive confidence-based weak supervision construction, and joint optimization via iterative clustering refinement. Extensive experiments on 24 datasets demonstrate that our approach outperforms six state-of-the-art methods across all evaluation metrics.

关键词: multi-label learning, partial multi-label learning, label noise, weakly-supervised clustering, membership matrix decomposition, noise identification, clustering refinement

232. ❌ Drift-Aware Online Dynamic Learning for Nonstationary Multivariate Time Series: Application to Sintering Quality Prediction

作者: Yumeng Zhao, Shengxiang Yang, Xianpeng Wang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09358v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于非平稳多元时间序列预测的在线动态学习框架（DA-MSDL），应用于工业烧结过程的质量预测。其核心是处理概念漂移、多尺度时空特征提取和在线自适应机制，使用卷积网络、MMD漂移检测和分层微调等技术。所有关键词均与大模型（LLMs）或深度学习技术原理直接相关，而本文研究的是传统时间序列预测和在线学习，未涉及大模型、Transformer架构、提示工程、对齐、推理、代理、高效微调等主题。唯一相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文将AI应用于工业科学（烧结过程），属于AI for Science的广义范畴，但非核心生物信息学或化学信息学，故给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对非平稳多元时间序列（如工业烧结过程）中概念漂移和标签延迟导致的预测性能下降问题，提出了一种漂移感知多尺度动态学习框架（DA-MSDL），通过在线自适应机制和多尺度特征提取，在真实工业数据上实现了优于基线的鲁棒预测性能。

摘要翻译

准确预测非平稳多元时间序列在铁矿烧结等复杂工业系统中仍是一项关键挑战。实践中，显著的概念漂变与严重的标签验证延迟相互叠加，导致离线训练模型的性能迅速退化。现有基于静态架构或被动更新策略的方法难以在缺乏即时监督的情况下，同时提取多尺度时空特征并克服稳定性-可塑性困境。为应对这些局限，本文提出一种漂变感知多尺度动态学习（Drift-Aware Multi-Scale Dynamic Learning, DA-MSDL）框架，通过对非平稳数据流实施在线自适应机制，维持稳健的多输出预测性能。该框架采用多尺度双分支卷积网络作为主干，以分离局部波动与长期趋势，从而增强对复杂动态模式的表征能力。为规避标签延迟瓶颈，DA-MSDL利用最大均值差异（Maximum Mean Discrepancy, MMD）进行无监督漂变检测。通过量化特征分布的在线统计偏差，DA-MSDL在推理前主动触发模型自适应。此外，本文开发了一种漂变严重程度引导的分层微调策略。该方法依托动态记忆队列的优先级经验回放，在实现快速分布对齐的同时，有效缓解灾难性遗忘。基于真实工业烧结数据和公共基准数据集的长期实验表明，在严重概念漂变下，DA-MSDL始终优于代表性基线方法。该框架展现出强大的跨领域泛化能力和预测稳定性，为非平稳环境下的质量监控提供了一种有效的在线动态学习范式。

摘要 (Abstract)

Accurate prediction of nonstationary multivariate time series remains a critical challenge in complex industrial systems such as iron ore sintering. In practice, pronounced concept drift compounded by significant label verification latency rapidly degrades the performance of offline-trained models. Existing methods based on static architectures or passive update strategies struggle to simultaneously extract multi-scale spatiotemporal features and overcome the stability-plasticity dilemma without immediate supervision. To address these limitations, a Drift-Aware Multi-Scale Dynamic Learning (DA-MSDL) framework is proposed to maintain robust multi-output predictive performance via online adaptive mechanisms on nonstationary data streams. The framework employs a multi-scale bi-branch convolutional network as its backbone to disentangle local fluctuations from long-term trends, thereby enhancing representational capacity for complex dynamic patterns. To circumvent the label latency bottleneck, DA-MSDL leverages Maximum Mean Discrepancy (MMD) for unsupervised drift detection. By quantifying online statistical deviations in feature distributions, DA-MSDL proactively triggers model adaptation prior to inference. Furthermore, a drift-severity-guided hierarchical fine-tuning strategy is developed. Supported by prioritized experience replay from a dynamic memory queue, this approach achieves rapid distribution alignment while effectively mitigating catastrophic forgetting. Long-horizon experiments on real-world industrial sintering data and a public benchmark dataset demonstrate that DA-MSDL consistently outperforms representative baselines under severe concept drift. Exhibiting strong cross-domain generalization and predictive stability, the proposed framework provides an effective online dynamic learning paradigm for quality monitoring in nonstationary environments.

关键词: nonstationary multivariate time series, concept drift, online dynamic learning, multi-scale spatiotemporal features, drift detection, hierarchical fine-tuning, sintering quality prediction, industrial systems

233. ❌ Hierarchical Flow Decomposition for Turning Movement Prediction at Signalized Intersections

作者: Md Atiqur Rahman Mallick, Kamrul Hasan, Pulock Das, Liang Hong, S M Shazzad Rassel 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09336v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究交通信号交叉口转向运动预测，使用深度学习框架（HFD-TM），属于交通工程和深度学习应用领域。所有评分关键词均与大模型（LLM）技术、大模型训练/对齐方法、大模型推理优化、大模型应用范式（如Agent、RAG）或特定科学领域AI（如生物信息学）相关。论文内容完全不涉及大模型或相关技术，也未应用于生物/化学信息学等科学领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于分层流分解的深度学习框架（HFD-TM），用于预测信号交叉口的转向运动，通过先预测走廊直行流量再分解为转向流，在真实LiDAR数据上实现了比Transformer和GRU更低的预测误差，并显著减少了训练时间。

摘要翻译

交叉口转向流量的精准预测对于自适应信号控制至关重要，但由于方向性流量的高波动性，这仍是一项难题。本研究提出HFD-TM（面向转向预测的分层流量分解框架），这是一种分层深度学习框架，其通过首先预测走廊直行流量，再将预测结果扩展至各转向流，从而实现转向流量预测。该设计基于实际交通结构特征：走廊直行流量占总流量的65.1%，其波动性低于转向流量，并能解释35.5%的转向流量方差。框架采用物理信息损失函数来强制流量守恒，以保持结构一致性。基于美国田纳西州纳什维尔一条包含六个交叉口的走廊所采集的六个月、以15分钟为间隔的LiDAR（激光探测与测距）数据进行评估，HFD-TM实现了每间隔2.49辆车的平均绝对误差，相较于Transformer模型降低了5.7%的MAE，相较于门控循环单元（GRU）降低了27.0%。消融实验结果表明，分层分解带来了最大的性能提升，同时其训练时间比扩散卷积循环神经网络（DCRNN）降低了12.8倍，证明了该框架适用于实时交通应用场景。

摘要 (Abstract)

Accurate prediction of intersection turning movements is essential for adaptive signal control but remains difficult due to the high volatility of directional flows. This study proposes HFD-TM (Hierarchical Flow-Decomposition for Turning Movement Prediction), a hierarchical deep learning framework that predicts turning movements by first forecasting corridor through-movements and then expanding these predictions to individual turning streams. This design is motivated by empirical traffic structure, where corridor flows account for 65.1% of total volume, exhibit lower volatility than turning movements, and explain 35.5% of turning-movement variance. A physics-informed loss function enforces flow conservation to maintain structural consistency. Evaluated on six months of 15-minute interval LiDAR (Light Detection and Ranging) data from a six-intersection corridor in Nashville, Tennessee, HFD-TM achieves a mean absolute error of 2.49 vehicles per interval, reducing MAE by 5.7% compared to a Transformer and by 27.0% compared to a GRU (Gated Recurrent Unit). Ablation results show that hierarchical decomposition provides the largest performance gain, while training time is 12.8 times lower than DCRNN (Diffusion Convolutional Recurrent Neural Network), demonstrating suitability for real-time traffic applications.

关键词: turning movement prediction, hierarchical flow decomposition, deep learning, traffic signal control, LiDAR data, physics-informed loss, real-time traffic applications, HFD-TM

234. ❌ Stability Enhanced Gaussian Process Variational Autoencoders

作者: Carl R. Richardson, Jichen Zhang, Ethan King, Ján Drgoňa 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09331v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文提出了一种稳定性增强的高斯过程变分自编码器（SEGP-VAE），用于从高维视频数据间接训练低维线性时不变（LTI）系统。研究聚焦于概率模型、物理建模、系统稳定性、参数化方法和视频分析，属于机器学习与物理建模交叉领域。所有评分关键词均涉及大模型、深度学习技术原理或特定AI应用（如生物信息学），而本文的核心内容（高斯过程、变分自编码器、LTI系统、稳定性分析、视频数据）与这些关键词无直接关联，未涉及大模型、语言模型、训练技术、推理方法、代理系统、模型压缩等主题，也未应用于生物信息学或化学信息学。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种稳定性增强的高斯过程变分自编码器（SEGP-VAE），用于从高维视频数据间接训练低维线性时不变系统，并通过参数化确保系统稳定性，从而实现了准确的潜在状态预测。

摘要翻译

本文提出了一种新型稳定性增强高斯过程变分自编码器（SEGP-VAE），用于利用高维视频数据间接训练低维线性时不变（LTI）系统。该新型SEGP先验的均值函数与协方差函数均从LTI系统的定义推导得出，使得SEGP能够通过结合概率模型与可解释的物理模型，捕捉间接观测到的潜在过程。通过一种完备且无约束的参数化方法，将LTI参数的搜索空间限制在半收缩系统集合内。因此，SEGP-VAE可采用无约束优化算法进行训练。此外，这种参数化方式避免了因非赫尔维茨状态矩阵存在而引起的数值问题。案例研究将SEGP-VAE应用于包含粒子螺旋运动视频的数据集，结果凸显了该方法的优势以及为实现精确潜在状态预测所采用的特定应用设计策略。

摘要 (Abstract)

A novel stability-enhanced Gaussian process variational autoencoder (SEGP-VAE) is proposed for indirectly training a low-dimensional linear time invariant (LTI) system, using high-dimensional video data. The mean and covariance function of the novel SEGP prior are derived from the definition of an LTI system, enabling the SEGP to capture the indirectly observed latent process using a combined probabilistic and interpretable physical model. The search space of LTI parameters is restricted to the set of semi-contracting systems via a complete and unconstrained parametrisation. As a result, the SEGP-VAE can be trained using unconstrained optimisation algorithms. Furthermore, this parametrisation prevents numerical issues caused by the presence of a non-Hurwitz state matrix. A case study applies SEGP-VAE to a dataset containing videos of spiralling particles. This highlights the benefits of the approach and the application-specific design choices that enabled accurate latent state predictions.

关键词: Gaussian process variational autoencoder, linear time invariant system, stability enhancement, latent state prediction, video data, probabilistic modeling, semi-contracting systems, unconstrained optimization

235. ❌ Transferable FB-GNN-MBE Framework for Potential Energy Surfaces: Data-Adaptive Transfer Learning in Deep Learned Many-Body Expansion Theory

作者: Siqi Chen, Zhiqiang Wang, Yili Shen, Xianqi Deng, Xi Cheng, Cheng-Wei Ju, Jun Yi, Guo Ling, Dieaa Alhmoud, Hui Guan, Zhou Lin 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09320v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于开发FB-GNN-MBE框架，用于预测复杂化学系统的势能面，属于AI for Science（科学人工智能）领域，具体涉及化学信息学（Cheminformatics）。论文核心是图神经网络（GNN）与多体展开（MBE）理论的结合，以及教师-学生迁移学习策略，用于大规模分子模拟。所有其他关键词（如LLMs、MoE、RLHF、RAG等）均与论文内容无关，因为这些关键词主要涉及大语言模型及其相关技术（如训练、对齐、推理、代理等），而本文研究的是特定领域的科学计算模型，未涉及任何大语言模型技术或通用人工智能方法。

!!! tip deepseek-chat TL;DR

该论文开发了可迁移的FB-GNN-MBE框架，通过结合图神经网络和多体展开理论，实现了对复杂化学系统势能面的高效准确预测，并利用教师-学生迁移学习策略提升了模型在不同水簇系统中的泛化能力。

摘要翻译

对复杂化学体系的机理理解和理性设计，依赖于超越单一结构单元、快速且精确的电子结构预测。然而，当体系原子数超过数百时，第一性原理量子力学（QM）建模变得不切实际。本研究通过将基于片段的图神经网络（FB-GNN）整合到多体展开（MBE）理论中，开发了FB-GNN-MBE方法，并证明了其能以可控的精度、复杂度和可解释性，为具有层次结构的体系复现第一性原理势能面（PES）。具体而言，我们将整个体系划分为基本结构单元（片段），使用QM模型评估其单片段能量，并利用通过FB-GNN训练得到的结构-性质关系来处理多片段相互作用。我们的研究表明，FB-GNN-MBE在水、苯酚及其混合物的基准测试中，以及在水分二聚体和苯酚二聚体的一维解离曲线预测上，对于二体（2B）和三体（3B）能量的预测均达到了化学精度。为了以最小的计算成本和数据需求，将FB-GNN-MBE的成功推广到不同体系，我们开发并验证了一种师生学习策略。一个在混合密度水团簇集合上训练的重型FB-GNN（教师）提炼其习得的知识，并将其传递给一个轻量级GNN（学生），该学生随后在均匀密度(H2O)21团簇集合上进行微调。这种迁移学习策略实现了对不同尺寸水团簇2B和3B能量的高效、准确预测，而无需重新训练。我们可迁移的FB-GNN-MBE框架优于传统的非FB-GNN模型，并显示出在大规模分子模拟中的高度实用性。

摘要 (Abstract)

Mechanistic understanding and rational design of complex chemical systems depend on fast and accurate predictions of electronic structures beyond individual building blocks. However, if the system exceeds hundreds of atoms, first-principles quantum mechanical (QM) modeling becomes impractical. In this study, we developed FB-GNN-MBE by integrating a fragment-based graph neural network (FB-GNN) into the many-body expansion (MBE) theory and demonstrated its capacity to reproduce first-principles potential energy surfaces (PES) for hierarchically structured systems with manageable accuracy, complexity, and interpretability. Specifically, we divided the entire system into basic building blocks (fragments), evaluated their one-fragment energies using a QM model, and addressed many-fragment interactions using the structure-property relationships trained by FB-GNNs. Our investigation shows that FB-GNN-MBE achieves chemical accuracy in predicting two-body (2B) and three-body (3B) energies across water, phenol, and mixture benchmarks, as well as the one-dimensional dissociation curves of water and phenol dimers. To transfer the success of FB-GNN-MBE across various systems with minimal computational costs and data demands, we developed and validated a teacher-student learning protocol. A heavy-weight FB-GNN trained on a mixed-density water cluster ensemble (teacher) distills its learned knowledge and passes it to a light-weight GNN (student), which is later fine-tuned on a uniform-density (H2O)21 cluster ensemble. This transfer learning strategy resulted in efficient and accurate prediction of 2B and 3B energies for variously sized water clusters without retraining. Our transferable FB-GNN-MBE framework outperformed conventional non-FB-GNN-based models and showed high practicality for large-scale molecular simulations.

关键词: FB-GNN-MBE, graph neural network, many-body expansion, potential energy surfaces, transfer learning, teacher-student learning, molecular simulations, chemical accuracy

236. ❌ Iterative Identification Closure: Amplifying Causal Identifiability in Linear SEMs

作者: Ziyi Ding, Xiao-Ping Zhang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09309v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究线性结构方程模型（SEMs）中的因果可识别性问题，提出了一种名为迭代识别闭合（IIC）的通用框架，用于改进半游走准则（HTC）在存在潜在混杂因素时的识别能力。论文的核心贡献在于因果推断和图模型理论，涉及算法设计、理论证明和实验验证。所有给定的关键词均与大语言模型、深度学习技术原理或其在科学领域的应用直接相关，而本文专注于传统的因果推断和图模型方法，未涉及任何大模型、深度学习、AI for Science或其他相关技术。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对线性结构方程模型中因果效应系数的可识别性问题，提出了迭代识别闭合（IIC）框架，通过迭代传播已知系数来显著减少标准半游走准则（HTC）无法识别的边，在理论上被证明是可靠且高效的，并在小规模图上验证了其100%的精确性和超过80%的HTC间隙减少。

摘要翻译

半迹准则（Half-Trek Criterion, HTC）是判定含潜在混杂变量的线性结构方程模型（Structural Equation Models, SEM）中因果效应系数在一般意义下可识别性的主要图方法工具。然而，HTC本质上是节点导向的：它同时解析一个节点的所有入边，导致存在“无法判定”的因果效应缺口（在中等规模图中占15-23%）。本文提出迭代识别闭包（Iterative Identification Closure, IIC），这是一个通用框架，将因果识别解耦为两个阶段：（1）种子函数 S_0，它利用任何外部信息源（工具变量、干预、非高斯性、先验知识等）识别出一组初始边；（2）约化HTC传播，通过迭代代入已知系数以降低系统维度，从而识别出标准HTC无法判定的边。其核心创新在于迭代识别传播：新识别出的边反馈回系统以解锁进一步的识别——这一机制在所有现有图准则中均不存在，这些准则均孤立地处理每条边（或每个节点）。这种传播并非平凡：系数代入会改变协方差结构，而其可靠性需要证明修改后的雅可比矩阵在一般意义下保持满秩——这是一个新的理论结果（约化HTC定理）。我们证明IIC是可靠的、单调的，在O(|E|)次迭代内收敛（实证中≤2次），并且严格包含了HTC与祖先分解。在节点数n≤5的所有图（涉及134,144条边）上进行的穷举验证证实了其100%的精确度（零误报）；结合种子信息，IIC将HTC的判定缺口减少了80%以上。其传播增益高达γ~4倍（从2个种子识别约3%的边，到最终识别97.5%的边），远超先前那些纳入辅助信息但无迭代反馈的方法（其γ≤1.2倍）。

摘要 (Abstract)

The Half-Trek Criterion (HTC) is the primary graphical tool for determining generic identifiability of causal effect coefficients in linear structural equation models (SEMs) with latent confounders. However, HTC is inherently node-wise: it simultaneously resolves all incoming edges of a node, leaving a gap of “inconclusive” causal effects (15-23% in moderate graphs). We introduce Iterative Identification Closure (IIC), a general framework that decouples causal identification into two phases: (1) a seed function S_0 that identifies an initial set of edges from any external source of information (instrumental variables, interventions, non-Gaussianity, prior knowledge, etc.); and (2) Reduced HTC propagation that iteratively substitutes known coefficients to reduce system dimension, enabling identification of edges that standard HTC cannot resolve. The core novelty is iterative identification propagation: newly identified edges feed back to unlock further identification – a mechanism absent from all existing graphical criteria, which treat each edge (or node) in isolation. This propagation is non-trivial: coefficient substitution alters the covariance structure, and soundness requires proving that the modified Jacobian retains generic full rank – a new theoretical result (Reduced HTC Theorem). We prove that IIC is sound, monotone, converges in O(|E|) iterations (empirically <=2), and strictly subsumes both HTC and ancestor decomposition. Exhaustive verification on all graphs with n<=5 (134,144 edges) confirms 100% precision (zero false positives); with combined seeds, IIC reduces the HTC gap by over 80%. The propagation gain is gamma~4x (2 seeds identifying ~3% of edges to 97.5% total identification), far exceeding gamma<=1.2x of prior methods that incorporate side information without iterative feedback.

关键词: Causal Identifiability, Linear Structural Equation Models, Half-Trek Criterion, Iterative Identification Closure, Latent Confounders, Graphical Criteria, Coefficient Substitution, Reduced HTC Theorem

237. ❌ Online Intention Prediction via Control-Informed Learning

作者: Tianyu Zhou, Zihao Liang, Zehui Lu, Shaoshuai Mou 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09303v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training	0.0	0.0/10	0.0
Post-training	0.0	0.0/10	0.0
Instruction Tuning	0.0	0.0/10	0.0
RLHF	0.0	0.0/10	0.0
PEFT	0.0	0.0/10	0.0
Retrieval-Augmented Generation	0.0	0.0/10	0.0
Context Window Extension	0.0	0.0/10	0.0
KV Cache Compression	0.0	0.0/10	0.0
Chain of Thought	0.0	0.0/10	0.0
System 2 Thinking	0.0	0.0/10	0.0
Monte Carlo Tree Search AND LLM	0.0	0.0/10	0.0
Self-Correction	0.0	0.0/10	0.0
LLM Agents	0.0	0.0/10	0.0
Tool Use	0.0	0.0/10	0.0
Multi-agent Systems	0.0	0.0/10	0.0
Quantization	0.0	0.0/10	0.0
Speculative Decoding	0.0	0.0/10	0.0
Hallucination Mitigation	0.0	0.0/10	0.0
Mechanistic Interpretability	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging	0.0	0.0/10	0.0
In-context Learning	0.0	0.0/10	0.0
AI for Science	0.0	0.0/10	0.0

评分理由: 论文研究在线意图预测框架，属于控制理论与强化学习领域，核心是逆最优控制/逆强化学习，用于估计自主系统的时变目标状态。所有关键词均与大模型、深度学习技术原理或AI科学应用直接相关，而本文未涉及任何大模型、深度学习或AI科学应用内容，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

本文提出了一种在线控制信息学习框架，用于在系统动态或目标包含未知参数时实时预测自主系统的时变意图，并通过仿真和无人机实验验证了该方法的准确性和适应性。

摘要翻译

本文提出一种在线意图预测框架，用于实时估计自主系统的目标状态，即使当意图随时间变化、且系统动力学或目标函数包含未知参数时仍能有效工作。该问题被构建为逆最优控制/逆强化学习任务，其中意图被视为目标函数中的待估参数。通过采用滑动时域策略对过时信息进行衰减处理，并结合在线控制感知学习机制实现高效梯度计算与实时参数更新。在不同噪声水平下的仿真实验及四旋翼无人机硬件测试表明，所提方法能够在复杂环境中实现精确、自适应的意图预测。

摘要 (Abstract)

This paper presents an online intention prediction framework for estimating the goal state of autonomous systems in real time, even when intention is time-varying, and system dynamics or objectives include unknown parameters. The problem is formulated as an inverse optimal control / inverse reinforcement learning task, with the intention treated as a parameter in the objective. A shifting horizon strategy discounts outdated information, while online control-informed learning enables efficient gradient computation and online parameter updates. Simulations under varying noise levels and hardware experiments on a quadrotor drone demonstrate that the proposed approach achieves accurate, adaptive intention prediction in complex environments.

关键词: online intention prediction, inverse optimal control, inverse reinforcement learning, control-informed learning, autonomous systems, quadrotor drone, real-time estimation, time-varying intention

238. ❌ Are Independently Estimated View Uncertainties Comparable? Unified Routing for Trusted Multi-View Classification

作者: Yilin Zhang, Cai Xu, Haishun Chen, Ziyu Guan, Wei Zhao 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09288v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究可信多视图分类中的视图不确定性可比性问题，提出TMUR方法，使用视图私有专家和协作专家，通过统一路由器进行全局路由。所有关键词均与大模型、深度学习技术原理或AI科学应用直接相关，而本文专注于多视图机器学习、不确定性估计和专家路由，未涉及大模型、LLM、MoE、缩放定律、训练技术、对齐、推理加速、可解释性、AI for Science等主题，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

本文针对可信多视图分类中独立估计的视图不确定性不可比的问题，提出了TMUR方法，通过统一路由器进行全局路由来解耦证据提取与融合仲裁，实现了更可靠的样本级专家权重分配。

摘要翻译

可信多视图分类通常依赖于视图级证据融合过程：每个视图独立生成类别证据与不确定性，最终预测通过聚合这些独立意见获得。尽管这种设计具有模块化与不确定性感知的优点，但其隐含假设不同视图产生的证据在数值上具有可比性。然而在实际应用中，这一假设往往难以成立。不同视图通常在特征空间、噪声水平和语义粒度上存在差异，而独立训练的视图分支仅针对预测准确性进行优化，缺乏对证据强度跨视图一致性的约束。因此，用于融合的不确定性可能被分支特定的尺度偏差主导，而非反映样本层面的真实可靠性。为解决这一问题，我们提出基于统一路由的可信多视图学习框架，将视图特定证据提取与融合仲裁机制解耦。该框架包含视图私有专家与协作专家，并采用能感知全局多视图上下文信息的统一路由器来生成样本级专家权重。通过软负载均衡与多样性正则化策略，进一步促进专家利用的平衡性及专家特化的判别力提升。理论分析表明，独立的证据监督无法确定跨视图的统一证据尺度；当可靠性取决于具体样本时，基于全局的统一路由机制优于分支局部仲裁策略。

摘要 (Abstract)

Trusted multi-view classification typically relies on a view-wise evidential fusion process: each view independently produces class evidence and uncertainty, and the final prediction is obtained by aggregating these independent opinions. While this design is modular and uncertainty-aware, it implicitly assumes that evidence from different views is numerically comparable. In practice, however, this assumption is fragile. Different views often differ in feature space, noise level, and semantic granularity, while independently trained branches are optimized only for prediction correctness, without any constraint enforcing cross-view consistency in evidence strength. As a result, the uncertainty used for fusion can be dominated by branch-specific scale bias rather than true sample-level reliability. To address this issue, we propose Trusted Multi-view learning with Unified Routing (TMUR), which decouples view-specific evidence extraction from fusion arbitration. TMUR uses view-private experts and one collaborative expert, and employs a unified router that observes the global multi-view context to generate sample-level expert weights. Soft load-balancing and diversity regularization further encourage balanced expert utilization and more discriminative expert specialization. We also provide theoretical analysis showing why independent evidential supervision does not identify a common cross-view evidence scale, and why unified global routing is preferable to branch-local arbitration when reliability is sample-dependent.

关键词: Trusted Multi-view Classification, View Uncertainty, Evidence Fusion, Unified Routing, Expert Weights, Sample-level Reliability, Cross-view Consistency, TMUR

239. ❌ Meta-Learned Basis Adaptation for Parametric Linear PDEs

作者: Vikas Dwivedi, Monica Sigovan, Bruno Sixou 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09289v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《Meta-Learned Basis Adaptation for Parametric Linear PDEs》提出了一种混合物理信息框架，用于求解参数化线性偏微分方程（PDEs）。该方法结合了元学习预测器和最小二乘校正器，核心是KAPI（Kernel-Adaptive Physics-Informed meta-learner）预测器，它通过任务条件模型映射查询坐标和PDE参数到解值，并生成可解释的、任务自适应的高斯基几何。论文在扩散、输运、混合对流-扩散和变速输运等线性PDE族上进行了评估。所有关键词中，只有“AI for Science OR Bioinformatics OR Cheminformatics”与论文有一定关联，因为论文涉及AI在科学计算（PDE求解）中的应用，属于AI for Science的范畴，但并非核心内容（如生物信息学或化学信息学）。其他关键词均与大型语言模型、深度学习技术原理（如MoE、Scaling Laws、训练方法、推理优化、代理系统等）完全无关，论文未提及任何大模型或深度学习技术，专注于物理信息机器学习和元学习在PDE求解中的应用。因此，除AI for Science得5分外，其余关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种混合物理信息框架KAPI，通过元学习预测器自适应生成高斯基几何并结合最小二乘校正器，有效求解参数化线性偏微分方程，在多个PDE案例中实现精度提升。

摘要翻译

我们提出一种混合物理信息框架，用于求解参数化线性偏微分方程族，该框架通过将元学习预测器与最小二乘校正器相结合实现。预测器名为 KAPI（核自适应物理信息元学习器），是一种浅层任务条件化模型，可将查询坐标和偏微分方程参数映射为解值，同时在内部生成可解释的、任务自适应的高斯基几何结构。一个轻量级元网络将偏微分方程参数映射到基函数的中心、宽度及活动模式，从而学习近似空间应如何随参数族自适应调整。预测器生成的几何结构被传递至第二阶段校正器，校正器通过引入背景基函数对其进行增强，并采用一次性物理信息极限学习机风格的最小二乘求解器计算最终解。我们在涵盖扩散、输运、混合对流-扩散及变速度输运的四类线性偏微分方程族上评估该方法。在所有案例中，预测器通过局部化且与输运方向对齐的基函数布局捕捉有意义的物理特征，而校正器则进一步将精度提升一个或多个数量级。与参数化物理信息神经网络、物理信息深度算子网络及均匀网格物理信息极限学习机校正器的对比表明，预测器引导的基函数自适应作为一种可解释且高效的参数化偏微分方程求解策略具有重要价值。

摘要 (Abstract)

We propose a hybrid physics-informed framework for solving families of parametric linear partial differential equations (PDEs) by combining a meta-learned predictor with a least-squares corrector. The predictor, termed \textbf{KAPI} (Kernel-Adaptive Physics-Informed meta-learner), is a shallow task-conditioned model that maps query coordinates and PDE parameters to solution values while internally generating an interpretable, task-adaptive Gaussian basis geometry. A lightweight meta-network maps PDE parameters to basis centers, widths, and activity patterns, thereby learning how the approximation space should adapt across the parametric family. This predictor-generated geometry is transferred to a second-stage corrector, which augments it with a background basis and computes the final solution through a one-shot physics-informed Extreme Learning Machine (PIELM)-style least-squares solve. We evaluate the method on four linear PDE families spanning diffusion, transport, mixed advection–diffusion, and variable-speed transport. Across these cases, the predictor captures meaningful physics through localized and transport-aligned basis placement, while the corrector further improves accuracy, often by one or more orders of magnitude. Comparisons with parametric PINNs, physics-informed DeepONet, and uniform-grid PIELM correctors highlight the value of predictor-guided basis adaptation as an interpretable and efficient strategy for parametric PDE solving.

关键词: parametric linear PDEs, meta-learned predictor, physics-informed framework, KAPI, Gaussian basis adaptation, least-squares corrector, PIELM, interpretable basis geometry

240. ❌ Distributed Online Convex Optimization with Compressed Communication: Optimal Regret and Applications

作者: Sifan Yang, Dan-Yue Li, Lijun Zhang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09276v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究分布式在线凸优化（D-OCO）中的压缩通信问题，属于分布式优化和通信效率领域，与所有关键词（均围绕大模型、深度学习技术原理及其应用）无直接关联。论文未涉及任何大模型、深度学习、AI for Science等内容，也未提及关键词中的任何技术或概念，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了分布式在线凸优化中压缩通信的通信成本问题，提出了最优算法并建立了理论下界和收敛率保证。

摘要翻译

分布式在线凸优化（Distributed Online Convex Optimization, D-OCO）是处理流式数据分布式场景的强大范式。然而，在大规模应用中，本地学习器与中央服务器之间的通信成本十分高昂。为缓解这一瓶颈，我们首次研究了压缩通信下的D-OCO问题。首先，为量化压缩影响，我们分别针对凸损失函数和强凸损失函数建立了$Ω(δ^{-1/2}\sqrt{T})$与$Ω(δ^{-1}\log{T})$的下界，其中$δ\in (0,1]$为压缩比。其次，我们提出了一种最优算法，该算法在凸损失函数和强凸损失函数下分别实现了$O(δ^{-1/2}\sqrt{T})$和$O(δ^{-1} \log T)$的遗憾界。我们的方法将误差反馈机制整合到跟随正则化领导者（Follow-the-Regularized-Leader）框架中，以解决压缩误差与投影误差之间的耦合问题。此外，我们采用在线压缩策略来缓解双向压缩带来的累积误差。所提出的在线方法具有广泛通用性，可通过在线到批处理转换扩展至离线随机优化场景。我们为凸损失函数和强凸损失函数分别建立了$O(δ^{-1/2}T^{-1/2})$和$O(δ^{-1} T^{-1})$的收敛率，首次为具有压缩通信和域约束的分布式非光滑优化问题提供了理论保证。

摘要 (Abstract)

Distributed online convex optimization (D-OCO) is a powerful paradigm for modeling distributed scenarios with streaming data. However, the communication cost between local learners and the central server is substantial in large-scale applications. To alleviate this bottleneck, we initiate the study of D-OCO with compressed communication. Firstly, to quantify the compression impact, we establish the $Ω(δ^{-1/2}\sqrt{T})$ and $Ω(δ^{-1}\log{T})$ lower bounds for convex and strongly convex loss functions, respectively, where $δ\in (0,1]$ is the compression ratio. Secondly, we propose an optimal algorithm, which enjoys regret bounds of $O(δ^{-1/2}\sqrt{T})$ and $O(δ^{-1} \log T)$ for convex and strongly convex loss functions, respectively. Our method incorporates the error feedback mechanism into the Follow-the-Regularized-Leader framework to address the coupling between the compression error and the projection error. Furthermore, we employ the online compression strategy to mitigate the accumulated error arising from the bidirectional compression. Our online method has great generality, and can be extended to the offline stochastic setting via online-to-batch conversion. We establish convergence rates of $O(δ^{-1/2}T^{-1/2})$ and $O(δ^{-1} T^{-1})$ for convex and strongly convex loss functions, respectively, providing the first guarantees for distributed non-smooth optimization with compressed communication and domain constraints.

关键词: Distributed Online Convex Optimization, Compressed Communication, Regret Bounds, Error Feedback, Online Compression Strategy, Convergence Rates, Domain Constraints, Non-smooth Optimization

241. ❌ The causal relation between off-street parking and electric vehicle adoption in Scotland

作者: Bernardino D’Amico, Achille Fonzone, Emma Hart 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09271v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究苏格兰电动汽车采用与私人停车位之间的因果关系，属于交通政策、城市规划和环境经济学领域。论文使用概率因果框架分析数据集，不涉及任何大模型、深度学习、AI技术或相关技术原理。所有评分关键词均与大模型技术、AI应用或相关方法相关，而本文完全未提及这些内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本研究通过因果分析发现，苏格兰家庭拥有私人停车位能提高电动汽车拥有概率70%，但主要影响已具备经济能力的家庭，而家庭收入才是决定是否参与电动汽车转型的主要因素，传统观测模型高估了停车位基础设施的独立效应。

摘要翻译

向电动出行的转型，既取决于总体采用规模的最大化，也依赖于公平获取机会的促进。本研究探讨了“有”与“没有”路外停车位的家庭之间的“充电鸿沟”，究竟是反映了真实的基础设施约束，还是社会经济差异的副产品。我们超越了传统的预测模型，对一个具有全国代表性的苏格兰家庭数据集应用了概率性因果推断框架，从而能够在明确消除其他因果因素混杂效应的同时，估计政策干预的效果。结果揭示了电动汽车采用过程中的结构性层级。私人路外停车位发挥着转化催化剂的作用：获得家用充电条件能将电动汽车拥有概率从3.3%提升至5.6%（相对增长70%，绝对增长2.3个百分点）。然而，这种效应主要加速了那些经济上已有能力购买电动汽车的家庭的转化，而非吸引新的参与者进入市场。相比之下，家庭收入构成了根本性的支付能力天花板。对低收入与高收入阶层进行的因果对比显示，市场非参与率降低了23.1个百分点，这确认了财务能力是进入电动汽车转型通道的主要守门人。关键的是，分析表明标准的观测模型高估了路外停车基础设施的独立效应。其表面效应源于选择偏差：高收入家庭同时拥有私人停车位和购买电动汽车的经济能力的可能性显著更高。这些发现支持一种双轨政策策略：一方面通过金融工具降低非参与者的支付能力门槛，另一方面在高密度城市环境中，为具有“潜在意向”的群体解决电动汽车家用充电的获取问题。

摘要 (Abstract)

The transition to electric mobility hinges on maximising aggregate adoption while also facilitating equitable access. This study examines whether the ‘charging divide’ between households with and without off-street parking reflects a genuine infrastructure constraint or a by-product of socio-economic disparity. Moving beyond conventional predictive models, we apply a probabilistic causal framework to a nationally representative dataset of Scottish households, enabling estimation of policy interventions while explicitly neutralising the confounding effect of other causal factors. The results reveal a structural hierarchy in the EV adoption process. Private off-street parking functions as a conversion catalyst: enabling access to home-charging increases the probability of EV ownership from 3.3% to 5.6% (a 70% relative, 2.3 percentage point absolute increase). However, this effect primarily accelerates households already economically positioned to purchase an EV rather than recruiting new entrants. By contrast, household income operates as the fundamental affordability ceiling. A causal contrast between lower- and higher-income strata, shows a reduction in market non-participation by 23.1 percentage points, identifying financial capacity as the principal gatekeeper to entering the EV transition funnel. Crucially, the analysis demonstrates that standard observational models overstate the isolated effect of off-street parking infrastructure. The apparent effect emerges from selection bias: higher-income households are disproportionately likely to possess both private parking and the means to purchase EVs. These findings support a dual-track policy strategy: lowering the affordability ceiling for non-participants through financial instruments, while addressing EV home-charging access for the ’latent intent’ cohort in high-density urban contexts.

关键词: electric vehicle adoption, off-street parking, causal analysis, home-charging access, socio-economic disparity, policy interventions, affordability ceiling, selection bias

242. ❌ Natural Riemannian gradient for learning functional tensor networks

作者: Nikolas Klug, Michael Ulbrich, André Uschmajew, Marius Willner 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09263v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是低秩函数树张量网络（functional tree tensor networks, TTN）的优化方法，提出了一种基于自然黎曼梯度下降的通用优化框架，适用于任意损失函数（如多项式逻辑回归）。论文内容聚焦于张量网络、黎曼优化、机器学习模型训练等传统机器学习/优化领域，与所有评分关键词（均围绕大模型、深度学习技术原理及其应用）完全无关。论文未涉及任何大模型、深度学习、AI for Science等主题，也未提及任何评分关键词中的技术或概念。

!!! tip deepseek-chat TL;DR

该论文针对低秩函数树张量网络在非最小二乘回归任务（如多项式逻辑回归）中优化困难的问题，提出了一种基于自然黎曼梯度下降的通用优化框架，并通过数值实验验证了该方法相比标准黎曼梯度方法能显著改善收敛性能。

摘要翻译

我们考虑以低秩函数树张量网络作为学习模型的机器学习任务。在最小二乘回归问题中，低秩函数树张量网络可通过交替优化高效求解，但对于其他问题（如多项逻辑回归），此方法则无法直接应用。我们提出一种适用于任意损失函数的自然黎曼梯度下降方法，该方法基于Amari提出的自然梯度。特别地，自然梯度所得的搜索方向独立于底层函数张量积空间的基选择。我们的框架同时适用于基于因子分解和基于流形表示的函数树张量网络。在实际应用中，我们提出了一种高效近似真实自然黎曼梯度的层次化方法，用于计算参数空间中的更新。数值实验在常见分类数据集上验证了我们的理论发现，并表明与标准黎曼梯度方法相比，采用自然黎曼梯度下降进行学习能显著改善收敛行为。

摘要 (Abstract)

We consider machine learning tasks with low-rank functional tree tensor networks (TTN) as the learning model. While in the case of least-squares regression, low-rank functional TTNs can be efficiently optimized using alternating optimization, this is not directly possible in other problems, such as multinomial logistic regression. We propose a natural Riemannian gradient descent type approach applicable to arbitrary losses which is based on the natural gradient by Amari. In particular, the search direction obtained by the natural gradient is independent of the choice of basis of the underlying functional tensor product space. Our framework applies to both the factorized and manifold-based approach for representing the functional TTN. For practical application, we propose a hierarchy of efficient approximations to the true natural Riemannian gradient for computing the updates in the parameter space. Numerical experiments confirm our theoretical findings on common classification datasets and show that using natural Riemannian gradient descent for learning considerably improves convergence behavior when compared to standard Riemannian gradient methods.

关键词: functional tree tensor networks, low-rank TTN, natural Riemannian gradient, alternating optimization, multinomial logistic regression, manifold-based optimization, gradient descent, classification datasets

243. ❌ Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima

作者: Huanran Chen, Huaqing Zhang, Xiao Li, Yinpeng Dong, Ke Shen, Jun Zhu 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09258v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的预训练优化问题，因此与’Large Language Models’和’Pre-training’高度相关（10分）。论文在复杂推理任务（如GSM8k）上评估性能，与’Chain of Thought’有一定关联（5分）。其他关键词如MoE、SFT、RAG、量化等未在摘要中提及或与论文主题无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型预训练中任务特定最小值之间的几何接近度对下游泛化能力的影响，并提出Nexus优化器来提升性能，实验表明在相同预训练损失下能显著改善下游任务表现。

摘要翻译

预训练是大语言模型（LLM）的基石，它占据了绝大部分的计算资源与数据，并作为模型能力的主要引擎。在预训练过程中，大语言模型从规模空前庞大且多样化的数据源中获取基础性知识，这些数据涵盖广泛领域，例如通用语言、数学、代码以及复杂推理。在本研究中，我们探讨了一个关于预训练收敛状态的有趣几何问题：模型是收敛至所有数据源上的一个公共极小值点（例如，\cref{fig:cwa_illustration:close}），还是仅仅收敛至总损失的一个极小值点（例如，\cref{fig:cwa_illustration:distant}）？我们假设，任务特定极小值点在几何上的“接近程度”与下游泛化能力存在内在关联。我们发现，标准优化器（例如 AdamW）通常收敛至任务特定极小值点彼此远离的位置。为解决此问题，我们提出了 Nexus 优化器，它通过在优化过程中最大化梯度相似性来促进这些极小值点的接近。在参数量从 130M 到 3B 的多种模型、不同数据混合方案及超参数调度上的实验表明，尽管 Nexus 取得了相同的预训练损失（参见 \cref{fig:demo:benchmark}），但它能显著提升下游任务性能。值得注意的是，在 3B 模型上，Nexus 将分布外损失降低了 0.012，并在复杂推理任务（例如 GSM8k）上实现了高达 15.0% 的准确率提升。这一发现挑战了仅依赖预训练损失作为模型评估唯一代理指标的做法，并揭示了隐含偏置对于释放下游泛化能力的重要性。

摘要 (Abstract)

Pretraining is the cornerstone of Large Language Models (LLMs), dominating the vast majority of computational budget and data to serve as the primary engine for their capabilities. During pretraining, LLMs acquire foundational knowledge from an unprecedentedly massive and diverse data sources, encompassing a vast array of domains such as general language, mathematics, code, and complex reasoning. In this work, we investigate an interesting geometric question regarding the converged state of pretraining: Does the model converge to a common minimizer across all data sources (e.g., \cref{fig:cwa_illustration:close}), or merely a minimizer of the summed loss (e.g., \cref{fig:cwa_illustration:distant})? We hypothesize that the geometric “closeness” of task-specific minima is intrinsically linked to downstream generalization. We reveal that standard optimizers (e.g., AdamW) often converge to points where task-specific minima are distant from each other. To address this, we propose the Nexus optimizer, which encourages the closeness of these minima by maximizing gradient similarity during optimization. Experiments across models ranging from 130M to 3B parameters, various data mixtures and hyperparameter schedules, show that Nexus \textit{significantly boosts downstream performance}, despite \textit{achieving the same pretraining loss} (see \cref{fig:demo:benchmark}). Notably, on the 3B model, Nexus reduces the out-of-distribution loss by 0.012 and yields up to a 15.0% accuracy improvement on complex reasoning tasks (e.g., GSM8k). This finding challenges the reliance on pretraining loss as the sole proxy for model evaluation and demonstrates the importance of implicit biases in unlocking downstream generalization.

关键词: Large Language Models, Pretraining, Optimization, Generalization, Downstream Performance, Nexus Optimizer, Task-specific Minima, Complex Reasoning

244. ❌ A Predictive View on Streaming Hidden Markov Models

作者: Gerardo Duran-Martin 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09208v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于传统隐马尔可夫模型（HMM）的流式推理优化，提出了一种基于预测优先的框架和束搜索算法。所有评分关键词均涉及大模型、深度学习及其相关技术（如训练方法、推理优化、对齐、应用等），而本文研究的是经典概率图模型（HMM）的在线推断问题，未涉及任何神经网络、深度学习或大语言模型技术，也未应用于生物信息学等科学领域。因此，所有关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于流式隐马尔可夫模型的预测优先优化框架，通过将推断问题转化为预测分布空间中的约束投影问题，并推导出基于束搜索的确定性递归算法，在计算预算匹配下实现了与在线EM和序列蒙特卡洛方法竞争的前瞻性能。

摘要翻译

我们为流式隐马尔可夫模型开发了一种预测优先的优化框架。与经典方法在完全指定的生成模型下优先恢复完整后验不同，我们假设能够访问特定状态下的预测模型，其参数在线学习的同时保持状态间固定的转移先验。我们的目标是顺序识别潜在状态，同时保持准确的一步超前预测分布。由于可能的状态路径数量呈指数增长，精确滤波不可行。因此，我们将流式推断构建为预测分布空间中的约束投影问题：在固定假设预算下，我们通过由$S$条路径支撑的前向KL最优混合分布来近似完整的后验预测分布。其解为经过重归一化的前$S$条后验加权混合路径，这为隐马尔可夫模型的束搜索提供了理论推导。所得算法完全递归且确定性，通过闭式预测更新执行束式截断，既不需要期望最大化算法也不需要采样。在匹配计算预算下，与在线期望最大化算法和序贯蒙特卡罗方法的实证比较表明，该框架在序列预测性能上具有竞争力。

摘要 (Abstract)

We develop a predictive-first optimisation framework for streaming hidden Markov models. Unlike classical approaches that prioritise full posterior recovery under a fully specified generative model, we assume access to regime-specific predictive models whose parameters are learned online while maintaining a fixed transition prior over regimes. Our objective is to sequentially identify latent regimes while maintaining accurate step-ahead predictive distributions. Because the number of possible regime paths grows exponentially, exact filtering is infeasible. We therefore formulate streaming inference as a constrained projection problem in predictive-distribution space: under a fixed hypothesis budget, we approximate the full posterior predictive by the forward-KL optimal mixture supported on $S$ paths. The solution is the renormalised top-$S$ posterior-weighted mixture, providing a principled derivation of beam search for HMMs. The resulting algorithm is fully recursive and deterministic, performing beam-style truncation with closed-form predictive updates and requiring neither EM nor sampling. Empirical comparisons against Online EM and Sequential Monte Carlo under matched computational budgets demonstrate competitive prequential performance.

关键词: streaming hidden Markov models, predictive-first optimization, beam search, online inference, predictive distributions, forward-KL optimal mixture, prequential performance, constrained projection

245. ❌ DiffHLS: Differential Learning for High-Level Synthesis QoR Prediction with GNNs and LLM Code Embeddings

作者: Zedong Peng, Zeju Li, Qiang Xu, Jieru Zhao 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09240v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文DiffHLS提出了一种用于HLS QoR预测的差分学习框架，结合了GNN和预训练代码大语言模型（LLM）的代码嵌入。核心相关性体现在：1）明确使用了预训练代码LLM（关键词1得8分），属于大模型在特定领域（EDA/硬件设计）的应用；2）属于AI for Science范畴（关键词27得8分），具体是AI在电子设计自动化（EDA）领域的应用；3）使用了预训练模型（关键词5得5分），但未涉及持续预训练或领域适应细节。其他关键词如MoE、SFT、RAG等均未涉及，论文专注于硬件设计预测任务，未讨论大模型技术原理创新。

!!! tip deepseek-chat TL;DR

该论文提出DiffHLS框架，利用图神经网络和预训练代码大语言模型的嵌入来预测高级综合中的设计质量，在PolyBench和ForgeHLS数据集上验证了其优于纯GNN基线的性能。

摘要翻译

高层次综合（High-Level Synthesis，HLS）可将C/C++代码编译为寄存器传输级（RTL）描述，但探索基于编译指示（pragma）驱动的优化方案仍然成本高昂，因为每个设计点都需要耗时的综合过程。本文提出\textbf{\DiffHLS}——一种面向HLS结果质量（Quality-of-Result，QoR）预测的差分学习框架，该框架从内核-设计对中学习：即一个内核基准版本与一个插入编译指示的设计变体。\DiffHLS通过专用的图神经网络（Graph Neural Network，GNN）分支对内核及设计的中间表示图进行编码，并利用预训练代码大语言模型（Large Language Model，LLM）生成的代码嵌入来增强差分路径。该方法不直接回归绝对目标值，而是联合预测内核基准值与设计引入的差分值，并通过组合二者得到最终设计预测结果。在PolyBench测试集上，\DiffHLS在四种GNN骨干网络下均取得比GNN基线更低的平均绝对百分比误差（MAPE），且LLM代码嵌入的引入持续优于纯GNN的消融实验。我们进一步在ForgeHLS数据集上验证了该框架的可扩展性。

摘要 (Abstract)

High-Level Synthesis (HLS) compiles C/C++ into RTL, but exploring pragma-driven optimization choices remains expensive because each design point requires time-consuming synthesis. We propose \textbf{\DiffHLS}, a differential learning framework for HLS Quality-of-Result (QoR) prediction that learns from kernel–design pairs: a kernel baseline and a pragma-inserted design variant. \DiffHLSencodes kernel and design intermediate-representation graphs with dedicated graph neural network (GNN) branches, and augments the delta pathway with code embeddings from a pretrained code large language model (LLM). Instead of regressing absolute targets directly, we jointly predict the kernel baseline and the design-induced delta, and compose them to obtain the design prediction. On PolyBench, \DiffHLSattains lower average MAPE than GNN baselines under four GNN backbones, and LLM code embeddings consistently improve over a GNN-only ablation. We further validate scalability on the ForgeHLS dataset.

关键词: High-Level Synthesis, Quality-of-Result Prediction, Graph Neural Networks, Large Language Models, Code Embeddings, Differential Learning, Hardware Design, EDA

246. ❌ Automated Batch Distillation Process Simulation for a Large Hybrid Dataset for Deep Anomaly Detection

作者: Jennifer Werner, Justus Arweiler, Indra Jungjohann, Jochen Schmid, Fabian Jirasek, Hans Hasse, Michael Bortz 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09166v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于化学过程（批蒸馏）的异常检测，使用深度学习方法和过程模拟生成混合数据集。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词特指大语言模型（LLM）及相关技术，而本文未涉及任何语言模型或自然语言处理。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，评分为5分，因为论文属于AI在科学（化学工程）领域的应用，但未明确提及生物信息学或化学信息学，且核心是过程模拟和数据集生成，而非AI方法本身的创新。

!!! tip deepseek-chat TL;DR

该研究解决了化学过程异常检测中缺乏大规模标注数据集的问题，通过开发自动化流程模拟器生成与实验数据对应的模拟数据，创建了一个用于深度异常检测的混合数据集。

摘要翻译

基于深度学习的化工过程异常检测虽前景广阔，却需要大规模、多样化且标注完善的训练数据集，而这类数据在工业运营中往往难以获取。在近期的一项工作中，我们发布了一个针对正常与异常操作条件下间歇精馏过程的大型全标注实验数据集。本研究通过引入相应的模拟数据集对该数据集进行了扩充，构建出一个新颖的混合数据集。模拟数据通过自动化工作流生成，该工作流采用一种新型基于Python的过程模拟器，并针对底层微分代数方程采用了定制化的指标约简策略。借助实验数据库丰富的元数据和结构化的异常标注，实验记录被自动转化为模拟场景。在基于单个参考实验完成校准后，其他实验的动态过程均得到了良好预测。这使得我们能够全自动、一致性地生成大量实验运行的时间序列数据，涵盖正常运行状态以及多种与执行器和控制相关的异常情况。最终形成的混合数据集已公开发布。从过程模拟的角度看，本研究以间歇精馏为例，展示了对大规模实验活动进行自动化、一致性模拟的方法。从数据驱动异常检测的角度看，该混合数据集为模拟到实验的风格迁移、伪实验数据生成以及化工过程监测中深度异常检测方法的未来研究提供了独特的基础。

摘要 (Abstract)

Anomaly detection (AD) in chemical processes based on deep learning offers significant opportunities but requires large, diverse, and well-annotated training datasets that are rarely available from industrial operations. In a recent work, we introduced a large, fully annotated experimental dataset for batch distillation under normal and anomalous operating conditions. In the present study, we augment this dataset with a corresponding simulation dataset, creating a novel hybrid dataset. The simulation data is generated in an automated workflow with a novel Python-based process simulator that employs a tailored index-reduction strategy for the underlying differential-algebraic equations. Leveraging the rich metadata and structured anomaly annotations of the experimental database, experimental records are automatically translated into simulation scenarios. After calibration to a single reference experiment, the dynamics of the other experiments are well predicted. This enabled the fully automated, consistent generation of time-series data for a large number of experimental runs, covering both normal operation and a wide range of actuator- and control-related anomalies. The resulting hybrid dataset is released openly. From a process simulation perspective, this work demonstrates the automated, consistent simulation of large-scale experimental campaigns, using batch distillation as an example. From a data-driven AD perspective, the hybrid dataset provides a unique basis for simulation-to-experiment style transfer, the generation of pseudo-experimental data, and future research on deep AD methods in chemical process monitoring.

关键词: anomaly detection, batch distillation, process simulation, hybrid dataset, deep learning, chemical processes, time-series data, differential-algebraic equations

247. ❌ Truncated Rectified Flow Policy for Reinforcement Learning with One-Step Sampling

作者: Xubin Zhou, Yipeng Yang, Zhan Li 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09159v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于强化学习中的策略表示问题，提出了一种基于截断整流流的生成策略（TRFP），以解决最大熵强化学习中高斯策略的单模态限制和多步采样的训练/推理挑战。虽然论文涉及生成模型（扩散/流匹配）和强化学习，但所有给定的关键词都明确指向大语言模型（LLMs）及其相关技术（如微调、对齐、推理、压缩、代理等）或特定科学AI应用。论文内容完全不涉及语言模型、自然语言处理或任何关键词中提到的具体技术，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对最大熵强化学习中高斯策略无法建模复杂多模态动作分布以及基于扩散/流匹配的生成策略存在似然/熵难解和多步采样不稳定的问题，提出了截断整流流策略（TRFP），通过混合确定性-随机架构实现了可处理的熵正则化优化，并在多目标环境和MuJoCo基准测试中有效捕获多模态行为，在单步采样下保持高性能。

摘要翻译

最大熵强化学习（MaxEnt RL）已成为序列决策的标准框架，但其标准的高斯策略参数化本质上是单峰的，限制了其建模复杂多峰动作分布的能力。这一局限性促使人们日益关注基于扩散和流匹配的生成式策略，将其作为更具表达力的替代方案。然而，将此类策略融入MaxEnt RL面临两大挑战：连续时间生成式策略的似然和熵通常难以处理，而多步采样会同时引入长时程反向传播的不稳定性和显著的推理延迟。为解决这些挑战，我们提出了截断整流流策略（Truncated Rectified Flow Policy, TRFP），这是一个构建于混合确定性-随机架构之上的框架。该设计通过梯度截断和流矫直，使得熵正则化优化变得可行，同时支持稳定的训练和高效的单步采样。在玩具多目标环境和10个MuJoCo基准测试上的实验结果表明，TRFP能有效捕捉多峰行为，在标准采样下于大多数基准测试中优于强基线，并在单步采样下保持高度竞争力。

摘要 (Abstract)

Maximum entropy reinforcement learning (MaxEnt RL) has become a standard framework for sequential decision making, yet its standard Gaussian policy parameterization is inherently unimodal, limiting its ability to model complex multimodal action distributions. This limitation has motivated increasing interest in generative policies based on diffusion and flow matching as more expressive alternatives. However, incorporating such policies into MaxEnt RL is challenging for two main reasons: the likelihood and entropy of continuous-time generative policies are generally intractable, and multi-step sampling introduces both long-horizon backpropagation instability and substantial inference latency. To address these challenges, we propose Truncated Rectified Flow Policy (TRFP), a framework built on a hybrid deterministic-stochastic architecture. This design makes entropy-regularized optimization tractable while supporting stable training and effective one-step sampling through gradient truncation and flow straightening. Empirical results on a toy multigoal environment and 10 MuJoCo benchmarks show that TRFP captures multimodal behavior effectively, outperforms strong baselines on most benchmarks under standard sampling, and remains highly competitive under one-step sampling.

关键词: reinforcement learning, maximum entropy RL, generative policies, rectified flow, multimodal action distributions, one-step sampling, policy optimization, MuJoCo benchmarks

248. ❌ A fast and Generic Energy-Shifting Transformer for Hybrid Monte Carlo Radiotherapy Calculation

作者: Chi-Hieu Pham, Didier Benoit, Vincent Bourbonne, Ulrike Schick, Julien Bert 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09157v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于医学放射治疗领域，提出了一种名为Energy-Shifting的深度学习框架，用于加速蒙特卡洛剂量计算，并设计了TransUNetSE3D架构。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、智能体等）完全无关，因为这些关键词主要针对自然语言处理领域的大语言模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在科学（具体是医学物理/放射治疗）领域的应用，与生物信息学或化学信息学有相似的科学应用属性，因此给予10分（高度相关，核心内容）。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于深度学习的Energy-Shifting框架和TransUNetSE3D架构，用于加速前列腺放射治疗中的蒙特卡洛剂量计算，实现了超过98%的Gamma通过率并保持了实时计算速度。

摘要翻译

我们提出了一种名为能量偏移的新型学习框架，用于加速蒙特卡罗剂量计算。该方法利用深度学习，在相同射束配置下，直接从单能输入合成6 MV TrueBeam直线加速器的剂量分布。传统的去噪技术依赖于噪声较大的低计数剂量图，这会损害射束剖面完整性；与之不同，我们的方法通过将高保真解剖纹理和特定源项射束相似性整合到模型输入空间中，在未见数据集上实现了卓越的跨域泛化能力。此外，我们提出了一种名为TransUNetSE3D的新型三维架构，该架构采用Transformer模块捕获全局上下文信息，并引入残差压缩激励模块实现自适应通道特征重校准。这些模块的层次化表示与主要剂量图参数一同融合到网络的隐空间中，从而实现基于物理原理的重建。这种混合设计在空间精度和结构保持方面均优于现有的基于UNet和Transformer的基准模型，同时保持了实时应用所需的执行速度。在针对前列腺放射治疗的治疗计划系统框架内评估，我们提出的流程相较于蒙特卡罗参考标准，实现了超过98%（3%/3mm）的伽马通过率。这些结果为自适应放射治疗中的快速体积剂量学提供了一个稳健的解决方案。

摘要 (Abstract)

We introduce a novel learning framework for accelerated Monte Carlo (MC) dose calculation termed Energy-Shifting. This approach leverages deep learning to synthesize 6 MV TrueBeam Linear Accelerator (LINAC) dose distributions directly from monoenergetic inputs under identical beam configurations. Unlike conventional denoising techniques, which rely on noisy low-count dose maps that compromise beam profile integrity, our method achieves superior cross-domain generalization on unseen datasets by integrating high-fidelity anatomical textures and source-specific beam similarity into the model’s input space. Furthermore, we propose a novel 3D architecture termed TransUNetSE3D, featuring Transformer blocks for global context and Residual Squeeze-and-Excitation (SE) modules for adaptive channel-wise feature recalibration. Hierarchical representations of these blocks are fused into the network’s latent space alongside the primary dose-map parameters, allowing physics-aware reconstruction. This hybrid design outperforms existing UNet and Transformer-based benchmarks in both spatial precision and structural preservation, while maintaining the execution speed necessary for real-time use. Our proposed pipeline achieves a Gamma Passing Rate exceeding 98% (3%/3mm) compared to the MC reference, evaluated within the framework of a treatment planning system (TPS) for prostate radiotherapy. These results offer a robust solution for fast volumetric dosimetry in adaptive radiotherapy.

关键词: Monte Carlo dose calculation, Energy-Shifting, TransUNetSE3D, radiotherapy, deep learning, prostate cancer, Gamma Passing Rate, real-time dosimetry

249. ❌ Score-Driven Rating System for Sports

作者: Vladimír Holý, Michal Černý 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09143v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是体育评分系统（Elo系统的推广），属于统计学和体育分析领域，与所有给定的大模型、深度学习、AI技术关键词完全无关。论文未涉及任何人工智能、机器学习或大模型相关内容，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于分数的评分系统，作为经典Elo评分系统的推广，使用对数似然梯度作为更新机制，能够处理多种比赛结果类型，并证明了该系统具有理论一致性和回归特性。

摘要翻译

本文介绍了一种分数驱动的评级系统，该系统是经典埃洛（Elo）评级系统的推广，其采用分数——即对数似然函数的梯度——作为运动员和团队评级的更新机制。所提出的框架超越了简单的胜/负比赛结果，能够适应多种比赛结果类型，例如分差、胜/平/负结果或完整排名。本文推导了分数的理论性质，证明其期望值为零、在所有参与者中总和为零，并随着参与者评级的升高而减小，从而确保了内部一致性与公平性。此外，分数驱动的评级系统展现出一种回归特性，即评级会随时间推移而趋近于潜在的、未被观测到的真实技能水平。该框架为现有的体育表现动态模型提供了理论依据，并为构建新模型提供了一种系统化方法。

摘要 (Abstract)

This paper introduces a score-driven rating system, a generalization of the classical Elo rating system that employs the score, i.e. the gradient of the log-likelihood, as the updating mechanism for player and team ratings. The proposed framework extends beyond simple win/loss game outcomes and accommodates a wide range of game results, such as point differences, win/draw/loss outcomes, or complete rankings. Theoretical properties of the score are derived, showing that it has zero expected value, sums to zero across all players, and decreases with increasing value of a player’s rating, thereby ensuring internal consistency and fairness. Furthermore, the score-driven rating system exhibits a reversion property, meaning that ratings tend to follow the underlying unobserved true skills over time. The proposed framework provides a theoretical rationale for existing dynamic models of sports performance and offers a systematic approach for constructing new ones.

关键词: score-driven rating system, Elo rating system, gradient of log-likelihood, player ratings, team ratings, game outcomes, dynamic models, sports performance

250. ❌ Identifying Causal Effects Using a Single Proxy Variable

作者: Silvan Vollmer, Niklas Pfister, Sebastian Weichwald 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09135v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于因果推断中的统计方法，特别是使用代理变量解决未观测混杂问题，并提出了SPICE-Net神经网络估计框架。论文内容与大多数关键词（如LLM、MoE、RLHF、RAG等）完全无关，因为这些关键词涉及大模型技术、训练方法、推理优化等具体AI技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于科学应用中的AI方法（因果推断在科学领域的应用），但并非核心生物信息学或化学信息学内容，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文研究了在存在未观测混杂变量的情况下，如何利用单个代理变量识别因果效应，提出了SPICE可识别性条件并开发了SPICE-Net神经网络估计框架。

摘要翻译

在科学应用中估计处理对结果的因果效应时，未观测混杂因素是一个关键挑战。本研究假设我们观测到未观测混杂因素的一个潜在多维代理变量，且已知该代理变量由混杂因素生成的机制。在此机制的完备性假设下——我们称之为因果效应的单代理可识别性（Single Proxy Identifiability of Causal Effects，简称SPICE），我们证明了因果效应是可识别的。我们将Kuroki和Pearl（2014）以及Pearl（2010）基于代理变量的因果可识别性结果扩展至更高维度、更灵活的函数关系以及更广泛的分布类别。此外，我们开发了一种基于神经网络的估计框架——SPICE-Net，用于估计因果效应，该框架适用于离散和连续处理。

摘要 (Abstract)

Unobserved confounding is a key challenge when estimating causal effects from a treatment on an outcome in scientific applications. In this work, we assume that we observe a single, potentially multi-dimensional proxy variable of the unobserved confounder and that we know the mechanism that generates the proxy from the confounder. Under a completeness assumption on this mechanism, which we call Single Proxy Identifiability of Causal Effects or simply SPICE, we prove that causal effects are identifiable. We extend the proxy-based causal identifiability results by Kuroki and Pearl (2014); Pearl (2010) to higher dimensions, more flexible functional relationships and a broader class of distributions. Further, we develop a neural network based estimation framework, SPICE-Net, to estimate causal effects, which is applicable to both discrete and continuous treatments.

关键词: causal effects, unobserved confounding, proxy variable, identifiability, neural network estimation, SPICE, treatment effect, causal inference

251. ❌ MATCHA: Efficient Deployment of Deep Neural Networks on Multi-Accelerator Heterogeneous Edge SoCs

作者: Enrico Russo, Mohamed Amine Hamdi, Alessandro Ottaviano, Francesco Conti, Angelo Garofalo, Daniele Jahier Pagliari, Maurizio Palesi, Luca Benini, Alessio Burrello 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09124v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MATCHA专注于深度神经网络在异构边缘SoC上的高效部署框架，涉及编译器优化、内存分配和调度技术。所有评分关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文研究的是通用DNN部署框架，不涉及大模型特定技术（如LLM、MoE、对齐、推理加速等）或科学领域应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

论文提出了MATCHA框架，通过约束编程优化内存分配和调度，在异构多加速器边缘SoC上实现深度神经网络的高并发部署，相比现有技术将推理延迟降低达35%。

摘要翻译

在配备多异构加速引擎的系统级芯片（SoC）上部署深度神经网络（DNN）具有挑战性，现有的大多数部署框架无法充分利用其异构性。我们提出MATCHA，一个统一的DNN部署框架，可为并行异构加速器生成高并发调度方案，并利用约束编程优化L3/L2内存分配与调度。通过模式匹配、分块处理以及在单个硬件单元间的映射，该框架实现了并行执行与高加速器利用率。在MLPerf Tiny基准测试中，基于搭载两个异构加速器的SoC，相较于最先进的MATCH编译器，MATCHA将加速器利用率提升了最高35%，并显著降低了推理延迟。

摘要 (Abstract)

Deploying DNNs on System-on-Chips (SoC) with multiple heterogeneous acceleration engines is challenging, and the majority of deployment frameworks cannot fully exploit heterogeneity. We present MATCHA, a unified DNN deployment framework that generates highly concurrent schedules for parallel, heterogeneous accelerators and uses constraint programming to optimize L3/L2 memory allocation and scheduling. Using pattern matching, tiling, and mapping across individual HW units enables parallel execution and high accelerator utilization. On the MLPerf Tiny benchmark, using a SoC with two heterogeneous accelerators, MATCHA improves accelerator utilization and reduces inference latency by up to 35% with respect to the the state-of-the-art MATCH compiler.

关键词: DNN deployment, heterogeneous accelerators, constraint programming, memory allocation, scheduling optimization, inference latency, edge SoC, MLPerf Tiny

252. ❌ GeoPAS: Geometric Probing for Algorithm Selection in Continuous Black-Box Optimisation

作者: Jiabao Brad Wang, Xiang Shi, Yiliang Yuan, Mustafa Misir 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09095v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是连续黑盒优化中的算法选择问题，提出了一种基于几何探测的方法GeoPAS。该研究属于优化算法和机器学习应用领域，但完全不涉及大语言模型、深度学习技术原理、AI for Science等关键词。所有关键词都与论文内容无关，因此全部评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为GeoPAS的几何探测方法，用于连续黑盒优化中的算法选择，通过在COCO/BBOB基准测试上的实验表明，该方法在多种评估设置下均优于单一最佳求解器。

摘要翻译

连续黑盒优化中的自动化算法选择通常依赖于在有限探测预算下计算的固定景观描述符，然而此类描述符在问题分割或跨基准评估中可能失效。我们提出GeoPAS——一种几何探测方法，该方法通过在多个位置、方向和对数尺度上采样多个粗略的二维切片来表征问题实例。一个共享的有效性感知卷积编码器将每个切片映射为嵌入向量，根据切片尺度和振幅统计量对其进行条件化处理，并以置换不变的方式聚合生成的特征，进而通过对数尺度性能预测（同时对尾部失败施加显式惩罚）实现风险感知的求解器选择。在COCO/BBOB基准测试中，使用包含12个求解器的组合在2-10维空间进行实验，GeoPAS在留实例外、分组随机和留问题外评估中均优于单一最佳求解器。这些结果表明，多尺度几何切片为算法选择提供了可迁移的静态有效信号，尽管少量重尾区域仍然存在并持续主导整体均值。我们的代码发布于$\href{https://github.com/BradWangW/GeoPAS}{GitHub}$。

摘要 (Abstract)

Automated algorithm selection in continuous black-box optimisation typically relies on fixed landscape descriptors computed under a limited probing budget, yet such descriptors can degrade under problem-split or cross-benchmark evaluation. We propose GeoPAS, a geometric probing approach that represents a problem instance by multiple coarse two-dimensional slices sampled across locations, orientations, and logarithmic scales. A shared validity-aware convolutional encoder maps each slice to an embedding, conditions it on slice-scale and amplitude statistics, and aggregates the resulting features permutation-invariantly for risk-aware solver selection via log-scale performance prediction with an explicit penalty on tail failures. On COCO/BBOB with a 12-solver portfolio in dimensions 2–10, GeoPAS improves over the single best solver under leave-instance-out, grouped random, and leave-problem-out evaluation. These results suggest that multi-scale geometric slices provide a useful transferable static signal for algorithm selection, although a small number of heavy-tail regimes remain and continue to dominate the mean. Our code is available at $\href{https://github.com/BradWangW/GeoPAS}{GitHub}$.

关键词: algorithm selection, continuous black-box optimisation, geometric probing, convolutional encoder, performance prediction, COCO/BBOB, solver portfolio, risk-aware selection

253. ❌ Synthesizing real-world distributions from high-dimensional Gaussian Noise with Fully Connected Neural Network

作者: Joanna Komorniczak 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09091v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于全连接神经网络和随机损失函数的高效合成数据生成方法，用于将高斯噪声转换为近似真实世界数据分布。论文内容聚焦于传统机器学习中的生成模型、数据增强和隐私保护，未涉及大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用。所有评分关键词均与大模型、深度学习技术或AI科学应用相关，而本文研究的是通用合成数据生成方法，与这些关键词无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于全连接神经网络和随机损失函数的高效合成数据生成方法，能够快速将高斯噪声转换为近似真实世界数据分布，在25个多样化表格数据集上超越了现有生成方法并显著提升了计算效率。

摘要翻译

在机器学习应用与研究中，合成数据的使用具有诸多优势，包括通过数据增强提升模型性能、保护原始样本的隐私，以及利用完全合成数据进行可靠的方法评估。本研究提出一种基于全连接神经网络和随机化损失函数的高效合成数据生成方法，该方法通过将随机高斯分布转换为对目标真实数据集的近似分布。在25个多样化的表格型真实数据集上进行的实验表明，所提出的方法超越了当前最先进的生成方法，并以远高于现代深度学习解决方案的速度达到参考MMD（最大均值差异）分数。实验内容涵盖分布相似性分析、对分类质量影响的评估，以及使用主成分分析（PCA, Principal Component Analysis）进行降维——这进一步增强了数据隐私性，并能在降低时间和内存复杂度的同时提升分类质量。

摘要 (Abstract)

The use of synthetic data in machine learning applications and research offers many benefits, including performance improvements through data augmentation, privacy preservation of original samples, and reliable method assessment with fully synthetic data. This work proposes a time-efficient synthetic data generation method based on a fully connected neural network and a randomized loss function that transforms a random Gaussian distribution to approximate a target real-world dataset. The experiments conducted on 25 diverse tabular real-world datasets confirm that the proposed solution surpasses the state-of-the-art generative methods and achieves reference MMD scores orders of magnitude faster than modern deep learning solutions. The experiments involved analyzing distributional similarity, assessing the impact on classification quality, and using PCA for dimensionality reduction, which further enhances data privacy and can boost classification quality while reducing time and memory complexity.

关键词: synthetic data generation, fully connected neural network, Gaussian noise, data augmentation, privacy preservation, tabular datasets, MMD scores, PCA dimensionality reduction

254. ❌ Temporal Patch Shuffle (TPS): Leveraging Patch-Level Shuffling to Boost Generalization and Robustness in Time Series Forecasting

作者: Jafar Bakhshaliyev, Johannes Burchert, Niels Landwehr, Lars Schmidt-Thieme 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09067v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于时间序列预测领域的数据增强方法（Temporal Patch Shuffle），研究内容与所有评分关键词均无直接关联。论文涉及深度学习模型（如TSMixer、DLinear等）在时间序列预测中的应用，但未涉及大语言模型（LLMs）、模型架构创新（如MoE）、训练技术（如RLHF、PEFT）、推理优化、AI代理或科学AI应用等评分关键词所涵盖的主题。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Temporal Patch Shuffle（TPS）的数据增强方法，用于提升时间序列预测模型的泛化能力和鲁棒性，并在多个数据集和模型上验证了其有效性。

摘要翻译

数据增强是提升模型泛化能力与鲁棒性的关键技术，尤其在训练数据有限的深度学习模型中更为重要。尽管目前已开发出多种针对时间序列分类的数据增强方法，但由于需要保持时间连贯性，这些方法大多无法直接适用于时间序列预测任务。本研究提出了一种简单且与模型无关的预测数据增强方法——时序片段混洗（Temporal Patch Shuffle, TPS）。该方法通过提取重叠的时序片段，基于方差排序这一保守启发式策略选择性地混洗部分片段，并通过平均重叠区域重建序列。这一设计在提升样本多样性的同时，保持了与预测任务一致的局部时序结构。我们在九个长期预测数据集上使用五种近期模型架构（TSMixer、DLinear、PatchTST、TiDE和LightTS），并在四个短期预测数据集上使用PatchTST模型对TPS进行了广泛评估，均观察到一致的性能提升。全面的消融实验进一步验证了所提方法的有效性、鲁棒性及其设计原理。

摘要 (Abstract)

Data augmentation is a crucial technique for improving model generalization and robustness, particularly in deep learning models where training data is limited. Although many augmentation methods have been developed for time series classification, most are not directly applicable to time series forecasting due to the need to preserve temporal coherence. In this work, we propose Temporal Patch Shuffle (TPS), a simple and model-agnostic data augmentation method for forecasting that extracts overlapping temporal patches, selectively shuffles a subset of patches using variance-based ordering as a conservative heuristic, and reconstructs the sequence by averaging overlapping regions. This design increases sample diversity while preserving forecast-consistent local temporal structure. We extensively evaluate TPS across nine long-term forecasting datasets using five recent model families (TSMixer, DLinear, PatchTST, TiDE, and LightTS), and across four short-term forecasting datasets using PatchTST, observing consistent performance improvements. Comprehensive ablation studies further demonstrate the effectiveness, robustness, and design rationale of the proposed method.

关键词: Temporal Patch Shuffle, data augmentation, time series forecasting, generalization, robustness, patch-level shuffling, temporal coherence, deep learning models

作者: Yu Chen, Weijun Lv, Yue Huang, Xiaozhao Fang, Jie Wen, Yong Xu, Guanbin Li 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09064v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是部分多标签学习（PML）问题，提出了一种基于特征-标签模态对齐的方法来处理候选标签中的噪声标签。论文的核心是传统机器学习中的多标签分类问题，使用了低秩正交分解、模态对齐、原型学习等技术。所有评分关键词都涉及大模型、深度学习技术原理或特定AI应用领域（如生物信息学），而该论文完全不涉及这些内容：没有提到任何语言模型、预训练/微调技术、推理方法、模型优化技术或特定科学领域的AI应用。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于特征-标签模态对齐的部分多标签学习方法，通过过滤噪声标签生成伪标签并保持特征-标签一致性，显著提高了分类准确性和噪声鲁棒性。

摘要翻译

在偏多标签学习（PML）中，每个实例关联一组包含真实标签和噪声标签的候选标签集。噪声标签的存在破坏了特征与标签之间的对应关系，导致分类性能下降。为应对这一挑战，我们提出一种基于特征-标签模态对齐的新型PML方法（PML-MA），该方法将特征和标签视为两种互补的模态，并通过系统性对齐恢复其一致性。具体而言，PML-MA首先采用低秩正交分解生成伪标签，通过过滤噪声标签来逼近真实标签分布；随后通过全局投影至公共子空间与局部保持邻域结构两种方式对齐特征与伪标签；最后，多峰值类原型学习机制利用实例同时属于多个类别的多标签特性，以伪标签作为软隶属度权重来增强判别性。通过将模态对齐与原型引导的优化相结合，PML-MA确保伪标签能更好地反映真实分布，同时保持对标签噪声的鲁棒性。在真实数据集与合成数据集上的大量实验表明，PML-MA显著优于现有先进方法，实现了更优的分类精度和噪声鲁棒性。

摘要 (Abstract)

In partial multi-label learning (PML), each instance is associated with a set of candidate labels containing both ground-truth and noisy labels. The presence of noisy labels disrupts the correspondence between features and labels, degrading classification performance. To address this challenge, we propose a novel PML method based on feature-label modal alignment (PML-MA), which treats features and labels as two complementary modalities and restores their consistency through systematic alignment. Specifically, PML-MA first employs low-rank orthogonal decomposition to generate pseudo-labels that approximate the true label distribution by filtering noisy labels. It then aligns features and pseudo-labels through both global projection into a common subspace and local preservation of neighborhood structures. Finally, a multi-peak class prototype learning mechanism leverages the multi-label nature where instances simultaneously belong to multiple categories, using pseudo-labels as soft membership weights to enhance discriminability. By integrating modal alignment with prototype-guided refinement, PML-MA ensures pseudo-labels better reflect the true distribution while maintaining robustness against label noise. Extensive experiments on both real-world and synthetic datasets demonstrate that PML-MA significantly outperforms state-of-the-art methods, achieving superior classification accuracy and noise robustness.

关键词: partial multi-label learning, feature-label modal alignment, noisy labels, pseudo-labels, low-rank orthogonal decomposition, class prototype learning, multi-label classification, robust learning

256. ❌ The nextAI Solution to the NeurIPS 2023 LLM Efficiency Challenge

作者: Gyuwon Park, DongIl Shin, SolGil Oh, SangGi Ryu, Byung-Hak Kim 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09034v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLaMa2 70B模型在单GPU资源约束下的高效微调，直接涉及LLMs、SFT、PEFT/LoRA、FlashAttention和Quantization等关键词，其中QLoRA是量化与LoRA的结合，Flash Attention 2被明确提及。其他关键词如MoE、SLMs、Scaling Laws、RAG、Agents等未在摘要中体现，与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文研究了在单A100 40GB GPU的严格资源约束下，通过QLoRA微调和Flash Attention 2等技术优化LLaMa2 70B模型，实现了高效且准确的性能，证明了大规模模型在资源受限环境中的可行性。

摘要翻译

大型语言模型（LLM）的快速发展对自然语言处理领域产生了深远影响，但其日益增长的复杂性引发了关于资源消耗和模型透明度的担忧。为应对这些挑战，我们参与了NeurIPS LLM效率挑战赛，旨在严格限制条件下对基础模型进行微调。我们的研究对象是LLaMa2 700亿参数模型，需在单张A100 40GB GPU上于24小时内完成优化。我们的方法核心在于构建了一个定制数据集，该数据集精心整合了多种开源资源和基准测试数据，符合挑战赛的开源精神。我们采用量化低秩自适应（QLoRA）微调技术，并结合Flash Attention 2等先进注意力机制。通过对LoRA技术不同配置的实验，我们优化了计算效率与模型精度之间的平衡。微调策略建立在多次数据集组合的创建与迭代测试基础上，最终选定的版本在多样化任务和基准测试中均表现出稳健性能。我们的成果是一个在单GPU限制下高效微调的LLaMa2 70B模型，该模型不仅显著降低了资源消耗，还在多项问答基准测试中展现出高精度。本研究证明了在资源受限环境下优化大规模模型的可行性，凸显了大型语言模型在实际应用中的潜力。

摘要 (Abstract)

The rapid evolution of Large Language Models (LLMs) has significantly impacted the field of natural language processing, but their growing complexity raises concerns about resource usage and transparency. Addressing these challenges, we participated in the NeurIPS LLM Efficiency Challenge, aiming to fine-tune a foundation model within stringent constraints. Our focus was the LLaMa2 70 billion model, optimized on a single A100 40GB GPU within a 24-hour limit. Our methodology hinged on a custom dataset, carefully assembled from diverse open-source resources and benchmark tests, aligned with the challenge’s open-source ethos. Our approach leveraged Quantized-Low Rank Adaptation (QLoRA) Fine tuning, integrated with advanced attention mechanisms like Flash Attention 2. We experimented with various configurations of the LoRA technique, optimizing the balance between computational efficiency and model accuracy. Our fine-tuning strategy was underpinned by the creation and iterative testing of multiple dataset compositions, leading to the selection of a version that demonstrated robust performance across diverse tasks and benchmarks. The culmination of our efforts was an efficiently fine-tuned LLaMa2 70B model that operated within the constraints of a single GPU, showcasing not only a significant reduction in resource utilization but also high accuracy across a range of QA benchmarks. Our study serves as a testament to the feasibility of optimizing large-scale models in resource-constrained environments, emphasizing the potential of LLMs in real-world applications.

关键词: Large Language Models, LLaMa2 70B, Fine-tuning, QLoRA, Flash Attention 2, Resource-constrained Optimization, Single GPU, Efficiency Challenge

257. ❌ Modality-Aware Zero-Shot Pruning and Sparse Attention for Efficient Multimodal Edge Inference

作者: Yueyuan Sui, Payal Mohapatra, Doğaç Eldenk, Haodong Yang, Yiting Zhang, Haoyan Zhang, Qi Zhu, Stephen Xia 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08971v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	8.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于边缘设备上的高效多模态推理，核心贡献是零样本剪枝和稀疏注意力机制。与关键词的相关性分析：1）高度相关（8-10分）：‘Small Language Models/On-device AI’（核心应用场景）、‘Quantization/Model Compression’（核心压缩技术）、‘Speculative Decoding/Inference Acceleration’（核心加速目标）、‘Mixture of Experts/Sparse Models’（使用稀疏注意力）、‘KV Cache Compression/Linear Attention/FlashAttention’（涉及注意力优化）。2）无关（0分）：其余关键词主要涉及大语言模型训练、对齐、推理方法、科学AI应用等，与论文的硬件效率优化主题无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了SentryFuse框架，通过模态感知的零样本剪枝和稀疏分组查询注意力，解决了边缘设备上多模态推理在动态功耗和传感器丢失条件下的效率问题，实现了无需微调的模型压缩和推理加速。

摘要翻译

边缘设备日益运行多模态感知流水线，这些流水线必须在波动的功耗预算和不可预测的传感器失效情况下保持准确性。现有的剪枝方法在这些条件下均告失效：它们通常需要在压缩后进行微调，消耗超过部署阶段10倍以上的能量，并且它们分配的是静态重要性分数，无法感知哪些传感器处于可用状态。我们提出了SentryFuse框架，该框架通过两个关键组件共同应对上述挑战。首先，SentryGate在训练期间通过一阶显著性监督学习模态条件化的重要性分数，随后在部署时无需微调即可剪枝注意力头和前馈通道。其次，SentryAttend将当代多模态架构中的关键瓶颈——密集自注意力（self-attention）——替换为稀疏分组查询注意力（grouped-query attention），从而在三种不同的多模态架构上实现了总计15%的GFLOPs降低。在三种应用和多模态骨干网络上，SentryGate相比最强的剪枝基线实现了平均12.7%的准确率提升，在模态失效条件下提升最高可达18%。总体而言，SentryFuse无需额外微调即可减少28.2%的内存占用，并将延迟降低高达1.63倍，从而确立了模态感知的零样本压缩作为在异构边缘硬件上实现多模态智能的一条实用路径。

摘要 (Abstract)

Edge devices increasingly run multimodal sensing pipelines that must remain accurate despite fluctuating power budgets and unpredictable sensor dropout. Existing pruning methods fail under these conditions: they generally require fine-tuning after compression, consuming over $10\times$ the deployment energy, and they assign static importance scores that are blind to which sensors are present. We present the SentryFuse framework, which addresses both challenges jointly through two key components. First, SentryGate learns modality-conditioned importance scores during training via first-order saliency supervision and then prunes attention heads and feed-forward channels at deployment without fine-tuning. Second, SentryAttend replaces dense self-attention, a key bottleneck in contemporary multimodal architectures, with sparse grouped-query attention, yielding a net 15% reduction in GFLOPs across three different multimodal architectures. Across three applications and multimodal backbones, SentryGate achieves a 12.7% average accuracy improvement over the strongest pruning baseline, and upto to 18% under modality dropout conditions. Together, SentryFuse reduces memory by 28.2% and lowers latency by up to $1.63\times$ without further fine-tuning, establishing modality-aware zero-shot compression as a practical path to multimodal intelligence on heterogeneous edge hardware.

关键词: edge inference, multimodal models, zero-shot pruning, sparse attention, modality-aware compression, efficient inference, attention heads pruning, grouped-query attention

258. ❌ Online Quantile Regression for Nonparametric Additive Models

作者: Haoran Zhan 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08969v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是在线非参数加性分位数回归模型的训练算法（P-FGD），属于传统统计机器学习领域，与所有评分关键词（均围绕大模型、深度学习技术及其应用）完全无关。论文未涉及任何大模型架构、训练方法、推理优化、对齐技术、代理系统或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于在线非参数加性分位数回归模型的投影函数梯度下降算法（P-FGD），证明了其在计算效率和最小化最优一致性率方面的优越性。

摘要翻译

本文提出了一种用于在线环境下训练非参数可加性分位数回归模型的投影函数梯度下降算法（P-FGD）。该算法将函数随机梯度下降框架拓展至分位数损失函数。P-FGD的优势在于无需存储历史数据，同时保持每步$O(J_t\ln J_t)$的计算复杂度，其中$J_t$表示基函数的数量。此外，在时刻$t$进行分位数函数预测仅需$O(J_t)$的计算时间。这些特性表明P-FGD在在线学习中明显优于常用的再生核希尔伯特空间（RKHS）方法。通过运用一种新颖的希尔伯特空间投影恒等式，我们证明了所提出的在线分位数函数估计器（P-FGD）能够达到极小极大最优一致性速率$O(t^{-\frac{2s}{2s+1}})$，其中$t$为当前时刻，$s$表示分位数函数的平滑度。本文还建立了该算法向小批量学习场景的拓展。

摘要 (Abstract)

This paper introduces a projected functional gradient descent algorithm (P-FGD) for training nonparametric additive quantile regression models in online settings. This algorithm extends the functional stochastic gradient descent framework to the pinball loss. An advantage of P-FGD is that it does not need to store historical data while maintaining $O(J_t\ln J_t)$ computational complexity per step where $J_t$ denotes the number of basis functions. Besides, we only need $O(J_t)$ computational time for quantile function prediction at time $t$. These properties show that P-FGD is much better than the commonly used RKHS in online learning. By leveraging a novel Hilbert space projection identity, we also prove that the proposed online quantile function estimator (P-FGD) achieves the minimax optimal consistency rate $O(t^{-\frac{2s}{2s+1}})$ where $t$ is the current time and $s$ denotes the smoothness degree of the quantile function. Extensions to mini-batch learning are also established.

关键词: quantile regression, nonparametric additive models, online learning, functional gradient descent, projection algorithm, minimax optimality, computational complexity

259. ❌ Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning

作者: Zhiqiang Dong, Teng Pang, Rongjian Xu, Guoqiang Wu 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08960v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于离线目标条件强化学习（GCRL），提出了一种基于平均速度场的分层策略模型和LeJEPA损失来改进目标表示。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是强化学习中的策略优化问题，与这些关键词没有直接关联。

!!! tip deepseek-chat TL;DR

本文提出了一种用于离线目标条件强化学习的高效分层隐式流Q学习方法，通过引入平均速度场策略和LeJEPA损失来改进目标表示和策略生成，在OGBench基准测试中取得了优异性能。

摘要翻译

离线目标条件强化学习（GCRL）是一种实用的强化学习范式，其目标是从无奖励的离线数据中学习目标条件策略。尽管近期在分层架构（如HIQL）方面取得了进展，但由于高斯策略的表达能力有限以及高层策略无法生成有效的子目标，离线GCRL中的长程控制仍然面临挑战。为解决这些限制，我们提出了目标条件平均流策略，该方法将平均速度场引入离线GCRL的分层策略建模中。具体而言，平均流策略通过学习得到的平均速度场，为高层和低层策略捕获复杂的目标分布，从而通过一步采样实现高效的动作生成。此外，考虑到目标表征的不足，我们引入了LeJEPA损失函数，该函数在训练过程中排斥目标表征嵌入，从而鼓励更具区分性的表征并提升泛化能力。实验结果表明，我们的方法在OGBench基准测试中，在基于状态和基于像素的任务上均取得了强劲的性能。

摘要 (Abstract)

Offline goal-conditioned reinforcement learning (GCRL) is a practical reinforcement learning paradigm that aims to learn goal-conditioned policies from reward-free offline data. Despite recent advances in hierarchical architectures such as HIQL, long-horizon control in offline GCRL remains challenging due to the limited expressiveness of Gaussian policies and the inability of high-level policies to generate effective subgoals. To address these limitations, we propose the goal-conditioned mean flow policy, which introduces an average velocity field into hierarchical policy modeling for offline GCRL. Specifically, the mean flow policy captures complex target distributions for both high-level and low-level policies through a learned average velocity field, enabling efficient action generation via one-step sampling. Furthermore, considering the insufficiency of goal representation, we introduce a LeJEPA loss that repels goal representation embeddings during training, thereby encouraging more discriminative representations and improving generalization. Experimental results show that our method achieves strong performance across both state-based and pixel-based tasks in the OGBench benchmark.

关键词: Offline Goal-conditioned Reinforcement Learning, Hierarchical Policy, Mean Flow Policy, LeJEPA Loss, Goal Representation, OGBench Benchmark, State-based Tasks, Pixel-based Tasks

260. ❌ Predictive Entropy Links Calibration and Paraphrase Sensitivity in Medical Vision-Language Models

作者: Binesh Sadanandan, Vahid Behzadan 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08941v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究医学视觉语言模型（VLMs）的校准和重述敏感性，属于AI在科学（医学）领域的应用，与’AI for Science’高度相关（10分）。论文评估了LoRA集成方法，与’PEFT/LoRA’相关（8分）。研究涉及模型置信度校准和事实性，与’Hallucination Mitigation/Factuality’相关（8分）。论文使用MedGemma 4BIT模型，涉及量化技术，与’Quantization’有一定关联（5分）。论文涉及视觉语言模型，与’Large Language Models’有一定关联（5分）。其他关键词与论文内容无直接关系，得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现医学视觉语言模型中的校准不足和问题重述敏感性具有共同原因（接近决策边界），并通过实验证明简单的预测熵方法能有效识别不可靠和重述敏感的预测。

摘要翻译

医学视觉语言模型（VLMs）存在两种威胁安全部署的失效模式：置信度校准不足以及对问题重述的敏感性。我们通过评估五种不确定性量化方法（在MedGemma 4BIT模型上，使用分布内MIMIC-CXR和分布外PadChest胸部X光数据集进行测试，并在LLaVA-RAD-7B上进行跨架构验证）证明这两种失效模式具有共同成因：接近决策边界。对于校准良好的单模型方法，单次前向传播产生的预测熵能够预测哪些样本会在问题重述时发生结果翻转（MedGemma上AUROC为0.711，LLaVA-RAD上为0.878，p<10⁻⁴），这使得单一熵阈值即可同时标记不可靠预测和重述敏感预测。由五个成员组成的LoRA集成方法在MIMIC到PadChest的分布偏移下失效（预期校准误差ECE为42.9，准确率34.1），但LLaVA-RAD的集成并未崩溃（ECE为69.1）。蒙特卡洛丢弃法（MC Dropout）实现了最佳校准效果（ECE为4.3）和选择性预测覆盖（在5%风险下覆盖率为21.5），而单次前向传播获得的总熵在错误检测（AUROC 0.743对比0.657）和重述筛查方面均优于集成方法。简单方法胜出。

摘要 (Abstract)

Medical Vision Language Models VLMs suffer from two failure modes that threaten safe deployment mis calibrated confidence and sensitivity to question rephrasing. We show they share a common cause, proximity to the decision boundary, by benchmarking five uncertainty quantification methods on MedGemma 4BIT across in distribution MIMIC CXR and outof distribution PadChest chest X ray datasets, with cross architecture validation on LLaVA RAD7B. For well calibrated single model methods, predictive entropy from one forward pass predicts which samples will flip under rephrasing AUROC 0.711 on MedGemma, 0.878 on LLaVARAD p 10 4, enabling a single entropy threshold to flag both unreliable and rephrase sensitive predictions. A five member LoRA ensemble fails under the MIMIC PadChest shift 42.9 ECE, 34.1 accuracy, though LLaVA RAD s ensemble does not collapse 69.1. MC Dropout achieves the best calibration ECE 4.3 and selective prediction coverage 21.5 at 5 risk, yet total entropy from a single forward pass outperforms the ensemble for both error detection AUROC 0.743 vs 0.657 and paraphrase screening. Simple methods win.

关键词: Medical Vision Language Models, Calibration, Paraphrase Sensitivity, Predictive Entropy, Uncertainty Quantification, LoRA Ensemble, MC Dropout, Selective Prediction

261. ❌ Delve into the Applicability of Advanced Optimizers for Multi-Task Learning

作者: Zhipeng Zhou, Linxiao Cao, Pengcheng Wu, Peilin Zhao, Chunyan Miao 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08939v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多任务学习（MTL）中的优化器应用，特别是研究高级优化器（如Muon）在MTL框架中的适用性问题，并提出APT框架来平衡优化器与MTL方法。所有评分关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文的核心是传统机器学习中的多任务优化问题，未涉及大模型、LLM、MoE、缩放定律、训练技术、推理优化、智能体、模型压缩等任何评分关键词领域。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文研究了高级优化器在多任务学习中的适用性问题，发现现有方法因梯度作用有限而效果不佳，并提出了APT框架和轻量方向保持方法，实验证明能显著提升多任务学习性能。

摘要翻译

多任务学习（Multi-Task Learning, MTL）是机器学习领域的一个基础性问题，在过去十年中得到了广泛发展。近年来，各类基于优化的多任务学习方法被提出，旨在通过改变优化轨迹来同时学习多个任务。尽管这些方法致力于协调与平衡任务间的冲突，但我们通过实证研究发现，当采用先进优化器时，其效果常受一个被忽视的因素削弱：即时计算得到的梯度在实际参数更新中仅起到边缘作用。这种差异阻碍了多任务学习框架在优化动态中充分发挥潜力。此外，我们观察到，近期出现的先进优化器Muon本质上已具备多任务学习器的功能，这突显了其正交化过程中所用梯度的关键重要性。为解决这些问题，我们提出了APT（先进优化器适用性框架），该框架采用一种简单的自适应动量机制，旨在平衡先进优化器与多任务学习方法之间的优势。同时，我们引入了一种轻量级方向保持方法，以促进Muon的正交化过程。在四个主流多任务学习数据集上的大量实验表明，APT能够持续增强现有多任务学习方法，并带来显著的性能提升。

摘要 (Abstract)

Multi-Task Learning (MTL) is a foundational machine learning problem that has seen extensive development over the past decade. Recently, various optimization-based MTL approaches have been proposed to learn multiple tasks simultaneously by altering the optimization trajectory. Although these methods strive to de-conflict and re-balance tasks, we empirically identify that their effectiveness is often undermined by an overlooked factor when employing advanced optimizers: the instant-derived gradients play only a marginal role in the actual parameter updates. This discrepancy prevents MTL frameworks from fully releasing its power on learning dynamics. Furthermore, we observe that Muon-a recently emerged advanced optimizer-inherently functions as a multi-task learner, which underscores the critical importance of the gradients used for its orthogonalization. To address these issues, we propose APT (Applicability of advanced oPTimizers), a framework featuring a simple adaptive momentum mechanism designed to balance the strengths between advanced optimizers and MTL. Additionally, we introduce a light direction preservation method to facilitate Muon’s orthogonalization. Extensive experiments across four mainstream MTL datasets demonstrate that APT consistently augments existing MTL approaches, yielding substantial performance improvements.

关键词: Multi-Task Learning, Optimizer, Muon, APT, Adaptive Momentum, Gradient Orthogonalization, Performance Improvement

262. ❌ A novel hybrid approach for positive-valued DAG learning

作者: Yao Zhao 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08935v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于因果发现和DAG学习，特别是针对正值数据（如基因表达、经济数据）的算法开发，属于传统机器学习/统计领域。所有关键词均与大模型、深度学习技术原理或应用直接相关，而本文未涉及任何大模型、深度学习、LLM相关技术，也未使用这些技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文提到在基因组学中的应用，但论文核心是算法本身，而非AI在科学领域的深度应用或创新，因此给予5分（有一定关联）。其他关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为H-MRS的新型混合方法，用于从正值数据中学习有向无环图，通过结合矩比评分和对数尺度回归，在合成数据上展示了竞争性的精度和召回率，适用于基因组学和经济学等领域。

摘要翻译

从观测数据中发现因果关系仍然是机器学习与统计学中的一个基础性挑战，尤其当变量表示本质上为正值的量时——例如基因表达水平、资产价格、公司收入或人口计数——这些数据通常遵循乘性而非加性动态。我们提出混合矩比评分（Hybrid Moment-Ratio Scoring, H-MRS）算法，这是一种通过结合基于矩的评分与对数尺度回归，从正值数据中学习有向无环图（Directed Acyclic Graphs, DAGs）的新方法。其核心思想在于，对于正值变量，矩比 $\frac{\mathbb{E}[X_j^2]}{\mathbb{E}[(\mathbb{E}[X_j \mid S])^2]}$ 为因果排序提供了一个有效准则，其中 $S$ 表示候选父节点集。H-MRS 将对数尺度岭回归用于矩比估计，并与基于原始尺度矩比的贪婪排序过程相结合，随后通过基于弹性网络的父节点选择来恢复最终的 DAG 结构。在合成对数线性数据上的实验显示出具有竞争力的精确率与召回率。该方法计算效率高，且天然符合正值约束，适用于基因组学与经济学等领域。这些结果表明，将对数尺度建模与原始尺度矩比相结合，为正值领域的因果发现提供了一个实用框架。

摘要 (Abstract)

Causal discovery from observational data remains a fundamental challenge in machine learning and statistics, particularly when variables represent inherently positive quantities such as gene expression levels, asset prices, company revenues, or population counts, which often follow multiplicative rather than additive dynamics. We propose the Hybrid Moment-Ratio Scoring (H-MRS) algorithm, a novel method for learning directed acyclic graphs (DAGs) from positive-valued data by combining moment-based scoring with log-scale regression. The key idea is that for positive-valued variables, the moment ratio $\frac{\mathbb{E}[X_j^2]}{\mathbb{E}[(\mathbb{E}[X_j \mid S])^2]}$ provides an effective criterion for causal ordering, where $S$ denotes candidate parent sets. H-MRS integrates log-scale Ridge regression for moment-ratio estimation with a greedy ordering procedure based on raw-scale moment ratios, followed by Elastic Net-based parent selection to recover the final DAG structure. Experiments on synthetic log-linear data demonstrate competitive precision and recall. The proposed method is computationally efficient and naturally respects positivity constraints, making it suitable for applications in genomics and economics. These results suggest that combining log-scale modeling with raw-scale moment ratios provides a practical framework for causal discovery in positive-valued domains.

关键词: causal discovery, directed acyclic graphs, positive-valued data, moment-ratio scoring, log-scale regression, genomics, economics, Hybrid Moment-Ratio Scoring

263. ❌ Bridging SFT and RL: Dynamic Policy Optimization for Robust Reasoning

作者: Taojie Zhu, Dongyang Xu, Ding Zou, Sen Zhao, Qiaobo Hao, Zhiguo Yang, Yonghong He 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08926v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM后训练范式中的SFT和RL优化问题，提出DYPO框架统一两者。与SFT、RLHF/DPO高度相关（10分），涉及推理改进（CoT、System 2 Thinking得8分），与Alignment、Self-Improvement有一定关联（5分）。其他关键词如MoE、SLMs、RAG等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型后训练中监督微调（SFT）和强化学习（RL）的偏差-方差权衡问题，提出了动态策略优化（DYPO）框架，通过组对齐损失、多教师蒸馏和动态门控机制，在复杂推理任务上实现了显著性能提升。

摘要翻译

大型语言模型（LLM）的后训练范式，主要包括监督微调（Supervised Fine-Tuning, SFT）和强化学习（Reinforcement Learning, RL），面临一个根本性的困境：SFT 提供了稳定性（低方差），但存在较高的拟合偏差；而 RL 能够进行探索（低偏差），却受困于较高的梯度方差。现有的统一优化策略通常采用简单的损失加权方法，忽视了这两种不同梯度信号之间的统计冲突。本文对此偏差-方差权衡进行了严格的理论分析，并提出了 DYPO（动态策略优化），一个旨在从结构上缓解此冲突的统一框架。DYPO 整合了三个核心组件：（1）一种群体对齐损失，它利用内在的群体动态显著降低 RL 梯度方差；（2）一种多教师蒸馏机制，通过多样化的推理路径来纠正 SFT 的拟合偏差；（3）一种动态利用-探索门控机制，基于奖励反馈在稳定的 SFT 和探索性的 RL 之间进行自适应仲裁。理论分析证实，DYPO 能线性降低拟合偏差并最小化总体方差。大量实验表明，DYPO 显著优于传统的顺序训练流程，在复杂推理基准测试中平均提升 4.8%，在分布外任务上平均提升 13.3%。我们的代码公开在 https://github.com/Tocci-Zhu/DYPO。

摘要 (Abstract)

Post-training paradigms for Large Language Models (LLMs), primarily Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), face a fundamental dilemma: SFT provides stability (low variance) but suffers from high fitting bias, while RL enables exploration (low bias) but grapples with high gradient variance. Existing unified optimization strategies often employ naive loss weighting, overlooking the statistical conflict between these distinct gradient signals. In this paper, we provide a rigorous theoretical analysis of this bias-variance trade-off and propose \textbf{DYPO} (Dynamic Policy Optimization), a unified framework designed to structurally mitigate this conflict. DYPO integrates three core components: (1) a \textit{Group Alignment Loss (GAL)} that leverages intrinsic group dynamics to significantly reduce RL gradient variance; (2) a \textit{Multi-Teacher Distillation} mechanism that corrects SFT fitting bias via diverse reasoning paths; and (3) a \textit{Dynamic Exploitation-Exploration Gating} mechanism that adaptively arbitrates between stable SFT and exploratory RL based on reward feedback. Theoretical analysis confirms that DYPO linearly reduces fitting bias and minimizes overall variance. Extensive experiments demonstrate that DYPO significantly outperforms traditional sequential pipelines, achieving an average improvement of 4.8% on complex reasoning benchmarks and 13.3% on out-of-distribution tasks. Our code is publicly available at https://github.com/Tocci-Zhu/DYPO.

关键词: Large Language Models, Supervised Fine-Tuning, Reinforcement Learning, Dynamic Policy Optimization, Bias-Variance Trade-off, Complex Reasoning, Group Alignment Loss, Multi-Teacher Distillation

264. ❌ Using Synthetic Data for Machine Learning-based Childhood Vaccination Prediction in Narok, Kenya

作者: Jimmy Bach, Yang Li, Yaqi Liu, John Sankok, Rose Kimani, Carrie B. Dolan, Julius N. Odhiambo, Haipeng Chen 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08902v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要研究使用机器学习模型（Logistic Regression和XGBoost）和基于表格扩散的合成数据生成（TabSyn）来预测肯尼亚儿童疫苗接种风险，属于医疗健康领域的AI应用。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，因为这些关键词特指大语言模型（LLM）及相关技术，而本文使用的是传统机器学习模型。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在生物医学/健康信息学（Bioinformatics/Health Informatics）领域的应用，但并非核心创新点（主要创新在于合成数据方法而非AI模型本身），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究使用机器学习模型和合成数据生成技术，在肯尼亚纳罗克地区预测儿童疫苗接种风险，结果表明该方法能有效识别风险儿童且合成数据不会降低预测性能。

摘要翻译

背景：在资源匮乏地区，有限的数据利用对疫苗交付生态系统构成障碍，破坏了实现公平免疫覆盖的努力。在游牧人群中，个体在儿童时期错过关键疫苗接种剂次的风险更高。肯尼亚纳罗克郡的马赛人群便是此类典型，该地区缺乏大量高质量数据，这阻碍了准确的覆盖率评估，影响了资源的高效分配，并削弱了实施及时干预的能力。此外，在敏感数据有限的群体中，数据隐私问题尤为突出。目标：首先，我们旨在通过大规模人群识别面临错过关键疫苗风险的儿童，以提供及时、基于证据的干预措施，从而提高疫苗接种覆盖率。其次，我们致力于更好地保护脆弱人群敏感健康数据的隐私。方法：我们将卫生部510登记册中8年的儿童疫苗接种记录（n=6,913）数字化，并应用机器学习模型（逻辑回归和XGBoost）来识别风险儿童。此外，我们采用了一种基于表格扩散模型的合成数据生成新方法（TabSyn），以在模型内部保护患者隐私。结果：我们的研究结果表明，分类技术能够可靠且成功地预测面临漏种疫苗风险的儿童，对于部分建模疫苗，其召回率、精确率和F1分数均超过90%。此外，使用合成数据而非真实数据训练这些模型（从而保护原始数据集中个体的隐私）并不会导致预测性能下降。结论：这些结果支持在数字基础设施有限的诊所中，将合成数据应用于健康信息学策略，从而为儿童免疫覆盖率提供可保护隐私、可扩展的预测方案。

摘要 (Abstract)

Background: Limited data utilization in low-resource settings poses a barrier to the vaccine delivery ecosystem, undermining efforts to achieve equitable immunization coverage. In nomadic populations, individuals face an increased risk of missing crucial vaccination doses as children. One such population is the Maasai in Narok County, Kenya, where the absence of high-volume, quality data hampers accurate coverage estimates, impedes efficient resource allocation, and weakens the ability to deliver timely interventions. Additionally, data privacy concerns are heightened in groups with limited sensitive data. Objectives: First, we aim to identify children at risk of missing key vaccines across a large population to provide timely, evidence-based interventions that support increased vaccination coverage. Second, we aim to better protect the privacy of sensitive health data in a vulnerable population. Methods: We digitized 8 years of child vaccination records from the MOH 510 registry (n=6,913) and applied machine learning models (Logistic Regression and XGBoost) to identify children at risk. Additionally, we utilize a novel approach to tabular diffusion-based synthetic data generation (TabSyn) to protect patient privacy within the models. Results: Our findings show that classification techniques can reliably and successfully predict children at risk of missing a vaccine, with recall, precision, and F1-scores exceeding 90% for some vaccines modeled. Additionally, training these models with synthetic data rather than real data, thus preserving the privacy of individuals within the original dataset, does not lead to a loss in predictive performance. Conclusion: These results support the use of synthetic data implementation in health informatics strategies for clinics with limited digital infrastructure, enabling privacy-preserving, scalable forecasting for childhood immunization coverage.

关键词: childhood vaccination prediction, machine learning, synthetic data generation, health informatics, privacy preservation, low-resource settings, tabular diffusion, vaccine coverage

265. ❌ Adaptive Candidate Point Thompson Sampling for High-Dimensional Bayesian Optimization

作者: Donney Fan, Geoff Pleiss 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08891v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究贝叶斯优化中的Thompson采样方法，提出了一种自适应候选点采样算法（ACTS），属于传统机器学习优化算法领域。所有评分关键词均与大模型、深度学习技术原理、AI应用等主题相关，而该论文完全不涉及这些内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对高维贝叶斯优化中Thompson采样候选点稀疏的问题，提出了一种自适应候选点Thompson采样方法（ACTS），通过梯度引导在子空间中生成候选点，在合成和真实基准测试中实现了更好的优化性能。

摘要翻译

在贝叶斯优化中，汤普森采样通过从目标函数最大化器的后验分布中采样来选择评估点。由于对于高斯过程代理模型而言，该采样问题是难以处理的，后验分布通常被限制在固定的离散化候选点上，而这些候选点会随着维度的增加呈指数级稀疏化。尽管先前的研究致力于通过可扩展的高斯过程近似来提高候选点密度，我们采用了一种正交方法，通过在采样过程中自适应地缩小搜索空间来增加密度。具体而言，我们提出了自适应候选汤普森采样，该方法在代理模型样本梯度引导的子空间中生成候选点。ACTS 是现有汤普森采样方法（包括那些使用信赖域或其他局部近似的方法）的一种简单即插即用替代方案，它能够生成更优的最大值样本，并在合成与真实世界基准测试中实现更好的优化效果。

摘要 (Abstract)

In Bayesian optimization, Thompson sampling selects the evaluation point by sampling from the posterior distribution over the objective function maximizer. Because this sampling problem is intractable for Gaussian process (GP) surrogates, the posterior distribution is typically restricted to fixed discretizations (i.e., candidate points) that become exponentially sparse as dimensionality increases. While previous works aim to increase candidate point density through scalable GP approximations, our orthogonal approach increases density by adaptively reducing the search space during sampling. Specifically, we introduce Adaptive Candidate Thompson Sampling (ACTS), which generates candidate points in subspaces guided by the gradient of a surrogate model sample. ACTS is a simple drop-in replacement for existing TS methods – including those that use trust regions or other local approximations – producing better samples of maxima and improved optimization across synthetic and real-world benchmarks.

关键词: Bayesian Optimization, Thompson Sampling, Gaussian Process, High-Dimensional Optimization, Adaptive Candidate Points, Surrogate Model, Optimization Algorithms

266. ❌ Uncertainty-Aware Transformers: Conformal Prediction for Language Models

作者: Abhiram Vellore, Niraj K. Jha 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08885v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究基于Transformer的语言模型的不确定性量化框架CONFIDE，通过保形预测提高模型的可信度和解释性。与’Large Language Models’相关度8分，因为论文明确提到大语言模型及其变体，并针对BERT/RoBERTa等模型进行研究。与’Mechanistic Interpretability OR Explainable AI’相关度10分，因为论文核心是解决模型黑盒问题，提供实例级解释和可解释性，属于解释性AI范畴。其他关键词如MoE、SFT、RAG、量化等均未涉及，相关度为0分。

!!! tip deepseek-chat TL;DR

该论文提出了CONFIDE框架，通过保形预测对基于Transformer的语言模型进行不确定性量化，提高了模型在资源受限和高风险任务中的鲁棒性和可解释性，并在BERT-tiny上实现了最高4.09%的测试准确率提升。

摘要翻译

Transformer模型对人工智能领域产生了深远影响，尤其在大语言模型及其变体方面。然而，正如神经网络曾面临的情况一样，其黑箱特性限制了在高风险场景中的可信度与部署应用。为使模型在关键应用中真正具备实用性和可信度，它们必须提供超越单纯预测的更多信息：即向用户清晰阐明支撑其决策的内在推理过程。本文提出了一种基于Transformer的语言模型不确定性量化框架。该框架名为CONFIDE（面向微调深度语言模型的共形预测），将共形预测应用于仅编码器架构（如BERT和RoBERTa）的内部嵌入表示，同时支持超参数调优。CONFIDE利用[CLS]标记嵌入或展平的隐藏状态构建类条件非共形分数，从而生成具有统计有效性的预测集并提供实例级解释。实验表明，在BERT-tiny模型上，CONFIDE将测试准确率最高提升4.09%，且相较于NM2和VanillaNN等现有方法，获得了更高的正确效率（即预测集包含真实标签时的期望大小）。我们证明，Transformer的早期和中间层通常能为共形预测提供校准效果更佳、语义信息更丰富的表征。在资源受限的模型和标签模糊的高风险任务中，当基于softmax的不确定性度量失效时，CONFIDE能提供更强的鲁棒性和可解释性。我们将CONFIDE定位为一个实用诊断框架，其在效率与鲁棒性方面较现有共形预测基线方法实现了显著提升。

摘要 (Abstract)

Transformers have had a profound impact on the field of artificial intelligence, especially on large language models and their variants. However, as was the case with neural networks, their black-box nature limits trust and deployment in high-stakes settings. For models to be genuinely useful and trustworthy in critical applications, they must provide more than just predictions: they must supply users with a clear understanding of the reasoning that underpins their decisions. This article presents an uncertainty quantification framework for transformer-based language models. This framework, called CONFIDE (CONformal prediction for FIne-tuned DEep language models), applies conformal prediction to the internal embeddings of encoder-only architectures, like BERT and RoBERTa, while enabling hyperparameter tuning. CONFIDE uses either [CLS] token embeddings or flattened hidden states to construct class-conditional nonconformity scores, enabling statistically valid prediction sets with instance-level explanations. Empirically, CONFIDE improves test accuracy by up to 4.09% on BERT-tiny and achieves greater correct efficiency (i.e., the expected size of the prediction set conditioned on it containing the true label) compared to prior methods, including NM2 and VanillaNN. We show that early and intermediate transformer layers often yield better-calibrated and more semantically meaningful representations for conformal prediction. In resource-constrained models and high-stakes tasks with ambiguous labels, CONFIDE offers robustness and interpretability where softmax-based uncertainty fails. We position CONFIDE as a framework for practical diagnostic and efficiency/robustness improvement over prior conformal baselines.

关键词: Transformers, Large Language Models, Uncertainty Quantification, Conformal Prediction, Interpretability, BERT, RoBERTa, High-stakes Applications

267. ❌ How does Chain of Thought decompose complex tasks?

作者: Amrut Nadgir, Vijay Balasubramanian, Pratik Chaudhari 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08872v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	15.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究链式思维（CoT）在LLM任务分解中的理论分析，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（15分），涉及’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’（8分），并明确使用LLM（10分）。其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了链式思维（CoT）如何通过将复杂任务分解为树状结构的小分类问题来降低LLM预测误差，并发现存在一个临界阈值和最优深度以最小化误差。

摘要翻译

许多语言任务可被建模为分类问题，其中大型语言模型（LLM）接收一个提示并从多个可能答案中选择其一。我们发现，此类问题中的分类误差随类别数量呈幂律关系缩放。这带来一个显著结果：通过将整体任务拆分为一系列具有相同类别数（即“度”）的较小分类问题，可大幅降低预测误差。这种树状分解模拟了思维链（CoT）机制。已有研究观察到，基于CoT的预测器在“思考”时——即构建更深层树结构从而将问题分解为更多步骤时——表现更优。我们识别出一个关于“度”的临界阈值：低于该阈值时思考反而有害，而高于该阈值时则存在一个使误差最小化的最优思考深度。若继续增加思考深度，将无法突破这一最小误差极限。

摘要 (Abstract)

Many language tasks can be modeled as classification problems where a large language model (LLM) is given a prompt and selects one among many possible answers. We show that the classification error in such problems scales as a power law in the number of classes. This has a dramatic consequence: the prediction error can be reduced substantially by splitting the overall task into a sequence of smaller classification problems, each with the same number of classes (“degree”). This tree-structured decomposition models chain-of-thought (CoT). It has been observed that CoT-based predictors perform better when they “think’”, i.e., when they develop a deeper tree, thus decomposing the problem into a larger number of steps. We identify a critical threshold for the degree, below which thinking is detrimental, and above which there exists an optimal depth that minimizes the error. It is impossible to surpass this minimal error by increasing the depth of thinking.

关键词: Chain of Thought, large language model, classification error, task decomposition, tree-structured, optimal depth, power law scaling, prediction error

268. ❌ Finite-Sample Analysis of Nonlinear Independent Component Analysis:Sample Complexity and Identifiability Bounds

作者: Yuwen Jiang 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.08850v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于非线性独立成分分析（ICA）的有限样本理论分析，包括样本复杂性和可识别性边界。研究内容属于经典的无监督学习、统计学习和理论机器学习范畴，使用了神经网络编码器。所有评分关键词均与大模型、深度学习技术原理创新或其在科学领域的应用直接相关，而本文的核心主题（非线性ICA的理论分析）与这些关键词没有直接关联。论文摘要中提到的“scaling laws”是指其理论分析中推导出的样本复杂度与维度、多样性之间的缩放规律，并非指大模型训练中的Scaling Laws。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

本文研究了非线性独立成分分析（ICA）的有限样本统计性质，首次建立了匹配的上下界，完整刻画了使用神经网络编码器时的样本复杂性和可识别性，并证明了在标准条件下SGD优化也能达到相同的样本效率。

摘要翻译

独立成分分析（ICA）是一种基础的无监督学习技术，旨在通过将混合信号分离为独立源来揭示数据中的潜在结构。尽管在建立非线性ICA的渐近可识别性保证方面已取得显著进展，但学习算法的有限样本统计特性仍未被充分理解。这一空白对实践者构成了重大挑战，因为他们必须确定可靠的源恢复所需的适当样本量。本文对基于神经网络编码器的非线性ICA进行了全面的有限样本分析，首次通过匹配的上界与下界提供了完整理论刻画。我们的理论发展包含三项关键的技术贡献。首先，我们在超额风险与识别误差之间建立了直接关系，绕过了参数空间论证，从而避免了原本会导致次优缩放比例的速率退化。其次，我们证明了匹配的信息论下界，确认了样本复杂度结果的最优性。第三，我们将分析扩展到实际的随机梯度下降优化，证明在标准的优化地形假设下，有限迭代的梯度下降法可实现相同的样本效率。我们通过精心设计的模拟实验验证了理论预测。这一空白指明了未来对神经网络训练有限样本行为进行有价值研究的方向，并凸显了我们已验证的维度与多样性缩放规律的重要性。

摘要 (Abstract)

Independent Component Analysis (ICA) is a fundamental unsupervised learning technique foruncovering latent structure in data by separating mixed signals into their independent sources. While substantial progress has been made in establishing asymptotic identifiability guarantees for nonlinear ICA, the finite-sample statistical properties of learning algorithms remain poorly understood. This gap poses significant challenges for practitioners who must determine appropriate sample sizes for reliable source recovery. This paper presents a comprehensive finite-sample analysis of nonlinear ICA with neural network encoders, providing the first complete characterization with matching upper and lower bounds. Our theoretical development introduces three key technical contributions. First, we establish a direct relationship between excess risk and identification error that bypasses parameter-space arguments, thereby avoiding the rate degradation that would otherwise yield suboptimal scaling. Second, we prove matching information-theoretic lower bounds that confirm the optimality of our sample complexity results. Third, we extend our analysis to practical SGD optimization, showing that the same sample efficiency can be achieved with finite-iteration gradient descent under standard landscape assumptions. We validate our theoretical predictions through carefully designed simulation experiments. This gap points toward valuable future research on finite-sample behavior of neural network training and highlights the importance of our validated scaling laws for dimension and diversity.

关键词: Nonlinear Independent Component Analysis, Finite-sample Analysis, Sample Complexity, Identifiability Bounds, Neural Network Encoders, Excess Risk, Information-theoretic Lower Bounds, SGD Optimization

269. ❌ Integral-equation analysis of transient diffusion-limited currents at disk electrodes: Asymptotic expansion and compact approximation

作者: Kazuhiko Seki, Yuko Yokoyama, Masahiro Yamamoto 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09274v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究电化学中盘电极的瞬态扩散限制电流，属于物理化学和电分析化学领域，采用积分方程、渐近展开和帕德近似等数学方法。所有评分关键词均涉及大模型、深度学习及相关技术（如训练方法、推理优化、对齐、代理等），而论文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文通过积分方程和渐近展开方法，分析了盘电极在电位阶跃后的瞬态扩散限制电流，并提出了一个紧凑的解析表达式，用于准确描述实验相关时间范围内的电流响应。

摘要翻译

本文分析了盘电极在电位阶跃引发界面离子浓度变化后的瞬态扩散限制电流，该研究与计时电流测量直接相关。该混合边界扩散问题在拉普拉斯域中建立，并简化为可直接确定法拉第电流的弗雷德霍姆积分方程。其稳态极限还原了Saito方程，而系统性的长时间渐近展开量化了趋近稳态的过程。通过帕德逼近式，得到了时域中的一个紧凑解析表达式，该表达式能在实验相关的时间范围内准确描述电流。与现有基于渐近-多项式混合近似的高精度数值方法相比，本框架提供了一个显式且紧凑的解析表示，便于理论解释和实际应用。其短期响应表现出具有盘电极边缘效应特征的Cottrell方程。总体而言，该框架为分析瞬态电流、提取扩散参数以及评估盘电极计时电流法中广泛使用的解析近似之准确性，提供了实用工具。

摘要 (Abstract)

The transient diffusion-limited current at a disk electrode following a change in interfacial ion concentration induced by a potential step is analyzed with direct relevance to chronoamperometric measurements. The mixed-boundary diffusion problem is formulated in the Laplace domain and reduced to a Fredholm integral equation that directly determines the Faradaic current. The steady-state limit recovers Saito’s equation, while a systematic long-time asymptotic expansion quantifies the approach to steady state. A Padé approximant yields a compact analytical expression in the time domain that accurately describes the current over experimentally relevant time ranges. In contrast to existing high-accuracy numerical procedures based on hybrid asymptotic and polynomial approximations, the present formulation provides an explicit and compact analytical representation that facilitates interpretation and practical implementation. The short-time response exhibits Cottrell’s equation with edge effects characteristic of disk electrodes. Overall, the framework provides practical tools for analyzing transient currents, extracting diffusion parameters, and assessing the accuracy of widely used analytical approximations in disk-electrode chronoamperometry.

关键词: disk electrode, transient diffusion-limited current, chronoamperometry, integral equation, asymptotic expansion, Padé approximant, analytical approximation, Faradaic current

270. ❌ Limitations of MRSF-TDDFT for Applications in Photochemistry

作者: Jiří Janoš, Andrew J. Orr-Ewing, Basile F. E. Curchod, Petr Slavíček 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09230v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是计算化学中的MRSF-TDDFT方法在光化学应用中的局限性，属于传统计算化学领域，与所有大模型、深度学习、AI技术相关的关键词均无直接关联。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该论文属于计算化学研究，可视为广义的’AI for Science’（科学计算应用），但论文本身并未涉及任何人工智能或机器学习方法，因此给予5分（有一定关联）。其他所有关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文评估了MRSF-TDDFT方法在光化学应用中的局限性，发现其存在缺失单激发构型和三重态参考变化导致激发态能量不可靠的问题，并提出了检测这些局限性的策略。

摘要翻译

混合参考自旋翻转时间依赖密度泛函理论（MRSF-TDDFT）近年来已成为研究光化学过程的一种备受关注的电子结构方法，因为它兼具单参考方法的计算效率与多参考方法的广泛适用性。本文中，我们批判性地评估了MRSF-TDDFT在光化学领域的普适性，并指出了两个重要的局限性。首先，MRSF-TDDFT中包含的双激发组态是以缺失部分单激发组态为代价的。其次，当该方法的核心基础——三重态参考态——的性质发生突变时（例如T$_1$与T$_2$三重态接近简并并交换电子特性），MRSF-TDDFT所提供的激发态能量将不可靠。这种三重态参考态特性的改变，可能在核构型空间中难以预料的区域引起响应态的电子势能曲线出现不连续或剧烈畸变。我们提出了相应的策略与诊断方法，以在使用MRSF-TDDFT探索势能面和非绝热分子动力学时识别这些局限性。

摘要 (Abstract)

Mixed-reference spin-flip time-dependent density functional theory (MRSF-TDDFT) has recently emerged as an attractive electronic-structure method for studying photochemical processes, given that it bridges the computational efficiency of single-reference approaches with the versatility of multireference methods. In the following, we critically assess the general applicability of MRSF-TDDFT to photochemistry and identify two important limitations. First, the doubly-excited configurations included in MRSF-TDDFT come at the cost of missing some singly-excited configurations. Second, MRSF-TDDFT provides unreliable excited-state energies when its triplet reference - a cornerstone of the method - abruptly changes its nature, e.g., when the T$_1$ and T$_2$ triplet states become nearly degenerate and exchange electronic character. This change of character of the triplet reference can induce discontinuities or sharp distortions in electronic potential energy curves of the response states in unsuspected regions of the nuclear configuration space. We propose strategies and diagnostics to detect these limitations in the exploration of potential energy surfaces and nonadiabatic molecular dynamics using MRSF-TDDFT.

关键词: MRSF-TDDFT, photochemistry, electronic-structure method, excited-state energies, potential energy surfaces, nonadiabatic molecular dynamics, computational chemistry, density functional theory

271. ❌ Linking Calendar and Cycle Ageing in Lithium-Ion Batteries through Consistent Parameterisation of an Electrochemical-Thermal-Degradation Model

作者: Ganesh Madabattula 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09217v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于锂离子电池的电化学-热-降解模型参数化研究，属于电池科学和工程领域。所有关键词均与大模型、深度学习技术原理或AI应用相关，但论文内容完全不涉及这些技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于科学计算和建模领域，但并未使用AI方法，而是基于物理模型和模拟，因此给予5分（有一定关联）。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种参数化框架，用于耦合锂离子电池的日历和循环老化降解机制，基于电化学-热-降解模型预测了在不同使用条件下的容量衰减轨迹、健康状态和剩余使用寿命。

摘要翻译

锂离子电池耦合衰减机制的参数化是一项重大挑战。各机制间的相互作用取决于使用条件：倍率（C-rate）、静置态荷电状态（SoC）、放电深度（DoD）以及温度。本研究基于衰减模式分析数据，提出了一个统一参数化关键衰减模式——固体电解质界面（SEI）膜生长、锂析出及两极活性物质损失——的框架。该工作采用P2D电化学-热-衰减耦合模型，预测了镍锰钴（NMC）基锂离子电池在日历老化及日历-循环复合老化下的容量衰减轨迹。通过PyBaMM平台，研究预测了电池在81种工况组合（温度：10°C、25°C、40°C；倍率：0.1 C、0.3 C、1.0 C；静置SoC：10%、60%、100%；DoD：50%、70%、90%）下的健康状态（SoH）、剩余使用寿命（RUL）及内部衰减模式。预测显示，达到75% SoH的循环寿命范围在0.8至14年之间。该研究从机理层面揭示了循环过程中日历老化与循环老化之间的竞争效应。模型表明，根据使用条件的不同，容量衰减可呈现亚线性、线性及超线性/加速衰减的特征。所有工况的模拟数据集均已公开。

摘要 (Abstract)

Parameterisation of coupled degradation mechanisms in lithium-ion batteries is a major challenge. Interactions between the mechanisms depend on usage conditions: C-rate, rest state-of-charge (SoC), depth-of-discharge (DoD) and temperature. This work presents a framework to consistently parameterise key degradation modes–solid-electrolyte interphase (SEI) growth, lithium plating, and active material loss in both electrodes–using insights derived from degradation mode analysis data. The work predicts capacity fade trajectories of a NMC-based lithium-ion cell under both calendar and combined calendar-cyclic ageing, using a P2D electrochemical-thermal-degradation model. The work predicts state-of-health (SoH), remaining-useful-life (RUL) and internal degradation modes of the cell–under 81 combinations of temperature (10$^o$C, 25$^o$C, 40$^o$C), C-rate (0.1 C, 0.3 C and 1.0 C), rest SoC (10%, 60%, and 100%) and DoD (50%, 70%, and 90%)–using PyBaMM. The predicted cycle-life varies between 0.8 to 14 years to reach 75% of SoH. The work provides mechanistic insights into competing effects between calendar and cyclic ageing, during cycling. The model demonstrates sub-linear, linear, and sup-linear/accelerated capacity fade based on the usage conditions. The simulated dataset for all the cases is made available.

关键词: lithium-ion batteries, degradation mechanisms, electrochemical-thermal-degradation model, calendar ageing, cyclic ageing, capacity fade, state-of-health, remaining-useful-life

272. ❌ Experimental proof of strong $Π$-$Σ$ mixing in the Renner-Teller and Pseudo-Jahn-Teller affected CCH$^+$ ($^3Π$) ion

作者: Kim Steenbakkers, P. Bryan Changala, Weslley G. D. P. Silva, John F. Stanton, Filippo Lipparini, Jürgen Gauss, Oskar Asvany, Gerrit C. Groenenboom, Britta Redlich, Stephan Schlemmer, Sandra Brünken 期刊/来源: arxiv 发布日期: 2026-04-10 arXiv链接: http://arxiv.org/abs/2604.09203v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文是实验物理化学领域的基础研究，专注于CCH+离子的光谱分析和非绝热效应，与所有大模型/深度学习技术关键词完全无关。唯一可能的相关性是’AI for Science’，但论文未使用任何AI方法，仅涉及传统光谱实验和理论模型，因此给予5分（有一定关联，因为属于科学领域研究）。

!!! tip deepseek-chat TL;DR

该研究通过宽带振动光谱实验证明了CCH+离子中强烈的Π-Σ混合效应，揭示了Renner-Teller和伪Jahn-Teller耦合对振动谱的复杂分裂模式影响。

摘要翻译

乙炔基自由基阳离子CCH⁺（³Π）因其开壳层线性结构及低位³Σ⁻态的存在，为非绝热效应的基础光谱研究提供了一个独特体系。该³Σ⁻态在（振）转光谱中引发了显著扰动。为探究这些效应，我们采用泄漏光谱技术记录了CCH⁺在350-3450 cm⁻¹范围内的宽带振动光谱。光谱显示CCH弯曲模存在复杂的分裂模式，这归因于³Π与³Σ⁻电子态之间的Renner-Teller效应和伪Jahn-Teller耦合效应。通过一个三态非绝热模型（已利用CH伸缩模的高分辨率红外数据验证），我们对宽带红外光谱进行了谱线归属，包括在上述高分辨率光谱中观测到的一个额外Π型振动电子特征。我们的研究结果表明，分裂模式对Π-Σ能隙具有显著敏感性，其耦合效应极强，甚至弯曲振动的零点振动运动足以破坏该离子的振动电子结构。这种结构紧凑的离子兼具强耦合效应和高质量光谱数据，可作为评估非绝热模型的典范体系。

摘要 (Abstract)

The ethynyl radical cation, CCH$^+$ ($^3Π$), offers a unique system for fundamental spectroscopic studies of non-adiabatic effects due to its open-shell linear structure and the presence of a low-lying $^3Σ^-$ state, which induces notable perturbations in the (ro-)vibrational spectrum. To probe these effects, we recorded the broadband vibrational spectrum of CCH$^+$ from 350-3450 cm$^{-1}$ using leak-out spectroscopy. The spectrum reveals a complex splitting pattern in the CCH bending mode, attributed to Renner-Teller and pseudo-Jahn-Teller coupling effects between the $^3 Π$ and $^3 Σ^-$ electronic states. A three-state diabatic model, validated here against high-resolution IR data of the CH stretching mode, facilitated assignments within the broadband infrared (IR) spectrum, including an additional $Π$ vibronic feature observed in the aforementioned high-resolution spectrum. Our results highlight a pronounced sensitivity of the splitting pattern to the $Π$-$Σ$ energy gap, with couplings so large that even the zero-point vibrational motion of the bending vibration is sufficient to disrupt the vibronic structure of this ion. This compact ion, with strong coupling effects and high-quality spectroscopic data, serves as an exemplary system for evaluating non-adiabatic models.

关键词: CCH+, Renner-Teller effect, pseudo-Jahn-Teller coupling, vibrational spectrum, non-adiabatic effects, spectroscopic studies, ethynyl radical cation, vibronic structure

Token 消耗统计

总计: 828,411 tokens（输入 558,495 / 输出 269,916）

模型	输入	输出	合计
deepseek-chat	488,157	269,916	758,073
glm-4.7	70,338	0	70,338

📊 ArXiv 研究报告 (2026-04-14)

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

评分设置

📈 论文统计

⭐ 及格论文详细分析

1. From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

2. Multi-Agent Decision-Focused Learning via Value-Aware Sequential Communication

3. OASIS: Online Activation Subspace Learning for Memory-Efficient Training

4. SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

5. Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs

6. Generalization and Scaling Laws for Mixture-of-Experts Transformers

📋 所有论文列表

1. ✅ From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

2. ✅ Multi-Agent Decision-Focused Learning via Value-Aware Sequential Communication

3. ✅ OASIS: Online Activation Subspace Learning for Memory-Efficient Training

4. ✅ SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

5. ✅ Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs

6. ✅ Generalization and Scaling Laws for Mixture-of-Experts Transformers

7. ❌ ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering

8. ❌ Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation

9. ❌ Scalable High-Recall Constraint-Satisfaction-Based Information Retrieval for Clinical Trials Matching

10. ❌ Plasticity-Enhanced Multi-Agent Mixture of Experts for Dynamic Objective Adaptation in UAVs-Assisted Emergency Communication Networks

11. ❌ Text-Conditioned Multi-Expert Regression Framework for Fully Automated Multi-Abutment Design

12. ❌ VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

13. ❌ Fine-Grained Action Segmentation for Renorrhaphy in Robot-Assisted Partial Nephrectomy

14. ❌ Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

15. ❌ Case-Grounded Evidence Verification: A Framework for Constructing Evidence-Sensitive Supervision

16. ❌ Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise

17. ❌ Envisioning the Future, One Step at a Time

18. ❌ VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning

19. ❌ Strategic Algorithmic Monoculture:Experimental Evidence from Coordination Games

20. ❌ Semantic Rate-Distortion for Bounded Multi-Agent Communication: Capacity-Derived Semantic Spaces and the Communication Cost of Alignment

21. ❌ VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning

22. ❌ BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation

23. ❌ XFED: Non-Collusive Model Poisoning Attack Against Byzantine-Robust Federated Classifiers

24. ❌ RecaLLM: Addressing the Lost-in-Thought Phenomenon with Explicit In-Context Retrieval

25. ❌ SafeMind: A Risk-Aware Differentiable Control Framework for Adaptive and Safe Quadruped Locomotion

26. ❌ Process Reward Agents for Steering Knowledge-Intensive Reasoning

27. ❌ E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning

28. ❌ SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning

29. ❌ ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

30. ❌ Many-Tier Instruction Hierarchy in LLM Agents

31. ❌ Physics-guided surrogate learning enables zero-shot control of turbulent wings

32. ❌ TME-PSR: Time-aware, Multi-interest, and Explanation Personalization for Sequential Recommendation

33. ❌ On the Representational Limits of Quantum-Inspired 1024-D Document Embeddings: An Experimental Evaluation Framework

34. ❌ Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

35. ❌ Three Modalities, Two Design Probes, One Prototype, and No Vision: Experience-Based Co-Design of a Multi-modal 3D Data Visualization Tool

36. ❌ Do We Really Need to Approach the Entire Pareto Front in Many-Objective Bayesian Optimisation?

37. ❌ Yes, But Not Always. Generative AI Needs Nuanced Opt-in

38. ❌ PhysInOne: Visual Physics Learning and Reasoning in One Suite

39. ❌ HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

40. ❌ The AI Codebase Maturity Model: From Assisted Coding to Self-Sustaining Systems

41. ❌ BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning

42. ❌ LLM-Rosetta: A Hub-and-Spoke Intermediate Representation for Cross-Provider LLM API Translation

43. ❌ Visually-Guided Policy Optimization for Multimodal Reasoning

44. ❌ Mind the Gap Between Spatial Reasoning and Acting! Step-by-Step Evaluation of Agents With Spatial-Gym

45. ❌ Constraint-Aware Corrective Memory for Language-Based Drug Discovery Agents

46. ❌ SatQNet: Satellite-assisted Quantum Network Entanglement Routing Using Directed Line Graph Neural Networks

47. ❌ SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering

48. ❌ SAGE: A Service Agent Graph-guided Evaluation Benchmark

49. ❌ Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization

50. ❌ DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?

51. ❌ DDSP-QbE++: Improving Speech Quality for Speech Anonymisation for Atypical Speech

52. ❌ Statistical Properties of the King Wen Sequence: An Anti-Habituation Structure That Does Not Improve Neural Network Training

53. ❌ Neural Distribution Prior for LiDAR Out-of-Distribution Detection

54. ❌ The Fast Lane Hypothesis: Von Economo Neurons Implement a Biological Speed-Accuracy Tradeoff

55. ❌ GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking

56. ❌ On the Role of DAG topology in Energy-Aware Cloud Scheduling : A GNN-Based Deep Reinforcement Learning Approach

57. ❌ Artificial intelligence can persuade people to take political actions

58. ❌ Vision Transformers for Preoperative CT-Based Prediction of Histopathologic Chemotherapy Response Score in High-Grade Serous Ovarian Carcinoma

59. ❌ Camera Artist: A Multi-Agent Framework for Cinematic Language Storytelling Video Generation

60. ❌ Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies

61. ❌ Persona-E$^2$: A Human-Grounded Dataset for Personality-Shaped Emotional Responses to Textual Events

62. ❌ Structuring versus Problematizing: How LLM-based Agents Scaffold Learning in Diagnostic Reasoning

63. ❌ CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation

64. ❌ EquiformerV3: Scaling Efficient, Expressive, and General SE(3)-Equivariant Graph Attention Transformers

65. ❌ Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition

66. ❌ PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

67. ❌ TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training