📊 ArXiv 研究报告 (2026-04-09)

生成时间: 2026-04-09 09:25:42 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 345 篇
及格论文: 11 篇 (3.2%)

⭐ 及格论文详细分析

1. UniCreative: Unifying Long-form Logic and Short-form Sparkle via Reference-Free Reinforcement Learni

作者: Xiaolong Wei, Zerun Zhu, Simin Niu, Xingyu Zhang, Peiying Yu, Changxuan Xiao, Yuchen Li, Jicheng Yang, Zhejun Zhao, Chong Meng, Long Xia, Daiting Shi 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05517v1

评分: 61.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	8.0/10	8.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文UniCreative提出了一种用于创意写作的统一强化学习框架，核心贡献是AC-GenRM奖励模型和ACPO优化算法，旨在解决长文本连贯性与短文本表达性之间的平衡问题。该研究与LLM高度相关（10分），因为它专注于大模型在创意写作中的应用；与Alignment和RLHF/DPO高度相关（10分），因为它提出了无参考的强化学习对齐方法；与Post-training/SFT相关（8分），因为它涉及模型对齐而不依赖监督微调；与Long Context LLMs相关（8分），因为它处理长文本生成中的规划问题；与CoT Reasoning、System 2 Thinking和Self-Correction有一定关联（5分），因为模型展现出元认知能力，能区分任务类型。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出UniCreative框架，通过自适应约束感知奖励模型（AC-GenRM）和无参考强化学习算法（ACPO），解决了创意写作中长文本连贯性与短文本表达性的平衡问题，显著提升了模型在多样化写作任务中的性能，并使其获得了自主区分任务类型的元认知能力。

摘要翻译

创造性写作的一个根本性挑战在于调和长文本叙事所需的全局连贯性与短文本创作所追求的局部表现力之间的固有张力。长文本生成需要明确的宏观规划，而短篇创作则往往要求自发且无约束的表达。然而，现有的对齐范式通常采用静态奖励信号，并严重依赖高质量监督数据，这类数据成本高昂且难以扩展。为此，我们提出了 \textbf{UniCreative}，一个统一的无需参考的强化学习框架。我们首先引入了 \textbf{AC-GenRM}，这是一个自适应约束感知奖励模型，它能动态合成针对特定查询的评判标准，以提供细粒度的偏好判断。利用这些信号，我们提出了 \textbf{ACPO}，一种策略优化算法，该算法能够在无需监督微调和真实参考文本的情况下，使模型在内容质量和结构范式两方面与人类偏好对齐。实证结果表明，AC-GenRM 与专家评估高度一致，而 ACPO 则能显著提升模型在多样化写作任务上的性能。关键的是，我们的分析揭示了一种新兴的元认知能力：模型学会了自主区分需要严格规划的任务与适合直接生成的任务，从而验证了我们直接对齐方法的有效性。

摘要 (Abstract)

A fundamental challenge in creative writing lies in reconciling the inherent tension between maintaining global coherence in long-form narratives and preserving local expressiveness in short-form texts. While long-context generation necessitates explicit macroscopic planning, short-form creativity often demands spontaneous, constraint-free expression. Existing alignment paradigms, however, typically employ static reward signals and rely heavily on high-quality supervised data, which is costly and difficult to scale. To address this, we propose \textbf{UniCreative}, a unified reference-free reinforcement learning framework. We first introduce \textbf{AC-GenRM}, an adaptive constraint-aware reward model that dynamically synthesizes query-specific criteria to provide fine-grained preference judgments. Leveraging these signals, we propose \textbf{ACPO}, a policy optimization algorithm that aligns models with human preferences across both content quality and structural paradigms without supervised fine-tuning and ground-truth references. Empirical results demonstrate that AC-GenRM aligns closely with expert evaluations, while ACPO significantly enhances performance across diverse writing tasks. Crucially, our analysis reveals an emergent meta-cognitive ability: the model learns to autonomously differentiate between tasks requiring rigorous planning and those favoring direct generation, validating the effectiveness of our direct alignment approach.

关键词: creative writing, reinforcement learning, alignment, reward model, policy optimization, long-form generation, short-form generation, meta-cognitive ability

2. Joint Knowledge Base Completion and Question Answering by Combining Large Language Models and Small

作者: Yinan Liu, Dongying Lin, Sigang Luo, Xiaochun Yang, Bin Wang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05875v1

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出JCQL框架，结合LLM和SLM解决知识库补全(KBC)和知识库问答(KBQA)的联合任务。核心相关关键词：LLMs（10分，论文核心使用LLM进行推理）、SLMs（10分，论文核心使用SLM进行KBC）、LLM Agents（10分，明确使用LLM agent-based KBQA模型）、Hallucination Mitigation（10分，明确解决LLM幻觉问题）。中等相关关键词：Supervised Fine-tuning（5分，提到增量微调KBC模型）、Chain of Thought（5分，涉及推理路径）、Tool Use（5分，将KBC模型作为agent动作使用）。其他关键词与论文内容无直接关联，得0分。

!!! tip deepseek-chat TL;DR

该论文提出JCQL框架，通过结合大语言模型(LLM)和小语言模型(SLM)的优势，以迭代方式解决知识库补全(KBC)和知识库问答(KBQA)的联合任务，实验表明该框架在两个基准数据集上均优于现有基线方法。

摘要翻译

知识库（Knowledge Bases，KBs）在各种应用中发挥着关键作用。作为两项具有代表性的知识库相关任务，知识库补全（Knowledge Base Completion，KBC）与知识库问答（Knowledge Base Question Answering，KBQA）紧密关联且本质互补。因此，联合解决KBC与KBQA任务以使其相互促进将大有裨益。然而，现有研究通常依赖小语言模型（Small Language Model，SLM）来联合增强这两项任务，而忽略了大语言模型（Large Language Model，LLM）强大的推理能力。本文通过结合LLM与SLM的优势，提出了一种新颖的框架JCQL，该框架能够以迭代方式使这两项任务相互增强。为使KBC增强KBQA，我们通过将基于SLM训练的KBC模型作为智能体的一项行动，来增强基于LLM智能体的KBQA模型的推理路径，从而缓解LLM在KBQA中存在的幻觉问题与高计算成本问题。为使KBQA增强KBC，我们利用KBQA的推理路径作为补充训练数据，对KBC模型进行增量微调，从而提升SLM在KBC任务中的能力。在两个公开基准数据集上的大量实验表明，JCQL在KBC与KBQA任务上均超越了所有基线方法。

摘要 (Abstract)

Knowledge Bases (KBs) play a key role in various applications. As two representative KB-related tasks, knowledge base completion (KBC) and knowledge base question answering (KBQA) are closely related and inherently complementary with each other. Thus, it will be beneficial to solve the task of joint KBC and KBQA to make them reinforce each other. However, existing studies usually rely on the small language model (SLM) to enhance them jointly, and the large language model (LLM)’s strong reasoning ability is ignored. In this paper, by combining the strengths of the LLM with the SLM, we propose a novel framework JCQL, which can make these two tasks enhance each other in an iterative manner. To make KBC enhance KBQA, we augment the LLM agent-based KBQA model’s reasoning paths by incorporating an SLM-trained KBC model as an action of the agent, alleviating the LLM’s hallucination and high computational costs issue in KBQA. To make KBQA enhance KBC, we incrementally fine-tune the KBC model by leveraging KBQA’s reasoning paths as its supplementary training data, improving the ability of the SLM in KBC. Extensive experiments over two public benchmark data sets demonstrate that JCQL surpasses all baselines for both KBC and KBQA tasks.

关键词: Large Language Models, Small Language Models, Knowledge Base Completion, Knowledge Base Question Answering, LLM Agents, Hallucination Mitigation, Joint Framework, Iterative Enhancement

3. Improving Sparse Memory Finetuning

作者: Satyam Goyal, Anirudh Kanchi, Garv Shah, Prakhar Gupta 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05248v1

评分: 48.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	8.0/10	8.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究稀疏内存微调（SMF）方法，直接涉及LLMs、稀疏模型、参数高效微调（PEFT/LoRA）和微调（SFT）等关键词，其中LLMs、PEFT和SFT是核心内容（10分），稀疏模型相关（8分），小模型和预训练/适应有一定关联（5分），其余关键词如对齐、推理、代理等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型（LLMs）在持续学习中的灾难性遗忘问题，提出了一种基于KL散度的稀疏内存微调（SMF）方法，通过局部化参数更新有效学习新知识并最小化遗忘。

摘要翻译

大型语言模型（LLMs）通常在训练后处于静态，而实际应用需要持续适应新知识，同时不削弱现有能力。更新模型的标准方法（如全参数微调或参数高效方法，例如LoRA）面临一个根本性的权衡：灾难性遗忘。这些方法修改了共享的稠密表征，导致任务间相互干扰。稀疏记忆微调（Sparse Memory Finetuning, SMF）通过将更新定位到显式记忆层中的一小部分参数，提供了一种有前景的替代方案。在本工作中，我们提出了一个开源流程，为现有的预训练模型（Qwen-2.5-0.5B）加装稀疏记忆模块，从而能够在消费级硬件上实现有效的持续学习。我们通过引入一种基于库尔巴克-莱布勒（Kullback-Leibler, KL）散度、具有理论依据的槽位选择机制，扩展了先前的工作。该机制优先针对相对于背景分布信息上“意外”的标记进行记忆更新。我们的实验表明，经过改装的模型能够以对保留能力的最小遗忘来获取新的事实知识，从而在实际场景中验证了稀疏更新假说。

摘要 (Abstract)

Large Language Models (LLMs) are typically static after training, yet real-world applications require continual adaptation to new knowledge without degrading existing capabilities. Standard approaches to updating models, like full finetuning or parameter-efficient methods (e.g., LoRA), face a fundamental trade-off: catastrophic forgetting. They modify shared dense representations, causing interference across tasks. Sparse Memory Finetuning (SMF) offers a promising alternative by localizing updates to a small subset of parameters in explicit memory layers. In this work, we present an open-source pipeline to retrofit existing pretrained models (Qwen-2.5-0.5B) with sparse memory modules, enabling effective continual learning on consumer hardware. We extend prior work by introducing a theoretically grounded slot-selection mechanism based on Kullback-Leibler (KL) divergence, which prioritizes memory updates for informationally “surprising” tokens relative to a background distribution. Our experiments demonstrate that our retrofitted models can acquire new factual knowledge with minimal forgetting of held-out capabilities, validating the sparse update hypothesis in a practical setting.

关键词: Sparse Memory Finetuning, Large Language Models, Continual Learning, Catastrophic Forgetting, Parameter-efficient Fine-tuning, KL Divergence, Memory Updates, Qwen-2.5-0.5B

4. BiMind: A Dual-Head Reasoning Model with Attention-Geometry Adapter for Incorrect Information Detect

作者: Zhongxing Zhang, Emily K. Vraga, Jisu Huh, Jaideep Srivastava 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06022v1

评分: 47.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	8.0/10	8.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文BiMind专注于错误信息检测，提出了一种双头推理框架，结合了内容内部推理和知识增强推理。与关键词的相关性分析如下：1）与"Large Language Models"有一定关联（5分），因为该研究属于大模型在信息验证领域的应用，但未明确使用LLMs；2）与"Retrieval-Augmented Generation"（8分）高度相关，因为论文提出了自检索知识机制，通过kNN检索构建语义记忆；3）与"Chain of Thought"和"System 2 Thinking"（各8分）相关，因为双头推理框架体现了多步推理和深度推理思想；4）与"Hallucination Mitigation"（10分）高度相关，因为错误信息检测直接针对事实性和真实性验证；5）与"Mechanistic Interpretability"（8分）相关，因为论文提出了可解释性诊断和VoX度量。其他关键词如MoE、量化、指令调优等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

论文提出了一种名为BiMind的双头推理模型，通过注意力几何适配器和自检索知识机制来解决错误信息检测中内容验证与外部知识整合的平衡问题，实验表明该模型在公开数据集上优于先进方法并提供了可解释的诊断。

摘要翻译

错误信息通过破坏内容的真实性与完整性构成重大挑战，然而多数检测方法难以在注意力几何结构坍塌的情况下，协同平衡文本内容验证与外部知识修正。为解决此问题，我们提出一种双头推理框架BiMind，该框架将内容内部推理与知识增强推理进行解耦。在BiMind中，我们引入了三项核心创新：（i）注意力几何适配器，通过令牌条件偏移重塑注意力逻辑值并缓解注意力坍塌；（ii）自检索知识机制，通过k近邻检索构建领域内语义记忆，并借助特征级线性调制注入检索到的邻近信息；（iii）不确定性感知融合策略，包括熵门控融合与可训练的共识头，并通过对称KL散度共识正则化器进行稳定。为量化知识贡献，我们定义了一种新颖的指标——经验价值，用于度量知识增强推理带来的实例级逻辑值增益。在公开数据集上的实验结果表明，我们的BiMind模型优于先进的检测方法，并能对知识何时及为何重要提供可解释的诊断分析。

摘要 (Abstract)

Incorrect information poses significant challenges by disrupting content veracity and integrity, yet most detection approaches struggle to jointly balance textual content verification with external knowledge modification under collapsed attention geometries. To address this issue, we propose a dual-head reasoning framework, BiMind, which disentangles content-internal reasoning from knowledge-augmented reasoning. In BiMind, we introduce three core innovations: (i) an attention geometry adapter that reshapes attention logits via token-conditioned offsets and mitigates attention collapse; (ii) a self-retrieval knowledge mechanism, which constructs an in-domain semantic memory through kNN retrieval and injects retrieved neighbors via feature-wise linear modulation; (iii) the uncertainty-aware fusion strategies, including entropy-gated fusion and a trainable agreement head, stabilized by a symmetric Kullback-Leibler agreement regularizer. To quantify the knowledge contributions, we define a novel metric, Value-of-eXperience (VoX), to measure instance-wise logit gains from knowledge-augmented reasoning. Experiment results on public datasets demonstrate that our BiMind model outperforms advanced detection approaches and provides interpretable diagnostics on when and why knowledge matters.

关键词: incorrect information detection, dual-head reasoning, attention geometry adapter, self-retrieval knowledge, knowledge-augmented reasoning, interpretable diagnostics, Value-of-eXperience (VoX), attention collapse mitigation

5. Measuring What Matters!! Assessing Therapeutic Principles in Mental-Health Conversation

作者: Abdullah Mazhar, Het Riteshkumar Shah, Aseem Srivastava, Smriti Joshi, Md Shad Akhtar 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05795v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	5.0/10	5.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文核心研究大语言模型（LLMs）在心理健康领域的应用评估，因此与"Large Language Models"和"AI for Science"高度相关（10分）。提出的CARE框架包含知识蒸馏的链式思维推理，与"Chain of Thought"高度相关（10分）。论文涉及评估AI响应与治疗原则的"Alignment"，以及框架中的对比范例检索与"Retrieval-Augmented Generation"概念相关，各给5分。框架的推理过程涉及深度分析，与"System 2 Thinking"有一定关联（5分）。其他关键词如MoE、SFT、RLHF等未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何评估AI生成的心理健康治疗师式回应的临床适当性和有效性，提出了一个名为CARE的多阶段评估框架，该框架整合了对话内上下文、对比范例检索和知识蒸馏的链式思维推理，实验表明CARE在评估治疗保真度方面显著优于强基线模型。

摘要翻译

大型语言模型在心理健康应用中的日益普及，要求建立能够超越表层流畅性、评估其与心理治疗最佳实践契合度的原则性评估框架。尽管现有系统展现出对话能力，但它们缺乏结构化机制来评估其对核心治疗原则的遵循程度。本文研究了如何评估人工智能生成的治疗师式回应在临床基础上的适当性与有效性。我们依据一个细粒度序数量表，从六项治疗原则维度评估每个治疗师话语：非评判性接纳、温暖、尊重自主性、积极倾听、反思性理解以及情境适当性。我们提出了FAITH-M基准数据集，该数据集包含专家标注的序数评分，并提出了CARE评估框架——一个整合了对话内部语境、对比范例检索以及知识蒸馏链式思维推理的多阶段评估框架。实验表明，CARE的F1分数达到63.34，而作为其基础模型的强基线Qwen3的F1分数为38.56，这代表了64.26%的性能提升，表明性能增益源于结构化推理与语境建模，而非仅依赖基础模型能力。专家评估及外部数据集测试进一步证明了该框架在领域转移下的鲁棒性，同时也凸显了在建模隐含临床细微差别方面存在的挑战。总体而言，CARE为评估人工智能心理健康系统的治疗保真度提供了一个基于临床的框架。

摘要 (Abstract)

The increasing use of large language models in mental health applications calls for principled evaluation frameworks that assess alignment with psychotherapeutic best practices beyond surface-level fluency. While recent systems exhibit conversational competence, they lack structured mechanisms to evaluate adherence to core therapeutic principles. In this paper, we study the problem of evaluating AI-generated therapist-like responses for clinically grounded appropriateness and effectiveness. We assess each therapists utterance along six therapeutic principles: non-judgmental acceptance, warmth, respect for autonomy, active listening, reflective understanding, and situational appropriateness using a fine-grained ordinal scale. We introduce FAITH-M, a benchmark annotated with expert-assigned ordinal ratings, and propose CARE, a multi-stage evaluation framework that integrates intra-dialogue context, contrastive exemplar retrieval, and knowledge-distilled chain-of-thought reasoning. Experiments show that CARE achieves an F-1 score of 63.34 versus the strong baseline Qwen3 F-1 score of 38.56 which is a 64.26 improvement, which also serves as its backbone, indicating that gains arise from structured reasoning and contextual modeling rather than backbone capacity alone. Expert assessment and external dataset evaluations further demonstrate robustness under domain shift, while highlighting challenges in modelling implicit clinical nuance. Overall, CARE provides a clinically grounded framework for evaluating therapeutic fidelity in AI mental health systems.

关键词: Large Language Models, Mental Health, Therapeutic Principles, Evaluation Framework, Chain-of-Thought Reasoning, AI for Science, Clinical Appropriateness, Benchmark FAITH-M

6. MedLayBench-V: A Large-Scale Benchmark for Expert-Lay Semantic Alignment in Medical Vision Language

作者: Han Jang, Junhyeok Lee, Heeseong Eum, Kyu Sung Choi 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05738v1

评分: 41.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	8.0/10	8.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文专注于医学视觉语言模型（Med-VLMs）的专家-大众语义对齐，属于大模型在生物医学领域的应用。核心贡献是MedLayBench-V基准数据集，用于训练和评估能弥合临床专家与患者沟通鸿沟的下一代Med-VLMs。因此，与"AI for Science"高度相关（10分），涉及医学AI应用。与"Alignment"相关（8分），因为研究专家与大众语义对齐。与"Hallucination Mitigation"相关（8分），因为数据集构建方法（SCGR）旨在避免简化导致的幻觉风险。与"Large Language Models"、“Pre-training”、“Post-training"有一定关联（各5分），因Med-VLMs基于大模型技术，且基准用于模型训练（包括预训练和微调）。其他关键词如MoE、量化、推理加速等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对医学视觉语言模型在患者沟通中存在的专业术语障碍问题，提出了首个大规模多模态基准数据集MedLayBench-V，通过结构化概念锚定细化方法确保语义等效性，为训练能实现专家-大众语义对齐的下一代模型提供了验证基础。

摘要翻译

医学视觉-语言模型（Med-VLMs）在解读诊断影像方面已达到专家水平。然而，当前模型主要基于专业文献进行训练，限制了其以患者为中心护理所需的通俗表达方式传递发现结果的能力。尽管以文本为中心的研究已积极开发简化医学术语的资源，但目前仍严重缺乏旨在促进通俗化医学图像理解的大规模多模态基准。为填补这一资源空白，我们推出了MedLayBench-V——首个致力于专家-通俗语义对齐的大规模多模态基准。与可能产生幻觉的简单简化方法不同，我们的数据集通过结构化概念锚定优化（SCGR）流程构建。该方法通过整合统一医学语言系统（UMLS）概念唯一标识符（CUIs）与微观实体约束，强制实现严格的语义等效性。MedLayBench-V为训练和评估新一代Med-VLMs提供了经过验证的基础，这些模型将能够弥合临床专家与患者之间的沟通鸿沟。

摘要 (Abstract)

Medical Vision-Language Models (Med-VLMs) have achieved expert-level proficiency in interpreting diagnostic imaging. However, current models are predominantly trained on professional literature, limiting their ability to communicate findings in the lay register required for patient-centered care. While text-centric research has actively developed resources for simplifying medical jargon, there is a critical absence of large-scale multimodal benchmarks designed to facilitate lay-accessible medical image understanding. To bridge this resource gap, we introduce MedLayBench-V, the first large-scale multimodal benchmark dedicated to expert-lay semantic alignment. Unlike naive simplification approaches that risk hallucination, our dataset is constructed via a Structured Concept-Grounded Refinement (SCGR) pipeline. This method enforces strict semantic equivalence by integrating Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs) with micro-level entity constraints. MedLayBench-V provides a verified foundation for training and evaluating next-generation Med-VLMs capable of bridging the communication divide between clinical experts and patients.

关键词: Medical Vision-Language Models, Med-VLMs, expert-lay semantic alignment, multimodal benchmark, MedLayBench-V, Structured Concept-Grounded Refinement, UMLS Concept Unique Identifiers, patient-centered care

7. HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference

作者: Bowen Zeng, Feiyang Ren, Jun Zhang, Xiaoling Gu, Ke Chen, Lidan Shou, Huan Li 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05887v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	15.0/10	15.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	5.0/10	5.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	10.0/10	10.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究KV缓存压缩技术以提升多模态大语言模型推理效率，与关键词"KV Cache Compression"高度相关（15分），直接解决大模型推理中的内存和延迟问题，属于大模型技术原理创新。论文涉及大模型（10分）和推理加速（10分），与模型压缩有一定关联（5分）。其他关键词如MoE、SLMs、对齐、RAG等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型推理中KV缓存内存开销大、延迟高的问题，提出了HybridKV混合压缩框架，实现了最高7.9倍的内存减少和1.52倍的解码加速，且几乎不影响模型性能。

摘要翻译

多模态大语言模型（MLLMs）在文本、图像和视频的统一推理方面取得了进展，但其推理过程受到键值（KV）缓存快速增长的阻碍。每个视觉输入会扩展为数千个标记，导致缓存随上下文长度线性增长，并在整个解码过程中驻留在GPU内存中，这即使在高端GPU上也会带来极高的内存开销和延迟。一种常见的解决方案是在固定的分配预算下以不同粒度压缩缓存：标记级均匀丢弃重要性较低的标记，层级在不同层间调整保留比例，而头级则在各个注意力头之间重新分配预算。然而，这些方法仅停留在分配层面，忽视了注意力头具有异质性行为，需要不同的压缩策略。我们提出了HybridKV，一种混合KV缓存压缩框架，它在三个阶段整合了互补策略：首先使用以文本为中心的注意力将头部分为静态或动态类型；然后采用自上而下的预算分配方案分层分配KV预算；最后，静态头通过文本优先剪枝进行压缩，动态头则通过分块检索进行压缩。在Qwen2.5-VL-7B模型上对11个多模态基准进行的实验表明，HybridKV将KV缓存内存降低了高达$7.9\times$，解码速度提升了$1.52\times$，且相较于全缓存MLLM，性能几乎无下降甚至有所提升。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) have advanced unified reasoning over text, images, and videos, but their inference is hindered by the rapid growth of key-value (KV) caches. Each visual input expands into thousands of tokens, causing caches to scale linearly with context length and remain resident in GPU memory throughout decoding, which leads to prohibitive memory overhead and latency even on high-end GPUs. A common solution is to compress caches under a fixed allocated budget at different granularities: token-level uniformly discards less important tokens, layer-level varies retention across layers, and head-level redistributes budgets across heads. Yet these approaches stop at allocation and overlook the heterogeneous behaviors of attention heads that require distinct compression strategies. We propose HybridKV, a hybrid KV cache compression framework that integrates complementary strategies in three stages: heads are first classified into static or dynamic types using text-centric attention; then a top-down budget allocation scheme hierarchically assigns KV budgets; finally, static heads are compressed by text-prior pruning and dynamic heads by chunk-wise retrieval. Experiments on 11 multimodal benchmarks with Qwen2.5-VL-7B show that HybridKV reduces KV cache memory by up to $7.9\times$ and achieves $1.52\times$ faster decoding, with almost no performance drop or even higher relative to the full-cache MLLM.

关键词: KV cache compression, Multimodal Large Language Models, inference efficiency, memory overhead, attention heads, pruning, retrieval, decoding acceleration

8. Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis

作者: Peixi Peng, Housheng Xie, Yanling Wei, Guangcong Ruan, Xiaoyang Zou, Qian Cao, Yongjian Nian, Guoyan Zheng 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05649v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文提出RATNet，一个用于胃肠道内窥镜诊断的基础模型，属于大模型在生物医学领域的应用。核心相关关键词：1）“Large Language Models” OR “LLMs” OR “Foundation Models”：论文明确开发了一个基础模型（foundation model），高度相关（10分）。2）“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”：论文采用循环预训练策略（cyclic pre-training strategy）从异构数据中获取知识，高度相关（10分）。3）“Post-training” OR “Supervised Fine-tuning” OR “SFT”：模型支持微调（fine-tuning），高度相关（10分）。4）“AI for Science” OR “Bioinformatics” OR “Cheminformatics”：论文专注于胃肠道疾病诊断的AI应用，属于生物信息学/科学AI范畴，高度相关（10分）。其他关键词如MoE、量化、推理加速、幻觉缓解等，论文未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对胃肠道内窥镜图像诊断中现有AI模型泛化性、适应性和鲁棒性不足的问题，提出了一个基于类比推理的基础模型RATNet，通过循环预训练从异构数据中获取知识，在多种诊断场景下优于现有模型。

摘要翻译

胃肠道疾病正日益加重全球健康负担，而内窥镜检查是早期诊断的主要工具。然而，常规内窥镜图像解读仍存在病变漏检和效率有限的问题。尽管人工智能辅助诊断展现出潜力，但由于医学数据有限、领域偏移以及标注异质性，现有模型往往缺乏泛化性、适应性、鲁棒性和可扩展性。为应对这些挑战，我们开发了RATNet——一个基于类比推理的胃肠道内窥镜影像基础模型。RATNet通过循环预训练策略，从五个胃肠道内窥镜数据集的异质性专家标注中获取并迁移知识。其架构包含编码器、关联知识获取与迁移（Relevance-knowledge Acquisition and Transfer, RAT）模块、投影器以及多任务头，支持微调、线性探测和零样本迁移。评估表明，在六种场景下——常见胃肠道疾病诊断、罕见疾病的少样本学习、向新医疗机构的零样本迁移、长尾疾病分布下的鲁棒性、对新疾病的适应性，以及通过联邦学习实现隐私保护部署——RATNet均优于包括GastroNet和GastroVision在内的现有基础模型。其优势源于类比推理机制：该机制将图像衍生的后验知识与学习到的先验知识库进行匹配，并迁移关联知识以指导诊断，从而提升泛化能力并增强对偏见的抵抗性。RATNet具有开放性和成本效益，支持自动整合异质性标注而无需人工标签统一，同时降低了数据获取成本，这使其成为智能胃肠道诊断——特别是在资源有限环境中——的实用基础框架。

摘要 (Abstract)

Gastrointestinal diseases impose a growing global health burden, and endoscopy is a primary tool for early diagnosis. However, routine endoscopic image interpretation still suffers from missed lesions and limited efficiency. Although AI-assisted diagnosis has shown promise, existing models often lack generalizability, adaptability, robustness, and scalability because of limited medical data, domain shift, and heterogeneous annotations. To address these challenges, we develop RATNet, a foundation model for gastrointestinal endoscopy imaging based on analogical reasoning. RATNet acquires and transfers knowledge from heterogeneous expert annotations across five gastrointestinal endoscopy datasets through a cyclic pre-training strategy. Its architecture consists of an encoder, a relevance-knowledge acquisition and transfer (RAT) module, a projector, and a multi-task head, and supports fine-tuning, linear probing, and zero-shot transfer. Evaluations show that RATNet outperforms existing foundation models, including GastroNet and GastroVision, across six scenarios: diagnosis of common gastrointestinal diseases, few-shot learning for rare diseases, zero-shot transfer to new medical sites, robustness under long-tailed disease distributions, adaptation to novel diseases, and privacy-preserving deployment via federated learning. Its advantage comes from an analogical reasoning mechanism that matches image-derived posterior knowledge to a learned prior knowledge base and transfers relative knowledge to guide diagnosis, improving generalization and resistance to bias. RATNet is open and cost-effective, supports automatic integration of heterogeneous annotations without manual label unification, and reduces data acquisition costs, making it a practical foundation for intelligent gastrointestinal diagnosis, especially in resource-limited settings.

关键词: foundation model, gastrointestinal endoscopy, analogical reasoning, cyclic pre-training, heterogeneous annotations, multi-task learning, medical AI, diagnosis

9. Mechanistic Circuit-Based Knowledge Editing in Large Language Models

作者: Tianyi Zhao, Yinhan He, Wendy Zheng, Chen Chen 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05876v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的知识编辑问题，特别是解决多步推理中的"推理鸿沟”。因此，与"Large Language Models"高度相关（10分）。论文专注于多步推理链（multi-step reasoning chains），与"Chain of Thought"和"System 2 Thinking"高度相关（各10分）。论文提出基于机制电路（mechanistic circuit）的方法，与"Mechanistic Interpretability"高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、Alignment、RAG、Quantization、AI for Science等，论文未涉及，故给0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在动态环境中更新知识时存在的"推理鸿沟"问题，提出了一种基于机制电路的知识编辑框架MCircKE，通过识别并精准修改负责特定推理任务的因果电路，有效提升了模型在多步推理链中利用编辑知识的能力。

摘要翻译

在现实动态环境中部署大语言模型（LLM）带来了更新其预训练知识的挑战。现有的知识编辑方法虽然能够可靠地修正孤立事实，但常常面临“推理鸿沟”问题：模型能回忆起被编辑的事实，却无法在多步推理链中有效运用该事实。为弥合这一鸿沟，我们提出MCircKE（基于机理电路的知识编辑），这是一种新颖的框架，能够实现精确的“定位-适配”编辑流程。MCircKE首先识别负责特定推理任务的因果电路，该电路既捕获事实的存储，也捕捉其逻辑结论的传递路径；随后，仅在此定位出的电路内部进行精准的参数更新。在MQuAKE-3K基准测试上的大量实验表明，该方法在知识编辑的多跳推理任务中具有显著有效性。

摘要 (Abstract)

Deploying Large Language Models (LLMs) in real-world dynamic environments raises the challenge of updating their pre-trained knowledge. While existing knowledge editing methods can reliably patch isolated facts, they frequently suffer from a “Reasoning Gap”, where the model recalls the edited fact but fails to utilize it in multi-step reasoning chains. To bridge this gap, we introduce MCircKE (\underline{M}echanistic \underline{Circ}uit-based \underline{K}nowledge \underline{E}diting), a novel framework that enables a precise “map-and-adapt” editing procedure. MCircKE first identifies the causal circuits responsible for a specific reasoning task, capturing both the storage of the fact and the routing of its logical consequences. It then surgically update parameters exclusively within this mapped circuit. Extensive experiments on the MQuAKE-3K benchmark demonstrate the effectiveness of the proposed method for multi-hop reasoning in knowledge editing.

关键词: Large Language Models, Knowledge Editing, Mechanistic Circuits, Multi-step Reasoning, Reasoning Gap, Causal Circuits, Parameter Update, MQuAKE-3K Benchmark

10. The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models

作者: Prashant C. Raju 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04155v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文核心研究科学基础模型（特别是生物学和物理学领域）中连续几何表示与离散标记化之间的根本矛盾，提出了"几何对齐税"概念。与"Foundation Models"高度相关（10分），因为论文明确研究科学基础模型；与"AI for Science"高度相关（10分），因为研究生物学和物理学领域的AI应用；与"Mechanistic Interpretability"高度相关（10分），因为论文深入分析模型内部表示和几何失真问题；与"Pre-training"有一定关联（5分），因为涉及模型训练和表示学习；其他关键词如MoE、SFT、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究发现科学基础模型在离散标记化过程中会损失连续几何结构，提出了"几何对齐税"概念，并通过实验证明连续目标能显著减少几何失真，同时评估了14个生物基础模型并识别了三种失败机制。

摘要翻译

生物学与物理学基础模型虽能优化预测准确性，但其内部表征系统性地无法保持所建模系统的连续几何结构。我们揭示了根本原因：几何对齐税——即强制将连续流形通过离散分类瓶颈所产生的固有代价。在合成动力系统上的受控消融实验表明，在相同编码器上用连续输出头替代交叉熵损失，可将几何失真降低高达8.5倍；而学习得到的码本则呈现非单调的双重约束现象：更精细的量化虽能改善重建效果，却会恶化几何保持。在连续目标下，三种架构的差异仅为1.3倍；而在离散标记化条件下，其差异扩大至3000倍。通过率失真理论与互信息神经估计（MINE）对14个生物基础模型进行评估，我们识别出三种失效机制：局部-全局解耦、表征压缩与几何空泛。受控实验证实，Evo 2模型在真实DNA数据上表现出的反向互补稳健性反映的是保守的序列组成特征，而非习得的对称性。所有模型均未能同时实现低失真、高互信息与全局连贯性。

摘要 (Abstract)

Foundation models for biology and physics optimize predictive accuracy, but their internal representations systematically fail to preserve the continuous geometry of the systems they model. We identify the root cause: the Geometric Alignment Tax, an intrinsic cost of forcing continuous manifolds through discrete categorical bottlenecks. Controlled ablations on synthetic dynamical systems demonstrate that replacing cross-entropy with a continuous head on an identical encoder reduces geometric distortion by up to 8.5x, while learned codebooks exhibit a non-monotonic double bind where finer quantization worsens geometry despite improving reconstruction. Under continuous objectives, three architectures differ by 1.3x; under discrete tokenization, they diverge by 3,000x. Evaluating 14 biological foundation models with rate-distortion theory and MINE, we identify three failure regimes: Local-Global Decoupling, Representational Compression, and Geometric Vacuity. A controlled experiment confirms that Evo 2’s reverse-complement robustness on real DNA reflects conserved sequence composition, not learned symmetry. No model achieves simultaneously low distortion, high mutual information, and global coherence.

关键词: Foundation Models, Geometric Alignment Tax, Continuous Geometry, Tokenization, Scientific Models, Biological Foundation Models, Representational Distortion, Rate-Distortion Theory

11. From Governance Norms to Enforceable Controls: A Layered Translation Method for Runtime Guardrails i

作者: Christopher Koch 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05229v1

评分: 33.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	8.0/10	8.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	5.0/10	5.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文专注于Agentic AI系统的治理和控制框架，与"LLM Agents/Autonomous Agents/Agentic Workflow"高度相关（10分），因为这是论文的核心研究对象。与"Tool Use/Function Calling/API Tool Use"相关（8分），因为论文提到Agentic AI系统使用工具并产生外部影响。与"Large Language Models/LLMs/Foundation Models"和"Instruction Tuning/Alignment/Value Alignment"有一定关联（各5分），因为Agentic AI通常基于大模型，且治理涉及价值对齐。与"Multi-agent Systems/Agent Coordination"有一定关联（5分），因为治理框架可能适用于多代理系统。其他关键词如MoE、Scaling Laws、训练方法、推理技术、模型压缩等与论文的治理和控制主题无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对Agentic AI系统在运行时产生的治理风险，提出了一种分层翻译方法，将治理标准转化为可执行的运行时护栏，并通过采购代理案例研究进行了验证。

摘要翻译

具身化人工智能系统能够进行规划、使用工具、维持状态，并产生具有外部效应的多步骤轨迹。这些特性催生了一个与单轮生成式人工智能存在本质差异的治理问题：重大风险在执行过程中涌现，而不仅仅出现在模型开发或部署阶段。因此，诸如ISO/IEC 42001、ISO/IEC 23894、ISO/IEC 42005、ISO/IEC 5338、ISO/IEC 38507等标准以及NIST人工智能风险管理框架与具身化AI高度相关，但它们本身并不能直接转化为可实施的运行时护栏。本文提出一种分层转化方法，将源于标准的治理目标连接到四个控制层：治理目标、设计时约束、运行时协调以及保障反馈。该方法区分了治理目标、技术控制、运行时护栏和保障证据；引入了用于层级分配的控制元组和运行时可执行性评估准则；并通过一个采购代理案例研究展示了该方法。核心主张是审慎的：标准应指导控制在架构、运行时策略、人工介入升级和审计等环节的部署，而运行时护栏应仅保留给那些具备可观测性、确定性和足够时效性，从而有理由在执行时进行干预的控制措施。

摘要 (Abstract)

Agentic AI systems plan, use tools, maintain state, and produce multi-step trajectories with external effects. Those properties create a governance problem that differs materially from single-turn generative AI: important risks emerge dur- ing execution, not only at model development or deployment time. Governance standards such as ISO/IEC 42001, ISO/IEC 23894, ISO/IEC 42005, ISO/IEC 5338, ISO/IEC 38507, and the NIST AI Risk Management Framework are therefore highly relevant to agentic AI, but they do not by themselves yield implementable runtime guardrails. This paper proposes a layered translation method that connects standards-derived governance objectives to four control layers: governance objectives, design- time constraints, runtime mediation, and assurance feedback. It distinguishes governance objectives, technical controls, runtime guardrails, and assurance evidence; introduces a control tuple and runtime-enforceability rubric for layer assignment; and demonstrates the method in a procurement-agent case study. The central claim is modest: standards should guide control placement across architecture, runtime policy, human escalation, and audit, while runtime guardrails are reserved for controls that are observable, determinate, and time-sensitive enough to justify execution-time intervention.

关键词: Agentic AI, runtime guardrails, governance standards, control layers, runtime mediation, procurement agent, enforceable controls, AI risk management

📋 所有论文列表

1. ✅ UniCreative: Unifying Long-form Logic and Short-form Sparkle via Reference-Free Reinforcement Learning

评分: 61.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	8.0/10	8.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出UniCreative框架，通过自适应约束感知奖励模型（AC-GenRM）和无参考强化学习算法（ACPO），解决了创意写作中长文本连贯性与短文本表达性的平衡问题，显著提升了模型在多样化写作任务中的性能，并使其获得了自主区分任务类型的元认知能力。

摘要翻译

创造性写作的一个根本性挑战在于调和长文本叙事所需的全局连贯性与短文本创作所追求的局部表现力之间的固有张力。长文本生成需要明确的宏观规划，而短篇创作则往往要求自发且无约束的表达。然而，现有的对齐范式通常采用静态奖励信号，并严重依赖高质量监督数据，这类数据成本高昂且难以扩展。为此，我们提出了 \textbf{UniCreative}，一个统一的无需参考的强化学习框架。我们首先引入了 \textbf{AC-GenRM}，这是一个自适应约束感知奖励模型，它能动态合成针对特定查询的评判标准，以提供细粒度的偏好判断。利用这些信号，我们提出了 \textbf{ACPO}，一种策略优化算法，该算法能够在无需监督微调和真实参考文本的情况下，使模型在内容质量和结构范式两方面与人类偏好对齐。实证结果表明，AC-GenRM 与专家评估高度一致，而 ACPO 则能显著提升模型在多样化写作任务上的性能。关键的是，我们的分析揭示了一种新兴的元认知能力：模型学会了自主区分需要严格规划的任务与适合直接生成的任务，从而验证了我们直接对齐方法的有效性。

摘要 (Abstract)

A fundamental challenge in creative writing lies in reconciling the inherent tension between maintaining global coherence in long-form narratives and preserving local expressiveness in short-form texts. While long-context generation necessitates explicit macroscopic planning, short-form creativity often demands spontaneous, constraint-free expression. Existing alignment paradigms, however, typically employ static reward signals and rely heavily on high-quality supervised data, which is costly and difficult to scale. To address this, we propose \textbf{UniCreative}, a unified reference-free reinforcement learning framework. We first introduce \textbf{AC-GenRM}, an adaptive constraint-aware reward model that dynamically synthesizes query-specific criteria to provide fine-grained preference judgments. Leveraging these signals, we propose \textbf{ACPO}, a policy optimization algorithm that aligns models with human preferences across both content quality and structural paradigms without supervised fine-tuning and ground-truth references. Empirical results demonstrate that AC-GenRM aligns closely with expert evaluations, while ACPO significantly enhances performance across diverse writing tasks. Crucially, our analysis reveals an emergent meta-cognitive ability: the model learns to autonomously differentiate between tasks requiring rigorous planning and those favoring direct generation, validating the effectiveness of our direct alignment approach.

关键词: creative writing, reinforcement learning, alignment, reward model, policy optimization, long-form generation, short-form generation, meta-cognitive ability

2. ✅ Joint Knowledge Base Completion and Question Answering by Combining Large Language Models and Small Language Models

作者: Yinan Liu, Dongying Lin, Sigang Luo, Xiaochun Yang, Bin Wang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05875v1

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出JCQL框架，通过结合大语言模型(LLM)和小语言模型(SLM)的优势，以迭代方式解决知识库补全(KBC)和知识库问答(KBQA)的联合任务，实验表明该框架在两个基准数据集上均优于现有基线方法。

摘要翻译

知识库（Knowledge Bases，KBs）在各种应用中发挥着关键作用。作为两项具有代表性的知识库相关任务，知识库补全（Knowledge Base Completion，KBC）与知识库问答（Knowledge Base Question Answering，KBQA）紧密关联且本质互补。因此，联合解决KBC与KBQA任务以使其相互促进将大有裨益。然而，现有研究通常依赖小语言模型（Small Language Model，SLM）来联合增强这两项任务，而忽略了大语言模型（Large Language Model，LLM）强大的推理能力。本文通过结合LLM与SLM的优势，提出了一种新颖的框架JCQL，该框架能够以迭代方式使这两项任务相互增强。为使KBC增强KBQA，我们通过将基于SLM训练的KBC模型作为智能体的一项行动，来增强基于LLM智能体的KBQA模型的推理路径，从而缓解LLM在KBQA中存在的幻觉问题与高计算成本问题。为使KBQA增强KBC，我们利用KBQA的推理路径作为补充训练数据，对KBC模型进行增量微调，从而提升SLM在KBC任务中的能力。在两个公开基准数据集上的大量实验表明，JCQL在KBC与KBQA任务上均超越了所有基线方法。

摘要 (Abstract)

Knowledge Bases (KBs) play a key role in various applications. As two representative KB-related tasks, knowledge base completion (KBC) and knowledge base question answering (KBQA) are closely related and inherently complementary with each other. Thus, it will be beneficial to solve the task of joint KBC and KBQA to make them reinforce each other. However, existing studies usually rely on the small language model (SLM) to enhance them jointly, and the large language model (LLM)’s strong reasoning ability is ignored. In this paper, by combining the strengths of the LLM with the SLM, we propose a novel framework JCQL, which can make these two tasks enhance each other in an iterative manner. To make KBC enhance KBQA, we augment the LLM agent-based KBQA model’s reasoning paths by incorporating an SLM-trained KBC model as an action of the agent, alleviating the LLM’s hallucination and high computational costs issue in KBQA. To make KBQA enhance KBC, we incrementally fine-tune the KBC model by leveraging KBQA’s reasoning paths as its supplementary training data, improving the ability of the SLM in KBC. Extensive experiments over two public benchmark data sets demonstrate that JCQL surpasses all baselines for both KBC and KBQA tasks.

关键词: Large Language Models, Small Language Models, Knowledge Base Completion, Knowledge Base Question Answering, LLM Agents, Hallucination Mitigation, Joint Framework, Iterative Enhancement

3. ✅ Improving Sparse Memory Finetuning

作者: Satyam Goyal, Anirudh Kanchi, Garv Shah, Prakhar Gupta 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05248v1

评分: 48.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	8.0/10	8.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对大语言模型（LLMs）在持续学习中的灾难性遗忘问题，提出了一种基于KL散度的稀疏内存微调（SMF）方法，通过局部化参数更新有效学习新知识并最小化遗忘。

摘要翻译

大型语言模型（LLMs）通常在训练后处于静态，而实际应用需要持续适应新知识，同时不削弱现有能力。更新模型的标准方法（如全参数微调或参数高效方法，例如LoRA）面临一个根本性的权衡：灾难性遗忘。这些方法修改了共享的稠密表征，导致任务间相互干扰。稀疏记忆微调（Sparse Memory Finetuning, SMF）通过将更新定位到显式记忆层中的一小部分参数，提供了一种有前景的替代方案。在本工作中，我们提出了一个开源流程，为现有的预训练模型（Qwen-2.5-0.5B）加装稀疏记忆模块，从而能够在消费级硬件上实现有效的持续学习。我们通过引入一种基于库尔巴克-莱布勒（Kullback-Leibler, KL）散度、具有理论依据的槽位选择机制，扩展了先前的工作。该机制优先针对相对于背景分布信息上“意外”的标记进行记忆更新。我们的实验表明，经过改装的模型能够以对保留能力的最小遗忘来获取新的事实知识，从而在实际场景中验证了稀疏更新假说。

摘要 (Abstract)

Large Language Models (LLMs) are typically static after training, yet real-world applications require continual adaptation to new knowledge without degrading existing capabilities. Standard approaches to updating models, like full finetuning or parameter-efficient methods (e.g., LoRA), face a fundamental trade-off: catastrophic forgetting. They modify shared dense representations, causing interference across tasks. Sparse Memory Finetuning (SMF) offers a promising alternative by localizing updates to a small subset of parameters in explicit memory layers. In this work, we present an open-source pipeline to retrofit existing pretrained models (Qwen-2.5-0.5B) with sparse memory modules, enabling effective continual learning on consumer hardware. We extend prior work by introducing a theoretically grounded slot-selection mechanism based on Kullback-Leibler (KL) divergence, which prioritizes memory updates for informationally “surprising” tokens relative to a background distribution. Our experiments demonstrate that our retrofitted models can acquire new factual knowledge with minimal forgetting of held-out capabilities, validating the sparse update hypothesis in a practical setting.

关键词: Sparse Memory Finetuning, Large Language Models, Continual Learning, Catastrophic Forgetting, Parameter-efficient Fine-tuning, KL Divergence, Memory Updates, Qwen-2.5-0.5B

4. ✅ BiMind: A Dual-Head Reasoning Model with Attention-Geometry Adapter for Incorrect Information Detection

作者: Zhongxing Zhang, Emily K. Vraga, Jisu Huh, Jaideep Srivastava 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06022v1

评分: 47.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	8.0/10	8.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

论文提出了一种名为BiMind的双头推理模型，通过注意力几何适配器和自检索知识机制来解决错误信息检测中内容验证与外部知识整合的平衡问题，实验表明该模型在公开数据集上优于先进方法并提供了可解释的诊断。

摘要翻译

错误信息通过破坏内容的真实性与完整性构成重大挑战，然而多数检测方法难以在注意力几何结构坍塌的情况下，协同平衡文本内容验证与外部知识修正。为解决此问题，我们提出一种双头推理框架BiMind，该框架将内容内部推理与知识增强推理进行解耦。在BiMind中，我们引入了三项核心创新：（i）注意力几何适配器，通过令牌条件偏移重塑注意力逻辑值并缓解注意力坍塌；（ii）自检索知识机制，通过k近邻检索构建领域内语义记忆，并借助特征级线性调制注入检索到的邻近信息；（iii）不确定性感知融合策略，包括熵门控融合与可训练的共识头，并通过对称KL散度共识正则化器进行稳定。为量化知识贡献，我们定义了一种新颖的指标——经验价值，用于度量知识增强推理带来的实例级逻辑值增益。在公开数据集上的实验结果表明，我们的BiMind模型优于先进的检测方法，并能对知识何时及为何重要提供可解释的诊断分析。

摘要 (Abstract)

Incorrect information poses significant challenges by disrupting content veracity and integrity, yet most detection approaches struggle to jointly balance textual content verification with external knowledge modification under collapsed attention geometries. To address this issue, we propose a dual-head reasoning framework, BiMind, which disentangles content-internal reasoning from knowledge-augmented reasoning. In BiMind, we introduce three core innovations: (i) an attention geometry adapter that reshapes attention logits via token-conditioned offsets and mitigates attention collapse; (ii) a self-retrieval knowledge mechanism, which constructs an in-domain semantic memory through kNN retrieval and injects retrieved neighbors via feature-wise linear modulation; (iii) the uncertainty-aware fusion strategies, including entropy-gated fusion and a trainable agreement head, stabilized by a symmetric Kullback-Leibler agreement regularizer. To quantify the knowledge contributions, we define a novel metric, Value-of-eXperience (VoX), to measure instance-wise logit gains from knowledge-augmented reasoning. Experiment results on public datasets demonstrate that our BiMind model outperforms advanced detection approaches and provides interpretable diagnostics on when and why knowledge matters.

5. ✅ Measuring What Matters!! Assessing Therapeutic Principles in Mental-Health Conversation

作者: Abdullah Mazhar, Het Riteshkumar Shah, Aseem Srivastava, Smriti Joshi, Md Shad Akhtar 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05795v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	5.0/10	5.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该论文研究了如何评估AI生成的心理健康治疗师式回应的临床适当性和有效性，提出了一个名为CARE的多阶段评估框架，该框架整合了对话内上下文、对比范例检索和知识蒸馏的链式思维推理，实验表明CARE在评估治疗保真度方面显著优于强基线模型。

摘要翻译

大型语言模型在心理健康应用中的日益普及，要求建立能够超越表层流畅性、评估其与心理治疗最佳实践契合度的原则性评估框架。尽管现有系统展现出对话能力，但它们缺乏结构化机制来评估其对核心治疗原则的遵循程度。本文研究了如何评估人工智能生成的治疗师式回应在临床基础上的适当性与有效性。我们依据一个细粒度序数量表，从六项治疗原则维度评估每个治疗师话语：非评判性接纳、温暖、尊重自主性、积极倾听、反思性理解以及情境适当性。我们提出了FAITH-M基准数据集，该数据集包含专家标注的序数评分，并提出了CARE评估框架——一个整合了对话内部语境、对比范例检索以及知识蒸馏链式思维推理的多阶段评估框架。实验表明，CARE的F1分数达到63.34，而作为其基础模型的强基线Qwen3的F1分数为38.56，这代表了64.26%的性能提升，表明性能增益源于结构化推理与语境建模，而非仅依赖基础模型能力。专家评估及外部数据集测试进一步证明了该框架在领域转移下的鲁棒性，同时也凸显了在建模隐含临床细微差别方面存在的挑战。总体而言，CARE为评估人工智能心理健康系统的治疗保真度提供了一个基于临床的框架。

摘要 (Abstract)

The increasing use of large language models in mental health applications calls for principled evaluation frameworks that assess alignment with psychotherapeutic best practices beyond surface-level fluency. While recent systems exhibit conversational competence, they lack structured mechanisms to evaluate adherence to core therapeutic principles. In this paper, we study the problem of evaluating AI-generated therapist-like responses for clinically grounded appropriateness and effectiveness. We assess each therapists utterance along six therapeutic principles: non-judgmental acceptance, warmth, respect for autonomy, active listening, reflective understanding, and situational appropriateness using a fine-grained ordinal scale. We introduce FAITH-M, a benchmark annotated with expert-assigned ordinal ratings, and propose CARE, a multi-stage evaluation framework that integrates intra-dialogue context, contrastive exemplar retrieval, and knowledge-distilled chain-of-thought reasoning. Experiments show that CARE achieves an F-1 score of 63.34 versus the strong baseline Qwen3 F-1 score of 38.56 which is a 64.26 improvement, which also serves as its backbone, indicating that gains arise from structured reasoning and contextual modeling rather than backbone capacity alone. Expert assessment and external dataset evaluations further demonstrate robustness under domain shift, while highlighting challenges in modelling implicit clinical nuance. Overall, CARE provides a clinically grounded framework for evaluating therapeutic fidelity in AI mental health systems.

关键词: Large Language Models, Mental Health, Therapeutic Principles, Evaluation Framework, Chain-of-Thought Reasoning, AI for Science, Clinical Appropriateness, Benchmark FAITH-M

6. ✅ MedLayBench-V: A Large-Scale Benchmark for Expert-Lay Semantic Alignment in Medical Vision Language Models

作者: Han Jang, Junhyeok Lee, Heeseong Eum, Kyu Sung Choi 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05738v1

评分: 41.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	8.0/10	8.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该论文针对医学视觉语言模型在患者沟通中存在的专业术语障碍问题，提出了首个大规模多模态基准数据集MedLayBench-V，通过结构化概念锚定细化方法确保语义等效性，为训练能实现专家-大众语义对齐的下一代模型提供了验证基础。

摘要翻译

医学视觉-语言模型（Med-VLMs）在解读诊断影像方面已达到专家水平。然而，当前模型主要基于专业文献进行训练，限制了其以患者为中心护理所需的通俗表达方式传递发现结果的能力。尽管以文本为中心的研究已积极开发简化医学术语的资源，但目前仍严重缺乏旨在促进通俗化医学图像理解的大规模多模态基准。为填补这一资源空白，我们推出了MedLayBench-V——首个致力于专家-通俗语义对齐的大规模多模态基准。与可能产生幻觉的简单简化方法不同，我们的数据集通过结构化概念锚定优化（SCGR）流程构建。该方法通过整合统一医学语言系统（UMLS）概念唯一标识符（CUIs）与微观实体约束，强制实现严格的语义等效性。MedLayBench-V为训练和评估新一代Med-VLMs提供了经过验证的基础，这些模型将能够弥合临床专家与患者之间的沟通鸿沟。

摘要 (Abstract)

Medical Vision-Language Models (Med-VLMs) have achieved expert-level proficiency in interpreting diagnostic imaging. However, current models are predominantly trained on professional literature, limiting their ability to communicate findings in the lay register required for patient-centered care. While text-centric research has actively developed resources for simplifying medical jargon, there is a critical absence of large-scale multimodal benchmarks designed to facilitate lay-accessible medical image understanding. To bridge this resource gap, we introduce MedLayBench-V, the first large-scale multimodal benchmark dedicated to expert-lay semantic alignment. Unlike naive simplification approaches that risk hallucination, our dataset is constructed via a Structured Concept-Grounded Refinement (SCGR) pipeline. This method enforces strict semantic equivalence by integrating Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs) with micro-level entity constraints. MedLayBench-V provides a verified foundation for training and evaluating next-generation Med-VLMs capable of bridging the communication divide between clinical experts and patients.

7. ✅ HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference

作者: Bowen Zeng, Feiyang Ren, Jun Zhang, Xiaoling Gu, Ke Chen, Lidan Shou, Huan Li 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05887v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	15.0/10	15.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	5.0/10	5.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	10.0/10	10.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型推理中KV缓存内存开销大、延迟高的问题，提出了HybridKV混合压缩框架，实现了最高7.9倍的内存减少和1.52倍的解码加速，且几乎不影响模型性能。

摘要翻译

多模态大语言模型（MLLMs）在文本、图像和视频的统一推理方面取得了进展，但其推理过程受到键值（KV）缓存快速增长的阻碍。每个视觉输入会扩展为数千个标记，导致缓存随上下文长度线性增长，并在整个解码过程中驻留在GPU内存中，这即使在高端GPU上也会带来极高的内存开销和延迟。一种常见的解决方案是在固定的分配预算下以不同粒度压缩缓存：标记级均匀丢弃重要性较低的标记，层级在不同层间调整保留比例，而头级则在各个注意力头之间重新分配预算。然而，这些方法仅停留在分配层面，忽视了注意力头具有异质性行为，需要不同的压缩策略。我们提出了HybridKV，一种混合KV缓存压缩框架，它在三个阶段整合了互补策略：首先使用以文本为中心的注意力将头部分为静态或动态类型；然后采用自上而下的预算分配方案分层分配KV预算；最后，静态头通过文本优先剪枝进行压缩，动态头则通过分块检索进行压缩。在Qwen2.5-VL-7B模型上对11个多模态基准进行的实验表明，HybridKV将KV缓存内存降低了高达$7.9\times$，解码速度提升了$1.52\times$，且相较于全缓存MLLM，性能几乎无下降甚至有所提升。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) have advanced unified reasoning over text, images, and videos, but their inference is hindered by the rapid growth of key-value (KV) caches. Each visual input expands into thousands of tokens, causing caches to scale linearly with context length and remain resident in GPU memory throughout decoding, which leads to prohibitive memory overhead and latency even on high-end GPUs. A common solution is to compress caches under a fixed allocated budget at different granularities: token-level uniformly discards less important tokens, layer-level varies retention across layers, and head-level redistributes budgets across heads. Yet these approaches stop at allocation and overlook the heterogeneous behaviors of attention heads that require distinct compression strategies. We propose HybridKV, a hybrid KV cache compression framework that integrates complementary strategies in three stages: heads are first classified into static or dynamic types using text-centric attention; then a top-down budget allocation scheme hierarchically assigns KV budgets; finally, static heads are compressed by text-prior pruning and dynamic heads by chunk-wise retrieval. Experiments on 11 multimodal benchmarks with Qwen2.5-VL-7B show that HybridKV reduces KV cache memory by up to $7.9\times$ and achieves $1.52\times$ faster decoding, with almost no performance drop or even higher relative to the full-cache MLLM.

关键词: KV cache compression, Multimodal Large Language Models, inference efficiency, memory overhead, attention heads, pruning, retrieval, decoding acceleration

8. ✅ Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该论文针对胃肠道内窥镜图像诊断中现有AI模型泛化性、适应性和鲁棒性不足的问题，提出了一个基于类比推理的基础模型RATNet，通过循环预训练从异构数据中获取知识，在多种诊断场景下优于现有模型。

摘要翻译

胃肠道疾病正日益加重全球健康负担，而内窥镜检查是早期诊断的主要工具。然而，常规内窥镜图像解读仍存在病变漏检和效率有限的问题。尽管人工智能辅助诊断展现出潜力，但由于医学数据有限、领域偏移以及标注异质性，现有模型往往缺乏泛化性、适应性、鲁棒性和可扩展性。为应对这些挑战，我们开发了RATNet——一个基于类比推理的胃肠道内窥镜影像基础模型。RATNet通过循环预训练策略，从五个胃肠道内窥镜数据集的异质性专家标注中获取并迁移知识。其架构包含编码器、关联知识获取与迁移（Relevance-knowledge Acquisition and Transfer, RAT）模块、投影器以及多任务头，支持微调、线性探测和零样本迁移。评估表明，在六种场景下——常见胃肠道疾病诊断、罕见疾病的少样本学习、向新医疗机构的零样本迁移、长尾疾病分布下的鲁棒性、对新疾病的适应性，以及通过联邦学习实现隐私保护部署——RATNet均优于包括GastroNet和GastroVision在内的现有基础模型。其优势源于类比推理机制：该机制将图像衍生的后验知识与学习到的先验知识库进行匹配，并迁移关联知识以指导诊断，从而提升泛化能力并增强对偏见的抵抗性。RATNet具有开放性和成本效益，支持自动整合异质性标注而无需人工标签统一，同时降低了数据获取成本，这使其成为智能胃肠道诊断——特别是在资源有限环境中——的实用基础框架。

摘要 (Abstract)

Gastrointestinal diseases impose a growing global health burden, and endoscopy is a primary tool for early diagnosis. However, routine endoscopic image interpretation still suffers from missed lesions and limited efficiency. Although AI-assisted diagnosis has shown promise, existing models often lack generalizability, adaptability, robustness, and scalability because of limited medical data, domain shift, and heterogeneous annotations. To address these challenges, we develop RATNet, a foundation model for gastrointestinal endoscopy imaging based on analogical reasoning. RATNet acquires and transfers knowledge from heterogeneous expert annotations across five gastrointestinal endoscopy datasets through a cyclic pre-training strategy. Its architecture consists of an encoder, a relevance-knowledge acquisition and transfer (RAT) module, a projector, and a multi-task head, and supports fine-tuning, linear probing, and zero-shot transfer. Evaluations show that RATNet outperforms existing foundation models, including GastroNet and GastroVision, across six scenarios: diagnosis of common gastrointestinal diseases, few-shot learning for rare diseases, zero-shot transfer to new medical sites, robustness under long-tailed disease distributions, adaptation to novel diseases, and privacy-preserving deployment via federated learning. Its advantage comes from an analogical reasoning mechanism that matches image-derived posterior knowledge to a learned prior knowledge base and transfers relative knowledge to guide diagnosis, improving generalization and resistance to bias. RATNet is open and cost-effective, supports automatic integration of heterogeneous annotations without manual label unification, and reduces data acquisition costs, making it a practical foundation for intelligent gastrointestinal diagnosis, especially in resource-limited settings.

关键词: foundation model, gastrointestinal endoscopy, analogical reasoning, cyclic pre-training, heterogeneous annotations, multi-task learning, medical AI, diagnosis

9. ✅ Mechanistic Circuit-Based Knowledge Editing in Large Language Models

作者: Tianyi Zhao, Yinhan He, Wendy Zheng, Chen Chen 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05876v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在动态环境中更新知识时存在的"推理鸿沟"问题，提出了一种基于机制电路的知识编辑框架MCircKE，通过识别并精准修改负责特定推理任务的因果电路，有效提升了模型在多步推理链中利用编辑知识的能力。

摘要翻译

在现实动态环境中部署大语言模型（LLM）带来了更新其预训练知识的挑战。现有的知识编辑方法虽然能够可靠地修正孤立事实，但常常面临“推理鸿沟”问题：模型能回忆起被编辑的事实，却无法在多步推理链中有效运用该事实。为弥合这一鸿沟，我们提出MCircKE（基于机理电路的知识编辑），这是一种新颖的框架，能够实现精确的“定位-适配”编辑流程。MCircKE首先识别负责特定推理任务的因果电路，该电路既捕获事实的存储，也捕捉其逻辑结论的传递路径；随后，仅在此定位出的电路内部进行精准的参数更新。在MQuAKE-3K基准测试上的大量实验表明，该方法在知识编辑的多跳推理任务中具有显著有效性。

摘要 (Abstract)

Deploying Large Language Models (LLMs) in real-world dynamic environments raises the challenge of updating their pre-trained knowledge. While existing knowledge editing methods can reliably patch isolated facts, they frequently suffer from a “Reasoning Gap”, where the model recalls the edited fact but fails to utilize it in multi-step reasoning chains. To bridge this gap, we introduce MCircKE (\underline{M}echanistic \underline{Circ}uit-based \underline{K}nowledge \underline{E}diting), a novel framework that enables a precise “map-and-adapt” editing procedure. MCircKE first identifies the causal circuits responsible for a specific reasoning task, capturing both the storage of the fact and the routing of its logical consequences. It then surgically update parameters exclusively within this mapped circuit. Extensive experiments on the MQuAKE-3K benchmark demonstrate the effectiveness of the proposed method for multi-hop reasoning in knowledge editing.

关键词: Large Language Models, Knowledge Editing, Mechanistic Circuits, Multi-step Reasoning, Reasoning Gap, Causal Circuits, Parameter Update, MQuAKE-3K Benchmark

10. ✅ The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models

作者: Prashant C. Raju 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04155v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该论文研究发现科学基础模型在离散标记化过程中会损失连续几何结构，提出了"几何对齐税"概念，并通过实验证明连续目标能显著减少几何失真，同时评估了14个生物基础模型并识别了三种失败机制。

摘要翻译

生物学与物理学基础模型虽能优化预测准确性，但其内部表征系统性地无法保持所建模系统的连续几何结构。我们揭示了根本原因：几何对齐税——即强制将连续流形通过离散分类瓶颈所产生的固有代价。在合成动力系统上的受控消融实验表明，在相同编码器上用连续输出头替代交叉熵损失，可将几何失真降低高达8.5倍；而学习得到的码本则呈现非单调的双重约束现象：更精细的量化虽能改善重建效果，却会恶化几何保持。在连续目标下，三种架构的差异仅为1.3倍；而在离散标记化条件下，其差异扩大至3000倍。通过率失真理论与互信息神经估计（MINE）对14个生物基础模型进行评估，我们识别出三种失效机制：局部-全局解耦、表征压缩与几何空泛。受控实验证实，Evo 2模型在真实DNA数据上表现出的反向互补稳健性反映的是保守的序列组成特征，而非习得的对称性。所有模型均未能同时实现低失真、高互信息与全局连贯性。

摘要 (Abstract)

Foundation models for biology and physics optimize predictive accuracy, but their internal representations systematically fail to preserve the continuous geometry of the systems they model. We identify the root cause: the Geometric Alignment Tax, an intrinsic cost of forcing continuous manifolds through discrete categorical bottlenecks. Controlled ablations on synthetic dynamical systems demonstrate that replacing cross-entropy with a continuous head on an identical encoder reduces geometric distortion by up to 8.5x, while learned codebooks exhibit a non-monotonic double bind where finer quantization worsens geometry despite improving reconstruction. Under continuous objectives, three architectures differ by 1.3x; under discrete tokenization, they diverge by 3,000x. Evaluating 14 biological foundation models with rate-distortion theory and MINE, we identify three failure regimes: Local-Global Decoupling, Representational Compression, and Geometric Vacuity. A controlled experiment confirms that Evo 2’s reverse-complement robustness on real DNA reflects conserved sequence composition, not learned symmetry. No model achieves simultaneously low distortion, high mutual information, and global coherence.

关键词: Foundation Models, Geometric Alignment Tax, Continuous Geometry, Tokenization, Scientific Models, Biological Foundation Models, Representational Distortion, Rate-Distortion Theory

11. ✅ From Governance Norms to Enforceable Controls: A Layered Translation Method for Runtime Guardrails in Agentic AI

作者: Christopher Koch 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05229v1

评分: 33.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	8.0/10	8.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	5.0/10	5.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对Agentic AI系统在运行时产生的治理风险，提出了一种分层翻译方法，将治理标准转化为可执行的运行时护栏，并通过采购代理案例研究进行了验证。

摘要翻译

具身化人工智能系统能够进行规划、使用工具、维持状态，并产生具有外部效应的多步骤轨迹。这些特性催生了一个与单轮生成式人工智能存在本质差异的治理问题：重大风险在执行过程中涌现，而不仅仅出现在模型开发或部署阶段。因此，诸如ISO/IEC 42001、ISO/IEC 23894、ISO/IEC 42005、ISO/IEC 5338、ISO/IEC 38507等标准以及NIST人工智能风险管理框架与具身化AI高度相关，但它们本身并不能直接转化为可实施的运行时护栏。本文提出一种分层转化方法，将源于标准的治理目标连接到四个控制层：治理目标、设计时约束、运行时协调以及保障反馈。该方法区分了治理目标、技术控制、运行时护栏和保障证据；引入了用于层级分配的控制元组和运行时可执行性评估准则；并通过一个采购代理案例研究展示了该方法。核心主张是审慎的：标准应指导控制在架构、运行时策略、人工介入升级和审计等环节的部署，而运行时护栏应仅保留给那些具备可观测性、确定性和足够时效性，从而有理由在执行时进行干预的控制措施。

摘要 (Abstract)

Agentic AI systems plan, use tools, maintain state, and produce multi-step trajectories with external effects. Those properties create a governance problem that differs materially from single-turn generative AI: important risks emerge dur- ing execution, not only at model development or deployment time. Governance standards such as ISO/IEC 42001, ISO/IEC 23894, ISO/IEC 42005, ISO/IEC 5338, ISO/IEC 38507, and the NIST AI Risk Management Framework are therefore highly relevant to agentic AI, but they do not by themselves yield implementable runtime guardrails. This paper proposes a layered translation method that connects standards-derived governance objectives to four control layers: governance objectives, design- time constraints, runtime mediation, and assurance feedback. It distinguishes governance objectives, technical controls, runtime guardrails, and assurance evidence; introduces a control tuple and runtime-enforceability rubric for layer assignment; and demonstrates the method in a procurement-agent case study. The central claim is modest: standards should guide control placement across architecture, runtime policy, human escalation, and audit, while runtime guardrails are reserved for controls that are observable, determinate, and time-sensitive enough to justify execution-time intervention.

关键词: Agentic AI, runtime guardrails, governance standards, control layers, runtime mediation, procurement agent, enforceable controls, AI risk management

12. ❌ A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models

作者: Maria Mahbub, Gregory M. Dams, Josh Arnold, Caitlin Rizy, Sudarshan Srinivasan, Elliot M. Fielstein, Minu A. Aghevli, Kamonica L. Craig, Elizabeth M. Oliva, Joseph Erdos, Jodie Trafton, Ioana Danciu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06028v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文核心是开发一个用于临床信息提取的LLM验证框架，属于LLM在生物医学领域的应用。因此，与"Large Language Models"和"AI for Science"高度相关（10分）。论文关注验证和减少错误，与"Hallucination Mitigation"有一定关联（5分）。其他关键词涉及具体技术细节（如MoE、SFT、RAG等）或推理方法（如CoT、MCTS），论文未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一个多阶段验证框架，用于在弱监督下评估基于大语言模型的临床信息提取，并在近百万份临床笔记中成功提取物质使用障碍诊断，证明了无需密集人工标注即可实现可扩展、可信赖的部署。

摘要翻译

大型语言模型（LLM）在从非结构化健康记录中提取具有临床意义的信息方面展现出潜力，但其在真实场景中的应用因缺乏可扩展且可信赖的验证方法而受到限制。传统评估方法严重依赖标注密集的参考标准或不完整的结构化数据，限制了在人群规模上的可行性。我们提出了一种基于LLM的临床信息提取多阶段验证框架，该框架能够在弱监督下实现严格评估。该框架整合了提示校准、基于规则的合理性过滤、语义基础性评估、使用独立高性能裁判LLM进行的针对性验证性评估、选择性专家评审以及外部预测效度分析，从而无需大量人工标注即可量化不确定性并描述错误模式。我们将此框架应用于从919,783份临床记录中提取涵盖11种物质类别的物质使用障碍（SUD）诊断。基于规则的过滤和语义基础性评估移除了14.59%缺乏支持、不相关或结构上不合理的LLM阳性提取结果。对于高不确定性案例，裁判LLM的评估结果与领域专家评审显示出高度一致性（Gwet’s AC1=0.80）。以裁判评估的输出为参考，主LLM在宽松匹配标准下实现了0.80的F1分数。LLM提取的SUD诊断在预测后续参与SUD专科治疗方面，也比基于结构化数据的基线方法更准确（AUC=0.80）。这些发现表明，无需依赖标注密集的评估，即可实现基于LLM的临床信息提取的可扩展且可信赖的部署。

摘要 (Abstract)

Large language models (LLMs) show promise for extracting clinically meaningful information from unstructured health records, yet their translation into real-world settings is constrained by the lack of scalable and trustworthy validation approaches. Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale. We propose a multi-stage validation framework for LLM-based clinical information extraction that enables rigorous assessment under weak supervision. The framework integrates prompt calibration, rule-based plausibility filtering, semantic grounding assessment, targeted confirmatory evaluation using an independent higher-capacity judge LLM, selective expert review, and external predictive validity analysis to quantify uncertainty and characterize error modes without exhaustive manual annotation. We applied this framework to extraction of substance use disorder (SUD) diagnoses across 11 substance categories from 919,783 clinical notes. Rule-based filtering and semantic grounding removed 14.59% of LLM-positive extractions that were unsupported, irrelevant, or structurally implausible. For high-uncertainty cases, the judge LLM’s assessments showed substantial agreement with subject matter expert review (Gwet’s AC1=0.80). Using judge-evaluated outputs as references, the primary LLM achieved an F1 score of 0.80 under relaxed matching criteria. LLM-extracted SUD diagnoses also predicted subsequent engagement in SUD specialty care more accurately than structured-data baselines (AUC=0.80). These findings demonstrate that scalable, trustworthy deployment of LLM-based clinical information extraction is feasible without annotation-intensive evaluation.

关键词: Large language models, Clinical information extraction, Validation framework, Weak supervision, Substance use disorder, Trustworthy deployment, Multi-stage validation, Predictive validity

13. ❌ How LLMs Follow Instructions: Skillful Coordination, Not a Universal Mechanism

作者: Elisabetta Rocchetti, Alfio Ferrara 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06015v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文核心研究指令调优（Instruction Tuning）在大型语言模型（LLMs）中的工作机制，直接高度相关于关键词"Instruction Tuning"和"Large Language Models"。论文通过诊断性探测、因果消融等方法探究模型内部机制，与"Mechanistic Interpretability"有一定关联。论文未涉及其他关键词如MoE、SFT、RAG、推理方法、代理、压缩等具体技术或应用领域。

!!! tip deepseek-chat TL;DR

该研究通过实验探究指令调优后的大型语言模型遵循指令的机制，发现其并非依赖单一通用机制，而是通过动态协调多种语言技能来实现。

摘要翻译

指令调优通常被认为赋予了语言模型遵循指令的领域通用能力，但其内在机制仍不甚明晰。指令遵循究竟依赖于一种通用机制，还是组合性技能的调用？我们通过在三个经过指令调优的模型中，对九项不同任务进行诊断性探测来研究此问题。我们的分析提供了反对通用机制存在的聚合证据。首先，在所有任务上训练的通用探测器的表现持续低于针对特定任务的专用探测器，这表明表征共享有限。其次，跨任务迁移能力较弱，且依据技能相似性形成聚类。第三，因果消融实验揭示了稀疏且不对称的依赖关系，而非共享表征。此外，任务依据复杂度在不同网络层中分层呈现：结构性约束在早期层形成，而语义任务在较深层显现。最后，时序分析表明，约束满足在生成过程中作为动态监控运作，而非在生成前进行规划。这些发现表明，将指令遵循理解为多种语言能力的熟练协调，而非调用单一的抽象约束检查过程，是更为准确的描述。

摘要 (Abstract)

Instruction tuning is commonly assumed to endow language models with a domain-general ability to follow instructions, yet the underlying mechanism remains poorly understood. Does instruction-following rely on a universal mechanism or compositional skill deployment? We investigate this through diagnostic probing across nine diverse tasks in three instruction-tuned models. Our analysis provides converging evidence against a universal mechanism. First, general probes trained across all tasks consistently underperform task-specific specialists, indicating limited representational sharing. Second, cross-task transfer is weak and clustered by skill similarity. Third, causal ablation reveals sparse asymmetric dependencies rather than shared representations. Tasks also stratify by complexity across layers, with structural constraints emerging early and semantic tasks emerging late. Finally, temporal analysis shows constraint satisfaction operates as dynamic monitoring during generation rather than pre-generation planning. These findings indicate that instruction-following is better characterized as skillful coordination of diverse linguistic capabilities rather than deployment of a single abstract constraint-checking process.

关键词: Instruction Tuning, Large Language Models, Mechanism, Skill Coordination, Diagnostic Probing, Causal Ablation, Representational Sharing, Constraint Satisfaction

14. ❌ Exclusive Unlearning

作者: Mutsumi Sasaki, Kouta Nakayama, Yusuke Miyao, Yohei Oseki, Masaru Isonuma 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06154v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文核心研究LLMs在工业应用（如医疗、教育）中的安全问题，提出了一种新的机器遗忘方法（Exclusive Unlearning）来广泛移除有害内容，同时保留特定领域（如医学、数学）的知识。因此，与"Large Language Models"高度相关（10分），因为论文明确研究LLMs的应用挑战和解决方案；与"Instruction Tuning"或"Alignment"有一定关联（5分），因为涉及模型安全对齐和指令响应；与"Hallucination Mitigation"或"Factuality"有一定关联（5分），因为处理有害内容生成与事实性风险；与"AI for Science"或"Bioinformatics"有一定关联（5分），因为论文提到医疗等科学领域应用。其他关键词如MoE、SFT、RAG等未在摘要中提及或无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型在工业应用中生成有害内容的风险，提出了一种名为Exclusive Unlearning的新方法，通过广泛遗忘除需保留知识外的所有内容，实现了对广泛输入（包括越狱攻击）的安全保障，同时保持了在特定领域（如医学和数学）的指令响应能力。

摘要翻译

将大型语言模型（LLM）引入医疗、教育等工业应用时，生成有害内容的风险成为重大挑战。尽管现有的机器遗忘方法能够消除特定的有害知识与表达，但有害内容的多样性使得全面清除变得困难。本研究并未逐一列举需要遗忘的目标，而是提出了排他性遗忘（Exclusive Unlearning, EU），该方法旨在通过广泛遗忘我们希望保留的知识与表达之外的所有内容，实现广泛的有害信息移除。我们证明，通过排他性遗忘，可以获得一个能够确保对包括越狱攻击在内的广泛输入保持安全性，同时维持对医学、数学等特定领域多样化指令响应能力的模型。

摘要 (Abstract)

When introducing Large Language Models (LLMs) into industrial applications, such as healthcare and education, the risk of generating harmful content becomes a significant challenge. While existing machine unlearning methods can erase specific harmful knowledge and expressions, diverse harmful content makes comprehensive removal difficult. In this study, instead of individually listing targets for forgetting, we propose Exclusive Unlearning (EU), which aims for broad harm removal by extensively forgetting everything except for the knowledge and expressions we wish to retain. We demonstrate that through Exclusive Unlearning, it is possible to obtain a model that ensures safety against a wide range of inputs, including jailbreaks, while maintaining the ability to respond to diverse instructions related to specific domains such as medicine and mathematics.

关键词: Large Language Models, Machine Unlearning, Harmful Content, Safety, Healthcare Applications, Jailbreak Attacks, Domain-specific Knowledge, Instruction Response

15. ❌ FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips

作者: Mengtian Li, Kunyan Dai, Yi Ding, Ruobing Ni, Ying Zhang, Wenwu Wang, Zhifeng Xie 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05731v1

评分: 24.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	8.0/10	8.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	8.0/10	8.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出FoleyDesigner框架，用于电影剪辑的沉浸式立体声Foley生成。核心相关性在于：1）明确使用large language model (LLM)驱动混合机制（摘要中提到），因此与"Large Language Models"相关度8分；2）采用多智能体架构（multi-agent architecture）进行时空分析，与"LLM Agents"和"Multi-agent Systems"相关度各8分。其他关键词如MoE、Scaling Laws、RLHF、RAG等均未在论文中涉及，相关度为0。论文属于大模型在音频生成领域的应用，符合研究背景中"大模型在不同领域的研究应用"的要求。

!!! tip deepseek-chat TL;DR

该论文针对电影中手动创建时空对齐音频劳动密集型的问题，提出了FoleyDesigner框架，通过多智能体架构和LLM驱动的混合机制，实现了精确时空对齐的沉浸式立体声Foley生成，并引入了首个专业立体声音频数据集FilmStereo。

摘要翻译

拟音艺术在提升电影沉浸式听觉体验中起着关键作用，但手动创作时空对齐的音频仍是一项劳动密集型任务。我们提出FoleyDesigner——一个受专业拟音工作流程启发的新型框架，集成了影片片段分析、时空可控的拟音生成以及专业音频混音能力。该框架采用多智能体架构以实现精确的时空分析，通过基于视频帧提取的时空线索训练的潜在扩散模型实现时空对齐，并结合由大语言模型驱动的混合机制来模拟电影行业后期制作实践。针对电影领域高质量立体声音频数据集的缺乏，我们推出了FilmStereo——首个包含空间元数据、精确时间戳及八类常见拟音类别语义标注的专业立体声音频数据集。在应用层面，该框架支持交互式用户控制，同时保持与专业制作流程的无缝集成，包括符合ITU-R BS.775标准的5.1声道杜比全景声系统，从而提供广泛的创作灵活性。大量实验表明，相较于现有基线方法，我们的方案在时空对齐方面表现更优，且能无缝兼容专业电影制作标准。项目页面详见：https://gekiii996.github.io/FoleyDesigner/。

摘要 (Abstract)

Foley art plays a pivotal role in enhancing immersive auditory experiences in film, yet manual creation of spatio-temporally aligned audio remains labor-intensive. We propose FoleyDesigner, a novel framework inspired by professional Foley workflows, integrating film clip analysis, spatio-temporally controllable Foley generation, and professional audio mixing capabilities. FoleyDesigner employs a multi-agent architecture for precise spatio-temporal analysis. It achieves spatio-temporal alignment through latent diffusion models trained on spatio-temporal cues extracted from video frames, combined with large language model (LLM)-driven hybrid mechanisms that emulate post-production practices in film industry. To address the lack of high-quality stereo audio datasets in film, we introduce FilmStereo, the first professional stereo audio dataset containing spatial metadata, precise timestamps, and semantic annotations for eight common Foley categories. For applications, the framework supports interactive user control while maintaining seamless integration with professional pipelines, including 5.1-channel Dolby Atmos systems compliant with ITU-R BS.775 standards, thereby offering extensive creative flexibility. Extensive experiments demonstrate that our method achieves superior spatio-temporal alignment compared to existing baselines, with seamless compatibility with professional film production standards. The project page is available at https://gekiii996.github.io/FoleyDesigner/ .

关键词: Foley generation, spatio-temporal alignment, multi-agent architecture, latent diffusion models, large language model (LLM), stereo audio dataset, film production, audio mixing

16. ❌ Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering

作者: Purva Chiniya, Kevin Scaria, Sagar Chaturvedi 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05179v1

评分: 23.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的安全防护机制，提出Gradient-Controlled Decoding (GCD)方法防御jailbreak和prompt-injection攻击，因此与"Large Language Models"高度相关（10分）。论文涉及模型安全对齐和有害内容防范，与"Instruction Tuning" OR “Alignment"有一定关联（5分），与"Hallucination Mitigation” OR “Factuality"有较强关联（8分），因为都关注输出可靠性和安全性。其他关键词如MoE、Scaling Laws、Pre-training等未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型易受越狱和提示注入攻击的问题，提出了一种无需训练的梯度控制解码安全护栏方法，在降低误报率的同时有效防御攻击并保证首令牌安全。

摘要翻译

大型语言模型（LLMs）仍易受到越狱攻击和直接提示注入攻击的影响，而最强的防御过滤器往往过度拒绝良性查询，损害用户体验。先前关于越狱和提示注入检测的研究（如GradSafe）通过单一的“全接受”锚定标记（“Sure”）来检测不安全提示，但其阈值较为脆弱，且无法在解码开始后确定性地保证不生成有害内容。我们提出梯度控制解码（Gradient-Controlled Decoding, GCD），这是一种无需训练的防护机制，它结合了接受锚定标记（“Sure”）和拒绝锚定标记（“Sorry”），从而收紧决策边界并显著降低误报率。在缓解阶段，若提示被标记为有害，GCD会在自回归解码恢复前预设注入一个或两个拒绝标记（“Sorry, I can’t…"），确保无论采用何种采样策略，首个生成标记的安全性。在ToxicChat、XSTest-v2和AdvBench测试集上，GCD在召回率相当的情况下将误报率较GradSafe降低了52%，攻击成功率较最强的纯解码基线降低了最高10%，在V100实例上平均增加低于15-20毫秒的延迟，可迁移至LLaMA-2-7B、Mixtral-8x7B和Qwen-2-7B模型，且仅需20个示例模板。

摘要 (Abstract)

Large language models (LLMs) remain susceptible to jailbreak and direct prompt-injection attacks, yet the strongest defensive filters frequently over-refuse benign queries and degrade user experience. Previous work on jailbreak and prompt injection detection such as GradSafe, detects unsafe prompts with a single “accept all” anchor token, but its threshold is brittle and it offers no deterministic guarantee that harmful content will not be emitted once decoding begins. We introduce Gradient-Controlled Decoding (GCD), a training-free guardrail that combines an acceptance anchor token (“Sure”) and refusal anchor token (“Sorry”) tightening the decision boundary and significantly lowering false positives. In the mitigation stage, if a prompt is flagged, GCD preset-injects one or two refusal tokens (“Sorry, I can’t…”) before autoregressive decoding resumes, guaranteeing first-token safety regardless of sampling strategy. On ToxicChat, XSTest-v2, and AdvBench, GCD reduces false positives by 52% vs. GradSafe at comparable recall, lowers attack success rate by up to 10% vs. the strongest decoding-only baseline, adds under 15-20 ms latency on an average on V100 instances, transfers to LLaMA-2-7B, Mixtral-8x7B, and Qwen-2-7B, and requires only 20 demonstration templates.

关键词: Large Language Models, Safety Guardrail, Jailbreak Defense, Prompt Injection, Gradient-Controlled Decoding, False Positive Reduction, First-token Safety, Training-free Method

17. ❌ On the Role of Fault Localization Context for LLM-Based Program Repair

作者: Melika Sepidband, Hung Viet Pham, Hadi Hemmati 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05481v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	5.0/10	5.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM在自动程序修复（APR）中的应用，特别是故障定位（FL）上下文的作用，因此与"Large Language Models"高度相关（10分）。研究涉及检索不同级别的上下文（文件、元素、行）来增强LLM的修复能力，这与"Retrieval-Augmented Generation"有一定关联（5分），因为RAG也涉及检索外部信息来增强生成，但论文更侧重于特定上下文的检索而非一般RAG。论文评估LLM在不同上下文配置下的性能，这隐含了上下文学习的概念，与"In-context Learning"有一定关联（5分）。其他关键词如MoE、SFT、对齐等未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在基于LLM的自动程序修复中，故障定位上下文（包括文件、元素和行级别）对修复性能的影响，发现更多上下文并不总是提高性能，文件级定位是关键因素，而LLM检索通常优于启发式方法。

摘要翻译

故障定位（Fault Localization, FL）是基于大语言模型（Large Language Model, LLM）的自动程序修复（Automated Program Repair, APR）的关键组成部分，但其影响尚未得到充分探究。具体而言，目前尚不清楚需要多大程度的定位、超出预测错误位置之外的额外上下文是否有益，以及应如何检索此类上下文。我们使用GPT-5-mini对500个SWE-bench Verified实例进行了大规模实证研究，评估了61种不同配置，这些配置在文件级、元素级和行级上下文上有所变化。我们的结果表明，增加上下文并不能持续提升修复性能。文件级定位是主导因素，相较于无文件基线实现了15-17倍的性能提升。扩展文件上下文通常与性能改善相关，在包含约6-10个相关文件的配置中最常观察到成功修复。元素级上下文扩展提供了有条件的增益，其效果高度依赖于文件上下文的质量，而行级上下文扩展则常因噪声放大而导致性能下降。基于LLM的检索方法通常优于结构性启发式方法，同时使用更少的文件和令牌。总体而言，最有效的故障定位上下文策略通常结合了较高抽象层次的广泛语义理解与精确的行级定位。这些发现挑战了我们关于增加定位上下文会统一提升自动程序修复性能的假设，并为设计基于大语言模型的故障定位策略提供了实用指导。

摘要 (Abstract)

Fault Localization (FL) is a key component of Large Language Model (LLM)-based Automated Program Repair (APR), yet its impact remains underexplored. In particular, it is unclear how much localization is needed, whether additional context beyond the predicted buggy location is beneficial, and how such context should be retrieved. We conduct a large-scale empirical study on 500 SWE-bench Verified instances using GPT-5-mini, evaluating 61 configurations that vary file-level, element-level, and line-level context. Our results show that more context does not consistently improve repair performance. File-level localization is the dominant factor, yielding a 15-17x improvement over a no-file baseline. Expanding file context is often associated with improved performance, with successful repairs most commonly observed in configurations with approximately 6-10 relevant files. Element-level context expansion provides conditional gains that depend strongly on the file context quality, while line-level context expansion frequently degrades performance due to noise amplification. LLM-based retrieval generally outperforms structural heuristics while using fewer files and tokens. Overall, the most effective FL context strategy typically combines a broad semantic understanding at higher abstraction levels with precise line-level localization. These findings challenge our assumption that increasing the localization context uniformly improves APR, and provide practical guidance for designing LLM-based FL strategies.

关键词: Large Language Models, Automated Program Repair, Fault Localization, Context Retrieval, Empirical Study, GPT-5-mini, SWE-bench, Performance Evaluation

18. ❌ Disentangling MLP Neuron Weights in Vocabulary Space

作者: Asaf Avrahamy, Yoav Gur-Arieh, Mor Geva 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06005v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文《Disentangling MLP Neuron Weights in Vocabulary Space》提出了一种名为ROTATE的数据无关方法，用于在权重空间中解耦MLP神经元，以解释语言模型中的信息编码。该研究直接针对大语言模型（Llama-3.1-8B-Instruct和Gemma-2-2B-it）进行实验，因此与"Large Language Models"高度相关（10分）。同时，论文的核心是模型解释性，属于"Mechanistic Interpretability"范畴，因此也高度相关（10分）。其他关键词如MoE、SLMs、训练方法、推理加速、AI for Science等均未在论文中涉及或提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ROTATE的数据无关方法，通过优化神经元权重在词汇空间中的峰度来解耦MLP神经元，从而恢复可解释的词汇通道，为语言模型的机制解释提供了可扩展的细粒度构建块。

摘要翻译

解读模型权重中编码的信息始终是机制可解释性领域的核心挑战。本研究提出ROTATE（权重空间中的旋转优化词元对齐），这是一种无需前向传播、直接在权重空间中解耦MLP神经元的无数据方法。我们的方法基于一个关键统计现象：当编码连贯、单义概念的神经元投影到模型的词汇空间时，会呈现出高峰态特征。通过优化神经元权重的旋转以最大化其在词汇空间中的峰度，本方法能够提取稀疏且可解释的方向向量，我们将其命名为词汇通道。在Llama-3.1-8B-Instruct和Gemma-2-2B-it模型上的实验表明，ROTATE能够稳定地还原出与神经元行为高度一致的词汇通道。对单个通道进行消融会选择性抑制相应的输入激活或阻碍特定概念的生成。此外，聚合通道层面的描述可生成全面的神经元解释，在直接对比评估中，其性能较基于激活优化的基线方法提升2-3倍。通过提供神经元权重的无数据分解方案，ROTATE为大规模语言模型（LMs）的可解释性研究提供了可扩展、细粒度的基础构建模块。

摘要 (Abstract)

Interpreting the information encoded in model weights remains a fundamental challenge in mechanistic interpretability. In this work, we introduce ROTATE (Rotation-Optimized Token Alignment in weighT spacE), a data-free method requiring no forward passes that disentangles MLP neurons directly in weight space. Our approach relies on a key statistical observation: neurons that encode coherent, monosemantic concepts exhibit high kurtosis when projected onto the model’s vocabulary. By optimizing rotations of neuron weights to maximize their vocabulary-space kurtosis, our method recovers sparse, interpretable directions which we name vocabulary channels. Experiments on Llama-3.1-8B-Instruct and Gemma-2-2B-it demonstrate that ROTATE consistently recovers vocabulary channels that are faithful to the neuron’s behavior. ablating individual channels selectively disables corresponding input activations or the promotion of specific concepts. Moreover, aggregating channel-level descriptions yields comprehensive neuron descriptions that outperform optimized activation-based baselines by 2-3x in head-to-head comparisons. By providing a data-free decomposition of neuron weights, ROTATE offers a scalable, fine-grained building block for interpreting LMs.

关键词: mechanistic interpretability, MLP neurons, weight space, vocabulary channels, data-free method, language models, ROTATE, kurtosis optimization

19. ❌ Bridging Natural Language and Microgrid Dynamics: A Context-Aware Simulator and Dataset

作者: Tinko Sebastian Bartels, Ruixiang Wu, Xinyu Lu, Yikai Lu, Fanzeng Xia, Haoxiang Yang, Yue Chen, Tongxin Li 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05429v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 该论文的核心贡献是开发了一个名为OpenCEM的仿真器和数据集，旨在将丰富的非结构化上下文信息（如事件日程、系统日志、用户意图）与可再生能源系统的定量动态相结合，以促进智能、上下文感知的能源管理研究。论文摘要中明确提到该平台特别适用于利用大型语言模型（LLMs）开发新型控制算法和预测模型，因此与关键词"Large Language Models” OR “LLMs” OR “Foundation Models"高度相关（10分）。同时，该研究属于大模型在科学领域（具体为能源系统管理）的应用，与关键词"AI for Science” OR “Bioinformatics” OR “Cheminformatics"高度相关（10分）。论文未涉及其他关键词所描述的具体技术原理、方法或应用场景（如MoE、量化、对齐、RAG等），因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对可再生能源系统中智能、上下文感知的能源管理需求，提出了首个开源数字孪生平台OpenCEM，它通过整合丰富的非结构化上下文信息与定量能源动态，为利用大型语言模型等开发新型控制算法和预测模型提供了高保真仿真环境和数据集。

摘要翻译

针对可再生能源系统对智能、情境感知能源管理的迫切需求，我们推出 OpenCEM 模拟器与数据集：这是首个明确设计用于将丰富的非结构化情境信息与定量可再生能源动态相集成的开源数字孪生平台。传统能源管理严重依赖数值时间序列，从而忽视了人类生成情境（如事件日程、系统日志、用户意图）中所蕴含的重要预测能力。OpenCEM 通过提供一个独特平台来弥合这一差距，该平台包含来自真实世界光伏-电池微电网安装的、经过精细对齐且富含语言描述的数据集，以及一个能够原生处理这种多模态情境的模块化模拟器。OpenCEM 模拟器为开发和验证新型控制算法与预测模型（尤其是那些利用大语言模型的方法）提供了一个高保真环境。我们详细阐述了其基于组件的架构、混合数据驱动与基于物理的建模能力，并通过实际案例（包括情境感知负荷预测和在线最优电池充电控制策略的实现）展示了其实用性。通过公开此平台，OpenCEM 旨在加速对下一代智能、可持续且真正具备情境感知能力的能源系统的研究。

摘要 (Abstract)

Addressing the critical need for intelligent, context-aware energy management in renewable systems, we introduce the \textbf{OpenCEM Simulator and Dataset}: the first open-source digital twin explicitly designed to integrate rich, unstructured contextual information with quantitative renewable energy dynamics. Traditional energy management relies heavily on numerical time series, thereby neglecting the significant predictive power embedded in human-generated context (e.g., event schedules, system logs, user intentions). OpenCEM bridges this gap by offering a unique platform comprising both a meticulously aligned, language-rich dataset from a real-world PV-and-battery microgrid installation and a modular simulator capable of natively processing this multi-modal context. The OpenCEM Simulator provides a high-fidelity environment for developing and validating novel control algorithms and prediction models, particularly those leveraging Large Language Models. We detail its component-based architecture, hybrid data-driven and physics-based modelling capabilities, and demonstrate its utility through practical examples, including context-aware load forecasting and the implementation of online optimal battery charging control strategies. By making this platform publicly available, OpenCEM aims to accelerate research into the next generation of intelligent, sustainable, and truly context-aware energy systems.

关键词: OpenCEM Simulator, context-aware energy management, renewable energy systems, digital twin, Large Language Models, microgrid dynamics, multi-modal context, load forecasting

20. ❌ Target Policy Optimization

作者: Jean Kaddour 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06159v1

评分: 18.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出了一种新的强化学习方法TPO，用于优化语言模型的策略。与关键词"RLHF"等高度相关（10分），因为TPO是RLHF中策略优化阶段的一种替代方法，论文在LLM RLVR任务上进行了评估。与"Large Language Models"相关（8分），因为论文在billion-parameter LLM上进行了实验。其他关键词如MoE、SFT、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

论文提出了一种新的强化学习算法Target Policy Optimization（TPO），通过分离目标分布构建和策略拟合来解决传统策略梯度方法在稀疏奖励任务中的不足，并在包括十亿参数大语言模型在内的多个任务上验证了其有效性。

摘要翻译

在强化学习中，给定一个提示，我们从模型中采样一组补全结果并为其评分。随之产生两个问题：哪些补全结果应获得概率质量的增加，以及参数应如何移动以实现这种变化？标准的策略梯度方法同时回答这两个问题，因此更新可能因学习率、裁剪和其他优化器选择而出现超调或欠调。我们引入\emph{目标策略优化}（Target Policy Optimization，TPO），它将这两个问题分离。给定已评分的补全结果，TPO构建一个目标分布$q_i \propto p_i^{,\mathrm{old}} \exp(u_i)$，并通过交叉熵将策略拟合至该分布。在采样补全对数上的损失梯度为$p^θ- q$，一旦策略与目标匹配，梯度即消失。在表格老虎机、Transformer序列任务以及数十亿参数大语言模型的RLVR任务中，TPO在简单任务上与PG、PPO、GRPO和DG表现相当，而在稀疏奖励条件下显著优于它们。代码发布于https://github.com/JeanKaddour/tpo。

摘要 (Abstract)

In RL, given a prompt, we sample a group of completions from a model and score them. Two questions follow: which completions should gain probability mass, and how should the parameters move to realize that change? Standard policy-gradient methods answer both at once, so the update can overshoot or undershoot depending on the learning rate, clipping, and other optimizer choices. We introduce \emph{Target Policy Optimization} (TPO), which separates the two questions. Given scored completions, TPO constructs a target distribution $q_i \propto p_i^{,\mathrm{old}} \exp(u_i)$ and fits the policy to it by cross-entropy. The loss gradient on sampled-completion logits is $p^θ- q$, which vanishes once the policy matches the target. On tabular bandits, transformer sequence tasks, and billion-parameter LLM RLVR, TPO matches PG, PPO, GRPO, and DG on easy tasks and substantially outperforms them under sparse reward. Code is available at https://github.com/JeanKaddour/tpo.

关键词: Target Policy Optimization, TPO, reinforcement learning, policy optimization, sparse reward, LLM RLVR, cross-entropy loss, billion-parameter LLM

21. ❌ FastDiSS: Few-step Match Many-step Diffusion Language Model on Sequence-to-Sequence Generation–Full Version

作者: Dat Nguyen-Cong, Tung Kieu, Hoang Thanh-Tung 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05551v1

评分: 13.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	8.0/10	8.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究连续扩散语言模型在少步采样下的自条件机制问题，提出FastDiSS训练框架以提高推理速度和样本质量。与关键词相关性分析：1）与"Large Language Models"有中等关联（5分），因论文研究扩散语言模型，属于大语言模型的一种变体；2）与"Speculative Decoding” OR “Inference Acceleration"高度相关（8分），因论文核心贡献是提升扩散模型的推理速度（400倍加速），直接对应推理加速主题；3）其他关键词（如MoE、SFT、RAG等）均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对连续扩散语言模型在少步采样时自条件机制失效导致样本质量下降的问题，提出了FastDiSS训练框架，通过扰动自条件信号和引入噪声感知机制，在提升样本质量的同时实现了高达400倍的推理加速。

摘要翻译

自调节机制是连续扩散语言模型取得成功的关键，它使模型能够修正先前生成的错误。然而，该机制的性能恰恰在扩散模型最具部署吸引力的场景下出现退化：即采用少量步数采样以实现快速推理时。本研究表明，当模型仅进行少量去噪步骤时，不准确的自调节会导致显著的近似误差；这种错误会在去噪步骤中不断累积，最终主导生成样本的质量。为解决这一问题，我们提出了一种新颖的训练框架，该框架在学习过程中通过扰动自调节信号以匹配推理噪声，从而提升模型对先验估计误差的鲁棒性。此外，我们引入了一种词元级别的噪声感知机制，防止训练过程陷入饱和，进而优化模型训练效果。在多项条件生成基准测试上的广泛实验表明，我们的框架超越了标准连续扩散模型，同时实现了高达400倍的推理加速，并且与其他一步式扩散框架相比仍保持竞争力。

摘要 (Abstract)

Self-conditioning has been central to the success of continuous diffusion language models, as it allows models to correct previous errors. Yet its ability degrades precisely in the regime where diffusion is most attractive for deployment: few-step sampling for fast inference. In this study, we show that when models only have a few denoising steps, inaccurate self-conditioning induces a substantial approximation gap; this mistake compounds across denoising steps and ultimately dominate the sample quality. To address this, we propose a novel training framework that handles these errors during learning by perturbing the self-conditioning signal to match inference noise, improving robustness to prior estimation errors. In addition, we introduce a token-level noise-awareness mechanism that prevents training from saturation, hence improving optimization. Extensive experiments across conditional generation benchmarks demonstrate that our framework surpasses standard continuous diffusion models while providing up to 400x faster inference speed, and remains competitive against other one-step diffusion frameworks.

关键词: diffusion language models, self-conditioning, few-step sampling, inference acceleration, denoising steps, training framework, conditional generation, fast inference

22. ❌ Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming

作者: Baoshun Tong, Haoran He, Ling Pan, Yang Liu, Liang Lin 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05595v1

评分: 13.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	8.0/10	8.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究Vision-Language-Action (VLA)模型在机器人操作中的鲁棒性测试，提出DAERT框架通过强化学习生成多样化对抗性指令来暴露模型对语言细微差别的脆弱性。与关键词的相关性分析：1) 与"Large Language Models"有一定关联（5分），因为VLA模型包含语言组件，但论文不聚焦于LLM本身的技术原理；2) 与"LLM Agents"高度相关（8分），因为论文研究的是具身AI代理（embodied AI agents）的安全测试和脆弱性发现，属于智能代理的研究范畴；3) 其他关键词（如MoE、Scaling Laws、RLHF等）均未在论文中涉及，得0分。论文主要贡献在于代理安全测试方法，而非大模型核心技术或科学应用。

!!! tip deepseek-chat TL;DR

该论文针对Vision-Language-Action模型在机器人操作中对语言细微差别鲁棒性不足的安全问题，提出了一个多样性感知的具身红队测试框架，能够生成多样化的对抗性指令，将任务平均成功率从93.33%降低到5.85%，有效暴露了模型的安全盲点。

摘要翻译

视觉-语言-动作（Vision-Language-Action，VLA）模型在机器人操作任务中取得了显著成功。然而，其对语言细微差异的鲁棒性仍是一个关键且尚未充分探索的安全问题，这给实际部署带来了重大安全风险。红队测试（red teaming），即识别可能引发灾难性行为的环境场景，是确保具身人工智能（embodied AI）智能体安全部署的重要环节。强化学习（Reinforcement Learning，RL）已成为自动化红队测试中一种有前景的方法，旨在揭示这些潜在漏洞。然而，基于强化学习的标准对抗方法由于其奖励最大化的特性，常遭受严重的模式坍塌（mode collapse）问题，倾向于收敛到一组狭窄的、琐碎或重复的失败模式，从而无法揭示有意义风险的完整图景。为弥补这一差距，我们提出了一种新颖的多样性感知具身红队测试（Diversity-Aware Embodied Red Teaming，DAERT）框架，以暴露VLA模型在语言变化下的脆弱性。我们的设计基于评估一个均匀策略（uniform policy），该策略能够生成一系列多样化的、具有挑战性的指令，同时确保其攻击有效性（通过在物理模拟器中的执行失败率来衡量）。我们在多个机器人基准测试中，针对包括$π_0$和OpenVLA在内的两种先进VLA模型进行了广泛实验。我们的方法持续发现了更广泛且更有效的对抗性指令，将平均任务成功率从93.33%降低至5.85%，证明了一种可扩展的压力测试VLA智能体的方法，并能在实际部署前暴露关键的安全盲点。

摘要 (Abstract)

Vision-Language-Action (VLA) models have achieved remarkable success in robotic manipulation. However, their robustness to linguistic nuances remains a critical, under-explored safety concern, posing a significant safety risk to real-world deployment. Red teaming, or identifying environmental scenarios that elicit catastrophic behaviors, is an important step in ensuring the safe deployment of embodied AI agents. Reinforcement learning (RL) has emerged as a promising approach in automated red teaming that aims to uncover these vulnerabilities. However, standard RL-based adversaries often suffer from severe mode collapse due to their reward-maximizing nature, which tends to converge to a narrow set of trivial or repetitive failure patterns, failing to reveal the comprehensive landscape of meaningful risks. To bridge this gap, we propose a novel \textbf{D}iversity-\textbf{A}ware \textbf{E}mbodied \textbf{R}ed \textbf{T}eaming (\textbf{DAERT}) framework, to expose the vulnerabilities of VLAs against linguistic variations. Our design is based on evaluating a uniform policy, which is able to generate a diverse set of challenging instructions while ensuring its attack effectiveness, measured by execution failures in a physical simulator. We conduct extensive experiments across different robotic benchmarks against two state-of-the-art VLAs, including $π_0$ and OpenVLA. Our method consistently discovers a wider range of more effective adversarial instructions that reduce the average task success rate from 93.33% to 5.85%, demonstrating a scalable approach to stress-testing VLA agents and exposing critical safety blind spots before real-world deployment.

关键词: Vision-Language-Action Models, Robotic Manipulation, Red Teaming, Linguistic Fragility, Diversity-Aware Adversarial Testing, Embodied AI Agents, Safety Evaluation, Reinforcement Learning

23. ❌ Selective Aggregation of Attention Maps Improves Diffusion-Based Visual Interpretation

作者: Jungwon Park, Jungmin Ko, Dongnam Byun, Wonjong Rhee 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05906v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究的是文本到图像（T2I）生成模型（特别是扩散模型）中交叉注意力图的解释性问题，提出了一种通过选择性聚合注意力头来提升视觉可解释性和分割性能的方法。该研究专注于计算机视觉领域的扩散模型，而非大语言模型（LLMs）或通用大模型。因此，绝大多数关键词（涉及LLMs、训练技术、推理、代理、压缩等）完全不相关，得0分。唯一相关的关键词是"Mechanistic Interpretability” OR “Explainable AI”，因为论文的核心是提升扩散模型的可解释性（通过注意力图分析），这属于可解释AI范畴，但并非针对LLMs的机制可解释性，故给予10分（高度相关，但非核心）。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过选择性聚合文本到图像扩散模型中与目标概念最相关的注意力头的交叉注意力图，来提升视觉可解释性和分割性能，实验表明该方法相比基线DAAM取得了更高的平均IoU分数，并能更好地诊断提示误解。

摘要翻译

大量关于文本到图像生成模型的研究已利用交叉注意力图来提升应用性能并解释模型行为。然而，不同注意力头所生成注意力图的独特特性仍未得到充分探索。本研究证明，选择性聚合与目标概念最相关的注意力头所产生的交叉注意力图，能够提升视觉可解释性。与基于扩散模型的分割方法DAAM相比，我们的方法获得了更高的平均交并比分数。我们还发现，最相关的注意力头比最不相关的注意力头能更准确地捕捉概念特定特征，且选择性聚合有助于诊断提示词的误解情况。这些发现表明，注意力头选择为提高文本到图像生成的可解释性与可控性提供了有前景的研究方向。

摘要 (Abstract)

Numerous studies on text-to-image (T2I) generative models have utilized cross-attention maps to boost application performance and interpret model behavior. However, the distinct characteristics of attention maps from different attention heads remain relatively underexplored. In this study, we show that selectively aggregating cross-attention maps from heads most relevant to a target concept can improve visual interpretability. Compared to the diffusion-based segmentation method DAAM, our approach achieves higher mean IoU scores. We also find that the most relevant heads capture concept-specific features more accurately than the least relevant ones, and that selective aggregation helps diagnose prompt misinterpretations. These findings suggest that attention head selection offers a promising direction for improving the interpretability and controllability of T2I generation.

关键词: text-to-image generation, diffusion models, cross-attention maps, attention head selection, visual interpretability, segmentation, DAAM, prompt misinterpretation

24. ❌ SignalClaw: LLM-Guided Evolutionary Synthesis of Interpretable Traffic Signal Control Skills

作者: Da Lei, Feng Xiao, Lu Li, Yuzhan Liu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05535v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文的核心是使用LLMs作为进化技能生成器来合成可解释的交通信号控制技能，因此与"Large Language Models"高度相关（10分）。论文未涉及其他关键词的具体技术细节，如MoE、SLMs、训练方法、推理优化、代理系统、模型压缩等，这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究提出SignalClaw框架，利用大语言模型作为进化技能生成器，为自适应交通信号控制合成和优化可解释的控制技能，在常规和事件注入场景下实现了接近最优的延迟性能，并显著降低了紧急车辆和公共交通的延迟。

摘要翻译

交通信号控制（Traffic Signal Control, TSC）需要兼具高效性与可解释性的策略以用于实际部署，然而强化学习生成的是不透明的神经策略，而程序合成则依赖于限制性强的领域特定语言。本文提出SIGNALCLAW框架，该框架利用大语言模型（Large Language Models, LLMs）作为进化技能生成器，以合成并优化用于自适应交通信号控制的可解释控制技能。每项技能均包含原理阐述、选择指导与可执行代码，从而使策略具备人类可检视性与自文档化特性。在每一代进化中，来自仿真指标（如排队百分位数、延误趋势与停滞状况）的进化信号被转化为自然语言反馈，以指导技能改进。SignalClaw还引入了事件驱动的组合式进化机制：通过TraCI，事件检测器可识别紧急车辆、公交优先、突发事件及拥堵状况；优先级调度器则据此选择专用技能。每项技能独立进化，并通过优先级链实现无需重新训练的运行时组合。我们在常规场景与事件注入的SUMO场景中，将SignalClaw与四种基线方法进行比较评估。在常规场景下，其平均延误为7.8至9.2秒，与最佳方法差距在3%至10%之间，且在不同随机种子下表现出低方差。在事件场景下，SignalClaw实现了最低的紧急车辆延误（11.2至18.5秒，而MaxPressure为42.3至72.3秒，DQN为78.5至95.3秒）与最低的公交乘客延误（9.8至11.5秒，而MaxPressure为38.7至45.2秒）。在混合事件场景中，调度器能有效组合技能，同时保持整体延误稳定。进化后的技能从简单的线性规则逐步发展为具有多特征交互的条件策略，同时保持完全的可解释性，并可由交通工程师直接修改。

摘要 (Abstract)

Traffic signal control TSC requires strategies that are both effective and interpretable for deployment, yet reinforcement learning produces opaque neural policies while program synthesis depends on restrictive domain-specific languages. We present SIGNALCLAW, a framework that uses large language models LLMs as evolutionary skill generators to synthesize and refine interpretable control skills for adaptive TSC. Each skill includes rationale, selection guidance, and executable code, making policies human-inspectable and self-documenting. At each generation, evolution signals from simulation metrics such as queue percentiles, delay trends, and stagnation are translated into natural language feedback to guide improvement. SignalClaw also introduces event-driven compositional evolution: an event detector identifies emergency vehicles, transit priority, incidents, and congestion via TraCI, and a priority dispatcher selects specialized skills. Each skill is evolved independently, and a priority chain enables runtime composition without retraining. We evaluate SignalClaw on routine and event-injected SUMO scenarios against four baselines. On routine scenarios, it achieves average delay of 7.8 to 9.2 seconds, within 3 to 10 percent of the best method, with low variance across random seeds. Under event scenarios, it yields the lowest emergency delay 11.2 to 18.5 seconds versus 42.3 to 72.3 for MaxPressure and 78.5 to 95.3 for DQN, and the lowest transit person delay 9.8 to 11.5 seconds versus 38.7 to 45.2 for MaxPressure. In mixed events, the dispatcher composes skills effectively while maintaining stable overall delay. The evolved skills progress from simple linear rules to conditional strategies with multi-feature interactions, while remaining fully interpretable and directly modifiable by traffic engineers.

关键词: Large Language Models, Evolutionary Synthesis, Interpretable Traffic Signal Control, Event-driven Composition, Skill Generation, Traffic Simulation, Adaptive Control, Human-inspectable Policies

作者: Yifeng He, Ziye Tang, Hao Chen 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05461v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文的核心是提出一个名为ContentFuzz的框架，该框架利用大型语言模型（LLM）来重写社交媒体帖子，以改变其被立场检测模型分类的标签，从而帮助内容突破信息茧房。因此，论文与关键词"Large Language Models” OR “LLMs” OR “Foundation Models"高度相关（评分为10分），因为LLM是该方法的核心工具。论文未涉及其他关键词所描述的具体技术（如MoE、量化、RAG、RLHF等）或应用领域（如生物信息学），也未提及任何指定的专家作者，因此这些关键词的评分均为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于大型语言模型的置信度引导模糊测试框架（ContentFuzz），用于重写社交媒体帖子，在保持语义完整性的同时改变其被机器立场检测模型分类的标签，从而帮助内容突破信息茧房，扩大其受众范围。

摘要翻译

社交媒体中的信息茧房限制了用户接触多元观点的内容。现代平台将立场检测作为推荐与排序流程的重要信号，这可能导致内容主要推送给观点相似的受众，从而减少跨立场内容的曝光。这种做法限制了异议观点的传播范围，并阻碍了建设性对话的开展。本研究从内容创作者视角出发，探讨如何通过内容修改使其突破现有的同质化受众圈层。我们提出了ContentFuzz——一个基于置信度引导的模糊测试框架，该框架能在保持人类可理解意图的前提下重写内容，并诱导机器学习模型生成不同的立场推断标签。ContentFuzz旨在帮助内容突破原始的信息茧房。我们的方法通过立场检测模型提供的置信度反馈，引导大语言模型生成保持原意的改写文本。通过在两种语言、三个数据集上对四个代表性立场检测模型的评估，ContentFuzz能有效改变机器分类的立场标签，同时保持与原始内容的语义完整性。

摘要 (Abstract)

Information cocoons on social media limit users’ exposure to posts with diverse viewpoints. Modern platforms use stance detection as an important signal in recommendation and ranking pipelines, which can route posts primarily to like-minded audiences and reduce cross-cutting exposure. This restricts the reach of dissenting opinions and hinders constructive discourse. We take the creator’s perspective and investigate how content can be revised to reach beyond existing affinity clusters. We present ContentFuzz, a confidence-guided fuzzing framework that rewrites posts while preserving their human-interpreted intent and induces different machine-inferred stance labels. ContentFuzz aims to route posts beyond their original cocoons. Our method guides a large language model (LLM) to generate meaning-preserving rewrites using confidence feedback from stance detection models. Evaluated on four representative stance detection models across three datasets in two languages, ContentFuzz effectively changes machine-classified stance labels, while maintaining semantic integrity with respect to the original content.

关键词: information cocoons, stance detection, content rewriting, large language model (LLM), confidence-guided fuzzing, social media, cross-cutting exposure, semantic integrity

26. ❌ Turbulence-like 5/3 spectral scaling in contextual representations of language as a complex system

作者: Zhongxin Yang, Chun Bao, Yuanwei Bin, Xiang I. A. Yang, Shiyi Chen 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05536v1

评分: 8.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文研究自然语言作为复杂系统的统计特性，使用基于Transformer的语言模型生成高维嵌入空间来表示文本，并分析其幂律谱特征。论文的核心是分析语言模型（特别是Transformer-based models）生成的上下文嵌入的统计规律，因此与"Large Language Models” OR “LLMs” OR “Foundation Models"有直接关联（评分8分）。论文未涉及其他关键词的具体技术、方法或应用，如MoE、训练技术、推理优化、对齐、代理、量化等，因此这些关键词评分为0分。论文属于基础研究，探讨语言模型的表示特性，而非具体的应用或技术创新。

!!! tip deepseek-chat TL;DR

该论文研究了基于Transformer的语言模型生成的上下文嵌入中是否存在类似湍流的5/3幂律谱标度，发现这种标度普遍存在于人类书写和AI生成的文本中，揭示了语义信息在语言尺度上的无标度、自相似整合方式。

摘要翻译

自然语言是一个展现稳健统计规律性的复杂系统。本研究将文本表示为基于Transformer的语言模型生成的高维嵌入空间中的轨迹，并利用嵌入步长信号量化沿标记序列的尺度依赖性涨落。在多种语言与语料库中，所得功率谱在宽广的频率范围内呈现出稳健的幂律分布，其指数接近$5/3$。这种标度律在人类书写文本和人工智能生成文本的上下文嵌入中均被一致观测到，但在静态词嵌入中并不存在，且会因标记顺序随机化而被破坏。这些结果表明，观测到的标度特性反映的是多尺度、上下文依赖的组织方式，而非仅源于词汇统计。通过与湍流中的柯尔莫哥洛夫谱类比，我们的发现表明语义信息在语言尺度上以无标度、自相似的方式被整合，并为研究语言表征的复杂结构提供了一个定量化、模型无关的基准。

摘要 (Abstract)

Natural language is a complex system that exhibits robust statistical regularities. Here, we represent text as a trajectory in a high-dimensional embedding space generated by transformer-based language models, and quantify scale-dependent fluctuations along the token sequence using an embedding-step signal. Across multiple languages and corpora, the resulting power spectrum exhibits a robust power law with an exponent close to $5/3$ over an extended frequency range. This scaling is observed consistently in contextual embeddings from both human-written and AI-generated text, but is absent in static word embeddings and is disrupted by randomization of token order. These results show that the observed scaling reflects multiscale, context-dependent organization rather than lexical statistics alone. By analogy with the Kolmogorov spectrum in turbulence, our findings suggest that semantic information is integrated in a scale-free, self-similar manner across linguistic scales, and provide a quantitative, model-agnostic benchmark for studying complex structure in language representations.

关键词: language models, transformer, contextual embeddings, power law scaling, complex system, semantic information, scale-free, self-similar

27. ❌ Channel-wise Retrieval for Multivariate Time Series Forecasting

作者: Junhyeok Kang, Jun Seo, Soyeon Park, Sangjun Han, Seohui Bae, Hyeokjun Choe, Soonyoung Lee 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05543v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	5.0/10	5.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文专注于多元时间序列预测，提出了一种基于检索增强的框架CRAFT。该研究属于传统机器学习/深度学习领域，而非大模型（LLM）研究。唯一相关的关键词是"Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”，因为论文使用了检索增强技术（retrieval-augmented forecasting），但这是应用于时间序列预测，而非大语言模型的文本生成。其他所有关键词均与大模型技术原理、训练方法、对齐、推理优化、代理系统等无关。论文未涉及任何大模型、深度学习技术原理创新或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文针对多元时间序列预测中固定回看窗口难以捕获长期依赖的问题，提出了一种通道级检索增强预测框架CRAFT，通过独立检索每个通道的历史片段并利用时频域两阶段处理，在多个基准测试中实现了优于现有方法的预测精度和推理效率。

摘要翻译

多元时间序列预测常因固定的回看窗口而难以捕捉长期依赖关系。检索增强预测通过从记忆中检索历史片段来解决这一问题，但现有方法依赖于通道无关策略，即对所有变量应用相同的参考片段。这种做法忽略了变量间的异质性——不同通道往往表现出不同的周期性和频谱特征。本文提出CRAFT（通道式检索增强预测框架），这是一种为每个通道独立执行检索的新型框架。为确保效率，CRAFT采用两阶段流程：首先在时域构建稀疏关系图以剪枝无关候选片段，继而在频域通过频谱相似度对参考片段排序，强调主导周期成分并抑制噪声。在七个公开基准数据集上的实验表明，CRAFT在保持实用推理效率的同时，其预测精度超越了当前最先进的预测基线模型。

摘要 (Abstract)

Multivariate time series forecasting often struggles to capture long-range dependencies due to fixed lookback windows. Retrieval-augmented forecasting addresses this by retrieving historical segments from memory, but existing approaches rely on a channel-agnostic strategy that applies the same references to all variables. This neglects inter-variable heterogeneity, where different channels exhibit distinct periodicities and spectral profiles. We propose CRAFT (Channel-wise retrieval-augmented forecasting), a novel framework that performs retrieval independently for each channel. To ensure efficiency, CRAFT adopts a two-stage pipeline: a sparse relation graph constructed in the time domain prunes irrelevant candidates, and spectral similarity in the frequency domain ranks references, emphasizing dominant periodic components while suppressing noise. Experiments on seven public benchmarks demonstrate that CRAFT outperforms state-of-the-art forecasting baselines, achieving superior accuracy with practical inference efficiency.

关键词: Multivariate Time Series Forecasting, Retrieval-Augmented Forecasting, Channel-wise Retrieval, Long-range Dependencies, Spectral Similarity, CRAFT, Two-stage Pipeline, Inference Efficiency

28. ❌ Effective Dynamics and Transition Pathways from Koopman-Inspired Neural Learning of Collective Variables

作者: Alexander Sikorski, Luca Donati, Marcus Weber, Christof Schütte 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05778v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文《Effective Dynamics and Transition Pathways from Koopman-Inspired Neural Learning of Collective Variables》提出了一种名为ISOKANN的数据驱动框架，用于从复杂分子系统中提取集体变量（CVs）和有效动力学。该研究结合了Koopman算子理论、类Krylov子空间算法和降维动力学建模，旨在描述高维系统中的亚稳态转变。论文的核心是分子动力学模拟、降维技术和计算化学，属于科学计算和计算物理/化学领域。所有关键词（共27个）中，只有第27个关键词“AI for Science” OR “Bioinformatics” OR “Cheminformatics”与论文有一定关联，因为论文属于AI在科学（具体是计算化学/分子动力学）中的应用，但论文并未明确涉及生物信息学或化学信息学，也未使用大语言模型（LLM）或深度学习技术，因此该关键词评分为5分（有一定关联）。其余26个关键词均与大语言模型、深度学习技术原理或应用直接相关，而本文未涉及这些内容，因此评分为0分（完全无关）。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于Koopman算子和神经网络的ISOKANN框架，用于从分子动力学模拟数据中提取集体变量并计算有效动力学，从而预测亚稳态转变的速率和路径。

摘要翻译

ISOKANN（基于人工神经网络学习的库普曼算子不变子空间）框架提供了一种数据驱动方法，用于从复杂分子系统中提取集体变量与有效动力学。本研究将库普曼算子的理论基础与类克雷洛夫子空间算法及降维动力学建模相结合，构建了一个基于集体变量描述高维系统亚稳态转变的连贯理论体系。从基于主导不变子空间识别集体变量出发，我们推导了潜在空间上的对应有效动力学，并将其与跃迁速率、跃迁时间、承诺函数及转变路径相联系。基于库普曼算子的学习方法与降维有效动力学的结合，形成了一个从模拟数据计算跃迁速率与路径的理论框架。在一维、二维和三维基准势能面上的数值实验表明，ISOKANN能够重构粗粒化动力学，并复现跨越焓垒与熵垒的转变时间。

摘要 (Abstract)

The ISOKANN (Invariant Subspaces of Koopman Operators Learned by Artificial Neural Networks) framework provides a data-driven route to extract collective variables (CVs) and effective dynamics from complex molecular systems. In this work, we integrate the theoretical foundation of Koopman operators with Krylov-like subspace algorithms, and reduced dynamical modeling to build a coherent picture of how to describe metastable transitions in high-dimensional systems based on CVs. Starting from the identification of CVs based on dominant invariant subspaces, we derive the corresponding effective dynamics on the latent space and connect these to transition rates and times, committor functions, and transition pathways. The combination of Koopman-based learning and reduced-dimensional effective dynamics yields a principled framework for computing transition rates and pathways from simulation data. Numerical experiments on one-, two-, and three-dimensional benchmark potentials illustrate the ability of ISOKANN to reconstruct the coarse-grained kinetics and reproduce transition times across enthalpic and entropic barriers.

关键词: Koopman operators, collective variables, effective dynamics, transition pathways, molecular systems, ISOKANN, reduced-dimensional modeling, metastable transitions

29. ❌ MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

作者: Yuchi Wang, Haiyang Yu, Weikang Bian, Jiefeng Long, Xiao Liang, Chao Feng, Hongsheng Li 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06156v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MMEmb-R1主要研究多模态大语言模型（MLLMs）在嵌入任务中的应用，核心创新在于将思维链（Chain of Thought）推理机制引入多模态嵌入学习，并提出了自适应控制方法以减少不必要的推理开销。因此，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（10分），与’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’有一定关联（8分），因为论文涉及深度推理机制。论文基于MLLMs，与’Large Language Models OR LLMs OR Foundation Models’相关（8分）。论文提到减少推理延迟，与’Speculative Decoding OR Inference Acceleration’有弱关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、Alignment、RAG等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在嵌入任务中直接引入思维链推理会导致结构错位和计算冗余的问题，提出了MMEmb-R1框架，通过配对感知推理选择和强化学习自适应控制，在MMEB-V2基准上以4B参数取得了71.2分的SOTA性能，同时显著降低了推理开销和延迟。

摘要翻译

多模态大语言模型（MLLMs）已成功应用于多模态嵌入任务，但其生成式推理能力仍未得到充分利用。将思维链推理直接引入嵌入学习面临两个根本性挑战。首先，实例级推理与成对对比监督之间的结构错位可能导致模型产生捷径行为，即仅学习推理的表面形式。其次，推理并非对所有嵌入任务均有益。强制对所有输入进行推理可能引入不必要的计算与延迟，甚至在处理简单案例时掩盖显著的语义信号。为解决这些问题，我们提出了MMEmb-R1——一种基于自适应推理的多模态嵌入框架。我们将推理建模为隐变量，并引入基于成对感知的推理选择机制，该机制利用反事实干预来识别有助于查询-目标对齐的推理路径。此外，我们采用强化学习方法，仅在必要时选择性调用推理。在MMEB-V2基准测试上的实验表明，我们的模型仅以40亿参数即获得71.2的评分，创造了新的最优性能记录，同时显著降低了推理开销与推断延迟。

摘要 (Abstract)

MLLMs have been successfully applied to multimodal embedding tasks, yet their generative reasoning capabilities remain underutilized. Directly incorporating chain-of-thought reasoning into embedding learning introduces two fundamental challenges. First, structural misalignment between instance-level reasoning and pairwise contrastive supervision may lead to shortcut behavior, where the model merely learns the superficial format of reasoning. Second, reasoning is not universally beneficial for embedding tasks. Enforcing reasoning for all inputs may introduce unnecessary computation and latency, and can even obscure salient semantic signals for simple cases. To address these issues, we propose MMEmb-R1, an adaptive reasoning-based multimodal embedding framework. We formulate reasoning as a latent variable and introduce pair-aware reasoning selection that employs counterfactual intervention to identify reasoning paths beneficial for query-target alignment. Furthermore, we adopt reinforcement learning to selectively invoke reasoning only when necessary. Experiments on the MMEB-V2 benchmark demonstrate that our model achieves a score of 71.2 with only 4B parameters, establishing a new state-of-the-art while significantly reducing reasoning overhead and inference latency.

关键词: Multimodal Large Language Models, Chain-of-Thought Reasoning, Multimodal Embedding, Pair-aware Selection, Adaptive Control, Reinforcement Learning, Inference Latency, MMEB-V2 Benchmark

30. ❌ DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models

作者: Zhengming Yu, Li Ma, Mingming He, Leo Isikdogan, Yuancheng Xu, Dmitriy Smirnov, Pablo Salamanca, Dao Mi, Pablo Delgado, Ning Yu, Julien Philip, Xin Li, Wenping Wang, Paul Debevec 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06161v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文DiffHDR专注于视频扩散模型在LDR到HDR转换中的应用，属于计算机视觉和视频处理领域。虽然使用了预训练的视频扩散模型（与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联，得5分），但论文核心并非大语言模型（LLMs）或深度学习技术原理的创新，而是应用现有扩散模型解决特定视觉任务。其他关键词均与大语言模型、推理、对齐、压缩、科学AI等主题无关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出DiffHDR框架，利用预训练的视频扩散模型将低动态范围（LDR）视频转换为高动态范围（HDR）视频，通过生成式辐射修复在过曝和欠曝区域合成逼真的HDR细节，显著提升了辐射保真度和时间稳定性。

摘要翻译

大多数数字视频以8比特低动态范围（LDR）格式存储，由于饱和与量化，原始高动态范围（HDR）场景辐亮度的大量信息在此过程中丢失。这种高光与阴影细节的缺失阻碍了向HDR显示器准确映射亮度的可能，并限制了后期制作流程中有意义的再曝光操作。尽管已有技术提出通过动态范围扩展将LDR图像转换为HDR，但这些方法难以在过曝与欠曝区域恢复真实的细节。为此，我们提出DiffHDR框架，该框架将LDR到HDR的转换构建为视频扩散模型潜在空间中的生成式辐亮度修复任务。通过在Log-Gamma色彩空间中操作，DiffHDR利用预训练视频扩散模型的时空生成先验，在过曝与欠曝区域合成合理的HDR辐亮度，同时恢复量化像素的连续场景辐亮度。我们的框架进一步支持通过文本提示或参考图像引导的可控LDR到HDR视频转换。针对成对HDR视频数据稀缺的问题，我们开发了一套从静态HDRI（高动态范围图像）贴图合成高质量HDR视频训练数据的流程。大量实验表明，DiffHDR在辐亮度保真度与时间稳定性方面显著优于现有先进方法，能够生成具有充分再曝光余量的逼真HDR视频。

摘要 (Abstract)

Most digital videos are stored in 8-bit low dynamic range (LDR) formats, where much of the original high dynamic range (HDR) scene radiance is lost due to saturation and quantization. This loss of highlight and shadow detail precludes mapping accurate luminance to HDR displays and limits meaningful re-exposure in post-production workflows. Although techniques have been proposed to convert LDR images to HDR through dynamic range expansion, they struggle to restore realistic detail in the over- and underexposed regions. To address this, we present DiffHDR, a framework that formulates LDR-to-HDR conversion as a generative radiance inpainting task within the latent space of a video diffusion model. By operating in Log-Gamma color space, DiffHDR leverages spatio-temporal generative priors from a pretrained video diffusion model to synthesize plausible HDR radiance in over- and underexposed regions while recovering the continuous scene radiance of the quantized pixels. Our framework further enables controllable LDR-to-HDR video conversion guided by text prompts or reference images. To address the scarcity of paired HDR video data, we develop a pipeline that synthesizes high-quality HDR video training data from static HDRI maps. Extensive experiments demonstrate that DiffHDR significantly outperforms state-of-the-art approaches in radiance fidelity and temporal stability, producing realistic HDR videos with considerable latitude for re-exposure.

关键词: video diffusion models, LDR-to-HDR conversion, generative radiance inpainting, temporal stability, HDR video synthesis, Log-Gamma color space, re-exposure, paired HDR video data

31. ❌ In-Place Test-Time Training

作者: Guhao Feng, Shengjie Luo, Kai Hua, Ge Zhang, Di He, Wenhao Huang, Tianle Cai 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06169v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的Test-Time Training（TTT）框架，直接涉及LLMs和参数高效微调（PEFT）技术，并显著扩展上下文长度至128k，因此与’Large Language Models’（10分）、‘PEFT’（5分）和’Context Window Extension’（8分）高度相关。其他关键词如MoE、SLMs、Scaling Laws、RAG、RLHF等未在摘要中提及或与论文主题无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了In-Place Test-Time Training框架，通过动态更新LLMs的权重来适应实时信息流，解决了传统静态训练范式的限制，使模型在长上下文任务中实现性能提升。

摘要翻译

静态的“训练后部署”范式从根本上限制了大语言模型（LLM）根据现实任务中固有的持续新信息流动态调整其权重的能力。测试时训练（Test-Time Training, TTT）提供了一种引人注目的替代方案，通过在推理时更新模型参数的一个子集（快速权重），但其在当前LLM生态系统中的潜力受到关键障碍的阻碍，包括架构不兼容、计算效率低下，以及用于语言建模的快速权重目标不匹配。在本工作中，我们提出了原位测试时训练（In-Place TTT），一个能够无缝赋予LLM测试时训练能力的框架。In-Place TTT将普遍存在的MLP模块中的最终投影矩阵作为其可适应的快速权重，从而实现对LLM的“即插即用”式增强，无需进行成本高昂的从头训练。此外，我们用一种定制的、有理论依据的目标取代了TTT通用的重构目标，该目标明确与主导自回归语言建模的下一个词元预测任务对齐。这一原则性目标，结合高效的块状更新机制，产生了一种高度可扩展且兼容上下文并行性的算法。大量实验验证了我们框架的有效性：作为一种原位增强，它使一个40亿参数的模型能够在长达128k上下文的任务上取得卓越性能；当从头开始预训练时，它持续优于相关的竞争性TTT方法。消融研究结果进一步为我们设计选择提供了更深入的见解。总体而言，我们的研究结果确立了In-Place TTT作为迈向LLM持续学习范式的一个有前景的步骤。

摘要 (Abstract)

The static train then deploy" paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to continuous streams of new information inherent in real-world tasks. Test-Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce In-Place Test-Time Training (In-Place TTT), a framework that seamlessly endows LLMs with Test-Time Training ability. In-Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a drop-in" enhancement for LLMs without costly retraining from scratch. Furthermore, we replace TTT’s generic reconstruction objective with a tailored, theoretically-grounded objective explicitly aligned with the Next-Token-Prediction task governing autoregressive language modeling. This principled objective, combined with an efficient chunk-wise update mechanism, results in a highly scalable algorithm compatible with context parallelism. Extensive experiments validate our framework’s effectiveness: as an in-place enhancement, it enables a 4B-parameter model to achieve superior performance on tasks with contexts up to 128k, and when pretrained from scratch, it consistently outperforms competitive TTT-related approaches. Ablation study results further provide deeper insights on our design choices. Collectively, our results establish In-Place TTT as a promising step towards a paradigm of continual learning in LLMs.

关键词: Test-Time Training, Large Language Models, In-Place TTT, Continual Learning, Context Window Extension, Parameter-efficient Fine-tuning, Next-Token-Prediction, Autoregressive Language Modeling

32. ❌ Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization

作者: Yanis Labrak, David Grünert, Séverin Baroudi, Jiyun Chun, Pawel Cyrta, Sergio Burdisso, Ahmed Hassoon, David Liu, Adam Rothschild, Reed Van Deusen, Petr Motlicek, Andrew Perrault, Ricard Marxer, Thomas Schaaf 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06138v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是使用LLMs生成合成医患对话数据，用于长上下文音频摘要任务。与’Large Language Models’高度相关（10分），因为整个数据生成管道基于LLMs；与’Context Window Extension’高度相关（10分），因为论文明确针对长上下文音频推理问题；与’AI for Science’有一定关联（5分），因为涉及医疗领域应用；其他关键词如MoE、SFT、RAG等未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于LLMs的合成数据生成管道，用于创建长上下文医患对话音频和SOAP笔记，以解决长上下文音频推理任务中训练数据和评估的不足，并发现级联方法在现有开放权重系统中仍显著优于端到端模型。

摘要翻译

长上下文音频推理在训练数据和评估方面均存在不足。现有基准主要针对短上下文任务，而与长上下文推理最相关的开放式生成任务对自动评估提出了众所周知的挑战。我们提出了一种合成数据生成流程，旨在同时作为训练资源和受控评估环境，并将其具体应用于以生成SOAP病历为任务的初诊医患对话场景。该流程包含三个阶段：基于人物设定的对话生成、包含重叠/停顿建模、房间声学及声音事件的多说话人音频合成，以及基于大语言模型的参考SOAP病历生成，整个流程完全基于开源模型构建。我们发布了8,800段合成对话，包含1.3千小时对应音频及参考病历。通过对当前开源系统的评估，我们发现级联方法仍显著优于端到端模型。

摘要 (Abstract)

Long-context audio reasoning is underserved in both training data and evaluation. Existing benchmarks target short-context tasks, and the open-ended generation tasks most relevant to long-context reasoning pose well-known challenges for automatic evaluation. We propose a synthetic data generation pipeline designed to serve both as a training resource and as a controlled evaluation environment, and instantiate it for first-visit doctor-patient conversations with SOAP note generation as the task. The pipeline has three stages, persona-driven dialogue generation, multi-speaker audio synthesis with overlap/pause modeling, room acoustics, and sound events, and LLM-based reference SOAP note production, built entirely on open-weight models. We release 8,800 synthetic conversations with 1.3k hours of corresponding audio and reference notes. Evaluating current open-weight systems, we find that cascaded approaches still substantially outperform end-to-end models.

关键词: synthetic data generation, doctor-patient conversations, long-context audio, SOAP note generation, LLM-based pipeline, audio summarization, open-weight models, cascaded approaches

33. ❌ Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement

作者: Qimin Zhong, Hao Liao, Haiming Qin, Mingyang Zhou, Rui Mao, Wei Chen, Naipeng Chao 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06155v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs内部世界模型的一致性，直接涉及’Large Language Models’和’World Models’关键词（10分）。论文提出方法解决’structural hallucinations’，与’Hallucination Mitigation’相关（5分）。论文分析梯度归纳偏置和表示对齐，与’Mechanistic Interpretability’有一定关联（5分）。其他关键词如MoE、SFT、RAG、Agents等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型（LLMs）内部世界模型的一致性问题，通过分析多令牌预测（MTP）的梯度归纳偏置，揭示了其易产生结构幻觉的缺陷，并提出了一种新的潜在语义增强MTP方法（LSE-MTP），该方法通过将预测锚定到真实隐藏状态轨迹，有效减少了结构幻觉并提高了表示对齐和鲁棒性。

摘要翻译

大语言模型是否形成了连贯的内部世界模型，这一核心问题仍存争议。传统的下一词元预测侧重于单步监督，而多词元预测在学习结构化表征方面展现出潜力。本研究从理论视角分析了多词元预测的梯度归纳偏置，并结合实证证据表明，多词元预测通过梯度耦合诱导表征收缩性，从而促进模型向内部信念状态收敛。然而，我们发现标准多词元预测常存在结构性幻觉问题——离散词元监督会促使潜在空间中出现违反环境约束的非法捷径。为解决此问题，我们提出了一种新方法：潜在语义增强多词元预测，该方法将预测锚定在真实隐藏状态轨迹上。在合成图与真实世界曼哈顿出租车行程数据集上的实验表明，潜在语义增强多词元预测能有效弥合离散词元与连续状态表征之间的鸿沟，显著提升表征对齐度，减少结构性幻觉，并增强对扰动的鲁棒性。

摘要 (Abstract)

Whether Large Language Models (LLMs) develop coherent internal world models remains a core debate. While conventional Next-Token Prediction (NTP) focuses on one-step-ahead supervision, Multi-Token Prediction (MTP) has shown promise in learning more structured representations. In this work, we provide a theoretical perspective analyzing the gradient inductive bias of MTP, supported by empirical evidence, showing that MTP promotes the convergence toward internal belief states by inducing representational contractivity via gradient coupling. However, we reveal that standard MTP often suffers from structural hallucinations, where discrete token supervision encourages illegal shortcuts in latent space that violate environmental constraints. To address this, we propose a novel method Latent Semantic Enhancement MTP (LSE-MTP), which anchors predictions to ground-truth hidden state trajectories. Experiments on synthetic graphs and real-world Manhattan Taxi Ride show that LSE-MTP effectively bridges the gap between discrete tokens and continuous state representations, enhancing representation alignment, reducing structural hallucinations, and improving robustness to perturbations.

关键词: Large Language Models, World Models, Multi-Token Prediction, Structural Hallucinations, Latent Semantic Enhancement, Representation Alignment, Gradient Inductive Bias, Internal Belief States

34. ❌ Who Governs the Machine? A Machine Identity Governance Taxonomy (MIGT) for AI Systems Operating Across Enterprise and Geopolitical Boundaries

作者: Andrew Kurtz, Klaudia Krawiecka 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06148v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于AI系统的机器身份治理、风险分类和监管框架，属于AI治理和政策领域，而非大模型技术原理、深度学习创新或科学应用。所有关键词均涉及大模型技术、训练方法、推理优化、应用部署等具体技术方面，与论文的治理主题完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了机器身份治理分类法（MIGT）来解决AI系统在企业跨国运营中的身份治理盲点，包括风险分类、治理框架、威胁模型和监管对齐结构。

摘要翻译

人工智能治理存在一个盲区：人工智能系统用以执行操作的机器身份。当前在企业环境中，AI代理、服务账户、API令牌和自动化工作流的数量已超过人类身份，比例高达80:1以上，然而目前尚无综合框架对其进行治理。2024年CrowdStrike服务中断事件中，单个未受治理的自动化代理造成了54至100亿美元的损失；包括Silk Typhoon和Salt Typhoon在内的国家级行为体已将未受治理的机器凭证作为针对关键基础设施的主要间谍活动载体投入实战。本文提出四项原创贡献。其一，AI身份风险分类体系（AI-Identity Risk Taxonomy, AIRT）：一个涵盖八大领域、37个子风险类别的全面枚举框架，每个类别均基于已记录事件、监管认知、从业者普遍性数据及威胁情报建立。其二，机器身份治理分类体系（Machine Identity Governance Taxonomy, MIGT）：一个整合了六大领域的治理框架，能同时应对技术治理缺口、监管合规缺口及跨司法辖区协调缺口——这些缺口在现有框架中仅被孤立处理。其三，针对企业身份治理的外国国家级行为体威胁模型，论证了Silk Typhoon、Salt Typhoon、Volt Typhoon及朝鲜AI增强型身份欺诈行动已将AI身份漏洞作为活跃攻击载体投入实战。其四，跨司法辖区监管协调结构，同步映射欧盟、美国及中国框架下的企业AI身份治理义务，识别不可调和的冲突并提供管理这些冲突的治理机制。最后，通过四阶段实施路线图将MIGT转化为可执行的企业方案。

摘要 (Abstract)

The governance of artificial intelligence has a blind spot: the machine identities that AI systems use to act. AI agents, service accounts, API tokens, and automated workflows now outnumber human identities in enterprise environments by ratios exceeding 80 to 1, yet no integrated framework exists to govern them. A single ungoverned automated agent produced $5.4-10 billion in losses in the 2024 CrowdStrike outage; nation-state actors including Silk Typhoon and Salt Typhoon have operationalized ungoverned machine credentials as primary espionage vectors against critical infrastructure. This paper makes four original contributions. First, the AI-Identity Risk Taxonomy (AIRT): a comprehensive enumeration of 37 risk sub-categories across eight domains, each grounded in documented incidents, regulatory recognition, practitioner prevalence data, and threat intelligence. Second, the Machine Identity Governance Taxonomy (MIGT): an integrated six-domain governance framework simultaneously addressing the technical governance gap, the regulatory compliance gap, and the cross-jurisdictional coordination gap that existing frameworks address only in isolation. Third, a foreign state actor threat model for enterprise identity governance, establishing that Silk Typhoon, Salt Typhoon, Volt Typhoon, and North Korean AI-enhanced identity fraud operations have already operationalized AI identity vulnerabilities as active attack vectors. Fourth, a cross-jurisdictional regulatory alignment structure mapping enterprise AI identity governance obligations under EU, US, and Chinese frameworks simultaneously, identifying irreconcilable conflicts and providing a governance mechanism for managing them. A four-phase implementation roadmap translates the MIGT into actionable enterprise programs.

关键词: machine identity governance, AI governance, enterprise AI, cross-jurisdictional, threat model, regulatory compliance, AI agents, identity risk taxonomy

35. ❌ Shot-Based Quantum Encoding: A Data-Loading Paradigm for Quantum Neural Networks

作者: Basil Kyriacou, Viktoria Patapovich, Maniraman Periyasamy, Alexey Melnikov 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06135v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究量子机器学习中的数据加载问题，提出了一种新的量子编码方法（Shot-Based Quantum Encoding），属于量子计算与机器学习的交叉领域。论文内容与绝大多数关键词（涉及大模型、深度学习技术原理、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词都特指经典深度学习和大语言模型技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为量子机器学习可视为AI在科学计算（量子物理）中的一个应用，但论文并未明确涉及生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对量子机器学习中数据加载效率低的问题，提出了一种基于硬件原生资源（shots）的量子编码方法SBQE，在Fashion MNIST和Semeion数据集上实现了优于传统编码方法的分类准确率，且无需数据编码门。

摘要翻译

高效数据加载仍是近期量子机器学习发展的瓶颈。现有编码方案（角度编码、振幅编码和基编码）要么未能充分利用指数级希尔伯特空间容量，要么所需的电路深度超出了噪声中等规模量子硬件的相干时间预算。本文提出基于量子态测量的编码方法，这是一种数据嵌入策略，它将硬件的原生资源——测量次数——依据数据相关的经典概率分布分配到多个初始量子态上。通过将测量次数作为可学习的自由度，该方法生成一种混合态表示，其期望值对经典概率呈线性关系，因而可与非线性激活函数结合使用。我们证明该编码在结构上等价于一个权重由量子电路实现的多层感知机，并描述了一种硬件兼容的实现方案。在Fashion MNIST和Semeion手写数字数据集上的基准测试（每个模型独立初始化十次）表明：该方法在Semeion数据集上达到89.1% ± 0.9%的测试准确率（较振幅编码错误率降低5.3%，并与宽度匹配的经典网络性能持平），在Fashion MNIST数据集上达到80.95% ± 0.10%的准确率（较振幅编码提升2.0%，较线性多层感知机提升1.3%），且全程无需使用任何数据编码量子门。

摘要 (Abstract)

Efficient data loading remains a bottleneck for near-term quantum machine-learning. Existing schemes (angle, amplitude, and basis encoding) either underuse the exponential Hilbert-space capacity or require circuit depths that exceed the coherence budgets of noisy intermediate-scale quantum hardware. We introduce Shot-Based Quantum Encoding (SBQE), a data embedding strategy that distributes the hardware’s native resource, shots, according to a data-dependent classical distribution over multiple initial quantum states. By treating the shot counts as a learnable degree of freedom, SBQE produces a mixed-state representation whose expectation values are linear in the classical probabilities and can therefore be composed with non-linear activation functions. We show that SBQE is structurally equivalent to a multilayer perceptron whose weights are realised by quantum circuits, and we describe a hardware-compatible implementation protocol. Benchmarks on Fashion MNIST and Semeion handwritten digits, with ten independent initialisations per model, show that SBQE achieves 89.1% +/- 0.9% test accuracy on Semeion (reducing error by 5.3% relative to amplitude encoding and matching a width-matched classical network) and 80.95% +/- 0.10% on Fashion MNIST (exceeding amplitude encoding by +2.0% and a linear multilayer perceptron by +1.3%), all without any data-encoding gates.

关键词: Quantum Machine Learning, Data Loading, Quantum Encoding, Shot-Based Quantum Encoding, Quantum Neural Networks, Mixed-State Representation, Hardware-Compatible, Fashion MNIST

36. ❌ PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

作者: David Picard, Nicolas Dufour, Lucas Degeorge, Arijit Ghosh, Davide Allegro, Tom Ravaud, Yohann Perron, Corentin Sautier, Zeynep Sonat Baltaci, Fei Meng, Syrine Kalleli, Marta López-Rauhut, Thibaut Loiseau, Ségolène Albouy, Raphael Baena, Elliot Vincent, Loic Landrieu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06129v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	8.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是提出一种线性复杂度的Polynomial Mixer（PoM）作为自注意力的替代机制，属于大模型底层架构创新。与’KV Cache Compression OR Linear Attention OR FlashAttention’高度相关（8分），因为PoM直接替代注意力机制，实现线性复杂度，属于注意力优化技术。与’Context Window Extension OR Long Context LLMs’（5分）相关，因为线性复杂度有利于处理长序列。与’Large Language Models OR LLMs OR Foundation Models’（5分）相关，因为PoM作为Transformer组件，可应用于大模型。与’Speculative Decoding OR Inference Acceleration’（5分）相关，因为降低计算成本可加速推理。与’AI for Science OR Bioinformatics OR Cheminformatics’（5分）相关，因为论文在Earth observation等科学领域有应用。其他关键词如MoE、SFT、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Polynomial Mixer（PoM）的线性复杂度token混合机制，作为自注意力的直接替代品，在保持Transformer通用近似能力的同时，显著降低了长序列处理的计算成本，并在文本生成、图像生成、地球观测等多个领域验证了其性能。

摘要翻译

本文介绍了一种具有线性复杂度的新型令牌混合机制——多项式混合器（Polynomial Mixer，简称PoM），它可作为自注意力（self-attention）的直接替代方案。PoM通过一个可学习的多项式函数将输入令牌聚合为紧凑表示，每个令牌从中检索上下文信息。我们证明了PoM满足上下文映射特性，从而确保配备PoM的变换器（transformers）仍然是通用的序列到序列近似器。我们在五个不同领域——文本生成、手写文本识别、图像生成、三维建模和地球观测——中用PoM替代了标准的自注意力机制。在处理长序列时，PoM在性能上与基于注意力的模型相当，同时显著降低了计算成本。代码发布于https://github.com/davidpicard/pom。

摘要 (Abstract)

This paper introduces the Polynomial Mixer (PoM), a novel token mixing mechanism with linear complexity that serves as a drop-in replacement for self-attention. PoM aggregates input tokens into a compact representation through a learned polynomial function, from which each token retrieves contextual information. We prove that PoM satisfies the contextual mapping property, ensuring that transformers equipped with PoM remain universal sequence-to-sequence approximators. We replace standard self-attention with PoM across five diverse domains: text generation, handwritten text recognition, image generation, 3D modeling, and Earth observation. PoM matches the performance of attention-based models while drastically reducing computational cost when working with long sequences. The code is available at https://github.com/davidpicard/pom.

关键词: Polynomial Mixer, attention replacement, linear complexity, token mixing, Transformer, computational efficiency, long sequences, universal approximator

37. ❌ Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

作者: Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, Qi Liu, Zhifang Sui, Tong Yang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06132v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM作为自主代理的评估框架，与’Large Language Models’和’LLM Agents’高度相关（10分）。涉及代理在软件环境中的工作流程，与’Tool Use’有一定关联（5分）。评估包括安全性和鲁棒性，与’Hallucination Mitigation’有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了Claw-Eval评估框架，解决了现有自主代理基准在轨迹透明度、安全鲁棒性评估和多模态覆盖方面的不足，实验发现轨迹不透明评估会遗漏大量安全违规和鲁棒性失败，且多模态性能差异显著。

摘要翻译

大型语言模型正日益作为自主代理被部署于现实软件环境中执行多步骤工作流。然而，现有代理基准存在三个关键局限：(1) 仅检查最终输出的轨迹不透明评分机制；(2) 安全性与鲁棒性评估规范不足；(3) 模态覆盖与交互范式狭窄。为此，我们提出Claw-Eval——一个端到端的评估套件，旨在全面解决上述缺陷。该套件包含300项经人工验证的任务，涵盖三大类别（通用服务编排、多模态感知与生成、多轮专业对话）下的9个领域。每个代理行为均通过三个独立证据通道（执行轨迹、审计日志、环境快照）记录，支持基于2,159个细粒度评分项进行轨迹感知式评分。评分协议从完成度、安全性与鲁棒性三个维度展开，通过三轮试验报告平均分、Pass@k与Pass^k指标，以区分真实能力与偶然成功。在14个前沿模型上的实验表明：(1) 轨迹不透明评估存在系统性不可靠问题，会遗漏我们混合流程所能捕获的44%安全性违规与13%鲁棒性故障；(2) 受控错误注入主要影响一致性而非峰值能力，表现为Pass^3最多下降24%而Pass@3保持稳定；(3) 多模态性能差异显著，大多数模型在视频任务上表现弱于文档或图像任务，且没有单一模型能在所有模态上全面领先。除基准测试外，Claw-Eval为代理开发指明了可实践方向，揭示了构建不仅具备能力且可可靠部署的智能代理所需的关键要素。

摘要 (Abstract)

Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (general service orchestration, multimodal perception and generation, and multi-turn professional dialogue). Every agent action is recorded through three independent evidence channels (execution traces, audit logs, and environment snapshots), enabling trajectory-aware grading over 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, reporting Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.

关键词: autonomous agents, large language models, evaluation framework, safety evaluation, robustness evaluation, multimodal tasks, trajectory-aware grading, agent benchmarking

38. ❌ ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

作者: Wang Yang, Chaoda Song, Xinpeng Li, Debargha Ganguly, Chuang Ma, Shouren Wang, Zhihao Dou, Yuli Zhou, Vipin Chaudhary, Xiaotian Han 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06111v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于智能体评估基准的设计，与大多数大模型技术关键词无直接关系。核心相关关键词：‘LLM Agents’（10分）——论文明确评估LLM智能体；‘Chain of Thought’和’System 2 Thinking’（各8分）——论文评估智能体推理能力，涉及多步规划和深度推理；‘Tool Use’（5分）——涉及工具调用（JSON文件）；‘Large Language Models’（5分）——实验涉及多种LLM模型。其他关键词如MoE、量化、对齐等均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了ACE-Bench基准，通过可扩展任务范围和可控难度设计，解决了现有智能体评估中环境交互开销高和任务分布不平衡的问题，实现了对智能体推理能力的可解释和可控评估。

摘要翻译

现有智能体基准测试存在两个关键局限：环境交互开销过高（占总评估时间比例高达41%），以及任务时间跨度和难度分布不均衡导致综合评分不可靠。为解决这些问题，我们提出基于统一网格规划任务的ACE-Bench，要求智能体在满足局部槽位约束和全局约束的前提下，为部分完成的日程表填充隐藏槽位。该基准通过两个正交维度实现细粒度控制：可扩展时间跨度（由隐藏槽位数量$H$调控）和可控难度（由决定全局误导性干扰项数量的诱饵预算$B$调控）。在轻量化环境设计框架下，所有工具调用均通过静态JSON文件解析，既消除了环境配置开销，又实现了适用于训练时验证的快速可复现评估。我们首先验证了$H$和$B$能可靠调控任务时间跨度与难度，并证明ACE-Bench具有强领域一致性和模型区分度。随后通过对6个领域13个不同规模与架构的模型进行全面实验，发现显著的跨模型性能差异，证实ACE-Bench能为智能体推理能力提供可解释且可控的评估。

摘要 (Abstract)

Existing Agent benchmarks suffer from two critical limitations: high environment interaction overhead (up to 41% of total evaluation time) and imbalanced task horizon and difficulty distributions that make aggregate scores unreliable. To address these issues, we propose ACE-Bench built around a unified grid-based planning task, where agents must fill hidden slots in a partially completed schedule subject to both local slot constraints and global constraints. Our benchmark offers fine-grained control through two orthogonal axes: Scalable Horizons, controlled by the number of hidden slots $H$, and Controllable Difficulty, governed by a decoy budget $B$ that determines the number of globally misleading decoy candidates. Crucially, all tool calls are resolved via static JSON files under a Lightweight Environment design, eliminating setup overhead and enabling fast, reproducible evaluation suitable for training-time validation. We first validate that H and B provide reliable control over task horizon and difficulty, and that ACE-Bench exhibits strong domain consistency and model discriminability. We then conduct comprehensive experiments across 13 models of diverse sizes and families over 6 domains, revealing significant cross-model performance variation and confirming that ACE-Bench provides interpretable and controllable evaluation of agent reasoning.

关键词: Agent Evaluation, Benchmark Design, Scalable Horizons, Controllable Difficulty, Lightweight Environment, Reasoning Assessment, Grid-based Planning, Model Discriminability

39. ❌ Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery

作者: Hao Chen, Fang Qiu, Fangchao Dong, Defei Yang, Eve Bohnett, Li An 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06124v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究轻量级多模态适应框架，将RGB预训练的视觉语言模型（VLMs）迁移到热红外图像，用于物种识别和栖息地上下文解释。核心相关关键词：1）‘Pre-training OR Continual Pre-training OR Domain Adaptation’（10分）：论文涉及领域适应，将RGB预训练模型迁移到热图像；2）‘Post-training OR Supervised Fine-tuning OR SFT’（10分）：使用热数据集对VLMs进行微调；3）‘PEFT OR LoRA OR Parameter-efficient Fine-tuning’（10分）：采用轻量级投影器对齐，属于参数高效微调；4）‘AI for Science OR Bioinformatics OR Cheminformatics’（10分）：应用于生态监测，属于科学AI领域。其他关键词如LLMs、MoE、推理方法等与论文内容无关，得0分。加权总分：30.0（10×1.0 + 10×1.0 + 10×1.0 + 10×1.0）。

!!! tip deepseek-chat TL;DR

该论文提出了一种轻量级多模态适应框架，通过投影器对齐和微调，将RGB预训练的视觉语言模型迁移到热红外无人机图像，有效实现了物种识别和栖息地上下文解释，在生态监测中展示了实用价值。

摘要翻译

本研究提出一种轻量级多模态适配框架，旨在弥合基于RGB预训练的视觉语言模型（VLMs）与热红外影像之间的表征差异，并利用真实无人机采集数据集验证其实际效用。研究基于无人机采集影像构建了热红外数据集，通过多模态投影器对齐方法对VLMs进行微调，实现了从RGB视觉表征到热辐射度量输入的信息迁移。在封闭集与开放集提示条件下，对InternVL3-8B-Instruct、Qwen2.5-VL-7B-Instruct及Qwen3-VL-8B-Instruct三种代表性模型进行了物种识别与个体计数任务的性能评估。测试模型中，采用开放集提示的Qwen3-VL-8B-Instruct取得最佳综合性能：对鹿、犀牛、象的识别F1分数分别为0.935、0.915、0.968；个体计数的±1误差范围内准确率分别为0.779、0.982、1.000。此外，通过融合热红外影像与同步采集的RGB影像，模型能够生成包含土地覆盖特征、关键景观要素及可见人为干扰在内的生境背景信息。总体而言，本研究表明基于轻量级投影器的适配方法为将RGB预训练VLMs迁移至热红外无人机影像提供了有效且实用的路径，将其应用范围从对象级识别扩展至生态监测中的生境背景解析。

摘要 (Abstract)

This study proposes a lightweight multimodal adaptation framework to bridge the representation gap between RGB-pretrained VLMs and thermal infrared imagery, and demonstrates its practical utility using a real drone-collected dataset. A thermal dataset was developed from drone-collected imagery and was used to fine-tune VLMs through multimodal projector alignment, enabling the transfer of information from RGB-based visual representations to thermal radiometric inputs. Three representative models, including InternVL3-8B-Instruct, Qwen2.5-VL-7B-Instruct, and Qwen3-VL-8B-Instruct, were benchmarked under both closed-set and open-set prompting conditions for species recognition and instance enumeration. Among the tested models, Qwen3-VL-8B-Instruct with open-set prompting achieved the best overall performance, with F1 scores of 0.935 for deer, 0.915 for rhino, and 0.968 for elephant, and within-1 enumeration accuracies of 0.779, 0.982, and 1.000, respectively. In addition, combining thermal imagery with simultaneously collected RGB imagery enabled the model to generate habitat-context information, including land-cover characteristics, key landscape features, and visible human disturbance. Overall, the findings demonstrate that lightweight projector-based adaptation provides an effective and practical route for transferring RGB-pretrained VLMs to thermal drone imagery, expanding their utility from object-level recognition to habitat-context interpretation in ecological monitoring.

关键词: lightweight multimodal adaptation, vision language models, thermal infrared imagery, species recognition, habitat context interpretation, drone imagery, parameter-efficient fine-tuning, ecological monitoring

40. ❌ Gym-Anything: Turn any Software into an Agent Environment

作者: Pranjal Aggarwal, Graham Neubig, Sean Welleck 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06126v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心贡献是开发Gym-Anything框架，将任意软件转化为计算机使用环境，并构建CUA-World基准。与关键词相关性分析：1. 高度相关（10分）：‘LLM Agents/Autonomous Agents’、‘Tool Use/Function Calling’、‘Multi-agent Systems’ - 论文直接研究计算机使用代理（agents），涉及多代理系统（coding agent和audit agent）和工具使用（软件环境）。2. 中等相关（5分）：‘Large Language Models’ - 论文提到使用vision-language model进行蒸馏和评估；‘AI for Science’ - 基准涵盖医学、天文学等科学领域应用。3. 无关（0分）：其余关键词涉及大模型技术细节（如MoE、量化、注意力机制等）、训练方法（如RLHF、PEFT）或推理技术（如CoT、MCTS），论文未深入探讨这些技术原理，主要聚焦代理框架和基准构建。

!!! tip deepseek-chat TL;DR

该论文提出了Gym-Anything框架，通过多代理系统将任意软件转化为交互式计算机使用环境，并构建了涵盖广泛领域的CUA-World基准，其中蒸馏到2B视觉语言模型的性能优于更大模型，且测试时审核机制提升了代理性能。

摘要翻译

计算机使用智能体展现出协助广泛数字经济活动的潜力。然而，当前研究主要集中于经济价值有限、软件范围受限的短周期任务，例如基础电子商务和操作系统配置任务。一个关键原因在于，为复杂软件创建环境需要大量时间和人力投入，因此难以规模化。为解决这一问题，我们提出了Gym-Anything框架，该框架可将任意软件转换为交互式计算机使用环境。我们将环境创建本身构建为多智能体任务：编码智能体编写设置脚本、下载真实世界数据并配置软件，同时生成正确设置的证据；随后，独立的审计智能体依据质量检查清单验证环境设置的证据。基于美国GDP数据构建的经济价值职业分类体系，我们将此流程应用于200个覆盖广泛职业领域的软件应用程序，最终构建出CUA-World——一个包含超过1万个长周期任务的集合，涵盖从医学、天文学到工程学和企业系统等多个领域，每个任务均配置了真实数据及训练/测试集划分。CUA-World还包含CUA-World-Long挑战性长周期基准测试，其中任务通常需要超过500个步骤，远超现有基准标准。从训练集中提炼成功轨迹并蒸馏至20亿参数视觉语言模型后，其表现优于规模为其两倍的模型。我们还在测试阶段应用相同的审计原则：独立的视觉语言模型审查已完成轨迹并提供剩余任务反馈，将Gemini-3-Flash在CUA-World-Long上的成功率从11.5%提升至14.0%。我们公开所有代码、基础设施和基准数据，以促进现实场景计算机使用智能体的未来研究。

摘要 (Abstract)

Computer-use agents hold the promise of assisting in a wide range of digital economic activities. However, current research has largely focused on short-horizon tasks over a limited set of software with limited economic value, such as basic e-commerce and OS-configuration tasks. A key reason is that creating environments for complex software requires significant time and human effort, and therefore does not scale. To address this, we introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment. We frame environment creation itself as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software, while producing evidence of correct setup. An independent audit agent then verifies evidence for the environment setup against a quality checklist. Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage. The result is CUA-World, a collection of over 10K long-horizon tasks spanning domains from medical science and astronomy to engineering and enterprise systems, each configured with realistic data along with train and test splits. CUA-World also includes CUA-World-Long, a challenging long-horizon benchmark with tasks often requiring over 500 steps, far exceeding existing benchmarks. Distilling successful trajectories from the training split into a 2B vision-language model outperforms models 2$\times$ its size. We also apply the same auditing principle at test time: a separate VLM reviews completed trajectories and provides feedback on what remains, improving Gemini-3-Flash on CUA-World-Long from 11.5% to 14.0%. We release all code, infrastructure, and benchmark data to facilitate future research in realistic computer-use agents.

关键词: computer-use agents, multi-agent systems, software environments, long-horizon tasks, vision-language models, benchmark creation, agent auditing, real-world applications

41. ❌ Artificial Intelligence and the Structure of Mathematics

作者: Maissam Barkeshli, Michael R. Douglas, Michael H. Freedman 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06107v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文是一篇哲学性、概念性的论述文章，探讨AI如何帮助理解数学的全局结构和形式证明的本质，以及数学是发现还是发明的哲学问题。文章没有涉及任何具体的大模型技术、深度学习算法、训练方法、优化技术或具体应用领域，所有关键词都是关于具体技术、方法或应用领域的，与这篇纯概念性、哲学性的论文完全无关。

!!! tip deepseek-chat TL;DR

这篇论文探讨了人工智能如何通过形式证明和结构超图来帮助理解数学的全局结构，并提出了AI模型实现自动化数学发现所需满足的标准，最终旨在回答数学是发现还是发明的哲学问题。

摘要翻译

人工智能（AI）的最新进展正在为数学领域释放变革性能力。人们寄予厚望，期待AI能够帮助解决重大开放性问题，并自主发现新的数学概念。本文进一步探讨了AI如何通过开辟一条新路径——作为对数学逻辑的补充——来理解形式证明（formal proof）的整体结构，从而为数学开启一个宏大的视角。我们首先从通用证明和结构超图的角度勾勒了数学的形式结构框架，并讨论了由此引发的关于数学基础结构的问题。随后，我们概述了实现自动化数学发现所需AI模型的主要构成要素，并提出了一组应满足的标准。当我们派遣AI智能体去探索柏拉图式数学世界（Platonic mathematical worlds）时，我们预期它们将帮助我们理解数学的本质：无论是其整体面貌，还是那些易于人类理解的小范围片段。或许它们将有助于阐明这个古老的问题：“数学是被发现的还是被发明的？”我们能否真正理解这些柏拉图式世界的地形全貌？

摘要 (Abstract)

Recent progress in artificial intelligence (AI) is unlocking transformative capabilities for mathematics. There is great hope that AI will help solve major open problems and autonomously discover new mathematical concepts. In this essay, we further consider how AI may open a grand perspective on mathematics by forging a new route, complementary to mathematical\textbf{ logic,} to understanding the global structure of formal \textbf{proof}\textbf{s}. We begin by providing a sketch of the formal structure of mathematics in terms of universal proof and structural hypergraphs and discuss questions this raises about the foundational structure of mathematics. We then outline the main ingredients and provide a set of criteria to be satisfied for AI models capable of automated mathematical discovery. As we send AI agents to traverse Platonic mathematical worlds, we expect they will teach us about the nature of mathematics: both as a whole, and the small ribbons conducive to human understanding. Perhaps they will shed light on the old question: “Is mathematics discovered or invented?” Can we grok the terrain of these \textbf{Platonic worlds}?

关键词: artificial intelligence, mathematics, formal proofs, structural hypergraphs, automated mathematical discovery, AI agents, Platonic worlds, mathematical structure

42. ❌ LLM4CodeRE: Generative AI for Code Decompilation Analysis and Reverse Engineering

作者: Hamed Jelodar, Samita Bai, Tochukwu Emmanuel Nwankwo, Parisa Hamedi, Mohammad Meymani, Roozbeh Razavi-Far, Ali A. Ghorbani 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06095v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLM在代码反编译分析中的应用，属于大模型在特定领域（网络安全/软件工程）的应用创新。高度相关的关键词包括：LLMs（论文明确使用）、Fine-tuning（提出两种微调策略）、Domain Adaptation（强调领域自适应）、PEFT（Multi-Adapter方法属于参数高效微调）。AI for Science得5分，因为代码反编译属于计算机科学应用，但非传统生物/化学科学。其余关键词如MoE、SLMs、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对恶意软件逆向工程中代码反编译的挑战，提出了一个领域自适应的LLM框架LLM4CodeRE，通过两种微调策略实现了双向代码翻译，并在实验中超越了现有工具和通用代码模型。

摘要翻译

代码反编译分析是恶意软件逆向工程中基础且具有挑战性的任务，尤其因复杂混淆技术的普遍使用而变得困难。尽管近期的大型语言模型（LLMs）在将低级表示转换为高级源代码方面展现出潜力，但现有方法大多依赖通用代码预训练，缺乏对恶意软件的针对性适配。我们提出LLM4CodeRE，一个面向双向代码逆向工程的领域自适应LLM框架，该框架在统一模型内同时支持汇编到源代码的反编译和源代码到汇编的翻译。为实现有效的任务适应，我们引入了两种互补的微调策略：（i）用于任务特定句法和语义对齐的多适配器方法，以及（ii）利用任务条件前缀以强化端到端生成约束的序列到序列统一方法。实验结果表明，LLM4CodeRE在反编译准确性和鲁棒性上优于现有反编译工具与通用代码模型，实现了稳健的双向泛化能力。

摘要 (Abstract)

Code decompilation analysis is a fundamental yet challenging task in malware reverse engineering, particularly due to the pervasive use of sophisticated obfuscation techniques. Although recent large language models (LLMs) have shown promise in translating low-level representations into high-level source code, most existing approaches rely on generic code pretraining and lack adaptation to malicious software. We propose LLM4CodeRE, a domain-adaptive LLM framework for bidirectional code reverse engineering that supports both assembly-to-source decompilation and source-to-assembly translation within a unified model. To enable effective task adaptation, we introduce two complementary fine-tuning strategies: (i) a Multi-Adapter approach for task-specific syntactic and semantic alignment, and (ii) a Seq2Seq Unified approach using task-conditioned prefixes to enforce end-to-end generation constraints. Experimental results demonstrate that LLM4CodeRE outperforms existing decompilation tools and general-purpose code models, achieving robust bidirectional generalization.

关键词: LLM, code decompilation, reverse engineering, domain adaptation, fine-tuning, malware analysis, bidirectional translation, assembly-to-source

作者: Changgeon Ko, Jisu Shin, Hoyun Song, Huije Lee, Eui Jun Hwang, Jong C. Park 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06091v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究LLM代理在多智能体环境中的决策可靠性，核心关注社会动态对代理决策的影响。高度相关关键词：‘Large Language Models’（论文明确研究LLM代理）、‘LLM Agents’（研究LLM代理作为人类代表）、‘Multi-agent Systems’（研究多代理环境中的协调与决策）。中等相关关键词：‘Hallucination Mitigation’（涉及决策准确性和可靠性问题）、‘Mechanistic Interpretability’（涉及理解代理决策机制和社会影响）。其他关键词与论文的技术细节（如模型架构、训练方法、优化技术、特定应用领域）无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了在多智能体环境中，社会动态（如从众、感知专业知识、主导发言者效应和修辞说服）如何削弱作为人类代表的LLM代理的客观决策能力，实验表明随着社会压力增加，代理的准确性持续下降。

摘要翻译

大语言模型（LLM）智能体在多智能体环境中日益扮演人类代理的角色，其中代表性智能体需整合多方同伴观点以做出最终决策。受社会心理学启发，本研究探讨了网络社会情境如何削弱该代表性智能体的可靠性。我们定义了四种关键现象——社会从众性、感知专业性、主导发言者效应与修辞说服力，并系统性地操控了对抗者数量、相对智能水平、论证长度及论证风格等变量。实验表明，随着社会压力增加，代表性智能体的决策准确率持续下降：更大规模的对抗群体、更高能力的同伴以及更冗长的论证均会导致其性能显著恶化。此外，强调可信度或逻辑性的修辞策略会依据具体情境进一步左右智能体的判断。这些发现揭示出，多智能体系统不仅对个体推理敏感，更对其配置中的社会动态具有敏感性，凸显了AI代理中存在的关键脆弱性——这与人类群体决策中观察到的心理偏差如出一辙。

摘要 (Abstract)

Large language model (LLM) agents are increasingly acting as human delegates in multi-agent environments, where a representative agent integrates diverse peer perspectives to make a final decision. Drawing inspiration from social psychology, we investigate how the reliability of this representative agent is undermined by the social context of its network. We define four key phenomena-social conformity, perceived expertise, dominant speaker effect, and rhetorical persuasion-and systematically manipulate the number of adversaries, relative intelligence, argument length, and argumentative styles. Our experiments demonstrate that the representative agent’s accuracy consistently declines as social pressure increases: larger adversarial groups, more capable peers, and longer arguments all lead to significant performance degradation. Furthermore, rhetorical strategies emphasizing credibility or logic can further sway the agent’s judgment, depending on the context. These findings reveal that multi-agent systems are sensitive not only to individual reasoning but also to the social dynamics of their configuration, highlighting critical vulnerabilities in AI delegates that mirror the psychological biases observed in human group decision-making.

关键词: LLM agents, multi-agent systems, social dynamics, decision-making, social conformity, adversarial groups, performance degradation, psychological biases

44. ❌ LAG-XAI: A Lie-Inspired Affine Geometric Framework for Interpretable Paraphrasing in Transformer Latent Spaces

作者: Olexander Mazurets, Olexander Barmak, Leonid Bedratyuk, Iurii Krak 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06086v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心贡献是提出LAG-XAI几何框架，用于Transformer潜在空间的解释性分析，与’Mechanistic Interpretability OR Explainable AI’高度相关（10分）。论文在LLM幻觉检测中应用该框架，与’Hallucination Mitigation OR Factuality OR Truthfulness’有一定关联（8分）。论文研究Transformer模型，与’Large Language Models OR LLMs OR Foundation Models’有一般关联（5分）。其他关键词（如MoE、SFT、RAG等）在论文中未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于李仿射几何的LAG-XAI框架，用于解释Transformer模型中的释义过程，并通过几何检查在LLM幻觉检测中实现了95.3%的失真检测率。

摘要翻译

基于Transformer的现代语言模型在自然语言处理任务中表现出色，但其潜在语义空间在很大程度上仍是难以解释的黑箱。本文提出LAG-XAI（面向可解释人工智能的李仿射几何框架），这是一种新颖的几何框架，它将复述建模为嵌入空间中的结构化仿射变换，而非离散的词语替换。通过将复述概念化为语义流形上的连续几何流动，我们受局部李群作用的启发，提出了一种计算高效的平均场近似方法。这使得我们能够将复述转换分解为几何可解释的组成部分：旋转、形变和平移。在使用Sentence-BERT编码的嘈杂PIT-2015 Twitter语料库上的实验揭示了一种“线性透明性”现象。所提出的仿射算子达到了0.7713的AUC值。通过与随机基准（AUC 0.5）进行归一化比较，该模型捕获了非线性基线（AUC 0.8405）约80%的有效分类能力，以绝对精度的微小下降为代价，提供了明确的参数可解释性。该模型识别出基本的几何不变量，包括稳定的矩阵重配置角（约27.84°）和接近零的形变，表明存在局部等距特性。通过在独立的TURL数据集上进行直接的跨语料库验证，确认了其跨领域泛化能力。此外，LAG-XAI的实用性在大语言模型幻觉检测中得到验证：利用一种“廉价的几何检查”，该模型通过记录超出允许语义走廊的偏差，在HaluEval数据集上自动检测出95.3%的事实性扭曲。这种方法为Transformer的机制可解释性提供了一条基于数学原理且资源高效的路径。

摘要 (Abstract)

Modern Transformer-based language models achieve strong performance in natural language processing tasks, yet their latent semantic spaces remain largely uninterpretable black boxes. This paper introduces LAG-XAI (Lie Affine Geometry for Explainable AI), a novel geometric framework that models paraphrasing not as discrete word substitutions, but as a structured affine transformation within the embedding space. By conceptualizing paraphrasing as a continuous geometric flow on a semantic manifold, we propose a computationally efficient mean-field approximation, inspired by local Lie group actions. This allows us to decompose paraphrase transitions into geometrically interpretable components: rotation, deformation, and translation. Experiments on the noisy PIT-2015 Twitter corpus, encoded with Sentence-BERT, reveal a “linear transparency” phenomenon. The proposed affine operator achieves an AUC of 0.7713. By normalizing against random chance (AUC 0.5), the model captures approximately 80% of the non-linear baseline’s effective classification capacity (AUC 0.8405), offering explicit parametric interpretability in exchange for a marginal drop in absolute accuracy. The model identifies fundamental geometric invariants, including a stable matrix reconfiguration angle (~27.84°) and near-zero deformation, indicating local isometry. Cross-domain generalization is confirmed via direct cross-corpus validation on an independent TURL dataset. Furthermore, the practical utility of LAG-XAI is demonstrated in LLM hallucination detection: using a “cheap geometric check,” the model automatically detected 95.3% of factual distortions on the HaluEval dataset by registering deviations beyond the permissible semantic corridor. This approach provides a mathematically grounded, resource-efficient path toward the mechanistic interpretability of Transformers.

关键词: Interpretable AI, Transformer latent spaces, Geometric framework, Paraphrasing, Lie affine geometry, Hallucination detection, Mechanistic interpretability, Semantic manifold

45. ❌ Graph-PiT: Enhancing Structural Coherence in Part-Based Image Synthesis via Graph Priors

作者: Junbin Zhang, Meng Cao, Feng Tan, Yikai Lin, Yuexian Zou 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06074v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文Graph-PiT专注于计算机视觉领域的图像合成任务，通过图神经网络建模部件间的空间语义关系来提升结构一致性。所有评分关键词均与大语言模型、深度学习技术原理或科学AI应用相关，而本文研究的是视觉生成中的图结构建模，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对基于部件的图像合成中结构一致性不足的问题，提出了Graph-PiT框架，通过图先验和分层图神经网络建模部件间关系，显著提升了生成图像的结构合理性和可控性。

摘要翻译

实现细粒度且结构合理的可控性是高级视觉生成的基石。现有基于部件的框架将用户提供的部件视为无序集合，因而忽略了其内在的空间与语义关系，这往往导致生成的组合缺乏结构完整性。为弥补这一差距，我们提出了Graph-PiT框架，该框架利用图先验显式建模视觉组件的结构依赖关系。具体而言，我们将视觉部件表示为节点，将其空间-语义关系表示为边。我们方法的核心是一个分层图神经网络（Hierarchical Graph Neural Network, HGNN）模块，该模块在粗粒度部件级超节点与细粒度IP+令牌子节点之间执行双向消息传递，从而在部件嵌入进入生成流程前对其进行优化。我们还引入了图拉普拉斯平滑损失和边重建损失，以确保相邻部件获得兼容且关系感知的嵌入。在受控合成领域（字符、产品、室内布局和拼图）的定量实验，以及对真实网络图像的定性迁移结果表明，Graph-PiT在保持与原始IP-Prior流程兼容的同时，相较于原始PiT显著提升了结构连贯性。消融实验证实，显式关系推理对于强制执行用户指定的邻接约束至关重要。我们的方法不仅增强了生成概念的合理性，还为复杂多部件图像合成提供了可扩展且可解释的机制。代码发布于https://github.com/wolf-bailang/Graph-PiT。

摘要 (Abstract)

Achieving fine-grained and structurally sound controllability is a cornerstone of advanced visual generation. Existing part-based frameworks treat user-provided parts as an unordered set and therefore ignore their intrinsic spatial and semantic relationships, which often results in compositions that lack structural integrity. To bridge this gap, we propose Graph-PiT, a framework that explicitly models the structural dependencies of visual components using a graph prior. Specifically, we represent visual parts as nodes and their spatial-semantic relationships as edges. At the heart of our method is a Hierarchical Graph Neural Network (HGNN) module that performs bidirectional message passing between coarse-grained part-level super-nodes and fine-grained IP+ token sub-nodes, refining part embeddings before they enter the generative pipeline. We also introduce a graph Laplacian smoothness loss and an edge-reconstruction loss so that adjacent parts acquire compatible, relation-aware embeddings. Quantitative experiments on controlled synthetic domains (character, product, indoor layout, and jigsaw), together with qualitative transfer to real web images, show that Graph-PiT improves structural coherence over vanilla PiT while remaining compatible with the original IP-Prior pipeline. Ablation experiments confirm that explicit relational reasoning is crucial for enforcing user-specified adjacency constraints. Our approach not only enhances the plausibility of generated concepts but also offers a scalable and interpretable mechanism for complex, multi-part image synthesis. The code is available at https://github.com/wolf-bailang/Graph-PiT.

关键词: part-based image synthesis, graph prior, structural coherence, hierarchical graph neural network, spatial-semantic relationships, controllable generation, IP-Prior pipeline

46. ❌ Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning

作者: Juekai Lin, Yun Zhu, Honglin Lin, Sijing Li, Tianwei Lin, Zheng Liu, Xiaoyang Wang, Wenqiao Zhang, Lijun Wu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06079v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	8.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究大模型在科学图形程序合成中的应用，属于AI for Science领域（高度相关10分）。论文使用多模态大语言模型（LLMs）处理视觉数据（高度相关10分）。论文明确解决数据质量问题（Scaling Laws AND Data Quality相关8分）。论文提出Dual Self-Consistency Reinforcement Learning优化范式，涉及自我一致性改进（Self-Correction相关8分）。其他关键词如MoE、SLMs、微调方法、推理加速等未在论文中涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文解决了科学图形程序合成中数据质量不足和评估标准缺失的问题，通过构建高质量数据集SciTikZ-230K、基准测试SciTikZ-Bench和创新的Dual Self-Consistency Reinforcement Learning优化方法，使模型SciTikZer-8B在科学图形生成任务上超越了Gemini-2.5-Pro等大型模型。

摘要翻译

图形程序合成对于解释和编辑视觉数据至关重要，它能有效促进将静态视觉内容逆向工程转化为可编辑的TikZ代码。尽管TikZ因其编程灵活性而成为科学示意图的事实标准，但其对严格空间精度的要求对多模态大语言模型构成了重大挑战。当前进展主要受限于两大缺口：(1) 数据质量缺口：现有的图像-TikZ语料库往往缺乏严格的可执行性和可靠的视觉对齐性；(2) 评估缺口：缺乏针对结构保真度和视觉保真度的基准测试。为解决这些问题，我们提出了一个闭环框架，其核心包括：SciTikZ-230K——一个通过我们以执行为中心的数据引擎构建的大规模高质量数据集，涵盖11个不同的科学学科；SciTikZ-Bench——一个多维度基准测试集，涵盖从基础几何构造到复杂层次结构示意图，用于评估视觉保真度和结构逻辑。为了进一步拓宽视觉-代码优化方法的研究范围，我们引入了一种新颖的双重自一致性强化学习优化范式，该范式利用往返验证来惩罚退化代码并提升整体自一致性。在这些工作的支持下，我们训练的模型SciTikZer-8B实现了最先进的性能，持续超越Gemini-2.5-Pro等专有巨头模型以及Qwen3-VL-235B-A22B-Instruct等大规模模型。

摘要 (Abstract)

Graphics Program Synthesis is pivotal for interpreting and editing visual data, effectively facilitating the reverse-engineering of static visuals into editable TikZ code. While TikZ is the de facto standard for scientific schematics due to its programmatic flexibility, its requirement for rigorous spatial precision presents a significant challenge for Multimodal Large Language Models. Progress is currently stifled by two primary gaps: (1) Data Quality Gap: existing image-TikZ corpora often lack strict executability and reliable visual alignment; (2) Evaluation Gap: a lack of benchmarks for both structural and visual fidelity. To address these, we present a closed-loop framework featuring: SciTikZ-230K, a large-scale, high-quality dataset from our Execution-Centric Data Engine covering 11 diverse scientific disciplines; SciTikZ-Bench, a multifaceted benchmark spanning from basic geometric constructs to intricate hierarchical schematics to evaluate both visual fidelity and structural logic. To further broaden the scope of visual-code optimization methodology, we introduce a novel Dual Self-Consistency Reinforcement Learning optimization paradigm, which utilizes Round-Trip Verification to penalize degenerate code and boost overall self-consistency. Empowered by these, our trained model SciTikZer-8B achieves state-of-the-art performance, consistently outperforming proprietary giants like Gemini-2.5-Pro and massive models like Qwen3-VL-235B-A22B-Instruct.

关键词: Graphics Program Synthesis, Multimodal Large Language Models, Scientific Schematics, Data Quality, Dual Self-Consistency Reinforcement Learning, SciTikZ-230K, SciTikZ-Bench, TikZ Code Generation

47. ❌ Stories of Your Life as Others: A Round-Trip Evaluation of LLM-Generated Life Stories Conditioned on Rich Psychometric Profiles

作者: Ben Wigler, Maria Tsfasman, Tiffany Matej Hrkalovic 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06071v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在人格模拟和评估中的应用，直接涉及LLMs和Alignment（人格对齐），与Pre-training有一定关联（提及预训练捕获的人格-语言关系），其他关键词如MoE、SLMs、Scaling Laws、SFT、RLHF、RAG等均未涉及。

!!! tip deepseek-chat TL;DR

该研究通过让LLMs基于真实心理测量档案生成人生故事，再让独立LLMs从故事中恢复人格分数，证明了LLMs能够稳健地编码和解码个体差异，其生成文本的人格特征与真实人类行为模式一致。

摘要翻译

人格特质在自然语言中有着丰富的编码，基于人类文本训练的大型语言模型（LLM）在给定人物描述的条件下能够模拟人格。然而，现有的评估主要依赖于条件化模型的问卷式自我报告，架构多样性有限，且很少使用真实的人类心理测量数据。若不解决这些局限，人格条件化究竟能产生具有心理测量学意义的个体差异表征，还是仅实现与特质描述符的表面对齐，这一问题仍不明确。为测试LLM将人格编码至长文本的稳健性，我们基于290名参与者的真实心理测量档案对LLM进行条件化处理，生成第一人称生活故事叙述，并让独立的LLM仅通过这些叙述来还原人格分数。研究表明，从生成叙述中还原人格分数的水平可接近人类重测信度（平均r = 0.750，达到人类基准的85%），且该还原能力在涵盖6家提供商的10种LLM叙述生成器和3种LLM人格评分模型中均表现稳健。对系统性偏差的分解显示，评分模型在实现准确性的同时抵消了由对齐诱导的默认倾向。对生成叙述的内容分析表明，人格条件化产生了行为差异化的文本：十项编码特征中有九项与参与者真实对话中的同项特征显著相关，且叙述中人格驱动的情感反应模式在真实对话数据中得以复现。这些发现证明，预训练过程中捕捉到的人格-语言关系支持对个体差异（包括在真实人类行为中复现的特征性情感变异模式）进行稳健的编码与解码。

摘要 (Abstract)

Personality traits are richly encoded in natural language, and large language models (LLMs) trained on human text can simulate personality when conditioned on persona descriptions. However, existing evaluations rely predominantly on questionnaire self-report by the conditioned model, are limited in architectural diversity, and rarely use real human psychometric data. Without addressing these limitations, it remains unclear whether personality conditioning produces psychometrically informative representations of individual differences or merely superficial alignment with trait descriptors. To test how robustly LLMs can encode personality into extended text, we condition LLMs on real psychometric profiles from 290 participants to generate first-person life story narratives, and then task independent LLMs to recover personality scores from those narratives alone. We show that personality scores can be recovered from the generated narratives at levels approaching human test-retest reliability (mean r = 0.750, 85% of the human ceiling), and that recovery is robust across 10 LLM narrative generators and 3 LLM personality scorers spanning 6 providers. Decomposing systematic biases reveals that scoring models achieve their accuracy while counteracting alignment-induced defaults. Content analysis of the generated narratives shows that personality conditioning produces behaviourally differentiated text: nine of ten coded features correlate significantly with the same features in participants’ real conversations, and personality-driven emotional reactivity patterns in narratives replicate in real conversational data. These findings provide evidence that the personality-language relationship captured during pretraining supports robust encoding and decoding of individual differences, including characteristic emotional variability patterns that replicate in real human behaviour.

关键词: Large Language Models, Personality Conditioning, Psychometric Profiles, Life Story Narratives, Personality Recovery, Individual Differences, Pre-training, Alignment

48. ❌ CritBench: A Framework for Evaluating Cybersecurity Capabilities of Large Language Models in IEC 61850 Digital Substation Environments

作者: Gustav Keppler, Moritz Gstür, Veit Hagenmeyer 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06019v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLM在网络安全（特别是工业控制系统）中的应用评估，与’Large Language Models’、‘LLM Agents’、‘Tool Use’高度相关（10分），涉及推理能力评估与’Chain of Thought’、‘System 2 Thinking’有一定关联（5分），属于AI在特定领域（工业/科学）的应用，与’AI for Science’部分相关（5分）。其他关键词如MoE、量化、训练方法等未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了CritBench框架，用于评估大型语言模型在IEC 61850数字变电站环境中的网络安全能力，发现当前模型在静态任务上表现可靠，但在动态任务中需要领域特定工具支持才能有效操作。

摘要翻译

大型语言模型（LLM）的发展引发了对其在网络安全领域双重用途潜力的担忧。现有评估框架绝大多数集中于信息技术（IT）环境，未能充分考虑运营技术（OT）的约束条件和专用协议。为弥补这一不足，我们提出了CritBench，这是一个新颖的框架，旨在评估IEC 61850数字化变电站环境中LLM智能体的网络安全能力。我们评估了五种最先进的模型，包括OpenAI的GPT-5系列模型和开源权重模型，测试范围涵盖81项领域特定任务，涉及静态配置分析、网络流量侦察以及实时虚拟机交互。为实现对工业协议的交互，我们开发了一套领域专用工具脚手架。实证结果表明，智能体能够可靠地执行静态结构化文件分析和单工具网络枚举，但在动态任务上表现下降。尽管当前模型显示出对IEC 61850标准术语明确、内化的知识，但在缺乏专用工具的情况下，它们难以完成操作实时系统所需的持续性顺序推理和状态追踪。为智能体配备我们的领域专用工具脚手架，能显著缓解这一操作瓶颈。代码与评估脚本发布于：https://github.com/GKeppler/CritBench

摘要 (Abstract)

The advancement of Large Language Models (LLMs) has raised concerns regarding their dual-use potential in cybersecurity. Existing evaluation frameworks overwhelmingly focus on Information Technology (IT) environments, failing to capture the constraints, and specialized protocols of Operational Technology (OT). To address this gap, we introduce CritBench, a novel framework designed to evaluate the cybersecurity capabilities of LLM agents within IEC 61850 Digital Substation environments. We assess five state-of-the-art models, including OpenAI’s GPT-5 suite and open-weight models, across a corpus of 81 domain-specific tasks spanning static configuration analysis, network traffic reconnaissance, and live virtual machine interaction. To facilitate industrial protocol interaction, we develop a domain-specific tool scaffold. Our empirical results show that agents reliably execute static structured-file analysis and single-tool network enumeration, but their performance degrades on dynamic tasks. Despite demonstrating explicit, internalized knowledge of the IEC 61850 standards terminology, current models struggle with the persistent sequential reasoning and state tracking required to manipulate live systems without specialized tools. Equipping agents with our domain-specific tool scaffold significantly mitigates this operational bottleneck. Code and evaluation scripts are available at: https://github.com/GKeppler/CritBench

关键词: Large Language Models, Cybersecurity, IEC 61850, Digital Substation, LLM Agents, Tool Use, Evaluation Framework, Operational Technology

49. ❌ Governance and Regulation of Artificial Intelligence in Developing Countries: A Case Study of Nigeria

作者: Uloma Okoro, Tammy Mckenzie, Branislav Radeljic 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06018v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究尼日利亚法律专业人士对AI治理的看法，聚焦于伦理风险、监管差距和制度准备度，属于AI治理和政策研究范畴。所有评分关键词均涉及大模型/深度学习的技术原理、方法、应用或性能优化，而本文完全不涉及这些技术内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究通过访谈尼日利亚法律专业人士，探讨了发展中国家AI治理的挑战，发现存在数据隐私风险、法律框架缺失和制度能力不足等问题，并强调需要本地化的治理模式而非直接套用外国框架。

摘要翻译

本研究以尼日利亚为例，探讨了法律专业人士对发展中国家人工智能治理的认知。研究聚焦于伦理风险、监管缺失与机构准备度三个维度。采用质性案例研究设计，通过对尼日利亚27位法律从业者的半结构化访谈收集数据，并额外组织了一场聚焦小组讨论，参与者涵盖金融、保险及公司法等领域的7名法律从业者。运用主题分析法识别受访者反馈中的关键模式。研究发现，受访者对数据隐私风险及缺乏可执行法律框架普遍存在担忧。参与者对现有机构能力信心不足，并强调需要建立适应当地情况的治理模式，而非直接套用国外框架。尽管部分受访者对人工智能的潜力持乐观态度，但该态度以存在强有力的法律监督与公共问责机制为前提。本研究通过聚焦法律专业人士的视角，为日益增长的发展中国家人工智能治理讨论提供了新见解。它凸显了监管路径的重要性：这些路径需具备情境针对性、包容性，并能弥合全球伦理准则与本土现实之间的鸿沟。这些发现为政策制定者、监管机构及学者在类似环境中构建负责任的人工智能治理体系提供了实践指导。

摘要 (Abstract)

This study examines the perception of legal professionals on the governance of AI in developing countries, using Nigeria as a case study. The study focused on ethical risks, regulatory gaps, and institutional readiness. The study adopted a qualitative case study design. Data were collected through 27 semi-structured interviews with legal practitioners in Nigeria. A focus group discussion was also held with seven additional legal practitioners across sectors such as finance, insurance, and corporate law. Thematic analysis was employed to identify key patterns in participant responses. Findings showed that there were concerns about data privacy risks and the lack of enforceable legal frameworks. Participants expressed limited confidence in institutional capacity and emphasized the need for locally adapted governance models rather than direct adoption of foreign frameworks. While some expressed optimism about AI’s potential, this was conditional on the presence of strong legal oversight and public accountability. The study contributes to the growing discourse on AI governance in developing countries by focusing on the perspectives of legal professionals. It highlights the importance of regulatory approaches that are context-specific, inclusive, and capable of bridging the gap between global ethical principles and local realities. These insights offer practical guidance for policymakers, regulators, and scholars working to shape responsible AI governance in similar environments.

关键词: AI governance, developing countries, Nigeria case study, legal professionals, regulatory gaps, data privacy risks, institutional capacity, context-specific regulation

50. ❌ Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis

作者: Michael Cuccarese 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06013v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLM在生物信息学（AI for Science）领域的应用，特别是构建一个使用LLM进行推理的agentic系统（LLM Agents）来分析多个生物数据集进行药物靶点优先排序。论文主要关注LLM推理过程中数据驱动推理与记忆先验的混合问题，提出了epistemic blinding协议来审计这种污染，这直接涉及LLM的事实性（Hallucination Mitigation）和可解释性（Mechanistic Interpretability）问题。论文中提到的LLM引导的进化优化和agentic reasoning体现了多步推理（Chain of Thought）和深度推理（System 2 Thinking）的特点。其他关键词如MoE、SFT、RAG、量化等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了epistemic blinding协议，用于审计LLM辅助分析中数据驱动推理与模型记忆先验的混合问题，在肿瘤药物靶点优先排序中应用该协议改变了16%的预测结果，同时保持了已验证靶点的相同恢复率。

摘要翻译

本文提出了一种在智能体系统中应用的认知遮蔽方法，该系统利用大语言模型对多组生物数据集进行推理以实现药物靶点优先排序。在开发过程中发现，大语言模型的输出会无声地将数据驱动推理与对命名实体的记忆先验知识相融合——这种融合是不可见的：无法从单一输出中判断多少内容源于当前数据，多少源于模型训练记忆。认知遮蔽是一种简单的推理时协议，其通过在提示前将实体标识符替换为匿名编码，并将输出结果与未遮蔽的对照组进行比较。该协议虽不能使大语言模型推理变得确定，但恢复了可审计性的关键维度：衡量输出内容有多少来自所提供数据，多少来自模型的参数化知识。本文完整描述了靶点识别系统——包括大语言模型引导的评分函数进化优化以及用于靶点合理性论证的遮蔽智能体推理——并证明两个阶段均在无法获取实体身份的条件下运行。在针对四种癌症类型的肿瘤学药物靶点优先排序中，遮蔽处理改变了16%的前20位预测结果，同时保持对已验证靶点的完全识别能力。研究证实这种污染问题具有跨生物学领域的普适性：在标普500股票筛选中，品牌认知偏差导致五个随机种子下30-40%的前20位排名发生重构。为降低应用门槛，该协议已发布为开源工具及Claude Code技能，可在智能体工作流中通过单一命令实现认知遮蔽。需要强调的是，遮蔽分析并非必然产生更优结果，但若无遮蔽机制，研究者将无法确知智能体在多大程度上遵循其设计的分析流程。

摘要 (Abstract)

This paper presents epistemic blinding in the context of an agentic system that uses large language models to reason across multiple biological datasets for drug target prioritization. During development, it became apparent that LLM outputs silently blend data-driven inference with memorized priors about named entities - and the blend is invisible: there is no way to determine, from a single output, how much came from the data on the page and how much came from the model’s training memory. Epistemic blinding is a simple inference-time protocol that replaces entity identifiers with anonymous codes before prompting, then compares outputs against an unblinded control. The protocol does not make LLM reasoning deterministic, but it restores one critical axis of auditability: measuring how much of an output came from the supplied data versus the model’s parametric knowledge. The complete target identification system is described - including LLM-guided evolutionary optimization of scoring functions and blinded agentic reasoning for target rationalization - with demonstration that both stages operate without access to entity identity. In oncology drug target prioritization across four cancer types, blinding changes 16% of top-20 predictions while preserving identical recovery of validated targets. The contamination problem is shown to generalize beyond biology: in S&P 500 equity screening, brand-recognition bias reshapes 30-40% of top-20 rankings across five random seeds. To lower the barrier to adoption, the protocol is released as an open-source tool and as a Claude Code skill that enables one-command epistemic blinding within agentic workflows. The claim is not that blinded analysis produces better results, but that without blinding, there is no way to know to what degree the agent is adhering to the analytical process the researcher designed.

关键词: epistemic blinding, large language models, agentic system, drug target prioritization, prior contamination, auditability, bioinformatics, oncology

51. ❌ The Model Agreed, But Didn’t Learn: Diagnosing Surface Compliance in Large Language Models

作者: Xiaojie Gu, Ziying Huang, Weicong Hong, Jian Xie, Renze Lou, Kai Zhang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05995v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文核心研究大语言模型（LLMs）的知识编辑问题，直接涉及’Large Language Models OR LLMs OR Foundation Models’关键词，评分为10分。论文提出在in-context learning（ICL）设置下进行诊断，与’In-context Learning OR Many-shot Learning’关键词高度相关，评分为10分。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、CoT、Agents、Quantization、Hallucination等均未在论文标题或摘要中提及或直接相关，评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究发现大语言模型知识编辑中存在表面顺从现象，即模型在基准测试中能模仿目标输出但未真正修改内部信念，且递归修改会积累残留表示并损害记忆可逆性。

摘要翻译

大型语言模型（LLMs）将海量世界知识内化为参数化记忆，但不可避免地继承了其源语料的陈旧性与错误。因此，确保这些内部表征的可靠性与可塑性对于实现可信的实际部署至关重要。知识编辑提供了一种无需重新训练即可精准修改记忆的关键范式。然而，尽管近期提出的编辑方法在标准基准测试中展现出高成功率，但当前依赖特定提示条件下输出评估的验证框架能否可靠地确认真实记忆修改，仍存疑问。本研究引入了一个简单的诊断框架，使模型在上下文学习（ICL）设置下进行判别性自我评估，该设置能更好地反映实际应用环境，并专门设计用于审视记忆修改所引发的细微行为变化。这一探查揭示了一种普遍存在的“表面遵从”现象：编辑方法仅通过模仿目标输出而非从结构上覆盖内部信念，即可在基准测试中获得高分。此外，我们发现递归修改会累积表征残留，引发认知不稳定性，并永久削弱模型记忆状态的可逆性。这些发现揭示了当前编辑范式的潜在风险，并强调了稳健的记忆修改在构建可信、长期可持续的大型语言模型系统中的关键作用。代码发布于 https://github.com/XiaojieGu/SA-MCQ。

摘要 (Abstract)

Large Language Models (LLMs) internalize vast world knowledge as parametric memory, yet inevitably inherit the staleness and errors of their source corpora. Consequently, ensuring the reliability and malleability of these internal representations is imperative for trustworthy real-world deployment. Knowledge editing offers a pivotal paradigm for surgically modifying memory without retraining. However, while recent editors demonstrate high success rates on standard benchmarks, it remains questionable whether current evaluation frameworks that rely on assessing output under specific prompting conditions can reliably authenticate genuine memory modification. In this work, we introduce a simple diagnostic framework that subjects models to discriminative self-assessment under in-context learning (ICL) settings that better reflect real-world application environments, specifically designed to scrutinize the subtle behavioral nuances induced by memory modifications. This probing reveals a pervasive phenomenon of Surface Compliance, where editors achieve high benchmark scores by merely mimicking target outputs without structurally overwriting internal beliefs. Moreover, we find that recursive modifications accumulate representational residues, triggering cognitive instability and permanently diminishing the reversibility of the model’s memory state. These insights underscore the risks of current editing paradigms and highlight the pivotal role of robust memory modification in building trustworthy, long-term sustainable LLM systems. Code is available at https://github.com/XiaojieGu/SA-MCQ.

关键词: Large Language Models, Knowledge Editing, Surface Compliance, In-context Learning, Memory Modification, Representational Residues, Reversibility, Trustworthy LLM Systems

52. ❌ Flowr – Scaling Up Retail Supply Chain Operations Through Agentic AI in Large Scale Supermarket Chains

作者: Eranga Bandara, Ross Gore, Sachin Shetty, Piumi Siyambalapitiya, Sachini Rajapakse, Isurunima Kularathna, Pramoda Karunarathna, Ravi Mukkamala, Peter Foytik, Safdar H. Bouk, Abdul Rahman, Xueping Liang, Amin Hass, Tharaka Hewa, Ng Wee Keong, Kasun De Zoysa, Aruna Withanage, Nilaan Loganathan 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05987v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Flowr框架，使用微调的专业领域大语言模型（LLMs）作为核心AI代理，通过中央推理LLM协调多代理系统（Multi-agent Systems）实现供应链自动化。摘要明确提到’fine-tuned, domain-specialized large language models’，因此与’Large Language Models’和’Post-training/SFT’高度相关（10分）。框架基于’agentic AI’和’AI agents’，与’LLM Agents’和’Multi-agent Systems’高度相关（10分）。其他关键词如MoE、SLMs、RAG、CoT等未在摘要中提及或与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了Flowr框架，通过基于微调大语言模型的多代理系统自动化大型超市连锁店的端到端零售供应链操作，显著减少了人工协调开销并改善了供需匹配。

摘要翻译

大型连锁超市的零售供应链运营涉及从需求预测、采购、供应商协调到库存补货等一系列持续、大批量的手动工作流程，这些流程具有重复性、决策密集且难以在缺乏大量人力投入的情况下实现规模化扩展。尽管数据分析领域的投资日益增长，但这些工作流程中的决策与协调环节仍主要依赖人工操作，呈现被动响应状态，并在门店、配送中心与供应商网络之间处于割裂状态。本文提出Flowr——一种创新的智能体人工智能框架，旨在实现大规模超市运营中端到端零售供应链工作流程的自动化。Flowr将人工供应链操作系统地分解为多个专业化的AI智能体，每个智能体承担明确定义的认知角色，从而实现对以往依赖持续人工协调的流程的自动化。为确保任务准确性并遵循负责任的人工智能原则，该框架采用由经过微调、领域专业化的大型语言模型组成的协作体系，并由一个核心推理型大型语言模型进行协调。该框架的核心在于人机协同编排模型：供应链管理者通过支持模型上下文协议的界面，在工作流各阶段进行监督与干预，从而保持问责制与组织控制力。评估表明，Flowr能显著降低人工协调成本，改善供需匹配度，并实现主动化异常处理，其处理规模远超人工流程能力范围。该框架已与一家大型连锁超市合作完成验证，且具备领域无关性，为大规模企业环境中基于智能体人工智能的供应链自动化提供了可推广的蓝图。

摘要 (Abstract)

Retail supply chain operations in supermarket chains involve continuous, high-volume manual workflows spanning demand forecasting, procurement, supplier coordination, and inventory replenishment, processes that are repetitive, decision-intensive, and difficult to scale without significant human effort. Despite growing investment in data analytics, the decision-making and coordination layers of these workflows remain predominantly manual, reactive, and fragmented across outlets, distribution centers, and supplier networks. This paper introduces Flowr, a novel agentic AI framework for automating end-to-end retail supply chain workflows in large-scale supermarket operations. Flowr systematically decomposes manual supply chain operations into specialized AI agents, each responsible for a clearly defined cognitive role, enabling automation of processes previously dependent on continuous human coordination. To ensure task accuracy and adherence to responsible AI principles, the framework employs a consortium of fine-tuned, domain-specialized large language models coordinated by a central reasoning LLM. Central to the framework is a human-in-the-loop orchestration model in which supply chain managers supervise and intervene across workflow stages via a Model Context Protocol (MCP)-enabled interface, preserving accountability and organizational control. Evaluation demonstrates that Flowr significantly reduces manual coordination overhead, improves demand-supply alignment, and enables proactive exception handling at a scale unachievable through manual processes. The framework was validated in collaboration with a large-scale supermarket chain and is domain-independent, offering a generalizable blueprint for agentic AI-driven supply chain automation across large-scale enterprise settings.

关键词: agentic AI, large language models, supply chain automation, multi-agent systems, retail operations, fine-tuned LLMs, human-in-the-loop, workflow orchestration

53. ❌ A Formal Security Framework for MCP-Based AI Agents: Threat Taxonomy, Verification Models, and Defense Mechanisms

作者: Nirajan Acharya, Gaurav Kumar Gupta 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05969v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于MCP-based AI agents的安全框架，与LLM agents、tool use高度相关（10分），涉及multi-agent systems（5分），但未涉及其他技术原理或科学应用关键词。

!!! tip deepseek-chat TL;DR

该论文针对MCP-based AI agents缺乏统一安全框架的问题，提出了MCPSHIELD框架，包括威胁分类、验证模型和防御机制，理论覆盖率达到91%的威胁场景。

摘要翻译

模型上下文协议（Model Context Protocol，MCP）由Anthropic于2024年11月提出，现由Linux基金会的Agentic AI Foundation管理，已迅速成为连接基于大语言模型（LLM）的智能体与外部工具及数据源的事实标准，其SDK月下载量超过9700万次，注册工具数量逾17.7万个。然而，这种爆发式应用暴露了一个关键缺陷：缺乏一个统一的形式化安全框架，能够系统性地表征、分析和缓解基于MCP的智能体生态系统所面临的各种威胁。现有安全研究仍分散于独立的攻击论文、孤立的基准测试和单点防御机制中。本文提出了MCPSHIELD，一个针对基于MCP的AI智能体的全面形式化安全框架。我们做出了四项主要贡献：（1）一个分层的威胁分类法，包含7个威胁类别和23个不同的攻击向量，组织在四个攻击面上，其基础是对超过17.7万个MCP工具的分析；（2）一个基于带信任边界标注的标签转移系统的形式化验证模型，支持对MCP工具交互链进行静态和运行时分析；（3）对12种现有防御机制的系统性比较评估，揭示了其在我们威胁分类法中的覆盖缺口；（4）一个纵深防御参考架构，整合了基于能力的访问控制、密码学工具证明、信息流追踪和运行时策略执行。我们的分析表明，现有单一防御最多只能覆盖已识别威胁版图的34%，而MCPSHIELD的集成架构理论上可实现91%的覆盖率。我们进一步指出了七个待解决的开放性研究挑战，这些挑战是确保下一代自主AI系统安全所必须应对的。

摘要 (Abstract)

The Model Context Protocol (MCP), introduced by Anthropic in November 2024 and now governed by the Linux Foundation’s Agentic AI Foundation, has rapidly become the de facto standard for connecting large language model (LLM)-based agents to external tools and data sources, with over 97 million monthly SDK downloads and more than 177000 registered tools. However, this explosive adoption has exposed a critical gap: the absence of a unified, formal security framework capable of systematically characterizing, analyzing, and mitigating the diverse threats facing MCP-based agent ecosystems. Existing security research remains fragmented across individual attack papers, isolated benchmarks, and point defense mechanisms. This paper presents MCPSHIELD, a comprehensive formal security framework for MCP-based AI agents. We make four principal contributions: (1) a hierarchical threat taxonomy comprising 7 threat categories and 23 distinct attack vectors organized across four attack surfaces, grounded in the analysis of over 177000 MCP tools; (2) a formal verification model based on labeled transition systems with trust boundary annotations that enables static and runtime analysis of MCP tool interaction chains; (3) a systematic comparative evaluation of 12 existing defense mechanisms, identifying coverage gaps across our threat taxonomy; and (4) a defense in depth reference architecture integrating capability based access control, cryptographic tool attestation, information flow tracking, and runtime policy enforcement. Our analysis reveals that no existing single defense covers more than 34 percent of the identified threat landscape, whereas MCPSHIELD’s integrated architecture achieves theoretical coverage of 91 percent. We further identify seven open research challenges that must be addressed to secure the next generation of agentic AI systems.

关键词: MCP, AI agents, security framework, threat taxonomy, formal verification, defense mechanisms, tool interaction, trust boundary

54. ❌ Beyond Compromise: Pareto-Lenient Consensus for Efficient Multi-Preference LLM Alignment

作者: Renxuan Tan, Rongpeng Li, Zhifeng Zhao, Honggang Zhang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05965v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的多目标偏好对齐（MPA），属于LLM对齐和RLHF/DPO等偏好优化技术的范畴，因此与’Large Language Models’、‘Instruction Tuning/Alignment’和’RLHF/RLAIF/DPO’高度相关（10分）。论文未涉及其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG、推理加速、AI for Science等具体技术或应用领域，故相关度为0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在多目标偏好对齐中易陷入局部最优的问题，提出了一个基于博弈论的Pareto-Lenient Consensus框架，通过动态容忍局部退化来探索更优的Pareto前沿，在实验中超越了现有基线方法。

摘要翻译

超越单一偏好范式，将大语言模型与多样化人类价值观对齐对于稳健部署至关重要。当代多目标偏好对齐方法主要依赖静态线性标量化或刚性梯度投影来权衡这些目标。然而，通过强制严格的冲突避免或同步下降，这些范式往往过早收敛至局部驻点。尽管这些点在数学上是稳定的，但它们代表了一种保守的妥协——模型为避免暂时的局部权衡而牺牲了潜在的全局帕累托改进。为打破这一僵局，我们提出帕累托宽容共识，这是一个博弈论框架，将对齐重新构想为一个动态协商过程。与刚性方法不同，PLC引入了共识驱动的宽容梯度校正机制，该机制在存在足够主导联盟盈余的条件下，动态容忍局部性能退化，从而使优化轨迹能够逃离局部次优均衡，探索远端的帕累托最优前沿。理论分析验证了PLC能够促进僵局逃逸并渐近收敛至帕累托共识均衡。此外，大量实验表明，PLC在固定偏好对齐和全局帕累托前沿质量方面均超越基线方法。这项工作凸显了协商驱动对齐作为MPA研究方向的潜力。我们的代码公开于 https://anonymous.4open.science/r/aaa-6BB8。

摘要 (Abstract)

Transcending the single-preference paradigm, aligning LLMs with diverse human values is pivotal for robust deployment. Contemporary Multi-Objective Preference Alignment (MPA) approaches predominantly rely on static linear scalarization or rigid gradient projection to navigate these trade-offs. However, by enforcing strict conflict avoidance or simultaneous descent, these paradigms often prematurely converge to local stationary points. While mathematically stable, these points represent a conservative compromise where the model sacrifices potential global Pareto improvements to avoid transient local trade-offs. To break this deadlock, we propose Pareto-Lenient Consensus (PLC), a game-theoretic framework that reimagines alignment as a dynamic negotiation process. Unlike rigid approaches, PLC introduces consensus-driven lenient gradient rectification, which dynamically tolerates local degradation provided there is a sufficient dominant coalition surplus, thereby empowering the optimization trajectory to escape local suboptimal equilibrium and explore the distal Pareto-optimal frontier. Theoretical analysis validates PLC can facilitate stalemate escape and asymptotically converge to a Pareto consensus equilibrium. Moreover, extensive experiments show that PLC surpasses baselines in both fixed-preference alignment and global Pareto frontier quality. This work highlights the potential of negotiation-driven alignment as a promising avenue for MPA. Our codes are available at https://anonymous.4open.science/r/aaa-6BB8.

关键词: Multi-Preference LLM Alignment, Pareto-Lenient Consensus, Multi-Objective Preference Alignment, Game-Theoretic Framework, Pareto-Optimal Frontier, Gradient Rectification, Consensus Equilibrium, Alignment Negotiation

55. ❌ Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution

作者: Kai Yu, Zhenhao Zhou, Junhao Zeng, Ying Wang, Xueying Du, Zhiqiang Yuan, Junwei Liu, Ziyu Zhou, Yujia Wang, Chong Wang, Xin Peng 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05955v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based agents在软件工程问题解决中的表现评估，特别是设计约束合规性，与’Large Language Models’和’LLM Agents’高度相关（10分），因为论文直接评估LLM代理在代码补丁生成中的能力。其他关键词如MoE、SLMs、训练技术、推理优化、AI for Science等均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，在基于LLM的软件问题解决中，仅凭测试通过率会高估补丁质量，因为超过一半的已解决问题未能满足项目特定的设计约束，揭示了当前代理能力在理解隐含设计要求方面的根本差距。

摘要翻译

代码库级别的问题解决基准已成为评估基于大语言模型智能体的标准测试平台，但成功与否仍主要依据测试通过率来衡量。然而在实践中，可接受的补丁还必须符合项目特定的设计约束，例如架构规范、错误处理策略和可维护性要求——这些约束很少被编码在测试中，通常仅隐含地记录在代码审查讨论中。本文提出了“设计感知问题解决”的概念，并介绍\bench{}基准，该基准使此类隐性设计约束变得显式且可量化。\bench{}通过从真实世界拉取请求中挖掘并验证设计约束、将其与具体问题实例关联，并利用基于大语言模型的验证器自动检查补丁合规性构建而成，最终涵盖六个代码库的495个问题及1,787条已验证约束，其构建标准与SWE-bench-Verified和SWE-bench-Pro保持一致。对前沿智能体的实验表明，基于测试的正确性评估会显著高估补丁质量：仅有不到半数已解决的问题完全满足设计约束，设计违规现象普遍存在，且功能正确性与设计满足度之间几乎不存在统计关联。尽管提供针对具体问题的设计指导能减少违规行为，但大量不合规现象依然存在，这揭示了当前智能体能力的根本缺陷，并推动我们需要超越功能正确性、进行设计感知的评估。

摘要 (Abstract)

Repository-level issue resolution benchmarks have become a standard testbed for evaluating LLM-based agents, yet success is still predominantly measured by test pass rates. In practice, however, acceptable patches must also comply with project-specific design constraints, such as architectural conventions, error-handling policies, and maintainability requirements, which are rarely encoded in tests and are often documented only implicitly in code review discussions. This paper introduces \textit{design-aware issue resolution} and presents \bench{}, a benchmark that makes such implicit design constraints explicit and measurable. \bench{} is constructed by mining and validating design constraints from real-world pull requests, linking them to issue instances, and automatically checking patch compliance using an LLM-based verifier, yielding 495 issues and 1,787 validated constraints across six repositories, aligned with SWE-bench-Verified and SWE-bench-Pro. Experiments with state-of-the-art agents show that test-based correctness substantially overestimates patch quality: fewer than half of resolved issues are fully design-satisfying, design violations are widespread, and functional correctness exhibits negligible statistical association with design satisfaction. While providing issue-specific design guidance reduces violations, substantial non-compliance remains, highlighting a fundamental gap in current agent capabilities and motivating design-aware evaluation beyond functional correctness.

关键词: LLM-based agents, issue resolution, design constraints, benchmark evaluation, patch compliance, software engineering, test pass rates, code review

56. ❌ Polynomial-Time Algorithm for Thiele Voting Rules with Voter Interval Preferences

作者: Pasin Manurangsi, Krzysztof Sornat 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05953v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究计算社会选择理论中的Thiele投票规则算法问题，属于理论计算机科学领域，与所有评分关键词（均涉及大模型、深度学习技术原理及应用）完全无关。论文摘要中提到的’human-AI collaboration’和’Gemini Deep Think’仅指研究过程中使用了AI工具辅助，并非论文的研究内容或技术贡献。

!!! tip deepseek-chat TL;DR

该论文解决了Voter Interval偏好下Thiele投票规则的最优委员会计算问题，提出了多项式时间算法并证明了其正确性。

摘要翻译

我们提出了一种多项式时间算法，用于在选民区间域（即选民可被排序，使得每位候选人获得一段连续选民的支持）上的选举中，计算任意给定蒂勒投票规则下规模为$k$的最优委员会。我们的结果可推广至广义蒂勒规则，其中每位选民拥有独立的权重（计分）序列。这解决了一个存在十年的开放性问题，该问题最初针对比例批准投票提出，后来扩展至所有蒂勒规则（Elkind and Lackner, IJCAI 2015; Peters, AAAI 2018）。
我们的核心技术贡献是一个新的结构结果——区间族的凹性定理。该定理表明，给定两个不同规模的解，可以构造出任意中间规模的解，其得分至少等于两个得分的对应线性插值。由此推论，在选民区间配置下，最优总蒂勒得分是委员会规模的凹函数。我们基于自然整数线性规划公式的拉格朗日松弛优化框架，通过将基数约束移入目标函数，利用了这一凹性。在选民区间配置中，所得约束矩阵是完全幺模的，因此可在多项式时间内求解。
我们的主要算法及其证明是通过人机协作完成的。具体而言，算法所使用的主要结构定理的一个稍简版本，是通过单次调用Gemini Deep Think获得的。

摘要 (Abstract)

We present a polynomial-time algorithm for computing an optimal committee of size $k$ under any given Thiele voting rule for elections on the Voter Interval domain (i.e., when voters can be ordered so that each candidate is approved by a consecutive voters). Our result extends to the Generalized Thiele rule, in which each voter has an individual weight (scoring) sequence. This resolves a 10-year-old open problem that was originally posed for Proportional Approval Voting and later extended to every Thiele rule (Elkind and Lackner, IJCAI 2015; Peters, AAAI 2018). Our main technical ingredient is a new structural result – a concavity theorem for families of intervals. It shows that, given two solutions of different sizes, one can construct a solution of any intermediate size whose score is at least the corresponding linear interpolation of the two scores. As a consequence, on Voter Interval profiles, the optimal total Thiele score is a concave function of the committee size. We exploit this concavity within an optimization framework based on a Lagrangian relaxation of a natural integer linear program formulation, obtained by moving the cardinality constraint into the objective. On Voter Interval profiles, the resulting constraint matrix is totally unimodular, so it can be solved in polynomial time. Our main algorithm and its proof were obtained via human–AI collaboration. In particular, a slightly simplified version of the main structural theorem used by the algorithm was obtained in a single call to Gemini Deep Think.

关键词: Thiele voting rules, Voter Interval preferences, polynomial-time algorithm, optimal committee, computational social choice, concavity theorem, integer linear programming, Lagrangian relaxation

57. ❌ Towards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration

作者: Yi Yuan, Xuhong Wang, Shanzhe Lei 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05952v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	15.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	15.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种深度研究智能体，专注于生成可信的研究报告，通过渐进式置信度估计和校准来解决幻觉和可信度问题。核心相关关键词包括：LLM Agents（15分，论文核心研究智能体系统）、Hallucination Mitigation（15分，直接解决幻觉和可信度问题）、Retrieval-Augmented Generation（10分，使用深度检索和多跳推理）、Chain of Thought（10分，多步推理过程）、Explainable AI（10分，增强透明度和可解释性）、Self-Correction（8分，置信度校准机制）、System 2 Thinking（8分，深思熟虑的搜索模型）、Large Language Models（8分，基于大模型的智能体）。其他关键词如AI for Science（5分，应用于研究领域）和Tool Use（5分，智能体工具使用）有间接关联。其余关键词与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对开放研究场景中深度研究智能体生成报告的可信度问题，提出了一种结合渐进式置信度估计和校准的新方法，通过深思熟虑的搜索模型和多步推理显著提高了生成报告的透明度和用户信任度。

摘要翻译

随着基于智能体的系统不断发展，深度研究智能体已能够在多领域自动生成研究式报告。尽管这类智能体有望简化信息整合与知识探索流程，但现有评估框架——通常基于主观维度——未能捕捉报告质量的一个关键方面：可信度。在缺乏标准答案的开放式研究场景中，当前评估方法无法有效衡量生成内容的认识论置信度，导致校准困难，并可能使用户受到误导性或虚构信息的影响。为突破这一局限，我们提出一种新型深度研究智能体，其在报告生成流程中融入了渐进式置信度估计与校准机制。该系统采用审慎搜索模型，通过深度检索与多跳推理将输出内容锚定于可验证证据，同时为每个独立主张分配置信度评分。结合精心设计的工作流程，该方法能够生成具有更高透明度的可信报告。实验结果与案例研究表明，我们的方法显著提升了报告的可解释性，并大幅增强了用户信任度。

摘要 (Abstract)

As agent-based systems continue to evolve, deep research agents are capable of automatically generating research-style reports across diverse domains. While these agents promise to streamline information synthesis and knowledge exploration, existing evaluation frameworks-typically based on subjective dimensions-fail to capture a critical aspect of report quality: trustworthiness. In open-ended research scenarios where ground-truth answers are unavailable, current evaluation methods cannot effectively measure the epistemic confidence of generated content, making calibration difficult and leaving users susceptible to misleading or hallucinated information. To address this limitation, we propose a novel deep research agent that incorporates progressive confidence estimation and calibration within the report generation pipeline. Our system leverages a deliberative search model, featuring deep retrieval and multi-hop reasoning to ground outputs in verifiable evidence while assigning confidence scores to individual claims. Combined with a carefully designed workflow, this approach produces trustworthy reports with enhanced transparency. Experimental results and case studies demonstrate that our method substantially improves interpretability and significantly increases user trust.

关键词: deep research agent, trustworthy report generation, progressive confidence estimation, confidence calibration, deliberative search model, multi-hop reasoning, hallucination mitigation, interpretability

58. ❌ MARL-GPT: Foundation Model for Multi-Agent Reinforcement Learning

作者: Maria Nesterova, Mikhail Kolosov, Anton Andreychuk, Egor Cherepanov, Oleg Bulichev, Alexey Kovalev, Konstantin Yakovlev, Aleksandr Panov, Alexey Skrynnik 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05943v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	5.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MARL-GPT提出了一种基于GPT的通用多智能体强化学习基础模型，与’Large Language Models/Foundation Models’高度相关（10分），因为论文明确使用GPT-based模型作为基础架构。与’LLM Agents/Autonomous Agents’和’Multi-agent Systems/Agent Coordination’高度相关（10分），因为论文专注于多智能体强化学习环境中的智能体协调。与’Pre-training/Domain Adaptation’相关（8分），因为模型使用离线强化学习在大规模专家轨迹上进行训练，涉及预训练和跨领域适应。与’Scaling Laws AND Data Quality’有一定关联（5分），因为论文使用了大规模数据集（400M-1B轨迹），但未深入讨论数据质量与缩放规律的关系。与’World Models AND General World Models’有一定关联（5分），因为论文旨在构建通用的多智能体模型，类似于通用世界模型的概念。其他关键词如MoE、SFT、RLHF、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出MARL-GPT，一种基于GPT的通用多智能体强化学习基础模型，通过离线强化学习在大规模专家轨迹上训练，在StarCraft、Google Research Football和POGEMA等多种环境中实现了与专用基线模型竞争的性能。

摘要翻译

近年来，多智能体强化学习（MARL）在众多具有挑战性的领域和环境中取得了成功，但通常需要为每个任务设计专门的模型。在本研究中，我们提出了一种连贯的方法论，使得单个基于GPT的模型能够跨多种不同的MARL环境和任务（包括《星际争霸》多智能体挑战赛、谷歌研究足球和POGEMA）进行学习并表现优异。我们的方法MARL-GPT，通过离线强化学习在专家轨迹数据（SMACv2为4亿条，GRF为1亿条，POGEMA为10亿条）上进行大规模训练，并结合一个无需任务特定调优的、基于Transformer的单一观测编码器。实验表明，与各环境中专门的基线模型相比，MARL-GPT均取得了具有竞争力的性能。因此，我们的研究结果表明，确实有可能构建一个基于Transformer的多任务模型，以应对多种（显著不同的）多智能体问题，这为通向基础MARL模型（类似于自然语言建模中的ChatGPT、Llama、Mistral等）铺平了道路。

摘要 (Abstract)

Recent advances in multi-agent reinforcement learning (MARL) have demonstrated success in numerous challenging domains and environments, but typically require specialized models for each task. In this work, we propose a coherent methodology that makes it possible for a single GPT-based model to learn and perform well across diverse MARL environments and tasks, including StarCraft Multi-Agent Challenge, Google Research Football and POGEMA. Our method, MARL-GPT, applies offline reinforcement learning to train at scale on the expert trajectories (400M for SMACv2, 100M for GRF, and 1B for POGEMA) combined with a single transformer-based observation encoder that requires no task-specific tuning. Experiments show that MARL-GPT achieves competitive performance compared to specialized baselines in all tested environments. Thus, our findings suggest that it is, indeed, possible to build a multi-task transformer-based model for a wide variety of (significantly different) multi-agent problems paving the way to the fundamental MARL model (akin to ChatGPT, Llama, Mistral etc. in natural language modeling).

关键词: Multi-agent reinforcement learning, Foundation model, GPT-based model, Transformer-based encoder, Offline reinforcement learning, Expert trajectories, Multi-task learning, MARL environments

59. ❌ Context-Value-Action Architecture for Value-Driven Large Language Model Agents

作者: TianZe Zhang, Sirui Sun, Yuhang Xie, Xin Zhang, Zhiqiang Wu, Guojie Song 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05939v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM智能体（LLM Agents）的行为模拟和价值对齐（Value Alignment）问题，提出CVA架构来提升行为保真度和可解释性，因此这两个关键词高度相关（10分）。论文涉及推理过程（Chain of Thought/System 2 Thinking）但非核心，给8分。论文提到可解释性（Explainable AI），有一定关联给5分。其他关键词如MoE、SFT、RAG等未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对现有大型语言模型智能体行为僵化的问题，提出了基于刺激-有机体-响应模型和施瓦茨基本人类价值理论的Context-Value-Action架构，通过解耦动作生成与认知推理并引入基于真实人类数据训练的价值验证器，在超过110万真实交互轨迹的CVABench上实验证明，该架构能有效缓解价值极化，同时提供更好的行为保真度和可解释性。

摘要翻译

大语言模型（LLM）在模拟人类行为方面展现出潜力，然而现有智能体常表现出行为僵化的问题，这一缺陷常被当前“以LLM为评判者”评估中的自我参照偏差所掩盖。通过基于实证基准进行评估，我们揭示了一个反直觉的现象：增强提示驱动的推理强度并不会提高行为拟真度，反而会加剧价值极化，导致群体多样性坍缩。为解决此问题，我们提出了基于刺激-机体-反应（S-O-R）模型和施瓦茨基本人类价值理论的语境-价值-行动（Context-Value-Action, CVA）架构。与依赖自我验证的方法不同，CVA通过一个基于真实人类数据训练的新型价值验证器，将行动生成与认知推理解耦，以显式建模动态价值激活机制。在包含超过110万条真实世界交互轨迹的CVABench数据集上的实验表明，CVA显著优于基线模型。我们的方法在有效缓解价值极化的同时，提供了更优的行为拟真度与可解释性。

摘要 (Abstract)

Large Language Models (LLMs) have shown promise in simulating human behavior, yet existing agents often exhibit behavioral rigidity, a flaw frequently masked by the self-referential bias of current “LLM-as-a-judge” evaluations. By evaluating against empirical ground truth, we reveal a counter-intuitive phenomenon: increasing the intensity of prompt-driven reasoning does not enhance fidelity but rather exacerbates value polarization, collapsing population diversity. To address this, we propose the Context-Value-Action (CVA) architecture, grounded in the Stimulus-Organism-Response (S-O-R) model and Schwartz’s Theory of Basic Human Values. Unlike methods relying on self-verification, CVA decouples action generation from cognitive reasoning via a novel Value Verifier trained on authentic human data to explicitly model dynamic value activation. Experiments on CVABench, which comprises over 1.1 million real-world interaction traces, demonstrate that CVA significantly outperforms baselines. Our approach effectively mitigates polarization while offering superior behavioral fidelity and interpretability.

关键词: Large Language Model Agents, Value Alignment, Behavioral Fidelity, Context-Value-Action Architecture, Value Polarization, Interpretability, Stimulus-Organism-Response Model, Schwartz’s Theory of Basic Human Values

60. ❌ Saliency-Guided Representation with Consistency Policy Learning for Visual Unsupervised Reinforcement Learning

作者: Jingbo Sun, Qichao Zhang, Songjun Tu, Xing Fang, Yupeng Zheng, Haoran Li, Ke Chen, Dongbin Zhao 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05931v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视觉无监督强化学习（URL）中的表示学习和策略学习，特别是针对零样本泛化问题。论文的核心技术是后继表示（SR）和提出的SRCP框架，涉及强化学习、表示学习、视觉环境、零样本泛化等主题。所有给定的关键词均与大语言模型（LLM）及其相关技术（如微调、对齐、推理、代理等）或特定科学领域AI应用相关，而本文完全不涉及大语言模型或深度学习在科学领域的应用，也未讨论大模型技术原理的创新。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对视觉无监督强化学习中后继表示方法在零样本泛化上的局限性，提出了一个结合显著性引导表示和一致性策略学习的新框架SRCP，在多个基准任务上实现了最先进的零样本泛化性能。

摘要翻译

零样本无监督强化学习（URL）为构建能够泛化至未见任务且无需额外监督的通才智能体提供了一个有前景的方向。在现有方法中，后继表示（SR）因其在结构化、低维环境中的有效性而成为一个重要范式。然而，SR方法难以扩展至高维视觉环境。通过实证分析，我们识别了SR在视觉URL中的两个关键局限：（1）SR目标常导致次优表示，这些表示关注与动态无关的区域，从而产生不准确的后继度量并损害任务泛化能力；（2）这些有缺陷的表示阻碍了SR策略对多模态技能条件动作分布的建模，并影响了技能可控性。为应对这些局限，我们提出了基于显著性引导表示与一致性策略学习（SRCP）的新框架，以提升SR方法在视觉URL中的零样本泛化能力。SRCP通过引入显著性引导的动态任务来捕获与动态相关的表示，从而将表示学习与后继训练解耦，以此改进后继度量和任务泛化。此外，它整合了快速采样一致性策略，结合URL特定的无分类器引导机制和定制化训练目标，以提升技能条件策略的建模能力和可控性。在ExORL基准测试中，跨越4个数据集的16项任务上的大量实验表明，SRCP在视觉URL中实现了最先进的零样本泛化性能，并且与多种SR方法兼容。

摘要 (Abstract)

Zero-shot unsupervised reinforcement learning (URL) offers a promising direction for building generalist agents capable of generalizing to unseen tasks without additional supervision. Among existing approaches, successor representations (SR) have emerged as a prominent paradigm due to their effectiveness in structured, low-dimensional settings. However, SR methods struggle to scale to high-dimensional visual environments. Through empirical analysis, we identify two key limitations of SR in visual URL: (1) SR objectives often lead to suboptimal representations that attend to dynamics-irrelevant regions, resulting in inaccurate successor measures and degraded task generalization; and (2) these flawed representations hinder SR policies from modeling multi-modal skill-conditioned action distributions and ensuring skill controllability. To address these limitations, we propose Saliency-Guided Representation with Consistency Policy Learning (SRCP), a novel framework that improves zero-shot generalization of SR methods in visual URL. SRCP decouples representation learning from successor training by introducing a saliency-guided dynamics task to capture dynamics-relevant representations, thereby improving successor measure and task generalization. Moreover, it integrates a fast-sampling consistency policy with URL-specific classifier-free guidance and tailored training objectives to improve skill-conditioned policy modeling and controllability. Extensive experiments on 16 tasks across 4 datasets from the ExORL benchmark demonstrate that SRCP achieves state-of-the-art zero-shot generalization in visual URL and is compatible with various SR methods.

关键词: unsupervised reinforcement learning, successor representations, zero-shot generalization, visual environments, saliency-guided representation, consistency policy learning, skill controllability, ExORL benchmark

61. ❌ “I See What You Did There”: Can Large Vision-Language Models Understand Multimodal Puns?

作者: Naen Xu, Jiayi Sheng, Changjiang Li, Chunyi Zhou, Yuyuan Li, Tianyu Du, Jun Wang, Zhihui Fu, Jinbao Li, Shouling Ji 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05930v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究视觉语言模型（VLMs）对多模态双关语的理解能力，属于大模型在特定应用领域（多模态理解）的研究。论文核心是评估现有VLMs的局限性并提出改进方法，而非开发新的模型架构或训练技术。因此，仅与’Large Language Models OR LLMs OR Foundation Models’有一定关联（VLMs可视为大模型的一种），评5分；与其他关键词（如MoE、Scaling Laws、训练方法、推理优化、代理系统等）无直接关联，均评0分。加权总分计算为5.0分。

!!! tip deepseek-chat TL;DR

该论文研究了视觉语言模型理解多模态双关语的能力，发现现有模型难以区分真实双关语与对抗性干扰项，并通过提出新策略将F1分数平均提升了16.5%。

摘要翻译

双关是一种常见的修辞性文字游戏，它利用多义性和语音相似性来制造幽默效果。在多模态双关中，视觉与文本元素协同作用，同时建立字面意义并唤起比喻含义。尽管视觉语言模型（Vision-Language Models, VLMs）已广泛应用于多模态理解与生成任务，但由于缺乏严谨的基准测试，其理解双关的能力尚未得到系统研究。为此，我们首先提出了一种多模态双关生成流程。随后，我们引入了MultiPun数据集，该数据集包含多种类型的双关以及对抗性的非双关干扰项。评估结果表明，大多数模型难以区分真实双关与这些干扰项。此外，我们提出了提示层面和模型层面的策略以增强双关理解能力，平均F1分数提升了16.5%。我们的研究结果为开发未来能够通过跨模态推理掌握类人幽默微妙之处的视觉语言模型提供了重要参考。

摘要 (Abstract)

Puns are a common form of rhetorical wordplay that exploits polysemy and phonetic similarity to create humor. In multimodal puns, visual and textual elements synergize to ground the literal sense and evoke the figurative meaning simultaneously. Although Vision-Language Models (VLMs) are widely used in multimodal understanding and generation, their ability to understand puns has not been systematically studied due to a scarcity of rigorous benchmarks. To address this, we first propose a multimodal pun generation pipeline. We then introduce MultiPun, a dataset comprising diverse types of puns alongside adversarial non-pun distractors. Our evaluation reveals that most models struggle to distinguish genuine puns from these distractors. Moreover, we propose both prompt-level and model-level strategies to enhance pun comprehension, with an average improvement of 16.5% in F1 scores. Our findings provide valuable insights for developing future VLMs that master the subtleties of human-like humor via cross-modal reasoning.

关键词: Vision-Language Models, multimodal puns, cross-modal reasoning, humor understanding, benchmark evaluation, adversarial distractors, pun comprehension, MultiPun dataset

62. ❌ ReLU Networks for Exact Generation of Similar Graphs

作者: Mamoona Ghafoor, Tatsuya Akutsu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05929v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究ReLU神经网络在精确生成具有特定图编辑距离约束的图方面的理论能力，属于图生成和图神经网络领域。论文中明确提到应用场景包括cheminformatics（化学信息学），这与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因此给予5分。然而，论文的核心内容并非大模型、深度学习技术原理创新或大模型在不同领域的应用，而是专注于特定类型的神经网络（ReLU网络）在图生成中的理论性质，与其余所有关键词（均围绕大模型、LLM相关技术、训练方法、推理优化、对齐、代理等）完全无关，因此均给予0分。

!!! tip deepseek-chat TL;DR

该论文理论证明了存在常数深度和O(n²d)规模的ReLU神经网络，能够确定性地生成与给定输入图在编辑距离d内的有效图，为构建具有保证有效性的紧凑生成模型提供了理论基础。

摘要翻译

在化学信息学、网络异常合成和结构化数据增强等应用中，生成与源图保持特定图编辑距离约束的图具有重要意义。尽管在分子设计和网络扰动分析等领域对此类约束生成模型的需求日益增长，但能够可证明地在有界图编辑距离内生成图的神经架构仍很大程度上未被探索。此外，现有的图生成模型主要依赖于数据驱动，并严重依赖训练数据的可用性和质量，这可能导致生成的图无法满足所需的编辑距离约束。本文通过从理论上刻画能够生成与给定图保持规定图编辑距离的ReLU神经网络，应对这些挑战。具体而言，我们证明了存在常数深度和O(n^2 d)规模的ReLU网络，能够确定性地生成与具有n个顶点的给定输入图编辑距离不超过d的图，从而在保证生成图有效性的同时消除了对训练数据的依赖。实验评估表明，所提出的网络成功地为顶点数高达1400、编辑距离约束高达140的实例生成了有效图，而基线生成模型则无法生成满足所需编辑距离的图。这些结果为构建具有可保证有效性的紧凑生成模型提供了理论基础。

摘要 (Abstract)

Generation of graphs constrained by a specified graph edit distance from a source graph is important in applications such as cheminformatics, network anomaly synthesis, and structured data augmentation. Despite the growing demand for such constrained generative models in areas including molecule design and network perturbation analysis, the neural architectures required to provably generate graphs within a bounded graph edit distance remain largely unexplored. In addition, existing graph generative models are predominantly data-driven and depend heavily on the availability and quality of training data, which may result in generated graphs that do not satisfy the desired edit distance constraints. In this paper, we address these challenges by theoretically characterizing ReLU neural networks capable of generating graphs within a prescribed graph edit distance from a given graph. In particular, we show the existence of constant depth and O(n^2 d) size ReLU networks that deterministically generate graphs within edit distance d from a given input graph with n vertices, eliminating reliance on training data while guaranteeing validity of the generated graphs. Experimental evaluations demonstrate that the proposed network successfully generates valid graphs for instances with up to 1400 vertices and edit distance bounds up to 140, whereas baseline generative models fail to generate graphs with the desired edit distance. These results provide a theoretical foundation for constructing compact generative models with guaranteed validity.

关键词: Graph Generation, Graph Edit Distance, ReLU Neural Networks, Theoretical Guarantees, Cheminformatics, Network Anomaly Synthesis, Structured Data Augmentation, Deterministic Generation

63. ❌ Automatic dental superimposition of 3D intraorals and 2D photographs for human identification

作者: Antonio D. Villegas-Yeguas, Xavier Abreau-Freire, Guillermo R-García, Andrea Valsecchi, Teresa Pinho, Daniel Pérez-Mongiovi, Oscar Ibáñez, Oscar Cordón 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05877v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于牙科识别中的计算机视觉和优化技术，涉及3D-2D配准、形态学比较和自动化方法，与大多数大模型和深度学习技术关键词（如LLM、MoE、训练方法、推理优化、智能体等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将AI应用于法医牙科科学领域，属于AI在科学（特别是生物信息学相关领域）的应用，但并非核心创新点，因此给予5分（有一定关联）。其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种自动化的3D牙科扫描与2D照片配准方法，用于法医人类识别中的形态学比较，通过计算机视觉和优化技术显著提升了匹配准确性和客观性。

摘要翻译

牙齿比对被视为与指纹和DNA分析同等级别的首要身份识别方法。该方法中一个关键但耗时的步骤是形态学比对。应用该方法的主要挑战之一是缺乏生前医疗记录，特别是在边境移民死亡案例和/或未实行全民医疗保健的国家中。社交媒体上可见牙齿的照片的可得性，促使许多法医牙科专家考虑利用这些照片进行形态学比对。然而，现有前沿方案存在显著局限性，包括缺乏对视点畸变的恰当建模，以及缺少能够量化形态差异的客观方法。
我们的方案采用一种三维（死后扫描）到二维（生前照片）的方法。利用计算机视觉和优化技术，我们通过三维模型复现生前图像以进行形态学比对。我们开发了两种自动化方法：i) 使用配对标志点；ii) 利用牙齿区域的分割来估计相机参数。两种方法在142个样本产生的20,164次交叉比对中均能获得极具前景的结果，平均排名值分别为1.6和1.5。这些结果明显优于自动牙科图表比对方法的筛选能力，同时提供了一个自动、客观且可量化的形态对应评分，并通过叠加图像的可视化，易于解释和分析。

摘要 (Abstract)

Dental comparison is considered a primary identification method, at the level of fingerprints and DNA profiling. One crucial but time-consuming step of this method is the morphological comparison. One of the main challenges to apply this method is the lack of ante-mortem medical records, specially on scenarios such as migrant death at the border and/or in countries where there is no universal healthcare. The availability of photos on social media where teeth are visible has led many odontologists to consider morphological comparison using them. However, state-of-the-art proposals have significant limitations, including the lack of proper modeling of perspective distortion and the absence of objective approaches that quantify morphological differences. Our proposal involves a 3D (post-mortem scan) - 2D (ante-mortem photos) approach. Using computer vision and optimization techniques, we replicate the ante-mortem image with the 3D model to perform the morphological comparison. Two automatic approaches have been developed: i) using paired landmarks and ii) using a segmentation of the teeth region to estimate camera parameters. Both are capable of obtaining very promising results over 20,164 cross comparisons from 142 samples, obtaining mean ranking values of 1.6 and 1.5, respectively. These results clearly outperform filtering capabilities of automatic dental chart comparison approaches, while providing an automatic, objective and quantitative score of the morphological correspondence, easily to interpret and analyze by visualizing superimposed images.

关键词: dental identification, 3D-2D superimposition, morphological comparison, computer vision, optimization techniques, forensic odontology, automated method, camera parameter estimation

64. ❌ JTON: A Token-Efficient JSON Superset with Zen Grid Tabular Encoding for Large Language Models

作者: Gowthamkumar Nandakishore 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05865v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs处理结构化数据时的序列化格式优化，与’Large Language Models’高度相关（10分）。通过减少token使用间接影响上下文窗口效率，与’Context Window Extension’有一定关联（5分）。论文测试了few-shot和zero-shot设置，与’In-context Learning’相关（5分）。其他关键词涉及模型架构、训练方法、推理技术、应用领域等，论文未涉及，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为JTON的JSON超集格式，通过Zen Grid表格编码减少LLMs处理结构化数据时的token开销，在多个领域平均减少28.5%的token使用，同时保持或略微提升模型理解准确度，并实现了100%的语法有效性。

摘要翻译

当大语言模型处理结构化数据时，序列化格式直接影响成本与上下文利用率。标准JSON在表格数组的每一行中重复键名，造成了令牌浪费——这种开销随行数线性增长。本文提出JTON（JSON表格对象表示法），它是一种严格的JSON超集，其核心思想“Zen网格”将列标题提取至单行，并使用分号编码值，在消除冗余的同时保留了JSON的类型系统。在七个现实领域的数据集上，与紧凑JSON相比，Zen网格将令牌数量减少了15-60%（平均28.5%；使用bare_strings时达32%）。在10个大语言模型上进行的理解测试显示，相比JSON获得了净+0.3个百分点的准确率提升：四个模型表现改善，三个保持稳定，三个略有下降。在12个大语言模型上进行的生成测试表明，在少样本和零样本设置下均实现了100%的句法有效性。一个基于Rust/PyO3的参考实现通过SIMD加速解析，速度达到Python json模块的1.4倍。代码、包含683个向量的测试套件及所有实验数据均已公开。

摘要 (Abstract)

When LLMs process structured data, the serialization format directly affects cost and context utilization. Standard JSON wastes tokens repeating key names in every row of a tabular array–overhead that scales linearly with row count. This paper presents JTON (JSON Tabular Object Notation), a strict JSON superset whose main idea, Zen Grid, factors column headers into a single row and encodes values with semicolons, preserving JSON’s type system while cutting redundancy. Across seven real-world domains, Zen Grid reduces token counts by 15-60% versus JSON compact (28.5% average; 32% with bare_strings). Comprehension tests on 10 LLMs show a net +0.3 pp accuracy gain over JSON: four models improve, three hold steady, and three dip slightly. Generation tests on 12 LLMs yield 100% syntactic validity in both few-shot and zero-shot settings. A Rust/PyO3 reference implementation adds SIMD-accelerated parsing at 1.4x the speed of Python’s json module. Code, a 683-vector test suite, and all experimental data are publicly available.

关键词: JTON, JSON superset, Zen Grid, token efficiency, structured data, LLM processing, tabular encoding, context utilization

65. ❌ Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts

作者: Fatih Uenal 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05872v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于评估大型语言模型（LLMs）在瑞士金融监管环境中的可靠性和对抗安全性，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文涉及评估模型的真实性（如Swiss TruthfulQA基准），与’Hallucination Mitigation OR Factuality OR Truthfulness’有一定关联（5分）。其他关键词（如MoE、SLMs、训练方法、推理技术、代理系统等）在摘要中未提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对瑞士金融监管环境，开发了Swiss-Bench 003评估框架，评估了10个前沿LLM在可靠性和对抗安全性方面的表现，发现自评可靠性得分（73-94%）远高于外部评估的安全性得分（20-61%），且所有模型在个人身份信息提取防御方面表现较弱。

摘要翻译

在瑞士金融与监管环境中部署大语言模型（LLM）需要同时获得生产可靠性与对抗安全性的实证依据，而现有以瑞士为中心的评估框架尚未将这两个维度共同纳入操作化评估。本文推出瑞士基准测试003（SBP-003），通过新增D7（自评可靠性代理指标）和D8（对抗安全性）两个维度，将HAAS（瑞士人工智能评估分数）体系从六维扩展至八维。我在四种语言（德语、法语、意大利语、英语）环境下，针对808项瑞士特定任务对十个前沿模型进行评估，任务集包含七个适配瑞士场景的基准测试（瑞士TruthfulQA、瑞士IFEval、瑞士SimpleQA、瑞士NIAH、瑞士PII-Scope、系统提示词泄露测试及瑞士德语理解测试），其设计目标对应瑞士金融市场监管局（FINMA）指引08/2024、修订版《联邦数据保护法》（nDSG）以及OWASP LLM十大风险。自评D7分数（73-94%）显著高于外部判定的D8安全分数（20-61%），但需注意这两个维度采用不可直接比较的评分机制。系统提示词泄露防御率分布在24.8%至88.2%之间，而所有模型的个人身份信息（PII）提取防御能力普遍薄弱（14-42%）。Qwen 3.5 Plus获得最高的自评D7分数（94.4%），GPT-oss 120B则以最低成本模型之姿取得最高D8分数（60.7%）。所有评估均在供应商默认设置下进行零样本测试；D7为模型自评结果，不构成经独立验证的准确率。本文提供概念映射表，阐明各基准测试维度与FINMA模型验证要求、nDSG数据保护义务及OWASP LLM风险类别的对应关系。

摘要 (Abstract)

The deployment of large language models (LLMs) in Swiss financial and regulatory contexts demands empirical evidence of both production reliability and adversarial security, dimensions not jointly operationalized in existing Swiss-focused evaluation frameworks. This paper introduces Swiss-Bench 003 (SBP-003), extending the HAAS (Helvetic AI Assessment Score) from six to eight dimensions by adding D7 (Self-Graded Reliability Proxy) and D8 (Adversarial Security). I evaluate ten frontier models across 808 Swiss-specific items in four languages (German, French, Italian, English), comprising seven Swiss-adapted benchmarks (Swiss TruthfulQA, Swiss IFEval, Swiss SimpleQA, Swiss NIAH, Swiss PII-Scope, System Prompt Leakage, and Swiss German Comprehension) targeting FINMA Guidance 08/2024, the revised Federal Act on Data Protection (nDSG), and OWASP Top 10 for LLMs. Self-graded D7 scores (73-94%) exceed externally judged D8 security scores (20-61%) by a wide margin, though these dimensions use non-comparable scoring regimes. System prompt leakage resistance ranges from 24.8% to 88.2%, while PII extraction defense remains weak (14-42%) across all models. Qwen 3.5 Plus achieves the highest self-graded D7 score (94.4%), while GPT-oss 120B achieves the highest D8 score (60.7%) despite being the lowest-cost model evaluated. All evaluations are zero-shot under provider default settings; D7 is self-graded and does not constitute independently validated accuracy. I provide conceptual mapping tables relating benchmark dimensions to FINMA model validation requirements, nDSG data protection obligations, and OWASP LLM risk categories.

关键词: Large Language Models, LLM evaluation, adversarial security, regulatory compliance, Swiss financial context, benchmark framework, reliability assessment, PII protection

66. ❌ When Do We Need LLMs? A Diagnostic for Language-Driven Bandits

作者: Uljad Berdica, Fernando Acero, Anton Ipsen, Parisa Zehtabi, Michael Cashmore, Manuela Veloso 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05859v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在上下文多臂老虎机（CMABs）决策问题中的应用，提出LLMP-UCB算法并比较LLM与轻量级数值老虎机的性能成本权衡，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术（如MoE、SLMs、训练方法、推理优化、代理系统等）或科学领域应用，这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在包含文本和数值信息的上下文多臂老虎机（CMABs）决策问题中，何时需要使用大型语言模型（LLMs）进行推理，并提出了一种基于嵌入的几何诊断方法，结果表明轻量级数值老虎机在成本效益上常优于LLM解决方案。

摘要翻译

本研究针对非片段式序列决策问题中的上下文多臂赌博机（Contextual Multi-Armed Bandits, CMABs）展开探讨，此类问题的上下文同时包含文本与数值信息（例如推荐系统、动态投资组合调整、优惠选择等金融领域常见场景）。尽管大语言模型（Large Language Models, LLMs）在此类场景中的应用日益增多，但在每个决策步骤均使用LLM进行推理的计算成本高昂，且难以获得不确定性估计。为解决这一问题，我们提出了LLMP-UCB算法，这是一种通过重复推理从LLM中获取不确定性估计的赌博机算法。然而，实验表明，基于文本嵌入（稠密嵌入或套娃嵌入）运行的轻量级数值赌博机，能以极低的成本达到甚至超越基于LLM解决方案的准确度。我们进一步证明，嵌入维度是调节探索-利用平衡的有效实践手段，可在不增加提示复杂度的前提下实现成本与性能的权衡。最后，为指导实践者，我们提出一种基于臂嵌入的几何诊断方法，以判断何时应采用LLM驱动推理，何时宜选用轻量级数值赌博机。本研究结果为构建具有成本效益且具备不确定性感知的决策系统提供了原则性部署框架，在金融服务的人工智能应用场景中具有广泛适用性。

摘要 (Abstract)

We study Contextual Multi-Armed Bandits (CMABs) for non-episodic sequential decision making problems where the context includes both textual and numerical information (e.g., recommendation systems, dynamic portfolio adjustments, offer selection; all frequent problems in finance). While Large Language Models (LLMs) are increasingly applied to these settings, utilizing LLMs for reasoning at every decision step is computationally expensive and uncertainty estimates are difficult to obtain. To address this, we introduce LLMP-UCB, a bandit algorithm that derives uncertainty estimates from LLMs via repeated inference. However, our experiments demonstrate that lightweight numerical bandits operating on text embeddings (dense or Matryoshka) match or exceed the accuracy of LLM-based solutions at a fraction of their cost. We further show that embedding dimensionality is a practical lever on the exploration-exploitation balance, enabling cost–performance tradeoffs without prompt complexity. Finally, to guide practitioners, we propose a geometric diagnostic based on the arms’ embedding to decide when to use LLM-driven reasoning versus a lightweight numerical bandit. Our results provide a principled deployment framework for cost-effective, uncertainty-aware decision systems with broad applicability across AI use cases in financial services.

关键词: Contextual Multi-Armed Bandits, Large Language Models, LLMP-UCB, text embeddings, exploration-exploitation balance, cost-performance tradeoffs, financial services, uncertainty-aware decision systems

67. ❌ Neural Network Pruning via QUBO Optimization

作者: Osama Orabi, Artur Zagitov, Hadi Salloum, Viktor A. Lobachev, Kasymkhan Khubiev, Yaroslav Kholodov 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05856v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于神经网络剪枝（一种模型压缩技术），与’Quantization OR Model Compression OR Low-bit Weights’高度相关（10分），因为剪枝是模型压缩的核心方法之一。与’Mixture of Experts OR MoE OR Sparse Models’有一定关联（8分），因为剪枝旨在创建稀疏模型，但论文未明确讨论MoE架构。其他关键词主要涉及大语言模型（LLM）的特定技术、训练方法、推理优化或科学应用，而本文研究通用的神经网络剪枝优化方法，未针对LLM或特定科学领域，因此相关性为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于二次无约束二进制优化（QUBO）的混合框架，用于解决神经网络剪枝中的组合优化问题，并通过实验证明该方法在图像去噪任务上优于传统的贪婪剪枝和L1-based QUBO方法。

摘要翻译

神经网络剪枝可被表述为一个组合优化问题，但现有方法大多依赖贪婪启发式策略，忽略了滤波器间复杂的相互作用。二次无约束二进制优化（Quadratic Unconstrained Binary Optimization, QUBO）等形式化优化方法提供了一种原则性的替代方案，但由于以往基于L1范数等指标的优化目标过于简化，其性能迄今仍不理想。本研究提出一个统一的混合QUBO框架，将启发式重要性评估与全局组合优化相融合。该框架在线性项中整合了梯度感知的敏感度指标——具体为一阶泰勒信息和二阶费舍尔信息，同时在二次项中利用数据驱动的激活相似性。这使得QUBO目标函数能够同时捕捉单个滤波器的重要性与滤波器间的功能冗余性。我们进一步引入动态容量驱动搜索机制，在不扭曲优化空间的前提下严格满足目标稀疏度要求。最后，我们采用包含张量链（Tensor-Train, TT）优化阶段的两阶段流程——该阶段采用无需梯度的优化器，直接依据真实评估指标对QUBO导出的解进行微调。在SIDD图像去噪数据集上的实验表明，所提出的混合QUBO方法显著优于贪婪泰勒剪枝和传统基于L1范数的QUBO方法，且TT优化阶段在适当的组合规模下能带来持续的性能提升。这凸显了混合组合优化框架在实现鲁棒、可扩展且可解释的神经网络压缩方面的潜力。

摘要 (Abstract)

Neural network pruning can be formulated as a combinatorial optimization problem, yet most existing approaches rely on greedy heuristics that ignore complex interactions between filters. Formal optimization methods such as Quadratic Unconstrained Binary Optimization (QUBO) provide a principled alternative but have so far underperformed due to oversimplified objective formulations based on metrics like the L1-norm. In this work, we propose a unified Hybrid QUBO framework that bridges heuristic importance estimation with global combinatorial optimization. Our formulation integrates gradient-aware sensitivity metrics - specifically first-order Taylor and second-order Fisher information - into the linear term, while utilizing data-driven activation similarity in the quadratic term. This allows the QUBO objective to jointly capture individual filter relevance and inter-filter functional redundancy. We further introduce a dynamic capacity-driven search to strictly enforce target sparsity without distorting the optimization landscape. Finally, we employ a two-stage pipeline featuring a Tensor-Train (TT) Refinement stage - a gradient-free optimizer that fine-tunes the QUBO-derived solution directly against the true evaluation metric. Experiments on the SIDD image denoising dataset demonstrate that the proposed Hybrid QUBO significantly outperforms both greedy Taylor pruning and traditional L1-based QUBO, with TT Refinement providing further consistent gains at appropriate combinatorial scales. This highlights the potential of hybrid combinatorial formulations for robust, scalable, and interpretable neural network compression.

关键词: Neural Network Pruning, QUBO Optimization, Model Compression, Sparse Models, Combinatorial Optimization, Taylor Pruning, Fisher Information, Tensor-Train Refinement

68. ❌ Deep Researcher Agent: An Autonomous Framework for 24/7 Deep Learning Experimentation with Zero-Cost Monitoring

作者: Xiangyue Zhang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05854v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是开发一个基于LLM的自主研究代理框架，用于24/7深度学习实验。高度相关的关键词包括：1) ‘Large Language Models’（论文明确使用LLM作为核心组件）；2) ‘LLM Agents’（框架本质是LLM驱动的自主代理）；3) ‘Tool Use’（代理使用工具进行实验操作）；4) ‘Multi-agent Systems’（采用领导者-工作者多代理架构。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为Deep Researcher Agent的自主框架，利用LLM代理实现24/7的深度学习实验自动化，通过零成本监控、恒定大小内存和最小工具集架构，在30多天的部署中完成了500多个实验周期，并将LLM成本降至平均每24小时0.08美元。

摘要翻译

我们提出\textbf{深度研究智能体}，这是一个开源框架，旨在使大语言模型（LLM）智能体能够全天候自主进行深度学习实验。与现有专注于论文写作或代码生成的人工智能研究助手不同，我们的系统处理完整的实验生命周期：假设形成、代码实现、训练执行、结果分析以及迭代优化。该框架引入了三项关键创新：(1) \textbf{零成本监控}——一种监控范式，在模型训练期间通过仅依赖进程级检查和日志文件读取，实现零LLM API成本；(2) \textbf{双层恒定大小记忆}——一种记忆架构，无论运行时间长短，其容量上限约为5K字符，从而避免了长期运行智能体中常见的上下文无限增长问题；(3) \textbf{最小工具集领导者-工作者架构}——一种多智能体设计，其中每个工作者智能体仅配备3至5个工具，将每次调用的令牌开销降低了高达73%。在持续30天以上的部署中，该框架在四个并行研究项目中自主完成了500多个实验周期，并通过200多次自动化实验在其中一个项目中实现了基线指标52%的提升——每个24小时周期的平均LLM成本仅为0.08美元。代码发布于https://github.com/Xiangyue-Zhang/auto-deep-researcher-24x7。

摘要 (Abstract)

We present \textbf{Deep Researcher Agent}, an open-source framework that enables large language model (LLM) agents to autonomously conduct deep learning experiments around the clock. Unlike existing AI research assistants that focus on paper writing or code generation, our system addresses the full experiment lifecycle: hypothesis formation, code implementation, training execution, result analysis, and iterative refinement. The framework introduces three key innovations: (1) \textbf{Zero-Cost Monitoring} – a monitoring paradigm that incurs zero LLM API costs during model training by relying solely on process-level checks and log file reads; (2) \textbf{Two-Tier Constant-Size Memory} – a memory architecture capped at $\sim$5K characters regardless of runtime duration, preventing the unbounded context growth that plagues long-running agents; and (3) \textbf{Minimal-Toolset Leader-Worker Architecture} – a multi-agent design where each worker agent is equipped with only 3–5 tools, reducing per-call token overhead by up to 73%. In sustained deployments spanning 30+ days, the framework autonomously completed 500+ experiment cycles across four concurrent research projects, achieving a 52% improvement over baseline metrics in one project through 200+ automated experiments – all at an average LLM cost of $0.08 per 24-hour cycle. Code is available at https://github.com/Xiangyue-Zhang/auto-deep-researcher-24x7.

关键词: LLM agents, autonomous experimentation, deep learning experiments, zero-cost monitoring, multi-agent systems, tool use, 24/7 automation, cost-efficient AI research

69. ❌ Evaluating Learner Representations for Differentiation Prior to Instructional Outcomes

作者: Junsoo Park, Youssef Medhat, Htet Phyo Wai, Ploy Thajchayapong, Ashok K. Goel 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05848v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究教育AI系统中的学习者表征评估方法，提出’独特性’指标来评估表征是否保留学习者之间的差异。虽然涉及AI在教育领域的应用，但论文内容聚焦于表征评估的通用方法学，并未具体涉及大模型、深度学习技术原理、科学领域应用或任何评分关键词中的具体技术。所有关键词均与大模型技术、训练方法、推理优化、应用领域等具体技术相关，而本文属于教育AI的方法论研究，与这些具体技术无直接关联。

!!! tip deepseek-chat TL;DR

本研究提出了一种名为'独特性'的评估指标，用于在缺乏教学结果的情况下评估教育AI系统中的学习者表征是否能够有效区分不同学生，并通过实验证明学习者级别的表征比交互级别的表征具有更好的区分能力。

摘要翻译

学习者表征在教育人工智能系统中扮演着核心角色，但在教学结果不可用或高度依赖情境时，这些表征是否保留了学生之间有意义的差异往往并不明确。本研究探讨了如何基于共享比较规则下学习者表征能否保持学习者间的区分度来评估这些表征。我们引入了区分性这一表征层面的度量标准，它通过成对距离评估每个学习者与同组其他学习者的差异，无需依赖聚类、标签或特定任务的评估。利用在线学习环境中通过对话式人工智能代理收集的学生自创问题，我们将基于单个问题构建的表征与聚合学生长期互动模式的表征进行了比较。结果表明，与交互层面的表征相比，学习者层面的表征具有更高的分离度、更强的聚类结构以及更可靠的成对判别能力。这些发现证明，学习者表征的评估可以独立于教学结果进行，并为实际部署前提供了一种实用标准——以区分性作为诊断指标，用于评估某一表征是否支持差异化建模或个性化教学。

摘要 (Abstract)

Learner representations play a central role in educational AI systems, yet it is often unclear whether they preserve meaningful differences between students when instructional outcomes are unavailable or highly context-dependent. This work examines how to evaluate learner representations based on whether they retain separation between learners under a shared comparison rule. We introduce distinctiveness, a representation-level measure that evaluates how each learner differs from others in the cohort using pairwise distances, without requiring clustering, labels, or task-specific evaluation. Using student-authored questions collected through a conversational AI agent in an online learning environment, we compare representations based on individual questions with representations that aggregate patterns across a student’s interactions over time. Results show that learner-level representations yield higher separation, stronger clustering structure, and more reliable pairwise discrimination than interaction-level representations. These findings demonstrate that learner representations can be evaluated independently of instructional outcomes and provide a practical pre-deployment criterion using distinctiveness as a diagnostic metric for assessing whether a representation supports differentiated modeling or personalization.

关键词: learner representations, educational AI, distinctiveness, evaluation metric, differentiation, personalization, student modeling

70. ❌ EEG-MFTNet: An Enhanced EEGNet Architecture with Multi-Scale Temporal Convolutions and Transformer Fusion for Cross-Session Motor Imagery Decoding

作者: Panagiotis Andrikopoulos, Siamak Mehrkanoon 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05843v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文EEG-MFTNet专注于脑机接口（BCI）中的运动想象（MI）解码，使用基于EEGNet的深度学习架构，结合多尺度时间卷积和Transformer编码器。所有关键词均与大模型（LLMs）或深度学习技术原理直接相关，但论文未涉及任何大模型技术、训练方法、推理优化、对齐、代理系统等。仅最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为BCI属于AI在科学（神经科学/生物医学）领域的应用，但论文未明确提及生物信息学或化学信息学，且创新点在于特定深度学习架构而非通用大模型技术，因此给予5分（有一定关联）。其他关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

该研究针对脑电图（EEG）信号中运动想象解码的噪声和跨会话变异性挑战，提出了一种增强的EEG-MFTNet模型，结合多尺度时间卷积和Transformer融合，在SHU数据集上实现了58.9%的平均分类准确率，优于基线模型，并具有低计算复杂度和推理延迟，适用于实时脑机接口应用。

摘要翻译

脑机接口（Brain-Computer Interfaces, BCIs）实现了大脑与外部设备之间的直接通信，为运动功能障碍患者提供了关键支持。然而，由于噪声和跨会话变异性，从脑电图（Electroencephalography, EEG）中准确解码运动想象（Motor Imagery, MI）仍然具有挑战性。本研究提出了EEG-MFTNet，这是一种基于EEGNet架构的新型深度学习模型，通过多尺度时间卷积和Transformer编码器流进行增强。这些组件旨在捕捉EEG信号中的短期和长期时间依赖性。该模型在SHU数据集上采用被试内跨会话设置进行评估，其性能优于包括EEGNet及其近期衍生模型在内的基线模型。EEG-MFTNet实现了58.9%的平均分类准确率，同时保持了较低的计算复杂度和推理延迟。结果突显了该模型在实时BCI应用中的潜力，并强调了架构创新对于改进MI解码的重要性。这项工作有助于开发更鲁棒和自适应的BCI系统，对辅助技术和神经康复领域具有积极意义。

摘要 (Abstract)

Brain-computer interfaces (BCIs) enable direct communication between the brain and external devices, providing critical support for individuals with motor impairments. However, accurate motor imagery (MI) decoding from electroencephalography (EEG) remains challenging due to noise and cross-session variability. This study introduces EEG-MFTNet, a novel deep learning model based on the EEGNet architecture, enhanced with multi-scale temporal convolutions and a Transformer encoder stream. These components are designed to capture both short and long-range temporal dependencies in EEG signals. The model is evaluated on the SHU dataset using a subject-dependent cross-session setup, outperforming baseline models, including EEGNet and its recent derivatives. EEG-MFTNet achieves an average classification accuracy of 58.9% while maintaining low computational complexity and inference latency. The results highlight the model’s potential for real-time BCI applications and underscore the importance of architectural innovations in improving MI decoding. This work contributes to the development of more robust and adaptive BCI systems, with implications for assistive technologies and neurorehabilitation.

关键词: EEG-MFTNet, motor imagery decoding, brain-computer interface, multi-scale temporal convolutions, Transformer encoder, cross-session variability, EEGNet architecture, real-time BCI applications

作者: Hannah Sansford, Derek H. C. Law, Wei Liu, Abhishek Tripathi, Niresh Agarwal, Gerrit J. J. van den Burg 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05839v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在代码生成中的应用，特别是前端网页开发领域，因此与’Large Language Models’高度相关（10分）。论文明确使用LoRA进行参数高效微调，与’PEFT OR LoRA’高度相关（10分）。论文提出基于视觉语言模型的自动化批评-循环框架进行迭代精炼，涉及自我改进机制，与’Self-Correction OR Self-Improvement’有一定关联（8分）。论文涉及监督微调以内部化批评反馈，与’Post-training OR Supervised Fine-tuning’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、RAG、RLHF等与论文内容无直接关联，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于视觉语言模型的自动化批评-循环框架，通过迭代精炼生成的前端代码，显著提高了代码生成质量，并使用LoRA微调将部分改进内部化到代码生成LLM中。

摘要翻译

基于大语言模型的代码生成通常依赖多阶段人工介入式优化，该方法虽有效但成本高昂——尤其在像前端网页开发这类解决方案质量取决于渲染视觉输出的领域。我们提出了一种全自动的"评审员在环"框架，其中视觉语言模型充当视觉评审员，对渲染后的网页提供结构化反馈，以指导生成代码的迭代优化。在WebDev Arena数据集中的真实用户需求测试中，该方法持续提升解决方案质量，经过三轮优化周期后性能最高提升达17.8%。随后，我们采用LoRA进行参数高效微调，以探究代码生成大模型能否内化评审员提供的改进能力。微调后实现了最佳"评审员在环"方案25%的性能增益，且未显著增加令牌消耗。我们的研究表明：基于视觉语言模型的自动化评审机制能为前端代码生成带来显著优于单次大模型推理的解决方案质量，同时凸显了迭代优化对于网页开发这类复杂视觉输出任务的重要性。

摘要 (Abstract)

Code generation with large language models often relies on multi-stage human-in-the-loop refinement, which is effective but very costly - particularly in domains such as frontend web development where the solution quality depends on rendered visual output. We present a fully automated critic-in-the-loop framework in which a vision-language model serves as a visual critic that provides structured feedback on rendered webpages to guide iterative refinement of generated code. Across real-world user requests from the WebDev Arena dataset, this approach yields consistent improvements in solution quality, achieving up to 17.8% increase in performance over three refinement cycles. Next, we investigate parameter-efficient fine-tuning using LoRA to understand whether the improvements provided by the critic can be internalized by the code-generating LLM. Fine-tuning achieves 25% of the gains from the best critic-in-the-loop solution without a significant increase in token counts. Our findings indicate that automated, VLM-based critique of frontend code generation leads to significantly higher quality solutions than can be achieved through a single LLM inference pass, and highlight the importance of iterative refinement for the complex visual outputs associated with web development.

关键词: code generation, large language models, vision-language model, iterative refinement, frontend web development, LoRA fine-tuning, automated critic, visual feedback

72. ❌ Reciprocal Trust and Distrust in Artificial Intelligence Systems: The Hard Problem of Regulation

作者: Martino Maggetti 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05826v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文讨论AI系统的信任、监管和治理问题，属于AI伦理、政策和治理领域，不涉及大模型技术原理、训练方法、推理优化、应用部署等具体技术内容，与所有技术关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文探讨了AI系统与人类之间的双向信任关系及其对AI监管的挑战，认为AI应被视为具有某种能动性的实体，并分析了这种关系给监管带来的关键困境。

摘要翻译

政策制定者、科学家与公众正日益面临关于人工智能系统监管的棘手问题。其中一个关键共性议题在于人工智能是否值得信任，以及哪些因素能使其在利益相关者和用户面前更具可信度。这至关重要，因为人工智能系统的可信度既是民主治理的基石，也是人工智能开发与部署的基础。本文通过论证人工智能系统至少应在某种程度上被视为能够行使某种主体性的技术制品，从而使其能够与人类建立信任或不信任关系，以此推进相关讨论。文章进一步探讨了这种双向信任动态对人工智能系统监管者的影响。最后，本文指出了这些动态关系为未来人工智能监管与治理带来的核心矛盾与未解困境。

摘要 (Abstract)

Policy makers, scientists, and the public are increasingly confronted with thorny questions about the regulation of artificial intelligence (AI) systems. A key common thread concerns whether AI can be trusted and the factors that can make it more trustworthy in front of stakeholders and users. This is indeed crucial, as the trustworthiness of AI systems is fundamental for both democratic governance and for the development and deployment of AI. This article advances the discussion by arguing that AI systems should also be recognized, as least to some extent, as artifacts capable of exercising a form of agency, thereby enabling them to engage in relationships of trust or distrust with humans. It further examines the implications of these reciprocal trust dynamics for regulators tasked with overseeing AI systems. The article concludes by identifying key tensions and unresolved dilemmas that these dynamics pose for the future of AI regulation and governance.

关键词: artificial intelligence, trust, distrust, regulation, governance, agency, policy, stakeholders

73. ❌ “OK Aura, Be Fair With Me”: Demographics-Agnostic Training for Bias Mitigation in Wake-up Word Detection

作者: Fernando López, Paula Delgado-Santos, Pablo Gómez, David Solans, Jordi Luque 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05830v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究语音唤醒词检测中的公平性问题，采用人口统计学无关的训练技术（数据增强和知识蒸馏）来减少性别、年龄和口音偏见。所有评分关键词均与大语言模型（LLM）、深度学习技术原理或AI在科学领域的应用直接相关，而本文专注于语音处理和公平性机器学习，未涉及LLM、MoE、缩放定律、训练技术、推理、代理、压缩、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI等主题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本研究通过人口统计学无关的训练技术（数据增强和知识蒸馏）有效减少了语音唤醒词检测中的性别、年龄和口音偏见，显著提升了跨说话人群体的公平性。

摘要翻译

基于语音的交互界面已得到广泛应用，然而，由于持续存在的人口统计偏差，在不同说话人群体中实现公平的唤醒词检测仍是一个关键挑战。本研究评估了与人口统计信息无关的训练技术在减轻不同性别、年龄和口音说话人之间性能差异方面的有效性。我们使用OK Aura数据库进行实验，采用一种排除人口统计标签的训练方法，这些标签仅用于评估目的。我们探索了（i）数据增强技术以提升模型泛化能力，以及（ii）预训练基础语音模型的知识蒸馏。实验结果表明，这些与人口统计无关的训练技术显著降低了人口统计偏差，从而在不同说话人群体间实现了更公平的性能表现。具体而言，与基线相比，其中一项评估技术在性别维度上实现了39.94%的预测差异减少，在年龄维度上减少了83.65%，在口音维度上减少了40.48%。本研究强调了标签无关方法在促进唤醒词检测公平性方面的有效性。

摘要 (Abstract)

Voice-based interfaces are widely used; however, achieving fair Wake-up Word detection across diverse speaker populations remains a critical challenge due to persistent demographic biases. This study evaluates the effectiveness of demographics-agnostic training techniques in mitigating performance disparities among speakers of varying sex, age, and accent. We utilize the OK Aura database for our experiments, employing a training methodology that excludes demographic labels, which are reserved for evaluation purposes. We explore (i) data augmentation techniques to enhance model generalization and (ii) knowledge distillation of pre-trained foundational speech models. The experimental results indicate that these demographics-agnostic training techniques markedly reduce demographic bias, leading to a more equitable performance profile across different speaker groups. Specifically, one of the evaluated techniques achieves a Predictive Disparity reduction of 39.94% for sex, 83.65% for age, and 40.48% for accent when compared to the baseline. This study highlights the effectiveness of label-agnostic methodologies in fostering fairness in Wake-up Word detection.

关键词: Wake-up Word Detection, Demographic Bias, Fairness, Data Augmentation, Knowledge Distillation, Demographics-agnostic Training, Predictive Disparity, Speech Models

74. ❌ Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents

作者: Shuai Zhen, Yanhua Yu, Ruopei Guo, Nan Cheng, Yang Deng 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05808v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLM agents在复杂交互决策任务中的效率问题，提出STEP-HRL框架，因此与’Large Language Models’和’LLM Agents’高度相关（10分）。论文在ScienceWorld和ALFWorld基准测试，属于科学AI应用，与’AI for Science’有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLM agents依赖长交互历史导致计算成本高和可扩展性有限的问题，提出了STEP-HRL分层强化学习框架，通过步级学习和局部进度模块显著提升了性能、泛化能力并减少了token使用。

摘要翻译

大型语言模型（LLM）智能体已在复杂交互式决策任务中展现出强大能力。然而，现有LLM智能体通常依赖日益增长的交互历史记录，导致计算成本高昂且可扩展性受限。本文提出STEP-HRL——一种分层强化学习（HRL）框架，该框架通过仅以单步转移而非完整交互历史为条件，实现步进级学习。STEP-HRL采用分层结构组织任务，利用已完成的子任务表征整体任务的全局进展。通过引入局部进展模块，该框架还能迭代且选择性地总结每个子任务内的交互历史，生成紧凑的局部进展摘要。这些组件共同为高层策略与低层策略生成增强的步进级转移状态。在ScienceWorld和ALFWorld基准测试上的实验结果表明，STEP-HRL在性能与泛化能力方面均显著超越基线方法，同时有效降低了令牌使用量。代码已发布于https://github.com/TonyStark042/STEP-HRL。

摘要 (Abstract)

Large language model (LLM) agents have demonstrated strong capabilities in complex interactive decision-making tasks. However, existing LLM agents typically rely on increasingly long interaction histories, resulting in high computational cost and limited scalability. In this paper, we propose STEP-HRL, a hierarchical reinforcement learning (HRL) framework that enables step-level learning by conditioning only on single-step transitions rather than full interaction histories. STEP-HRL structures tasks hierarchically, using completed subtasks to represent global progress of overall task. By introducing a local progress module, it also iteratively and selectively summarizes interaction history within each subtask to produce a compact summary of local progress. Together, these components yield augmented step-level transitions for both high-level and low-level policies. Experimental results on ScienceWorld and ALFWorld benchmarks consistently demonstrate that STEP-HRL substantially outperforms baselines in terms of performance and generalization while reducing token usage. Our code is available at https://github.com/TonyStark042/STEP-HRL.

关键词: LLM agents, hierarchical reinforcement learning, step-level learning, interaction history, computational cost, ScienceWorld, ALFWorld, token usage

作者: Silja Keßler, Miriam Bautista-Salinero, Claudio Tennie, Charley M. Wu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05777v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究社会学习与文化传播机制，使用强化学习模拟个体如何通过观察他人行为（而非推断心理状态）来学习环境表示。论文主题属于认知科学、文化演化和社会学习理论，与提供的大模型、深度学习技术原理及应用关键词完全无关。所有关键词均涉及大模型架构、训练方法、推理技术、应用领域等具体技术方向，而本文未涉及任何大模型或深度学习技术内容，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该研究通过强化学习模拟发现，即使不进行心理状态推断，简单的社会学习机制（如观察他人行为）也能使学习者的环境表示向专家收敛，揭示了文化传播可能源于利用非心理化过程的简单社会线索。

摘要翻译

在认知能力有限的情况下，人们如何通过他人获取关于环境的丰富而灵活的知识？传统观点认为人类依赖于计算成本高昂的心理化过程，例如推断他人信念。相比之下，文化演化理论强调行为传递可以通过简单的社会线索实现。通过强化学习模拟，我们展示了最低限度的社会学习如何间接传递更高层次的表征。我们模拟了一个在可重构环境中寻找奖励的初始智能体，其通过独立学习或观察专家行为进行学习——关键之处在于，该学习者并不推断心理状态，而是基于观察到的行为启发式地选择动作或增强价值表征。结果表明，这些线索会引导学习者的经验，使其表征逐渐趋近于专家的表征。基于模型的学习者从社会接触中获益最大，表现出更快的学习速度和更接近专家的表征。这些发现揭示了文化传递如何通过利用非社会性学习机制的简单非心理化过程得以实现。

摘要 (Abstract)

How do people acquire rich, flexible knowledge about their environment from others despite limited cognitive capacity? Humans are often thought to rely on computationally costly mentalizing, such as inferring others’ beliefs. In contrast, cultural evolution emphasizes that behavioral transmission can be supported by simple social cues. Using reinforcement learning simulations, we show how minimal social learning can indirectly transmit higher-level representations. We simulate a naïve agent searching for rewards in a reconfigurable environment, learning either alone or by observing an expert - crucially, without inferring mental states. Instead, the learner heuristically selects actions or boosts value representations based on observed actions. Our results demonstrate that these cues bias the learner’s experience, causing its representation to converge toward the expert’s. Model-based learners benefit most from social exposure, showing faster learning and more expert-like representations. These findings show how cultural transmission can arise from simple, non-mentalizing processes exploiting asocial learning mechanisms.

关键词: social learning, cultural transmission, reinforcement learning, model-based representations, non-mentalizing processes, behavioral transmission, expert-like representations, asocial learning mechanisms

76. ❌ What Models Know, How Well They Know It: Knowledge-Weighted Fine-Tuning for Learning When to Say “I Don’t Know”

作者: Joosung Lee, Hwiyeol Jo, Donghyeon Ko, Kyubyung Chae, Cheonbok Park, Jeonghoon Kim 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05779v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的幻觉问题，提出基于知识加权的微调方法，让模型学会在缺乏知识时说"我不知道"。高度相关的关键词包括：LLMs（论文明确研究LLMs）、Supervised Fine-tuning（提出新的微调方法）、Hallucination Mitigation（直接解决幻觉问题）。中等相关的关键词包括：Pre-training（涉及预训练与微调的知识对齐问题）、Instruction Tuning/Alignment（涉及模型响应校准）、Self-Correction（涉及模型自我评估不确定性）、Explainable AI（涉及知识评分和不确定性评估）。其他关键词如MoE、SLMs、RAG、Quantization等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型因预训练与微调知识不匹配导致的幻觉问题，提出了一种基于实例级知识评分的加权微调方法，使模型能够明确表达不确定性并在缺乏知识时说"我不知道"，同时保持已知问题的回答准确性。

摘要翻译

尽管大型语言模型（LLMs）在处理多样化用户查询时展现出强大能力，但其仍存在幻觉问题，这通常源于预训练与微调阶段之间的知识错位。为解决此类错位，我们通过多重采样推理可靠地估计出细粒度的实例级知识得分。利用该知识得分，我们依据模型已有知识对学习信号进行缩放，同时鼓励模型对超出范围的问题给出明确的“我不知道”回应。实验结果表明，该方法使模型在缺乏相关知识时能够明确表达不确定性，同时对其能够回答的问题保持准确性。此外，我们提出了针对不确定性的评估指标，证明准确区分已知与未知实例能持续提升模型性能。

摘要 (Abstract)

While large language models (LLMs) demonstrate strong capabilities across diverse user queries, they still suffer from hallucinations, often arising from knowledge misalignment between pre-training and fine-tuning. To address this misalignment, we reliably estimate a fine-grained, instance-level knowledge score via multi-sampled inference. Using the knowledge score, we scale the learning signal according to the model’s existing knowledge, while encouraging explicit “I don’t know” responses for out-of-scope queries. Experimental results show that this approach allows the model to explicitly express uncertainty when it lacks knowledge, while maintaining accuracy on questions it can answer. Furthermore, we propose evaluation metrics for uncertainty, showing that accurate discrimination between known and unknown instances consistently improves performance.

关键词: large language models, hallucination mitigation, fine-tuning, knowledge alignment, uncertainty estimation, I don’t know responses, instance-level knowledge score, multi-sampled inference

77. ❌ CAKE: Cloud Architecture Knowledge Evaluation of Large Language Models

作者: Tim Lukas Adam, Phongsakon Mark Konrad, Riccardo Terrenzi, Florian Girardo Lukas, Rahime Yilmaz, Krzysztof Sierszecki, Serkan Ayvaz 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05755v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是评估LLMs在云原生软件架构知识方面的能力，建立了CAKE基准测试。与’Large Language Models’高度相关（10分），因为全文围绕LLMs评估展开。与’Small Language Models’有一定关联（5分），因为评估了0.5B-70B参数模型，包括小模型。与’Chain of Thought’有一定关联（5分），因为提到了推理增强（+think）改进自由回答质量。与’Tool Use’有一定关联（5分），因为提到了工具增强（+tool）对小模型性能的影响。其他关键词如MoE、Scaling Laws、Alignment等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了CAKE基准测试来评估大型语言模型对云原生软件架构知识的理解，发现多项选择题准确率在3B参数以上趋于稳定，而自由回答分数持续区分模型能力，且推理增强能提升回答质量。

摘要翻译

在当今的软件架构中，大型语言模型（LLMs）扮演着软件架构协同驾驶员的角色。然而，目前尚缺乏评估大型语言模型对云原生软件架构实际理解能力的基准。为此，我们提出了一个名为CAKE的基准测试，该测试包含188个经过专家验证的问题，涵盖布鲁姆修订分类法的四个认知层次——记忆、分析、设计与实现——以及五个云原生主题。评估在四个LLM家族的22种模型配置（参数规模0.5B至70B）上进行，针对选择题（MCQs）采用三轮多数投票法，对自由回答题（FR）则使用LLM-as-a-judge评分机制。基于此次评估，我们得出了四项重要发现。首先，选择题的准确率在参数规模超过3B后趋于稳定，最佳模型达到99.2%。其次，自由回答题的得分在所有认知层次上均呈现稳定增长趋势。第三，两种题型捕捉了知识的不同侧面：选择题准确率接近天花板效应时，自由回答题仍能有效区分模型能力。最后，推理增强（+think）提升了自由回答的质量，而工具增强（+tool）会降低小规模模型的性能。这些结果表明，评估形式从根本上影响了我们衡量LLMs架构知识的方式。

摘要 (Abstract)

In today’s software architecture, large language models (LLMs) serve as software architecture co-pilots. However, no benchmark currently exists to evaluate large language models’ actual understanding of cloud-native software architecture. For this reason we present a benchmark called CAKE, which consists of 188 expert-validated questions covering four cognitive levels of Bloom’s revised taxonomy – recall, analyze, design, and implement – and five cloud-native topics. Evaluation is conducted on 22 model configurations (0.5B–70B parameters) across four LLM families, using three-run majority voting for multiple-choice questions (MCQs) and LLM-as-a-judge scoring for free-responses (FR). Based on this evaluation, four notable findings were identified. First, MCQ accuracy plateaus above 3B parameters, with the best model reaching 99.2%. Second, free-response scores scale steadily across all cognitive levels. Third, the two formats capture different facets of knowledge, as the MCQ accuracy approaches a ceiling while free-responses continue to differentiate models. Finally, reasoning augmentation (+think) improves free-response quality, while tool augmentation (+tool) degrades performance for small models. These results suggest that the evaluation format fundamentally shapes how we measure architectural knowledge in LLMs.

关键词: Large Language Models, Cloud-native Architecture, Benchmark Evaluation, Knowledge Assessment, Model Scaling, Reasoning Augmentation, Tool Augmentation, Free-response Scoring

78. ❌ On the Robustness of Diffusion-Based Image Compression to Bit-Flip Errors

作者: Amit Vaisman, Gal Pomerants, Raz Lapid 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05743v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究扩散模型在图像压缩中的抗比特翻转鲁棒性，属于计算机视觉和图像处理领域，与所有评分关键词（均聚焦于大语言模型、深度学习技术原理及其在科学领域的应用）完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了扩散模型在图像压缩中对比特翻转错误的鲁棒性，发现基于反向信道编码的扩散压缩器比传统和学习的编解码器更稳健，并提出了一种改进的Turbo-DDCM变体，在保持率-失真-感知权衡的同时显著提升了鲁棒性。

摘要翻译

现代图像压缩方法通常针对码率-失真-感知权衡进行优化，而其对比特级损坏的鲁棒性却鲜少被检验。我们发现，基于反向信道编码范式的扩散压缩器相较于经典及学习型编解码器，对比特翻转具有显著更强的鲁棒性。我们进一步提出了一种更鲁棒的Turbo-DDCM变体，在仅对码率-失真-感知权衡产生极小影响的前提下，显著提升了鲁棒性。我们的研究结果表明，基于反向信道编码的压缩能够产生更具韧性的压缩表示，在高噪声环境中可能降低对纠错编码的依赖。

摘要 (Abstract)

Modern image compression methods are typically optimized for the rate–distortion–perception trade-off, whereas their robustness to bit-level corruption is rarely examined. We show that diffusion-based compressors built on the Reverse Channel Coding (RCC) paradigm are substantially more robust to bit flips than classical and learned codecs. We further introduce a more robust variant of Turbo-DDCM that significantly improves robustness while only minimally affecting the rate–distortion–perception trade-off. Our findings suggest that RCC-based compression can yield more resilient compressed representations, potentially reducing reliance on error-correcting codes in highly noisy environments.

关键词: diffusion-based compression, bit-flip errors, robustness, Reverse Channel Coding, Turbo-DDCM, rate-distortion-perception trade-off, error-correcting codes

79. ❌ Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing

作者: Jiaren Peng, Zeqin Li, Chang You, Yan Wang, Hanlin Sun, Xuan Tian, Shuqiao Zhang, Junyi Liu, Jianguo Zhao, Renyang Liu, Haoran Ou, Yuqiang Sun, Jiancheng Zhang, Yutong Jiao, Kunshu Song, Chao Zhang, Fan Shi, Hongda Sun, Rui Yan, Cheng Huang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05719v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在自动化渗透测试（AutoPT）中的应用，属于LLM在特定领域（网络安全）的研究应用。与"Large Language Models"高度相关（10分），因为全文围绕LLM-based AutoPT框架展开。与"LLM Agents"高度相关（10分），因为论文系统分析了AutoPT框架的agent架构、规划、记忆、执行等维度。与"Tool Use"和"Multi-agent Systems"有一定关联（各5分），因为AutoPT涉及工具使用（如攻击工具）和多agent协调（如攻击流程），但论文未深入讨论这些技术的具体实现。其他关键词（如MoE、SFT、RAG等）未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

本文首次系统化分析了基于大语言模型的自动化渗透测试框架的架构设计，并通过大规模实验评估了13个开源框架的性能，为未来研究提供了分类基准和实证数据。

摘要翻译

大型语言模型（LLM）的快速发展为自动化渗透测试（AutoPT）创造了新的机遇，催生了众多旨在实现端到端自主攻击的框架。然而，尽管相关研究激增，现有工作普遍缺乏在统一基准下的系统性架构分析和大规模实证比较。为此，本文首次提出了一个知识系统化研究，聚焦于当前基于LLM的AutoPT框架的架构设计与综合实证评估。在系统化层面，我们从六个维度全面审视了现有框架设计：智能体架构、智能体规划、智能体记忆、智能体执行、外部知识以及基准测试。在实证层面，我们利用统一基准对13个代表性的开源AutoPT框架和2个基线框架进行了大规模实验。实验总计消耗了超过100亿个令牌，生成了超过1500份执行日志，这些日志由一个超过15名网络安全领域专家组成的研究小组耗时四个月进行了人工审查与分析。通过探究这一快速发展领域的最新进展，我们为研究者提供了一个结构化的分类体系以理解现有的基于LLM的AutoPT框架，并提供了一项大规模实证基准以及未来研究的潜在方向。

摘要 (Abstract)

The rapid advancement of Large Language Models (LLMs) has created new opportunities for Automated Penetration Testing (AutoPT), spawning numerous frameworks aimed at achieving end-to-end autonomous attacks. However, despite the proliferation of related studies, existing research generally lacks systematic architectural analysis and large-scale empirical comparisons under a unified benchmark. Therefore, this paper presents the first Systematization of Knowledge (SoK) focusing on the architectural design and comprehensive empirical evaluation of current LLM-based AutoPT frameworks. At systematization level, we comprehensively review existing framework designs across six dimensions: agent architecture, agent plan, agent memory, agent execution, external knowledge, and benchmarks. At empirical level, we conduct large-scale experiments on 13 representative open-source AutoPT frameworks and 2 baseline frameworks utilizing a unified benchmark. The experiments consumed over 10 billion tokens in total and generated more than 1,500 execution logs, which were manually reviewed and analyzed over four months by a panel of more than 15 researchers with expertise in cybersecurity. By investigating the latest progress in this rapidly developing field, we provide researchers with a structured taxonomy to understand existing LLM-based AutoPT frameworks and a large-scale empirical benchmark, along with promising directions for future research.

关键词: Large Language Models, Automated Penetration Testing, LLM-based AutoPT, Agent Architecture, Systematization of Knowledge, Empirical Evaluation, Cybersecurity, Benchmark

80. ❌ Can Large Language Models Reinvent Foundational Algorithms?

作者: Jian Zhao, Haoren Luo, Yu Wang, Yuhan Cao, Pingyue Sheng, Tianxing He 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05716v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLMs能否重新发明基础算法，直接涉及LLMs、推理能力（Chain of Thought、System 2 Thinking）和自我改进（Self-Correction），这些是论文的核心内容。AI for Science有中等关联，因为论文探讨LLMs在科学发现中的潜力，但未深入具体科学领域。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文内容无关。

!!! tip deepseek-chat TL;DR

该研究探讨大型语言模型能否重新发明计算机科学中的基础算法，通过Unlearn-and-Reinvent流程测试发现，最强模型Qwen3-4B-Thinking-2507在无提示下能成功重新发明50%的算法，而生成验证器在推理过程中起到关键作用以避免“思维崩溃”现象。

摘要翻译

大型语言模型已展现出推动科学发现的强大潜力。然而，它们是否具备基础性创新能力仍是一个悬而未决的问题。本研究聚焦于基础性创新的一个先决条件：大型语言模型能否重新发明计算机科学中的基础算法？我们提出的“遗忘与再创造”流程，首先通过大型语言模型遗忘技术，从模型的预训练知识中移除特定基础算法（例如Dijkstra算法或Euclid算法），随后在受控环境中测试模型能否重新发明该算法。为实现有效遗忘，我们采用了一种基于GRPO的策略性遗忘方法。通过对10种目标算法、3个强效开源模型及3种提示级别的实验，我们发现：（1）最强模型Qwen3-4B-Thinking-2507在无提示情况下成功再创造了50%的算法，在1级提示下达到70%，在2级提示下达到90%；（2）少量高层级提示能提升再创造成功率，但对于复杂算法，即使是逐步提示也难以成功；（3）在2级提示下，测试时强化学习能使Strassen算法成功实现再创造。通过对输出轨迹的分析与消融实验，我们发现再创造阶段的生成式验证器对维持模型的推理能力具有关键作用，有助于避免“思维崩溃”现象。这些发现为理解大型语言模型创新思维的潜力与当前局限提供了重要见解。

摘要 (Abstract)

LLMs have shown strong potential to advance scientific discovery. Whether they possess the capacity for foundational innovation, however, remains an open question. In this work, we focus on a prerequisite for foundational innovation: can LLMs reinvent foundational algorithms in computer science? Our \textit{Unlearn-and-Reinvent} pipeline applies LLM unlearning to remove a specific foundational algorithm, such as Dijkstra’s or Euclid’s algorithm, from an LLM’s pretrained knowledge, and then tests whether the model can reinvent it in a controlled environment. To enable effective unlearning, we adopt a GRPO-based, on-policy unlearning method. Across 10 target algorithms, 3 strong open-weight models, and 3 hint levels, our experiments demonstrate that (1) the strongest model Qwen3-4B-Thinking-2507 successfully reinvents 50% of the algorithms with no hint, 70% at hint level 1, and 90% at hint level 2; (2) a few high-level hints can enhance the reinvention success rate, but even step-by-step hints fail for those complicated algorithms; and (3) test-time reinforcement learning enables successful reinvention for the Strassen algorithm at hint level 2. Through analyses of output trajectories and ablation studies, we find that generative verifier in the reinvention phase plays a critical role in sustaining models’ reasoning strength, helping to avoid the ``thought collapse’’ phenomenon. These findings offer insights into both the potential and current limits of LLMs’ innovative thinking.

关键词: Large Language Models, foundational algorithms, reinvention, unlearning, reasoning, generative verifier, thought collapse, scientific discovery

81. ❌ QA-MoE: Towards a Continuous Reliability Spectrum with Quality-Aware Mixture of Experts for Robust Multimodal Sentiment Analysis

作者: Yitong Zhu, Yuxuan Jiang, Guanxuan Jiang, Bojing Hou, Peng Yuan Zhou, Ge Lin Kan, Yuyang Wang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05704v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文QA-MoE专注于多模态情感分析，核心创新是提出了一种质量感知的混合专家框架，直接对应关键词’Mixture of Experts OR MoE OR Sparse Models’，因此给予10分。其他关键词主要涉及大语言模型技术、训练方法、推理优化、代理系统等，而本文研究的是特定任务的多模态模型，未涉及这些通用大模型技术或科学AI应用，因此均为0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态情感分析中动态噪声和模态缺失问题，提出了一个连续可靠性谱系和质量感知混合专家框架，实现了在多样化退化场景下的竞争性或最先进性能。

摘要翻译

多模态情感分析旨在从文本、声学和视觉信号中推断人类情感。然而，在实际场景中，多模态输入常受到动态噪声或模态缺失的影响。现有方法通常将这些不完善情况视为离散案例或假设固定的损坏比例，这限制了其对连续变化可靠性条件的适应能力。为解决此问题，我们首先引入连续可靠性谱，将缺失和质量退化统一到单一框架中。在此基础上，我们提出QA-MoE——一种质量感知的专家混合框架，该框架通过自监督的任意不确定性来量化模态可靠性。该机制显式指导专家路由，使模型能够抑制不可靠信号的误差传播，同时保留任务相关信息。大量实验表明，QA-MoE在多种退化场景中均取得具有竞争力或最先进的性能，并在实践中展现出有前景的“单一检查点适应所有场景”特性。

摘要 (Abstract)

Multimodal Sentiment Analysis (MSA) aims to infer human sentiment from textual, acoustic, and visual signals. In real-world scenarios, however, multimodal inputs are often compromised by dynamic noise or modality missingness. Existing methods typically treat these imperfections as discrete cases or assume fixed corruption ratios, which limits their adaptability to continuously varying reliability conditions. To address this, we first introduce a Continuous Reliability Spectrum to unify missingness and quality degradation into a single framework. Building on this, we propose QA-MoE, a Quality-Aware Mixture-of-Experts framework that quantifies modality reliability via self-supervised aleatoric uncertainty. This mechanism explicitly guides expert routing, enabling the model to suppress error propagation from unreliable signals while preserving task-relevant information. Extensive experiments indicate that QA-MoE achieves competitive or state-of-the-art performance across diverse degradation scenarios and exhibits a promising One-Checkpoint-for-All property in practice.

关键词: Multimodal Sentiment Analysis, Mixture of Experts, Quality-Aware, Continuous Reliability Spectrum, Aleatoric Uncertainty, Expert Routing, One-Checkpoint-for-All, Robustness

82. ❌ SemLink: A Semantic-Aware Automated Test Oracle for Hyperlink Verification using Siamese Sentence-BERT

作者: Guan-Yan Yang, Wei-Ling Wen, Shu-Yuan Ku, Farn Wang, Kuo-Hui Yeh 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05711v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出SemLink，一种基于Siamese Sentence-BERT的语义感知超链接验证测试预言机。论文与’Large Language Models OR LLMs OR Foundation Models’相关度5分，因为摘要提到LLMs（如GPT-5.2）用于语义理解，但论文主要使用SBERT而非LLMs作为核心方法。与’Pre-training OR Continual Pre-training OR Domain Adaptation’相关度5分，因为SemLink使用预训练的SBERT作为骨干网络，涉及预训练模型的应用。其他关键词与论文内容无关，论文专注于特定应用（网页超链接验证），未涉及MoE、SLMs、Scaling Laws、微调、对齐、推理、代理、压缩等大模型核心技术。

!!! tip deepseek-chat TL;DR

该论文提出SemLink，一种基于Siamese Sentence-BERT的自动化测试预言机，用于检测超链接的语义漂移问题，在保持高召回率（96.00%）的同时，比LLMs快47.5倍且计算资源需求更低。

摘要翻译

Web应用程序高度依赖超链接来连接分散的信息资源。然而，网络的动态性导致了链接失效问题，即目标资源变得不可访问；更隐蔽的是语义漂移现象，即虽然存在有效的HTTP 200连接，但目标内容与源上下文不再匹配。传统验证工具主要通过检查HTTP状态码来充当崩溃判定器，往往无法检测语义不一致性，从而损害网络完整性和用户体验。尽管大语言模型具备语义理解能力，但其存在高延迟、隐私顾虑以及在大规模回归测试中成本过高的问题。本文提出SemLink——一种用于语义超链接验证的新型自动化测试判定器。SemLink采用以预训练Sentence-BERT为骨干的孪生神经网络架构，通过计算超链接源上下文（锚文本、周边DOM元素及视觉特征）与其目标页面内容之间的语义连贯性实现验证。为训练和评估模型，我们构建了超链接-网页正配对数据集，这是一个包含超过60,000个语义对的严格标注语料库。评估结果表明，SemLink实现了96.00%的召回率，与前沿大语言模型性能相当，同时运行速度提升约47.5倍且所需计算资源显著降低。本研究弥合了传统语法检查器与昂贵生成式人工智能之间的技术鸿沟，为自动化网络质量保障提供了鲁棒高效的解决方案。

摘要 (Abstract)

Web applications rely heavily on hyperlinks to connect disparate information resources. However, the dynamic nature of the web leads to link rot, where targets become unavailable, and more insidiously, semantic drift, where a valid HTTP 200 connection exists, but the target content no longer aligns with the source context. Traditional verification tools, which primarily function as crash oracles by checking HTTP status codes, often fail to detect semantic inconsistencies, thereby compromising web integrity and user experience. While Large Language Models (LLMs) offer semantic understanding, they suffer from high latency, privacy concerns, and prohibitive costs for large-scale regression testing. In this paper, we propose SemLink, a novel automated test oracle for semantic hyperlink verification. SemLink leverages a Siamese Neural Network architecture powered by a pre-trained Sentence-BERT (SBERT) backbone to compute the semantic coherence between a hyperlink’s source context (anchor text, surrounding DOM elements, and visual features) and its target page content. To train and evaluate our model, we introduce the Hyperlink-Webpage Positive Pairs (HWPPs) dataset, a rigorously constructed corpus of over 60,000 semantic pairs. Our evaluation demonstrates that SemLink achieves a Recall of 96.00%, comparable to state-of-the-art LLMs (GPT-5.2), while operating approximately 47.5 times faster and requiring significantly fewer computational resources. This work bridges the gap between traditional syntactic checkers and expensive generative AI, offering a robust and efficient solution for automated web quality assurance.

关键词: SemLink, hyperlink verification, semantic drift, Siamese Neural Network, Sentence-BERT, automated test oracle, web quality assurance, semantic coherence

作者: Xuecong Liu, Mengzhu Ding, Zixuan Sun, Zhang Li, Xichao Teng 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05689v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文CRFT专注于计算机视觉领域的跨模态图像配准，提出了一种基于Transformer的特征流学习框架。虽然论文使用了Transformer架构，但其应用领域是图像处理而非自然语言处理，因此与绝大多数大语言模型（LLM）相关的关键词完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文提到在医学成像等科学领域有应用潜力，但这不是论文的核心研究内容，因此给予5分（有一定关联）。其他所有关键词都涉及LLM技术原理、训练方法、推理优化、对齐、代理系统等，与本文的计算机视觉研究主题无直接关联。

!!! tip deepseek-chat TL;DR

本文提出了一种基于Transformer的跨模态图像配准框架CRFT，通过特征流学习和迭代注意力机制实现了在多种模态下准确鲁棒的图像对齐，并在多个数据集上超越了现有方法。

摘要翻译

本文提出一致循环特征流变换器（CRFT），这是一种基于特征流学习的、从粗到精的统一框架，用于实现鲁棒的跨模态图像配准。CRFT在基于变换器的架构中学习一种与模态无关的特征流表示，该架构联合执行特征对齐与流估计。粗配准阶段通过多尺度特征相关性建立全局对应关系，而精配准阶段则通过分层特征融合与自适应空间推理来细化局部细节。为增强几何适应性，一个结合了空间几何变换（SGT）的迭代差异引导注意力机制循环优化流场，逐步捕捉细微的空间不一致性并强制实现特征级一致性。该设计使得CRFT能够在较大的仿射与尺度变化下实现精确对齐，同时保持跨模态的结构连贯性。在多种跨模态数据集上的大量实验表明，CRFT在精度与鲁棒性上均持续优于当前最先进的配准方法。除配准任务外，CRFT为多模态空间对应问题提供了一个可泛化的范式，在遥感、自主导航与医学成像等领域具有广泛适用性。代码与数据集已公开于 https://github.com/NEU-Liuxuecong/CRFT。

摘要 (Abstract)

We present Consistent-Recurrent Feature Flow Transformer (CRFT), a unified coarse-to-fine framework based on feature flow learning for robust cross-modal image registration. CRFT learns a modality-independent feature flow representation within a transformer-based architecture that jointly performs feature alignment and flow estimation. The coarse stage establishes global correspondences through multi-scale feature correlation, while the fine stage refines local details via hierarchical feature fusion and adaptive spatial reasoning. To enhance geometric adaptability, an iterative discrepancy-guided attention mechanism with a Spatial Geometric Transform (SGT) recurrently refines the flow field, progressively capturing subtle spatial inconsistencies and enforcing feature-level consistency. This design enables accurate alignment under large affine and scale variations while maintaining structural coherence across modalities. Extensive experiments on diverse cross-modal datasets demonstrate that CRFT consistently outperforms state-of-the-art registration methods in both accuracy and robustness. Beyond registration, CRFT provides a generalizable paradigm for multimodal spatial correspondence, offering broad applicability to remote sensing, autonomous navigation, and medical imaging. Code and datasets are publicly available at https://github.com/NEU-Liuxuecong/CRFT.

关键词: Cross-modal Image Registration, Feature Flow Learning, Transformer Architecture, Spatial Geometric Transform, Iterative Discrepancy-guided Attention, Multimodal Spatial Correspondence, Medical Imaging, Remote Sensing

84. ❌ Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion

作者: Zhen Cheng, Hao-Bo Yang, Wan-Yi Huang, Jin-Long Li 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05688v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM推理效率优化，直接涉及KV缓存压缩和长上下文处理，与’KV Cache Compression OR Linear Attention OR FlashAttention’（10分）和’Context Window Extension OR Long Context LLMs’（8分）高度相关；论文明确研究LLM，与’Large Language Models OR LLMs OR Foundation Models’（10分）高度相关；方法通过注意力转换提升推理效率，与’Speculative Decoding OR Inference Acceleration’（8分）相关；其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Attention Editing的框架，用于将已训练的大型语言模型转换为新的注意力架构（如MLA和GateSWA），无需重新预训练，从而显著提升长上下文和长生成场景下的推理效率，同时保持模型性能。

摘要翻译

键值（KV）缓存的内存与带宽占用，在长上下文和长生成场景中日益主导大语言模型推理成本。多头潜在注意力（MLA）和混合滑动窗口注意力（SWA）等架构能够缓解这一瓶颈，但将其集成到现有模型中仍存在困难。现有方法对源注意力模块和目标注意力模块均施加了细粒度的结构约束，难以满足实际部署的可行性要求。本文提出注意力编辑（Attention Editing），一种实用的框架，可将已训练的大语言模型（LLM）转换为采用新注意力架构的模型，而无需从头开始重新预训练。注意力编辑通过可学习的目标模块替换原始注意力机制，并采用渐进式蒸馏进行训练，该过程包括：（1）结合中间激活监督的逐层教师强制优化，以防止冷启动误差累积；（2）基于下一词分布进行模型级蒸馏，可选择性地通过弱特征匹配进行正则化。我们在两种不同的目标架构——MLA以及门控混合SWA设计（GateSWA）上实例化了该框架，并将其应用于Qwen3-8B和Qwen3-30B-A3B模型。所得模型在保持竞争力的性能的同时，实现了显著的效率提升，证明大规模注意力转换既可行又鲁棒。值得注意的是，实验在昇腾910B集群上进行，为国产硬件提供了一个实用的训练案例研究。

摘要 (Abstract)

Key-Value (KV) cache memory and bandwidth increasingly dominate large language model inference cost in long-context and long-generation regimes. Architectures such as multi-head latent attention (MLA) and hybrid sliding-window attention (SWA) can alleviate this bound, but integrating them into existing models remains difficult. Prior methods impose fine-grained structural requirements on both source and target attention modules, which cannot meet the feasible requirement in practical deployment. We present Attention Editing, a practical framework for converting already-trained large language models (LLMs) with new attention architectures without re-pretraining from scratch. Attention editing replaces the original attention with a learnable target module and trains it using progressive distillation, consisting of (1) layer-wise teacher-forced optimization with intermediate activation supervision to prevent cold-start error accumulation, and (2) model-level distillation on next-token distributions, optionally regularized by weak feature matching. We instantiate the framework on two different target–MLA and GateSWA, a gated hybrid SWA design, and apply it to Qwen3-8B and Qwen3-30B-A3B. The resulting models maintain competitive performance while delivering substantial efficiency improvements, demonstrating that large-scale attention conversion is both feasible and robust. Notably, experiments are conducted on an Ascend 910B clusters, offering a practical training case study on domestic hardware.

关键词: Attention Editing, KV cache, large language models, inference efficiency, attention architecture conversion, progressive distillation, long-context, MLA

85. ❌ LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo

作者: Ojas Jain, Dhruv Kumar 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05681v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	8.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是评估LLM在Ludo棋盘游戏中的战略决策能力，属于LLM应用和评估研究。高度相关关键词：LLMs（核心研究对象，10分）、LLM Agents（论文评估LLM作为游戏代理，10分）、Multi-agent Systems（涉及4玩家游戏和代理协调，8分）、Chain of Thought和System 2 Thinking（评估战略推理和深度思考，8分）。其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究通过LudoBench基准评估LLM在不确定环境下的战略推理能力，发现LLM与游戏理论基准仅40-46%一致，并识别出两种行为原型，揭示了提示敏感性是主要弱点。

摘要翻译

我们推出LudoBench，这是一个用于评估大语言模型在Ludo游戏中战略推理能力的基准测试框架。Ludo是一种随机性多智能体棋盘游戏，其骰子机制、棋子捕获、安全格点导航及归家路径推进等规则引入了具有实质意义的规划复杂性。该基准包含480个人工设计的特定场景，涵盖12种行为特征各异的决策类别，每个类别均隔离出特定的战略选择。我们还贡献了一个功能完备的4玩家Ludo模拟器，支持随机智能体、启发式智能体、博弈论智能体及大语言模型智能体。其中博弈论智能体采用期望最小最大搜索算法配合深度受限的前瞻机制，为超越贪婪启发式策略提供了原则性的战略上限基准。通过对涵盖四个模型系列的六个模型进行评估，我们发现所有模型与博弈论基准策略的一致率仅为40-46%。模型行为呈现明显分异：可分为专注完成棋子但忽视发展进程的"终结者"类型，以及注重发展但无法完成棋子的"建设者"类型，每种类型仅能体现博弈论策略的一半维度。研究还发现，在相同棋盘状态下，当引入历史条件驱动的"积怨"情境框架时，模型会表现出可量化的行为偏移，这揭示了提示词敏感性是其关键脆弱点。LudoBench为评估不确定性环境下大语言模型的战略推理能力提供了一个轻量化且可解释的框架。所有代码、特定场景数据集（480条条目）及模型输出均已公开于https://anonymous.4open.science/r/LudoBench-5CBF/。

摘要 (Abstract)

We introduce LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo, a stochastic multi-agent board game whose dice mechanics, piece capture, safe-square navigation, and home-path progression introduce meaningful planning complexity. LudoBench comprises 480 handcrafted spot scenarios across 12 behaviorally distinct decision categories, each isolating a specific strategic choice. We additionally contribute a fully functional 4-player Ludo simulator supporting Random, Heuristic, Game-Theory, and LLM agents. The game-theory agent uses Expectiminimax search with depth-limited lookahead to provide a principled strategic ceiling beyond greedy heuristics. Evaluating six models spanning four model families, we find that all models agree with the game-theory baseline only 40-46% of the time. Models split into distinct behavioral archetypes: finishers that complete pieces but neglect development, and builders that develop but never finish. Each archetype captures only half of the game theory strategy. Models also display measurable behavioral shifts under history-conditioned grudge framing on identical board states, revealing prompt-sensitivity as a key vulnerability. LudoBench provides a lightweight and interpretable framework for benchmarking LLM strategic reasoning under uncertainty. All code, the spot dataset (480 entries) and model outputs are available at https://anonymous.4open.science/r/LudoBench-5CBF/

关键词: LLM evaluation, strategic reasoning, board game, multi-agent systems, behavioral decision-making, game theory, benchmark, uncertainty

86. ❌ From Incomplete Architecture to Quantified Risk: Multimodal LLM-Driven Security Assessment for Cyber-Physical Systems

作者: Shaofei Huang, Christopher M. Poskitt, Lwin Khin Shar 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05674v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用多模态大语言模型（LLMs）进行网络安全评估，因此与’Large Language Models’高度相关（10分）。论文提到使用’prompt chaining’和’few-shot learning’，这与’In-context Learning’相关（5分）。论文涉及’architectural reasoning’，这与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、对齐、RAG、推理加速、量化、AI for Science等，论文未直接涉及或仅作为背景技术提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对网络物理系统因架构文档不完整导致安全评估困难的问题，提出了一种基于多模态大语言模型的架构中心安全威胁风险评估方法（ASTRAL），通过案例研究和专家评估验证了该方法的实用性和可靠性。

摘要翻译

信息物理系统常面临架构文档不完整或信息过时的问题，这源于遗留技术、知识管理缺失以及长生命周期中多子系统集成的复杂性。此类架构不完整性阻碍了可靠的安全评估，因为不准确或缺失的架构知识限制了系统依赖关系、攻击面和风险传播路径的识别。为应对这一根本性挑战，本文提出ASTRAL（基于架构的安全威胁风险评估系统，使用多模态大语言模型实现），这是一种以架构为中心的安全评估技术，其原型工具由多模态大语言模型驱动。该方法在文档碎片化或缺失时，可协助从业者重构和分析信息物理系统架构。通过利用提示链、少样本学习和架构推理，ASTRAL能够从异构数据源中提取并综合系统表征。通过将大语言模型推理与架构建模相结合，本方法支持对信息物理系统进行自适应威胁识别和定量风险估计。我们通过多个信息物理系统案例的消融研究以及一项包含14位经验丰富的网络安全从业者的专家评估来验证该方法。从业者反馈表明，ASTRAL对于支持以架构为中心的安全评估具有实用性和可靠性。总体而言，研究结果表明该方法能够为更明智的网络安全风险管理决策提供支持。

摘要 (Abstract)

Cyber-physical systems often contend with incomplete architectural documentation or outdated information resulting from legacy technologies, knowledge management gaps, and the complexity of integrating diverse subsystems over extended operational lifecycles. This architectural incompleteness impedes reliable security assessment, as inaccurate or missing architectural knowledge limits the identification of system dependencies, attack surfaces, and risk propagation pathways. To address this foundational challenge, this paper introduces ASTRAL (Architecture-Centric Security Threat Risk Assessment using LLMs), an architecture-centric security assessment technique implemented in a prototype tool powered by multimodal LLMs. The proposed approach assists practitioners in reconstructing and analysing CPS architectures when documentation is fragmented or absent. By leveraging prompt chaining, few-shot learning, and architectural reasoning, ASTRAL extracts and synthesises system representations from disparate data sources. By integrating LLM reasoning with architectural modelling, our approach supports adaptive threat identification and quantitative risk estimation for cyber-physical systems. We evaluated the approach through an ablation study across multiple CPS case studies and an expert evaluation involving 14 experienced cybersecurity practitioners. Practitioner feedback suggests that ASTRAL is useful and reliable for supporting architecture-centric security assessment. Overall, the results indicate that the approach can support more informed cyber risk management decisions.

关键词: multimodal LLMs, cyber-physical systems, security assessment, architectural reasoning, risk estimation, prompt chaining, few-shot learning, ASTRAL

87. ❌ CuraLight: Debate-Guided Data Curation for LLM-Centered Traffic Signal Control

作者: Qing Guo, Xinhang Li, Junyu Chen, Zheng Guo, Shengzhe Xu, Lin Zhang, Lei Li 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05663v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是LLM在交通信号控制领域的应用创新，明确使用LLM作为控制器并进行微调，因此与’Large Language Models’和’Post-training’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、RAG、CoT等均未涉及，得0分。AI for Science虽属科学应用，但论文聚焦交通工程而非生物/化学信息学，相关性不足，得0分。

!!! tip deepseek-chat TL;DR

该研究提出CuraLight框架，通过RL辅助探索和多LLM辩论机制来优化基于LLM的交通信号控制器，实验表明其在异构路网中能显著减少平均旅行时间、排队长度和等待时间。

摘要翻译

交通信号控制是智能交通系统的核心组成部分，旨在减少拥堵、排放和行程时间。近期基于强化学习和大语言模型的方法提升了适应性，但仍存在可解释性有限、交互数据不足以及对异质交叉口泛化能力弱的问题。本文提出CuraLight，一个以大语言模型为中心的框架，其中强化学习智能体辅助微调基于大语言模型的交通信号控制器。强化学习智能体探索交通环境并生成高质量的交互轨迹，这些轨迹被转化为提示-响应对以进行模仿微调。一个多LLM集成审议系统通过结构化辩论进一步评估候选信号配时方案，为训练提供偏好感知的监督信号。在SUMO仿真环境中，基于济南、杭州和亦庄的真实异质路网进行的实验表明，CuraLight持续优于现有先进基线方法，平均行程时间降低5.34%，平均排队长度减少5.14%，平均等待时间降低7.02%。结果凸显了强化学习辅助探索与基于审议的数据策展相结合的方法，对于可扩展且可解释的交通信号控制的有效性。

摘要 (Abstract)

Traffic signal control (TSC) is a core component of intelligent transportation systems (ITS), aiming to reduce congestion, emissions, and travel time. Recent approaches based on reinforcement learning (RL) and large language models (LLMs) have improved adaptivity, but still suffer from limited interpretability, insufficient interaction data, and weak generalization to heterogeneous intersections. This paper proposes CuraLight, an LLM-centered framework where an RL agent assists the fine-tuning of an LLM-based traffic signal controller. The RL agent explores traffic environments and generates high-quality interaction trajectories, which are converted into prompt-response pairs for imitation fine-tuning. A multi-LLM ensemble deliberation system further evaluates candidate signal timing actions through structured debate, providing preference-aware supervision signals for training. Experiments conducted in SUMO across heterogeneous real-world networks from Jinan, Hangzhou, and Yizhuang demonstrate that CuraLight consistently outperforms state-of-the-art baselines, reducing average travel time by 5.34 percent, average queue length by 5.14 percent, and average waiting time by 7.02 percent. The results highlight the effectiveness of combining RL-assisted exploration with deliberation-based data curation for scalable and interpretable traffic signal control.

关键词: Large Language Models, Traffic Signal Control, Reinforcement Learning, Fine-tuning, Multi-LLM Ensemble, Data Curation, Intelligent Transportation Systems, SUMO Simulation

作者: Wuyang Luan, Junhui Li, Weiguang Zhao, Wenjian Zhang, Tieru Wu, Rui Ma 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05673v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视觉导航中的生成策略优化，使用Schrödinger Bridges和扩散模型，核心贡献是提出Rectified Schrödinger Bridge Matching (RSBM)框架，通过控制熵正则化参数ε来平衡多模态覆盖和路径平直度，实现少步收敛。所有评分关键词均与大语言模型、训练技术、推理优化、AI代理、科学AI应用等直接相关，而本文研究的是Embodied AI中的视觉导航和生成模型，未涉及任何大语言模型技术或相关应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对视觉导航中基于Schrödinger Bridges的生成策略需要多步积分导致实时控制困难的问题，提出了Rectified Schrödinger Bridge Matching (RSBM)框架，通过利用速度场结构不变性和减少条件速度方差，在仅3个积分步骤下实现了高保真导航性能。

摘要翻译

视觉导航是具身人工智能的核心挑战，它要求自主智能体将高维感知观测转化为连续、长程的动作轨迹。尽管基于扩散模型和薛定谔桥的生成策略能有效捕捉多模态动作分布，但由于高方差随机传输的存在，它们需要数十个积分步骤，这构成了实时机器人控制的关键障碍。我们提出了校正薛定谔桥匹配，该框架利用了标准薛定谔桥（$\varepsilon=1$，最大熵传输）与确定性最优传输（$\varepsilon\to 0$，如条件流匹配中所示）之间共享的速度场结构，该结构由单一熵正则化参数 $\varepsilon$ 控制。我们证明了两个关键结果：（1）条件速度场的函数形式在整个 $\varepsilon$ 谱系中保持不变，这使得单一网络能够服务于所有正则化强度；（2）降低 $\varepsilon$ 会线性减少条件速度方差，从而实现更稳定的粗步长常微分方程积分。通过锚定于一个缩短传输距离的学习条件先验，RSBM 在一个中间 $\varepsilon$ 值下运行，以平衡多模态覆盖与路径平直度。实验表明，标准桥方法需要 $\geq 10$ 步才能收敛，而 RSBM 仅用 3 个积分步骤即可实现超过 94% 的余弦相似度和 92% 的成功率——无需蒸馏或多阶段训练——从而显著缩小了高保真生成策略与具身人工智能低延迟需求之间的差距。

摘要 (Abstract)

Visual navigation is a core challenge in Embodied AI, requiring autonomous agents to translate high-dimensional sensory observations into continuous, long-horizon action trajectories. While generative policies based on diffusion models and Schrödinger Bridges (SB) effectively capture multimodal action distributions, they require dozens of integration steps due to high-variance stochastic transport, posing a critical barrier for real-time robotic control. We propose Rectified Schrödinger Bridge Matching (RSBM), a framework that exploits a shared velocity-field structure between standard Schrödinger Bridges ($\varepsilon=1$, maximum-entropy transport) and deterministic Optimal Transport ($\varepsilon\to 0$, as in Conditional Flow Matching), controlled by a single entropic regularization parameter $\varepsilon$. We prove two key results: (1) the conditional velocity field’s functional form is invariant across the entire $\varepsilon$-spectrum (Velocity Structure Invariance), enabling a single network to serve all regularization strengths; and (2) reducing $\varepsilon$ linearly decreases the conditional velocity variance, enabling more stable coarse-step ODE integration. Anchored to a learned conditional prior that shortens transport distance, RSBM operates at an intermediate $\varepsilon$ that balances multimodal coverage and path straightness. Empirically, while standard bridges require $\geq 10$ steps to converge, RSBM achieves over 94% cosine similarity and 92% success rate in merely 3 integration steps – without distillation or multi-stage training – substantially narrowing the gap between high-fidelity generative policies and the low-latency demands of Embodied AI.

关键词: Visual Navigation, Schrödinger Bridges, Rectified Schrödinger Bridge Matching, Generative Policies, Few-Step Integration, Embodied AI, Optimal Transport, Entropic Regularization

89. ❌ SnapFlow: One-Step Action Generation for Flow-Matching VLAs via Progressive Self-Distillation

作者: Wuyang Luan, Junhui Li, Weiguang Zhao, Wenjian Zhang, Tieru Wu, Rui Ma 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05656v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是视觉-语言-动作（VLA）模型的推理加速技术，通过自蒸馏方法将多步去噪压缩为单步前向传播。该论文与大多数关键词无关，因为这些关键词主要针对大语言模型（LLMs）及其相关技术（如微调、对齐、推理、代理等），而本文专注于机器人操作领域的VLA模型。唯一相关的关键词是’Speculative Decoding OR Inference Acceleration’，因为论文的核心贡献是加速推理（9.6倍去噪加速，端到端延迟从274ms降至83ms），但并非专门针对LLMs的推测解码技术，因此给予5分（有一定关联）。其他关键词如模型压缩、世界模型等虽涉及相关概念，但论文未直接讨论这些具体技术。

!!! tip deepseek-chat TL;DR

论文提出了SnapFlow方法，通过渐进式自蒸馏将基于流匹配的视觉-语言-动作模型的多步去噪过程压缩为单步前向传播，在保持性能（98.75%平均成功率）的同时实现了9.6倍去噪加速和端到端延迟从274ms降至83ms。

摘要翻译

基于流匹配的视觉-语言-动作（Vision-Language-Action, VLA）模型——如 pi0、pi0.5 和 SmolVLA——在通用机器人操作任务上实现了最先进的性能，但其迭代去噪过程（通常需 10 步 ODE 求解）引入了显著的延迟：在现代 GPU 上，仅去噪步骤就占端到端推理时间的 80%。简单地减少步数并不可靠，由于速度场未针对单步跳跃进行校准，这会导致大多数任务的成功率下降。本文提出 SnapFlow，一种即插即用的自蒸馏方法，可将流匹配 VLA 模型的多步去噪压缩为单次前向传播（1-NFE）。SnapFlow 将标准流匹配样本与一致性样本混合，后者的目标是通过模型自身边缘速度预测计算出的两步欧拉捷径速度，从而避免了条件速度引起的轨迹漂移（我们在理论上进行了分析）。通过零初始化的目标时间嵌入，网络可在单一架构内切换局部速度估计与全局单步生成。SnapFlow 无需外部教师模型、无需改变架构，且仅需单 GPU 训练约 12 小时。我们在两种参数规模相差 6 倍的 VLA 架构上使用相同超参数进行验证：在 pi0.5（3B 参数）模型上，跨越四个 LIBERO 测试集（40 项任务，400 个回合），SnapFlow 实现了 98.75% 的平均成功率——与 10 步教师模型的 97.75% 相当并略有超越——同时去噪速度提升 9.6 倍，端到端延迟从 274ms 降至 83ms；在 SmolVLA（500M 参数）上，其 MSE 降低 8.3%，端到端加速达 3.56 倍。在长视野任务上的动作步数扫描实验表明，SnapFlow 在不同执行视野下均保持优势，在 n_act=5 时达到 93% 成功率，而基线仅达 90%。SnapFlow 与层蒸馏及令牌剪枝方法正交，可实现组合式加速。

摘要 (Abstract)

Vision-Language-Action (VLA) models based on flow matching – such as pi0, pi0.5, and SmolVLA – achieve state-of-the-art generalist robotic manipulation, yet their iterative denoising, typically 10 ODE steps, introduces substantial latency: on a modern GPU, denoising alone accounts for 80% of end-to-end inference time. Naively reducing the step count is unreliable, degrading success on most tasks due to the velocity field being uncalibrated for single-step jumps. We present SnapFlow, a plug-and-play self-distillation method that compresses multi-step denoising into a single forward pass (1-NFE) for flow-matching VLAs. SnapFlow mixes standard flow-matching samples with consistency samples whose targets are two-step Euler shortcut velocities computed from the model’s own marginal velocity predictions, avoiding the trajectory drift caused by conditional velocities, as we analyze theoretically. A zero-initialized target-time embedding lets the network switch between local velocity estimation and global one-step generation within a single architecture. SnapFlow requires no external teacher, no architecture changes, and trains in ~12h on a single GPU. We validate on two VLA architectures spanning a 6x parameter range, with identical hyperparameters: on pi0.5 (3B) across four LIBERO suites (40 tasks, 400 episodes), SnapFlow achieves 98.75% average success – matching the 10-step teacher at 97.75% and slightly exceeding it – with 9.6x denoising speedup and end-to-end latency reduced from 274ms to 83ms; on SmolVLA (500M), it reduces MSE by 8.3% with 3.56x end-to-end acceleration. An action-step sweep on long-horizon tasks reveals that SnapFlow maintains its advantage across execution horizons, achieving 93% at n_act=5 where the baseline reaches only 90%. SnapFlow is orthogonal to layer-distillation and token-pruning approaches, enabling compositional speedups.

关键词: Vision-Language-Action models, flow matching, self-distillation, inference acceleration, one-step generation, robotic manipulation, progressive distillation, latency reduction

90. ❌ LLM Reasoning as Trajectories: Step-Specific Representation Geometry and Correctness Signals

作者: Lihao Sun, Hang Dong, Bo Qiao, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05655v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的思维链推理过程，将其建模为表示空间中的轨迹，分析其几何结构、正确/错误解的分化模式，并开发基于轨迹的干预方法。因此与’Large Language Models’、‘Chain of Thought’、‘Mechanistic Interpretability’高度相关（10分）；与’System 2 Thinking’、‘Self-Correction’有一定关联（8分），因为涉及深度推理分析和推理纠正；其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究将大语言模型的思维链推理建模为表示空间中的轨迹，揭示了推理步骤的几何结构以及正确与错误解在后期阶段的分化规律，并提出了基于轨迹的干预框架来实现推理纠正和长度控制。

摘要翻译

本研究将大语言模型的思维链生成过程表征为表征空间中的结构化轨迹。我们证明数学推理过程会穿越功能有序、步骤特定的子空间，这些子空间随网络层深度增加而呈现增强的可分离性。这种结构在基础模型中已然存在，而推理训练主要加速模型向终止相关子空间的收敛，而非引入新的表征组织形式。虽然早期推理步骤遵循相似轨迹，但正确与错误解答在后期阶段呈现系统性分化。这种晚期分化特性使得我们能在推理过程中期预测最终答案的正确性，其ROC-AUC指标最高可达0.87。此外，我们提出了基于轨迹的引导技术——一种推理时干预框架，能够根据推导的理想轨迹实现推理校正与长度控制。这些研究成果共同确立了推理轨迹作为解释、预测和控制大语言模型推理行为的几何视角。

摘要 (Abstract)

This work characterizes large language models’ chain-of-thought generation as a structured trajectory through representation space. We show that mathematical reasoning traverses functionally ordered, step-specific subspaces that become increasingly separable with layer depth. This structure already exists in base models, while reasoning training primarily accelerates convergence toward termination-related subspaces rather than introducing new representational organization. While early reasoning steps follow similar trajectories, correct and incorrect solutions diverge systematically at late stages. This late-stage divergence enables mid-reasoning prediction of final-answer correctness with ROC-AUC up to 0.87. Furthermore, we introduce trajectory-based steering, an inference-time intervention framework that enables reasoning correction and length control based on derived ideal trajectories. Together, these results establish reasoning trajectories as a geometric lens for interpreting, predicting, and controlling LLM reasoning behavior.

关键词: large language models, chain-of-thought reasoning, representation geometry, reasoning trajectories, correctness prediction, trajectory-based steering, reasoning correction, mathematical reasoning

91. ❌ Multiscale Physics-Informed Neural Network for Complex Fluid Flows with Long-Range Dependencies

作者: Prashant Kumar, Rajesh Ranjan 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05652v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于物理信息神经网络（PINN）在复杂流体力学中的应用，属于科学机器学习领域。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、智能体等）完全无关，因为这些关键词特指大语言模型及相关技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学（流体力学）中的应用，但并非核心匹配（论文未涉及生物信息学或化学信息学），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种域分解和偏移物理信息神经网络（DDS-PINN）框架，用于以最小监督解决具有长程依赖的复杂多尺度流体流动问题，并在多个基准测试和湍流模拟中实现了高精度和高效收敛。

摘要翻译

流体流动受非线性纳维-斯托克斯方程控制，即使从可预测的初始条件出发，也可能呈现多尺度动力学特征。在科学机器学习中，预测此类现象在收敛速度、数据需求和求解精度方面仍存在巨大挑战。对于复杂流体流动，由遥远边界条件引发的长程空间依赖性进一步加剧了这些困难，通常需要大量监督数据才能获得可接受的结果。本文提出域分解与平移物理信息神经网络框架，旨在以最小监督解析此类多尺度相互作用。该框架通过采用具有统一全局损失的局部化网络，在保持局部精度的同时捕捉全局依赖性。该方法的鲁棒性在一系列基准测试中得到验证，包括多尺度线性微分方程、非线性伯格斯方程以及平板边界层的无数据纳维-斯托克斯模拟。最后，将DDS-PINN应用于计算具有挑战性的后向台阶流动问题：在层流工况下，该模型无需任何数据即可获得与计算流体力学相当的结果，准确预测边界层厚度、分离及再附着长度；对于雷诺数Re=10,000的湍流后向台阶流动，该框架仅使用500个随机监督点即可收敛至O(10^-4)精度，其监督数据量不足计算域总量的0.3%，在精度上超越了基于残差的注意力物理信息神经网络等现有方法。该方法展现了基于稀疏实验测量实现复杂湍流超分辨率重建的强大潜力。

摘要 (Abstract)

Fluid flows are governed by the nonlinear Navier-Stokes equations, which can manifest multiscale dynamics even from predictable initial conditions. Predicting such phenomena remains a formidable challenge in scientific machine learning, particularly regarding convergence speed, data requirements, and solution accuracy. In complex fluid flows, these challenges are exacerbated by long-range spatial dependencies arising from distant boundary conditions, which typically necessitate extensive supervision data to achieve acceptable results. We propose the Domain-Decomposed and Shifted Physics-Informed Neural Network (DDS-PINN), a framework designed to resolve such multiscale interactions with minimal supervision. By utilizing localized networks with a unified global loss, DDS-PINN captures global dependencies while maintaining local precision. The robustness of the approach is demonstrated across a suite of benchmarks, including a multiscale linear differential equation, the nonlinear Burgers’ equation, and data-free Navier-Stokes simulations of flat-plate boundary layers. Finally, DDS-PINN is applied to the computationally challenging backward-facing step (BFS) problem; for laminar regimes (Re = 100), the model yields results comparable to computational fluid dynamics (CFD) without the need for any data, accurately predicting boundary layer thickness, separation, and reattachment lengths. For turbulent BFS flow at Re = 10,000, the framework achieves convergence to O(10^-4) using only 500 random supervision points (< 0.3 % of the total domain), outperforming established methods like Residual-based Attention-PINN in accuracy. This approach demonstrates strong potential for the super-resolution of complex turbulent flows from sparse experimental measurements.

关键词: Physics-Informed Neural Network, Multiscale Fluid Flows, Long-Range Dependencies, Domain Decomposition, Navier-Stokes Equations, Turbulent Flow Simulation, Data-Free Learning, Scientific Machine Learning

92. ❌ PECKER: A Precisely Efficient Critical Knowledge Erasure Recipe For Machine Unlearning in Diffusion Models

作者: Zhiyong Ma, Zhitao Deng, Huan Tang, Jialin Chen, Zhijun Zheng, Zhengping Li, Qingyuan Chuai 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05634v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究扩散模型的机器遗忘技术，专注于提高遗忘效率，与所有评分关键词（均围绕大语言模型、深度学习技术原理及其应用）完全无关。论文未涉及LLMs、MoE、SLMs、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理方法、智能体、量化、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI应用。

!!! tip deepseek-chat TL;DR

论文提出了一种名为PECKER的高效机器遗忘方法，通过引入显著性掩码优先更新对遗忘目标数据贡献最大的参数，在扩散模型中实现了更快的遗忘速度并减少了计算开销，同时在CIFAR-10和STL-10数据集上保持了图像分布的对齐。

摘要翻译

机器遗忘已成为生成式人工智能模型安全合规运行的关键技术。现有遗忘方法虽有效，但大多伴随极高的训练时间与计算开销。我们的分析表明，其根本原因在于梯度更新方向失准，导致训练效率降低并破坏收敛稳定性。为缓解这些问题，我们提出PECKER——一种高效遗忘方法，其性能可匹配或超越主流方法。在蒸馏框架内，PECKER引入显著性掩码机制，优先更新对目标数据遗忘贡献最大的参数，从而减少不必要的梯度计算，在保持遗忘效能的同时显著缩短整体训练时间。本方法能快速生成遗忘特定类别或概念的样本，在CIFAR-10和STL-10数据集上与真实图像分布高度吻合，在类别遗忘和概念遗忘任务中均实现了更短的训练周期。

摘要 (Abstract)

Machine unlearning (MU) has become a critical technique for GenAI models’ safe and compliant operation. While existing MU methods are effective, most impose prohibitive training time and computational overhead. Our analysis suggests the root cause lies in poorly directed gradient updates, which reduce training efficiency and destabilize convergence. To mitigate these issues, we propose PECKER, an efficient MU approach that matches or outperforms prevailing methods. Within a distillation framework, PECKER introduces a saliency mask to prioritize updates to parameters that contribute most to forgetting the targeted data, thereby reducing unnecessary gradient computation and shortening overall training time without sacrificing unlearning efficacy. Our method generates samples that unlearn related class or concept more quickly, while closely aligning with the true image distribution on CIFAR-10 and STL-10 datasets, achieving shorter training times for both class forgetting and concept forgetting.

关键词: Machine Unlearning, Diffusion Models, Efficient Training, Saliency Mask, Gradient Updates, Class Forgetting, Concept Forgetting, Computational Overhead

93. ❌ Beyond Behavior: Why AI Evaluation Needs a Cognitive Revolution

作者: Amir Konigsberg 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05631v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Beyond Behavior: Why AI Evaluation Needs a Cognitive Revolution》是一篇哲学/方法论论文，讨论AI评估的认知基础和历史局限性，不涉及具体的大模型技术、深度学习应用或技术原理创新。论文核心是批判行为主义评估范式，呼吁向认知主义转变，但未提及任何评分关键词中的具体技术（如LLM、MoE、RLHF等）、应用领域（如生物信息学）或技术原理（如缩放定律、注意力机制）。所有关键词均与论文内容完全无关，因此全部评分为0。

!!! tip deepseek-chat TL;DR

该论文认为当前AI评估过度依赖行为主义范式，限制了研究内部认知过程的能力，呼吁进行类似心理学认知革命的范式转变，以更全面地理解智能系统。

摘要翻译

1950年，艾伦·图灵提出用行为测试取代“机器能否思考？”这一问题：如果一台机器的输出与思考者的输出无法区分，那么它是否真正思考的问题便可搁置。本文认为，图灵的这一步不仅是实用性的简化，更是一种认识论上的承诺——即关于何种证据可被视为与智能归属相关的决定，且这一承诺已在七十年间悄然制约着人工智能研究。我们追溯了图灵的行为认识论如何嵌入该领域的评估体系，致使一系列关于过程、机制与内部组织的问题——这些正是认知心理学、神经科学及相关学科已学会追问的问题——变得无法被提出。我们将其与心理学从行为主义到认知主义的转变进行结构类比：正如心理学曾因局限于研究可观测行为而无法有效探究内部心理过程，直至放弃这一承诺；人工智能对行为评估的执着也使其无法区分那些通过根本不同的计算过程却实现相同输出的系统，而这一区分恰恰是智能归属所依赖的。我们认为，该领域需要一场堪比认知革命的认识论转型：并非抛弃行为证据，而是认识到仅凭行为证据不足以支撑该领域试图构建的理论主张。我们阐明了人工智能后行为主义认识论应包含的要素，并指出了它将使哪些当前无法提出的具体问题成为可能。

摘要 (Abstract)

In 1950, Alan Turing proposed replacing the question “Can machines think?” with a behavioral test: if a machine’s outputs are indistinguishable from those of a thinking being, the question of whether it truly thinks can be set aside. This paper argues that Turing’s move was not only a pragmatic simplification but also an epistemological commitment, a decision about what kind of evidence counts as relevant to intelligence attribution, and that this commitment has quietly constrained AI research for seven decades. We trace how Turing’s behavioral epistemology became embedded in the field’s evaluative infrastructure, rendering unaskable a class of questions about process, mechanism, and internal organization that cognitive psychology, neuroscience, and related disciplines learned to ask. We draw a structural parallel to the behaviorist-to-cognitivist transition in psychology: just as psychology’s commitment to studying only observable behavior prevented it from asking productive questions about internal mental processes until that commitment was abandoned, AI’s commitment to behavioral evaluation prevents it from distinguishing between systems that achieve identical outputs through fundamentally different computational processes, a distinction on which intelligence attribution depends. We argue that the field requires an epistemological transition comparable to the cognitive revolution: not an abandonment of behavioral evidence, but a recognition that behavioral evidence alone is insufficient for the construct claims the field wishes to make. We articulate what a post-behaviorist epistemology for AI would involve and identify the specific questions it would make askable that the field currently has no way to ask.

关键词: AI evaluation, behavioral epistemology, cognitive revolution, Turing test, intelligence attribution, computational processes, post-behaviorist epistemology, internal organization

94. ❌ Semantic-Topological Graph Reasoning for Language-Guided Pulmonary Screening

作者: Chenyu Xue, Yiran Liu, Mian Zhou, Jionglong Su, Zhixiang Lu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05620v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出了一种用于语言引导肺部筛查的语义-拓扑图推理（STGR）框架，核心创新在于将大语言模型（LLaMA-3-V）的推理能力与视觉基础模型（MedSAM）的零样本分割能力相结合，并引入了一种选择性非对称微调（SAFT）策略，该策略仅更新少于1%的参数，属于参数高效微调（PEFT）范畴。论文属于大模型在医学影像分析（AI for Science/Bioinformatics）领域的应用研究，因此与’Large Language Models OR LLMs OR Foundation Models’、‘PEFT OR LoRA OR Parameter-efficient Fine-tuning’和’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文未涉及其他关键词所描述的具体技术（如MoE、量化、RAG、推理加速等）或应用场景（如智能体、工具调用等），因此这些关键词得分为0分。

!!! tip deepseek-chat TL;DR

该研究针对临床自由文本指令驱动的医学图像分割中存在的语义模糊和低对比度扫描中复杂解剖结构重叠问题，提出了一种新颖的语义-拓扑图推理（STGR）框架，通过结合大语言模型的推理和视觉基础模型的零样本分割能力，并采用参数高效微调策略，在LIDC-IDRI和LNDb数据集上实现了最先进的性能，显著提升了分割精度和模型稳定性。

摘要翻译

基于自由文本临床指令驱动的医学图像分割是计算机辅助诊断的关键前沿领域。然而，现有的多模态与基础模型难以处理临床报告的语义模糊性，且无法在低对比度扫描中区分复杂的解剖结构重叠。此外，在有限的医学数据集上对这些庞大架构进行全参数微调总会导致严重的过拟合。为应对这些挑战，我们提出了一种新颖的语义-拓扑图推理框架，用于语言引导的肺部筛查。我们的方法巧妙地将大语言模型的推理能力与视觉基础模型的零样本分割能力相结合。具体而言，我们引入了文本到视觉意图蒸馏模块，以提取精确的诊断指导。为解决解剖结构模糊性问题，我们将掩模选择构建为一个动态图推理问题，其中候选病灶被建模为节点，边则捕捉空间与语义关联性。为确保部署可行性，我们提出了选择性非对称微调策略，该策略仅更新少于1%的模型参数。在LIDC-IDRI和LNDb数据集上进行的严格五折交叉验证表明，我们的框架确立了新的性能标杆。值得注意的是，其在LIDC-IDRI数据集上达到了81.5%的戴斯相似性系数，较LISA等领先的基于大语言模型的工具性能提升超过5%。关键的是，我们的选择性非对称微调策略起到了强大的正则化作用，实现了卓越的跨折稳定性，为鲁棒且具备上下文感知能力的临床部署铺平了道路。

摘要 (Abstract)

Medical image segmentation driven by free-text clinical instructions is a critical frontier in computer-aided diagnosis. However, existing multimodal and foundation models struggle with the semantic ambiguity of clinical reports and fail to disambiguate complex anatomical overlaps in low-contrast scans. Furthermore, fully fine-tuning these massive architectures on limited medical datasets invariably leads to severe overfitting. To address these challenges, we propose a novel Semantic-Topological Graph Reasoning (STGR) framework for language-guided pulmonary screening. Our approach elegantly synergizes the reasoning capabilities of large language models (LLaMA-3-V) with the zero-shot delineation of vision foundation models (MedSAM). Specifically, we introduce a Text-to-Vision Intent Distillation (TVID) module to extract precise diagnostic guidance. To resolve anatomical ambiguity, we formulate mask selection as a dynamic graph reasoning problem, where candidate lesions are modeled as nodes and edges capture spatial and semantic affinities. To ensure deployment feasibility, we introduce a Selective Asymmetric Fine-Tuning (SAFT) strategy that updates less than 1% of the parameters. Rigorous 5-fold cross-validation on the LIDC-IDRI and LNDb datasets demonstrates that our framework establishes a new state-of-the-art. Notably, it achieves an 81.5% Dice Similarity Coefficient (DSC) on LIDC-IDRI, outperforming leading LLM-based tools like LISA by over 5%. Crucially, our SAFT strategy acts as a powerful regularizer, yielding exceptional cross-fold stability (0.6% DSC variance) and paving the way for robust, context-aware clinical deployment.

关键词: Language-guided medical image segmentation, Large language models (LLMs), Vision foundation models, Semantic-topological graph reasoning, Parameter-efficient fine-tuning, Pulmonary screening, Clinical instruction disambiguation, Selective asymmetric fine-tuning (SAFT)

95. ❌ Evaluation of Randomization through Style Transfer for Enhanced Domain Generalization

作者: Dustin Eisenhardt, Timothy Schaumlöffel, Alperen Kantarci, Gemma Roig 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05616v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究计算机视觉领域的风格迁移数据增强方法（StyleMixDG）用于提升领域泛化能力，解决Sim2Real差距问题。所有评分关键词均与大语言模型、深度学习技术原理创新、AI科学应用等主题相关，而本文专注于计算机视觉中的风格迁移和领域泛化，未涉及任何大模型技术、训练方法、推理优化、AI代理或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文通过系统实证研究解决了风格迁移数据增强中关于风格池多样性、纹理复杂性和风格来源选择的关键设计矛盾，提出了轻量级模型无关的StyleMixDG方法，在GTAV到真实世界驾驶数据集基准上实现了稳定的领域泛化性能提升。

摘要翻译

用于计算机视觉的深度学习模型在真实世界场景中部署时，常因泛化能力不足而表现不佳，尤其是在使用合成数据训练时，这源于众所周知的仿真到真实（Sim2Real）差距。尽管风格迁移作为领域泛化的一种数据增强策略日益流行，但现有文献在三个关键设计轴上仍存在未解决的矛盾：风格池的多样性、纹理复杂性的作用以及风格源的选择。我们提出了一项系统的实证研究，针对推动场景理解的任务，分离并评估了上述每个因素，从而解决了先前工作中的不一致性。我们的研究结果表明：（i）扩大风格池比使用少量风格进行重复增强能带来更大的性能提升；（ii）当风格池足够大时，纹理复杂性无显著影响；（iii）多样化的艺术风格优于领域对齐的替代方案。基于这些发现，我们提出了StyleMixDG（面向领域泛化的风格混合），这是一种轻量级、模型无关的数据增强方案，无需修改架构或引入额外损失函数。在GTAV $\rightarrow$ {BDD100k, Cityscapes, Mapillary Vistas}基准测试中，StyleMixDG相较于强基线模型展现出持续的性能改进，证实了通过实证确定的设计原则能够转化为实际收益。代码将在GitHub上发布。

摘要 (Abstract)

Deep learning models for computer vision often suffer from poor generalization when deployed in real-world settings, especially when trained on synthetic data due to the well-known Sim2Real gap. Despite the growing popularity of style transfer as a data augmentation strategy for domain generalization, the literature contains unresolved contradictions regarding three key design axes: the diversity of the style pool, the role of texture complexity, and the choice of style source. We present a systematic empirical study that isolates and evaluates each of these factors for driving scene understanding, resolving inconsistencies in prior work. Our findings show that (i) expanding the style pool yields larger gains than repeated augmentation with few styles, (ii) texture complexity has no significant effect when the pool is sufficiently large, and (iii) diverse artistic styles outperform domain-aligned alternatives. Guided by these insights, we derive StyleMixDG (Style-Mixing for Domain Generalization), a lightweight, model-agnostic augmentation recipe that requires no architectural modifications or additional losses. Evaluated on the GTAV $\rightarrow$ {BDD100k, Cityscapes, Mapillary Vistas} benchmark, StyleMixDG demonstrates consistent improvements over strong baselines, confirming that the empirically identified design principles translate into practical gains. The code will be released on GitHub.

关键词: style transfer, domain generalization, data augmentation, Sim2Real gap, computer vision, scene understanding, StyleMixDG, empirical study

96. ❌ INTERACT: An AI-Driven Extended Reality Framework for Accesible Communication Featuring Real-Time Sign Language Interpretation and Emotion Recognition

作者: Nikolaos D. Tantaroudas, Andrew J. McCracken, Ilias Karachalios, Evangelos Papatheou 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05605v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文描述了一个AI驱动的XR平台INTERACT，专注于实时手语翻译、语音转文字、多语言翻译和情感识别，用于提升听障和多语言用户的视频会议可访问性。它使用了Whisper、NLLB、RoBERTa和Google MediaPipe等现有AI模型，但未涉及大模型（LLM）或深度学习技术原理的创新，也未在生物医药等科学领域应用大模型。所有评分关键词均与大模型技术、训练方法、推理优化、代理系统或AI for Science相关，而本文是应用现有AI工具解决特定领域（无障碍通信）问题的工程实现，与这些关键词的核心内容无关。

!!! tip deepseek-chat TL;DR

本文提出了一个AI驱动的扩展现实平台INTERACT，通过集成实时语音转文字、国际手语3D虚拟人渲染、多语言翻译和情感识别技术，解决了视频会议中听障和多语言用户的可访问性问题，并在试点评估中取得了92%的用户满意度和85%以上的转录准确率。

摘要翻译

视频会议已成为专业协作的核心工具，然而大多数平台对聋人、听力障碍及多语言用户的支持仍十分有限。世界卫生组织估计，全球有超过4.3亿人因听力损失需要康复服务，这一数字预计到2050年将超过7亿。传统的无障碍辅助措施受限于高成本、有限可用性和实施障碍，而扩展现实（Extended Reality, XR）技术为沉浸式、包容性沟通开辟了新路径。本文提出INTERACT（包容性翻译与具身实时增强沟通工具），这是一个由人工智能驱动的XR平台，它集成了实时语音转文本、通过三维虚拟形象呈现的国际手语（International Sign Language, ISL）、多语言翻译以及情感识别功能，并置于沉浸式虚拟环境中。该平台基于CORTEX2框架构建，部署于Meta Quest 3头显设备，结合了Whisper（用于语音识别）、NLLB（用于多语言翻译）、RoBERTa（用于情感分类）以及Google MediaPipe（用于手势提取）。试点评估分两个阶段进行：首先面向来自学术界与工业界的技术专家，随后面向聋人社区成员。试验结果显示用户满意度达92%，转录准确率超过85%，情感检测精确度为90%，整体体验平均评分为4.6分（满分5.0），且90%的参与者愿意参与进一步测试。这些结果凸显了该平台在教育、文化及专业场景中推动无障碍服务的巨大潜力。本研究的扩展版本，包括完整的试点数据与实施细节，已作为开放研究欧洲文章发表[Tantaroudas等人，2026a]。

摘要 (Abstract)

Video conferencing has become central to professional collaboration, yet most platforms offer limited support for deaf, hard-of-hearing, and multilingual users. The World Health Organisation estimates that over 430 million people worldwide require rehabilitation for disabling hearing loss, a figure projected to exceed 700 million by 2050. Conventional accessibility measures remain constrained by high costs, limited availability, and logistical barriers, while Extended Reality (XR) technologies open new possibilities for immersive and inclusive communication. This paper presents INTERACT (Inclusive Networking for Translation and Embodied Real-Time Augmented Communication Tool), an AI-driven XR platform that integrates real-time speech-to-text conversion, International Sign Language (ISL) rendering through 3D avatars, multilingual translation, and emotion recognition within an immersive virtual environment. Built on the CORTEX2 framework and deployed on Meta Quest 3 headsets, INTERACT combines Whisper for speech recognition, NLLB for multilingual translation, RoBERTa for emotion classification, and Google MediaPipe for gesture extraction. Pilot evaluations were conducted in two phases, first with technical experts from academia and industry, and subsequently with members of the deaf community. The trials reported 92% user satisfaction, transcription accuracy above 85%, and 90% emotion-detection precision, with a mean overall experience rating of 4.6 out of 5.0 and 90% of participants willing to take part in further testing. The results highlight strong potential for advancing accessibility across educational, cultural, and professional settings. An extended version of this work, including full pilot data and implementation details, has been published as an Open Research Europe article [Tantaroudas et al., 2026a].

关键词: Extended Reality, Sign Language Interpretation, Emotion Recognition, Accessible Communication, Real-time Translation, AI-driven Platform, Virtual Environment, User Experience Evaluation

97. ❌ Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge

作者: Xin Sun, Di Wu, Sijing Qin, Isao Echizen, Abdallah El Ali, Saku Sugawara 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05593v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究LLM作为评估者（LLM-as-a-Judge）的可靠性，发现LLM和人类在信任评估中都受到来源标签的启发式偏见影响。核心相关关键词：1）‘Large Language Models’（10分）- 论文明确研究LLM作为评估者的应用；2）‘Instruction Tuning OR Alignment OR Value Alignment’（8分）- 论文讨论将模型与人类偏好对齐可能传播人类启发式依赖，涉及对齐问题。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、CoT、Agents、Quantization等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该研究发现，当大型语言模型（LLM）作为评估者时，其信任判断会受到来源标签（如人类撰写 vs. AI生成）的偏见影响，这种启发式依赖与人类行为相似，从而对LLM-as-a-Judge评估的有效性提出质疑。

摘要翻译

大型语言模型（LLM）正日益被用作自动化评估工具（LLM-as-a-Judge）。本研究通过揭示LLM的信任判断会受到披露的来源标签影响而存在偏差，从而对其可靠性提出质疑。采用反事实实验设计，我们发现无论是人类还是LLM评估者，都对标注为人类创作的信息比标注为AI生成的相同内容赋予更高信任度。眼动追踪数据显示，人类在判断时高度依赖来源标签作为启发式线索。我们进一步分析了LLM在判断过程中的内部状态：在不同标签条件下，模型对标签区域的注意力分配密度均高于内容区域，且这种标签主导效应在“人类”标签条件下比“AI”标签条件下更为显著，这与人类注视模式一致。此外，通过logits测量的决策不确定性在“AI”标签条件下高于“人类”标签条件。这些结果表明，来源标签对人类和LLM而言都是显著的启发式线索。这引发了针对标签敏感的LLM-as-a-Judge评估的效度担忧，我们审慎指出：将模型与人类偏好对齐可能会将人类的启发式依赖传播至模型中，这促使我们需要开展去偏差的评估与对齐研究。

摘要 (Abstract)

Large language models (LLMs) are increasingly used as automated evaluators (LLM-as-a-Judge). This work challenges its reliability by showing that trust judgments by LLMs are biased by disclosed source labels. Using a counterfactual design, we find that both humans and LLM judges assign higher trust to information labeled as human-authored than to the same content labeled as AI-generated. Eye-tracking data reveal that humans rely heavily on source labels as heuristic cues for judgments. We analyze LLM internal states during judgment. Across label conditions, models allocate denser attention to the label region than the content region, and this label dominance is stronger under Human labels than AI labels, consistent with the human gaze patterns. Besides, decision uncertainty measured by logits is higher under AI labels than Human labels. These results indicate that the source label is a salient heuristic cue for both humans and LLMs. It raises validity concerns for label-sensitive LLM-as-a-Judge evaluation, and we cautiously raise that aligning models with human preferences may propagate human heuristic reliance into models, motivating debiased evaluation and alignment.

关键词: LLM-as-a-Judge, trust assessment, source label bias, heuristic reliance, human-AI comparison, evaluation reliability, alignment concerns

98. ❌ AI-Driven Modular Services for Accessible Multilingual Education in Immersive Extended Reality Settings: Integrating Speech Processing, Translation, and Sign Language Rendering

作者: N. D. Tantaroudas, A. J. McCracken, I. Karachalios, E. Papatheou 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05591v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要描述了一个集成多种现成AI服务（如Whisper、NLLB、Polly、RoBERTa、flan-t5、MediaPipe）的模块化平台，用于XR环境中的多语言教育辅助。虽然涉及AI应用，但所有关键词均聚焦于大模型技术原理、训练方法、推理优化、对齐技术等底层创新，而本文仅使用现有模型作为服务组件，未涉及任何模型架构、训练、优化或原理层面的创新研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该研究开发了一个集成多种现成AI服务的模块化平台，用于在扩展现实环境中实现可访问的多语言教育，并通过技术评估验证了平台在实时XR部署中的可行性。

摘要翻译

本研究提出一个模块化平台，该平台整合了六项人工智能服务：基于OpenAI Whisper的自动语音识别、通过Meta NLLB实现的多语言翻译、利用AWS Polly的语音合成、采用RoBERTa的情感分类、基于flan t5 base samsum的对话摘要，以及借助Google MediaPipe的国际手语（International Sign, IS）渲染。平台处理了国际手语手势录制语料库，以提取手部关键点坐标，并将其映射到虚拟现实（VR）环境中的三维虚拟形象动画上。验证工作包括对各人工智能组件的技术基准测试，其中涵盖语音合成服务商的对比评估以及多语言翻译模型（NLLB 200与EuroLLM 1.7B变体）的比较分析。技术评估证实了该平台适用于实时扩展现实（XR）场景部署。语音合成基准测试表明，AWS Polly在保持价格竞争力的同时实现了最低延迟。EuroLLM 1.7B Instruct变体获得了更高的BLEU分数，表现优于NLLB模型。这些发现证实了在XR环境中协调跨模态人工智能服务以提供无障碍多语言教学方案的可行性。模块化设计允许针对不同教育场景进行独立扩展与适配，为符合欧盟数字无障碍目标的公平学习解决方案奠定了基础。

摘要 (Abstract)

This work introduces a modular platform that brings together six AI services, automatic speech recognition via OpenAI Whisper, multilingual translation through Meta NLLB, speech synthesis using AWS Polly, emotion classification with RoBERTa, dialogue summarisation via flan t5 base samsum, and International Sign (IS) rendering through Google MediaPipe. A corpus of IS gesture recordings was processed to derive hand landmark coordinates, which were subsequently mapped onto three dimensional avatar animations inside a virtual reality (VR) environment. Validation comprised technical benchmarking of each AI component, including comparative assessments of speech synthesis providers and multilingual translation models (NLLB 200 and EuroLLM 1.7B variants). Technical evaluations confirmed the suitability of the platform for real time XR deployment. Speech synthesis benchmarking established that AWS Polly delivers the lowest latency at a competitive price point. The EuroLLM 1.7B Instruct variant attained a higher BLEU score, surpassing NLLB. These findings establish the viability of orchestrating cross modal AI services within XR settings for accessible, multilingual language instruction. The modular design permits independent scaling and adaptation to varied educational contexts, providing a foundation for equitable learning solutions aligned with European Union digital accessibility goals.

关键词: AI services, multilingual education, extended reality, speech processing, sign language rendering, modular platform, real-time deployment, accessibility

99. ❌ Foundations for Agentic AI Investigations from the Forensic Analysis of OpenClaw

作者: Jan Gruber, Jan-Niclas Hilgert 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05589v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究Agentic AI系统的数字取证方法，重点关注OpenClaw单智能体助手的内部状态和行动重建。与关键词的相关性分析如下：1）论文明确提到LLM是智能体系统的核心组件，因此"Large Language Models"相关度为8分；2）论文研究智能体工作流程和工具使用，“LLM Agents"是核心主题，相关度10分，“Tool Use"相关度8分；3）其他关键词如MoE、量化、推理加速等均未在论文中涉及，相关度为0分。

!!! tip deepseek-chat TL;DR

该论文研究了OpenClaw智能体助手的数字取证方法，通过静态代码分析和差异取证分析识别可恢复的痕迹，提出了智能体工件分类法，并揭示了LLM驱动的智能体系统在取证中面临的抽象层和非确定性挑战。

摘要翻译

代理式人工智能系统正日益作为个人助手被部署，并可能成为数字调查的常见对象。然而，关于如何在取证分析中重建其内部状态与行为，目前知之甚少。尽管这类系统日益普及，但针对它们的系统性取证方法在很大程度上仍未得到探索。本文对广泛使用的单智能体助手OpenClaw进行了一项实证研究。我们通过静态代码分析考察了OpenClaw的技术设计，并应用差分取证分析来识别智能体交互循环各阶段中可恢复的痕迹。我们对这些痕迹进行分类与关联，以系统化评估其调查价值。基于这些观察，我们提出了一个智能体数字痕迹分类法，用以捕捉反复出现的调查模式。最后，我们指出了代理式人工智能取证的一个基础性挑战：智能体介导的执行在痕迹生成中引入了额外的抽象层和显著的非确定性。大语言模型（LLM）、执行环境以及不断演变的上下文，都可能以基于规则的软件中通常不存在的方式影响工具选择与状态转换。总体而言，我们的研究结果为代理式人工智能的系统性调查提供了初步基础，并概述了对数字取证实践及未来研究的启示。

摘要 (Abstract)

Agentic Al systems are increasingly deployed as personal assistants and are likely to become a common object of digital investigations. However, little is known about how their internal state and actions can be reconstructed during forensic analysis. Despite growing popularity, systematic forensic approaches for such systems remain largely unexplored. This paper presents an empirical study of OpenClaw a widely used single-agent assistant. We examine OpenClaw’s technical design via static code analysis and apply differential forensic analysis to identify recoverable traces across stages of the agent interaction loop. We classify and correlate these traces to assess their investigative value in a systematic way. Based on these observations, we propose an agent artifact taxonomy that captures recurring investigative patterns. Finally, we highlight a foundational challenge for agentic Al forensics: agent-mediated execution introduces an additional layer of abstraction and substantial nondeterminism in trace generation. The large language model (LLM), the execution environment, and the evolving context can influence tool choice and state transitions in ways that are largely absent from rule-based software. Overall, our results provide an initial foundation for the systematic investigation of agentic Al and outline implications for digital forensic practice and future research.

关键词: Agentic AI, Digital Forensics, OpenClaw, LLM, Tool Use, Forensic Analysis, Agent Artifacts, Non-determinism

100. ❌ Context-Agent: Dynamic Discourse Trees for Non-Linear Dialogue

作者: Junan Hu, Shudan Guo, Wenqi Liu, Jianhua Yin, Yinwei Wei 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05552v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Context-Agent框架，通过动态树结构管理多轮对话历史，解决LLMs在非线性对话中的挑战。核心相关关键词：1) ‘Large Language Models’ (10分)：论文明确研究LLMs在对话任务中的性能提升；2) ‘LLM Agents’ (10分)：Context-Agent本身就是一种代理框架，用于管理对话流程；3) ‘Context Window Extension’ (5分)：通过结构化上下文管理提高长对话效率；4) ‘In-context Learning’ (5分)：涉及对话历史的结构化表示和学习。其他关键词如MoE、SFT、RAG等未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在非线性对话中上下文管理效率低的问题，提出了基于动态树结构的Context-Agent框架，实验证明其能提高任务完成率和token效率。

摘要翻译

大型语言模型在众多语言任务中展现出卓越性能，但在处理人类对话的非线性流动方面仍面临根本性挑战。当前主流方法将对话历史视为扁平化的线性序列，这与自然话语本质上具有的层次化、分支化结构不相符，导致在涉及话题转换或指令细化的长程交互中出现上下文利用效率低下和连贯性丧失的问题。为突破这一局限，我们提出了Context-Agent框架，该框架将多轮对话历史建模为动态树状结构。这种方法映射了对话内在的非线性特征，使模型能够维护并导航对应于不同话题的多个对话分支。此外，为支持系统性评估，我们构建了非线性任务多轮对话基准数据集，专门用于评估模型在长程非线性场景下的性能。实验表明，Context-Agent在不同大型语言模型中均能提升任务完成率并改善token使用效率，印证了结构化上下文管理对复杂动态对话的重要价值。数据集与代码已发布于GitHub平台。

摘要 (Abstract)

Large Language Models demonstrate outstanding performance in many language tasks but still face fundamental challenges in managing the non-linear flow of human conversation. The prevalent approach of treating dialogue history as a flat, linear sequence is misaligned with the intrinsically hierarchical and branching structure of natural discourse, leading to inefficient context utilization and a loss of coherence during extended interactions involving topic shifts or instruction refinements. To address this limitation, we introduce Context-Agent, a novel framework that models multi-turn dialogue history as a dynamic tree structure. This approach mirrors the inherent non-linearity of conversation, enabling the model to maintain and navigate multiple dialogue branches corresponding to different topics. Furthermore, to facilitate robust evaluation, we introduce the Non-linear Task Multi-turn Dialogue (NTM) benchmark, specifically designed to assess model performance in long-horizon, non-linear scenarios. Our experiments demonstrate that Context-Agent enhances task completion rates and improves token efficiency across various LLMs, underscoring the value of structured context management for complex, dynamic dialogues. The dataset and code is available at GitHub.

关键词: Large Language Models, Dialogue Management, Dynamic Tree Structure, Non-linear Dialogue, Context-Agent, Multi-turn Dialogue, Context Utilization, NTM Benchmark

101. ❌ ResearchEVO: An End-to-End Framework for Automated Scientific Discovery and Documentation

作者: Zhe Zhao, Haibin Wen, Jiaming Ma, Jiachang Zhan, Tianyi Xu, Ye Wei, Qingfu Zhang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05587v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出ResearchEVO框架，包含Evolution Phase（LLM引导的算法进化）和Writing Phase（RAG生成论文），高度相关关键词：1）LLMs（权重1.0，评分10.0）— 核心使用LLM引导算法进化；2）RAG（权重1.0，评分10.0）— Writing Phase使用RAG生成论文；3）Hallucination Mitigation（权重1.0，评分10.0）— 明确提及反幻觉验证；4）AI for Science（权重1.0，评分10.0）— 应用于量子纠错和物理神经网络等科学问题；5）LLM Agents（权重1.0，评分5.0）— 框架体现自主代理工作流。其他关键词未涉及或仅边缘相关，评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了ResearchEVO框架，通过LLM引导的算法进化和RAG驱动的论文生成，实现了从科学发现到文档撰写的端到端自动化，并在量子纠错和物理神经网络等跨学科问题上验证了其有效性。

摘要翻译

科学突破中一个重要的反复出现的模式是一个两阶段过程：首先是无定向实验的初始阶段，产生意外发现；随后是回顾性阶段，解释该发现为何有效并将其置于现有理论框架中。我们提出了ResearchEVO，这是一个端到端的框架，通过计算实例化这种“先发现后解释”的范式。在演化阶段，该系统采用大语言模型（LLM）引导的双维度协同演化——同时优化算法逻辑与整体架构——仅依据适应度在代码实现空间中搜索，无需理解其产生的解决方案。在写作阶段，系统随后选取性能最佳的算法，通过句子级检索增强生成（RAG），结合显式的抗幻觉验证与自动化实验设计，自主生成完整、可直接投稿的研究论文。据我们所知，ResearchEVO是首个覆盖此完整端到端流程的系统：此前未有工作能联合执行原则性的算法演化与基于文献的科学文档撰写。我们在两个跨学科科学问题上验证了该框架——使用真实谷歌量子硬件数据的量子纠错（Quantum Error Correction）以及物理信息神经网络（Physics-Informed Neural Networks）——在演化阶段，系统发现了各自领域文献中未曾提出过的、人类可解读的算法机制。在这两个案例中，写作阶段均自主生成了可编译的LaTeX文稿，通过检索增强生成正确地将这些无先验指导的发现扎根于现有理论，且未产生任何捏造的引用。

摘要 (Abstract)

An important recurring pattern in scientific breakthroughs is a two-stage process: an initial phase of undirected experimentation that yields an unexpected finding, followed by a retrospective phase that explains why the finding works and situates it within existing theory. We present ResearchEVO, an end-to-end framework that computationally instantiates this discover-then-explain paradigm. The Evolution Phase employs LLM-guided bi-dimensional co-evolution – simultaneously optimizing both algorithmic logic and overall architecture – to search the space of code implementations purely by fitness, without requiring any understanding of the solutions it produces. The Writing Phase then takes the best-performing algorithm and autonomously generates a complete, publication-ready research paper through sentence-level retrieval-augmented generation with explicit anti-hallucination verification and automated experiment design. To our knowledge, ResearchEVO is the first system to cover this full pipeline end to end: no prior work jointly performs principled algorithm evolution and literature-grounded scientific documentation. We validate the framework on two cross-disciplinary scientific problems – Quantum Error Correction using real Google quantum hardware data, and Physics-Informed Neural Networks – where the Evolution Phase discovered human-interpretable algorithmic mechanisms that had not been previously proposed in the respective domain literatures. In both cases, the Writing Phase autonomously produced compilable LaTeX manuscripts that correctly grounded these blind discoveries in existing theory via RAG, with zero fabricated citations.

关键词: ResearchEVO, LLM-guided evolution, Retrieval-Augmented Generation, scientific discovery, automated documentation, anti-hallucination verification, Quantum Error Correction, Physics-Informed Neural Networks

102. ❌ COSMO-Agent: Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration

作者: Liyuan Deng, Shujian Deng, Yongkang Chen, Yongkang Dai, Zhihang Zhong, Linyang Li, Xiao Sun, Yilei Shi, Huaxi Huang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05547v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是提出COSMO-Agent框架，使用强化学习训练LLM（特别是小型开源LLM）来协调外部工具完成闭环CAD-CAE工业设计优化。高度相关的关键词包括：LLMs（框架核心）、SLMs（实验改进小型模型）、LLM Agents（框架本质是智能体）、Tool Use（协调外部工具）。AI for Science得5分，因为论文涉及工业设计优化，属于科学应用领域但非生物/化学信息学。其他关键词如MoE、Scaling Laws、训练方法、推理优化等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对工业设计中的CAD-CAE语义鸿沟问题，提出了COSMO-Agent框架，通过强化学习训练LLM协调工具完成闭环优化，实验表明该训练显著提升了小型开源LLM在约束驱动设计中的性能。

摘要翻译

迭代式工业设计-仿真优化的瓶颈在于CAD-CAE语义鸿沟：如何在多样化、耦合的约束条件下将仿真反馈转化为有效的几何编辑。为填补这一鸿沟，我们提出COSMO-Agent（闭环优化、仿真与建模协同框架），这是一个工具增强的强化学习框架，通过训练大语言模型完成闭环CAD-CAE流程。具体而言，我们将CAD生成、CAE求解、结果解析与几何修正构建为交互式强化学习环境，使大语言模型学习协调外部工具并修改参数化几何体，直至满足所有约束。为实现稳定且适用于工业场景的学习过程，我们设计了多约束奖励机制，同步优化可行性、工具链鲁棒性与结构化输出有效性。此外，我们构建了一个涵盖25个零部件类别、包含可执行CAD-CAE任务的工业级数据集，以支持真实场景的训练与评估。实验表明，COSMO-Agent训练显著提升了小型开源大语言模型在约束驱动设计中的表现，在可行性、效率与稳定性方面均超越大型开源模型及主流闭源模型。

摘要 (Abstract)

Iterative industrial design-simulation optimization is bottlenecked by the CAD-CAE semantic gap: translating simulation feedback into valid geometric edits under diverse, coupled constraints. To fill this gap, we propose COSMO-Agent (Closed-loop Optimization, Simulation, and Modeling Orchestration), a tool-augmented reinforcement learning (RL) framework that teaches LLMs to complete the closed-loop CAD-CAE process. Specifically, we cast CAD generation, CAE solving, result parsing, and geometry revision as an interactive RL environment, where an LLM learns to orchestrate external tools and revise parametric geometries until constraints are satisfied. To make this learning stable and industrially usable, we design a multi-constraint reward that jointly encourages feasibility, toolchain robustness, and structured output validity. In addition, we contribute an industry-aligned dataset that covers 25 component categories with executable CAD-CAE tasks to support realistic training and evaluation. Experiments show that COSMO-Agent training substantially improves small open-source LLMs for constraint-driven design, exceeding large open-source and strong closed-source models in feasibility, efficiency, and stability.

关键词: COSMO-Agent, LLM agents, tool-augmented, reinforcement learning, CAD-CAE optimization, closed-loop design, small language models, industrial design

103. ❌ From Large Language Model Predicates to Logic Tensor Networks: Neurosymbolic Offer Validation in Regulated Procurement

作者: Cedric Haufe, Frieder Stolzenburg 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05539v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种结合语言模型和逻辑张量网络（LTN）的神经符号方法，用于验证受监管公共机构的投标文件。论文与’Large Language Models’相关（8分），因为使用了语言模型进行信息提取；与’Hallucination Mitigation OR Factuality OR Truthfulness’有一定关联（5分），因为关注决策的事实正确性和可验证性；与’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为核心优势在于可解释性、模块化谓词提取和对XAI的明确支持。其他关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该研究提出了一种神经符号方法，结合语言模型和逻辑张量网络来验证受监管公共机构的投标文件，在保持性能的同时实现了可解释的决策过程。

摘要翻译

本文提出一种神经符号方法（即结合符号与亚符号人工智能），用于规范公共机构中招标文件的合规性验证。我们采用语言模型提取信息，随后通过逻辑张量网络进行聚合以生成可审计的决策。在受监管的公共机构中，决策必须同时满足事实准确性与法律可验证性。我们的神经符号方法能够将特定领域的既有知识与语言模型的语义文本理解相连接。该流程产生的决策可通过谓词取值、规则真值及对应文本片段进行解释，从而支持基于真实招标文件语料库的规则核查。在真实语料库上的实验表明，所提流程在性能上与现有模型相当，其核心优势在于可解释性、模块化谓词提取以及对可解释人工智能的显式支持。

摘要 (Abstract)

We present a neurosymbolic approach, i.e., combining symbolic and subsymbolic artificial intelligence, to validating offer documents in regulated public institutions. We employ a language model to extract information and then aggregate with an LTN (Logic Tensor Network) to make an auditable decision. In regulated public institutions, decisions must be made in a manner that is both factually correct and legally verifiable. Our neurosymbolic approach allows existing domain-specific knowledge to be linked to the semantic text understanding of language models. The decisions resulting from our pipeline can be justified by predicate values, rule truth values, and corresponding text passages, which enables rule checking based on a real corpus of offer documents. Our experiments on a real corpus show that the proposed pipeline achieves performance comparable to existing models, while its key advantage lies in its interpretability, modular predicate extraction, and explicit support for XAI (Explainable AI).

关键词: neurosymbolic approach, language model, Logic Tensor Network, offer validation, regulated procurement, interpretability, Explainable AI, auditable decision

104. ❌ A canonical generalization of OBDD

作者: Florent Capelli, YooJung Choi, Stefan Mengel, Martín Muñoz, Guy Van den Broeck 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05537v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是布尔函数的表示模型（Tree Decision Diagrams，TDD），作为OBDD的泛化，属于形式化方法、逻辑电路和知识编译领域。论文内容完全聚焦于理论计算机科学中的数据结构、算法复杂性和编译复杂性分析，与所有评分关键词（均涉及大模型、深度学习、AI应用、训练技术、推理优化、AI对齐等现代AI主题）无任何关联。论文未提及任何机器学习、深度学习或大语言模型相关内容，也未涉及任何科学领域的AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种称为树决策图（TDD）的新布尔函数表示模型，它泛化了OBDD，具有与OBDD相同的可处理性（如模型计数、枚举），但更简洁，并证明了树宽为k的CNF公式可以用FPT大小的TDD表示，而OBDD无法做到这一点。

摘要翻译

我们引入树决策图（Tree Decision Diagrams，TDD）作为布尔函数的一种模型，它推广了有序二叉决策图（OBDD）。TDD可视为结构化d-DNNF的一种限制形式，即遵循虚树$T$的d-DNNF。我们证明TDD具备与OBDD相同的易处理性质，如模型计数、枚举、条件化及应用操作，且表示更为简洁。特别地，我们证明树宽为$k$的合取范式（CNF）公式可由固定参数可处理（FPT）规模的TDD表示，而这在OBDD中已知是不可能的。我们研究了通过自底向上编译将CNF公式转化为确定性TDD的复杂度，并将此方法的复杂度与Bova和Szeider提出的因子宽度概念联系起来。

摘要 (Abstract)

We introduce Tree Decision Diagrams (TDD) as a model for Boolean functions that generalizes OBDD. They can be seen as a restriction of structured d-DNNF; that is, d-DNNF that respect a vtree $T$. We show that TDDs enjoy the same tractability properties as OBDD, such as model counting, enumeration, conditioning, and apply, and are more succinct. In particular, we show that CNF formulas of treewidth $k$ can be represented by TDDs of FPT size, which is known to be impossible for OBDD. We study the complexity of compiling CNF formulas into deterministic TDDs via bottom-up compilation and relate the complexity of this approach with the notion of factor width introduced by Bova and Szeider.

关键词: Tree Decision Diagrams, TDD, OBDD, Boolean functions, d-DNNF, treewidth, CNF compilation, factor width

105. ❌ Experience Transfer for Multimodal LLM Agents in Minecraft Game

作者: Chenghao Li, Jun Liu, Songbo Zhang, Huadong Jian, Hao Ni, Lik-Hang Lee, Sung-Ho Bae, Guoqing Wang, Yang Yang, Chaoning Zhang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05533v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Echo框架，用于多模态LLM代理在Minecraft游戏中的经验迁移，核心涉及LLM代理（高度相关）、In-context Learning（高度相关，通过ICAL实现经验检索和适应），其他关键词如MoE、SFT、RAG等未在摘要中提及，因此评分为0。

!!! tip deepseek-chat TL;DR

该研究提出Echo框架，通过分解可重用知识和In-Context Analogy Learning，使多模态LLM代理在Minecraft中能迁移过去经验解决新任务，实验显示在对象解锁任务上实现1.3x到1.7x加速。

摘要翻译

在复杂游戏环境中运行的多模态大语言模型智能体必须持续复用过往经验以高效解决新任务。本研究提出Echo——一个面向迁移的记忆框架，使智能体能够从先前的交互中提取可操作的知识，而非将记忆视为静态记录的被动存储库。为实现显式迁移，Echo将可复用知识解构为五个维度：结构、属性、过程、功能与交互。该框架使智能体能够识别跨任务共享的重复模式，并推断哪些先验经验在新情境中仍然适用。基于此框架，Echo利用上下文类比学习（In-Context Analogy Learning, ICAL）检索相关经验，并通过上下文示例将其适配到未见任务中。在《我的世界》（Minecraft）环境中的实验表明，在从零开始学习（from-scratch learning）设定下，Echo在物体解锁任务上实现了1.3倍至1.7倍的加速。此外，Echo展现出爆发式链式解锁现象：在获得可迁移经验后的短时间内，能快速解锁多个相似物品。这些结果表明，经验迁移是提升多模态大语言模型智能体在复杂交互环境中效率与适应性的重要方向。

摘要 (Abstract)

Multimodal LLM agents operating in complex game environments must continually reuse past experience to solve new tasks efficiently. In this work, we propose Echo, a transfer-oriented memory framework that enables agents to derive actionable knowledge from prior interactions rather than treating memory as a passive repository of static records. To make transfer explicit, Echo decomposes reusable knowledge into five dimensions: structure, attribute, process, function, and interaction. This formulation allows the agent to identify recurring patterns shared across different tasks and infer what prior experience remains applicable in new situations. Building on this formulation, Echo leverages In-Context Analogy Learning (ICAL) to retrieve relevant experiences and adapt them to unseen tasks through contextual examples. Experiments in Minecraft show that, under a from-scratch learning setting, Echo achieves a 1.3x to 1.7x speed-up on object-unlocking tasks. Moreover, Echo exhibits a burst-like chain-unlocking phenomenon, rapidly unlocking multiple similar items within a short time interval after acquiring transferable experience. These results suggest that experience transfer is a promising direction for improving the efficiency and adaptability of multimodal LLM agents in complex interactive environments.

关键词: Multimodal LLM Agents, Experience Transfer, Memory Framework, In-Context Analogy Learning, Minecraft, Task Efficiency, Transferable Knowledge, Interactive Environments

106. ❌ Inventory of the 12 007 Low-Dimensional Pseudo-Boolean Landscapes Invariant to Rank, Translation, and Rotation

作者: Arnaud Liefooghe, Sébastien Verel 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05530v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究伪布尔函数的秩不变性、平移和旋转对称性下的不变景观分类，属于组合优化和理论计算机科学领域，与所有评分关键词（均聚焦于大模型、深度学习、AI应用和技术原理）完全无关。论文未涉及任何大模型、深度学习、AI技术或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了伪布尔函数在秩、平移和旋转不变性下的不变景观分类问题，通过分析维度1、2和3的伪布尔函数，构建了包含12,007个不变景观类的详尽清单，揭示了非单射函数比单射函数产生更多不变类，并探讨了景观拓扑性质与算法行为（如欺骗性、中立性和爬山策略性能）之间的复杂关系。

摘要翻译

许多随机优化算法具有排序不变性，其仅依赖于解的相对顺序而非绝对适应度值。我们提出一种更强的排序景观不变性概念：若两个问题不仅排序相同，其邻域结构及对称性（平移与旋转）也诱导出完全一致的景观，则视其为等价问题。这促使我们研究排序景观本身而非单个函数。尽管先前研究仅单独分析了单射函数类的排序特性，我们首次对维度为1、2、3的伪布尔函数（包含非单射情形）的不变景观类进行了详尽枚举。分析共揭示12,007个类别，相较于仅考虑排序不变性实现了显著缩减。研究发现，非单射函数产生的不可变景观类远多于单射函数。此外，研究还揭示了拓扑景观特性与算法行为之间复杂的交互关系，特别是在欺骗性、中立性以及爬山策略性能方面。该枚举结果可作为教学资源与基准测试设计的参考，为构建具有可控难度的更大规模问题奠定基础，并推动我们对景观难度与算法性能的理解。

摘要 (Abstract)

Many randomized optimization algorithms are rank-invariant, relying solely on the relative ordering of solutions rather than absolute fitness values. We introduce a stronger notion of rank landscape invariance: two problems are equivalent if their ranking, but also their neighborhood structure and symmetries (translation and rotation), induce identical landscapes. This motivates the study of rank landscapes rather than individual functions. While prior work analyzed the rankings of injective function classes in isolation, we provide an exhaustive inventory of the invariant landscape classes for pseudo-Boolean functions of dimensions 1, 2, and 3, including non-injective cases. Our analysis reveals 12,007 classes in total, a significant reduction compared to rank-invariance alone. We find that non-injective functions yield far more invariant landscape classes than injective ones. In addition, complex combinations of topological landscape properties and algorithm behaviors emerge, particularly regarding deceptiveness, neutrality, and the performance of hill-climbing strategies. The inventory serves as a resource for pedagogical purposes and benchmark design, offering a foundation for constructing larger problems with controlled hardness and advancing our understanding of landscape difficulty and algorithm performance.

关键词: pseudo-Boolean functions, rank invariance, landscape classification, translation symmetry, rotation symmetry, optimization algorithms, neutrality, hill-climbing strategies

107. ❌ ActivityEditor: Learning to Synthesize Physically Valid Human Mobility

作者: Chenjie Yang, Yutian Jiang, Anqi Liang, Wei Qi, Chenyu Wu, Junbo Zhang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05529v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	8.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出ActivityEditor，一个基于双LLM代理的框架，用于零样本跨区域轨迹生成。核心创新在于使用LLM代理（高度相关’Large Language Models’和’LLM Agents’）和双代理协作（高度相关’Multi-agent Systems’），并通过强化学习（相关’RLHF’）训练代理内部化移动规律。研究属于AI在城市科学中的应用（有一定相关’AI for Science’）。其他关键词如MoE、SFT、RAG等未涉及，评分为0。

!!! tip deepseek-chat TL;DR

该论文提出ActivityEditor，一个双LLM代理框架，通过分解意图生成和轨迹编辑任务，并利用强化学习确保物理有效性，实现了零样本跨区域的高保真人类移动轨迹生成。

摘要翻译

人类移动建模对于多样化城市应用至关重要。然而，现有数据驱动方法常受数据稀缺性制约，限制了其在历史轨迹数据缺失或受限区域的适用性。为弥补这一空白，我们提出\textbf{ActivityEditor}——一种专为零样本跨区域轨迹生成设计的新型双LLM智能体框架。该框架将复杂的合成任务分解为两个协同阶段：首先，基于意图的智能体利用人口统计学驱动的先验知识，生成结构化的人类意图与粗粒度活动链，以确保高层次的社会语义连贯性；随后，编辑智能体通过迭代修正对这些输出进行精细化处理，在强化人类移动规律约束下生成移动轨迹。此能力通过基于现实物理约束的多奖励强化学习获得，使智能体内化移动规律并确保高保真轨迹生成。大量实验表明，\textbf{ActivityEditor}在不同城市语境间迁移时展现出卓越的零样本性能，同时保持高统计保真度与物理有效性，为数据稀缺场景下的移动模拟提供了强健且高度可泛化的解决方案。代码已发布于：https://anonymous.4open.science/r/ActivityEditor-066B。

摘要 (Abstract)

Human mobility modeling is indispensable for diverse urban applications. However, existing data-driven methods often suffer from data scarcity, limiting their applicability in regions where historical trajectories are unavailable or restricted. To bridge this gap, we propose \textbf{ActivityEditor}, a novel dual-LLM-agent framework designed for zero-shot cross-regional trajectory generation. Our framework decomposes the complex synthesis task into two collaborative stages. Specifically, an intention-based agent, which leverages demographic-driven priors to generate structured human intentions and coarse activity chains to ensure high-level socio-semantic coherence. These outputs are then refined by editor agent to obtain mobility trajectories through iteratively revisions that enforces human mobility law. This capability is acquired through reinforcement learning with multiple rewards grounded in real-world physical constraints, allowing the agent to internalize mobility regularities and ensure high-fidelity trajectory generation. Extensive experiments demonstrate that \textbf{ActivityEditor} achieves superior zero-shot performance when transferred across diverse urban contexts. It maintains high statistical fidelity and physical validity, providing a robust and highly generalizable solution for mobility simulation in data-scarce scenarios. Our code is available at: https://anonymous.4open.science/r/ActivityEditor-066B.

关键词: human mobility modeling, dual-LLM-agent framework, zero-shot cross-regional trajectory generation, reinforcement learning, physical constraints, trajectory synthesis, urban applications, data scarcity

108. ❌ Controllable Singing Style Conversion with Boundary-Aware Information Bottleneck

作者: Zhetao Hu, Yiquan Zhou, Wenyu Wang, Zhiyu Wu, Xin Gao, Jihua Zhu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05526v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于歌唱声音转换（SVC）任务，提出了一种新颖的歌唱风格转换系统，涉及边界感知信息瓶颈、帧级技术矩阵和高频带补全策略等技术。虽然属于AI应用领域，但所有给定的关键词均与大语言模型（LLM）、深度学习技术原理、科学AI应用（如生物信息学）等主题相关，而本文研究的是音频信号处理和生成式AI在特定音频任务中的应用，与关键词列表中的任何主题均无直接关联。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于歌唱声音转换的边界感知信息瓶颈系统，解决了风格泄漏、动态渲染和数据稀缺问题，在SVCC2025比赛中取得了最佳自然度性能。

摘要翻译

本文介绍了S4团队为2025年歌声转换挑战赛（SVCC2025）提交的系统——一种新颖的歌唱风格转换系统，该系统在领域内设置中实现了细粒度风格转换与控制的进阶。为应对风格泄露、动态表现力渲染以及在有限数据下实现高保真生成等关键挑战，我们引入了三项核心创新：一种边界感知的Whisper瓶颈层，通过聚合音素跨度表征来抑制残留源风格，同时保留语言内容；一种显式的帧级技巧矩阵，在推理阶段通过针对性的基频（F0）处理进行增强，以实现稳定且鲜明的动态风格渲染；以及一种基于感知的高频段补全策略，该策略利用一个辅助的标准48kHz歌声转换模型来增强高频频谱，从而在不导致过拟合的情况下克服数据稀缺问题。在官方SVCC2025主观评估中，尽管使用的额外歌唱数据量显著少于其他顶尖系统，我们的系统在所有提交方案中取得了最佳的自然度表现，同时在说话人相似度和技巧控制方面保持了有竞争力的结果。音频样本已在线提供。

摘要 (Abstract)

This paper presents the submission of the S4 team to the Singing Voice Conversion Challenge 2025 (SVCC2025)-a novel singing style conversion system that advances fine-grained style conversion and control within in-domain settings. To address the critical challenges of style leakage, dynamic rendering, and high-fidelity generation with limited data, we introduce three key innovations: a boundary-aware Whisper bottleneck that pools phoneme-span representations to suppress residual source style while preserving linguistic content; an explicit frame-level technique matrix, enhanced by targeted F0 processing during inference, for stable and distinct dynamic style rendering; and a perceptually motivated high-frequency band completion strategy that leverages an auxiliary standard 48kHz SVC model to augment the high-frequency spectrum, thereby overcoming data scarcity without overfitting. In the official SVCC2025 subjective evaluation, our system achieves the best naturalness performance among all submissions while maintaining competitive results in speaker similarity and technique control, despite using significantly less extra singing data than other top-performing systems. Audio samples are available online.

关键词: Singing Voice Conversion, Style Conversion, Information Bottleneck, Boundary-aware, High-fidelity Generation, Dynamic Rendering, SVCC2025, Audio Processing

109. ❌ Market-Bench: Benchmarking Large Language Models on Economic and Trade Competition

作者: Yushuo Zheng, Huiyu Duan, Zicheng Zhang, Yucheng Zhu, Xiongkuo Min, Guangtao Zhai 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05523v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是评估LLMs在经济学任务中的能力，通过构建多智能体供应链经济模型，让LLMs作为零售商代理进行采购和零售。因此，与’Large Language Models’和’LLM Agents’、‘Multi-agent Systems’高度相关（10分）。论文涉及LLMs在竞争市场中的决策，需要推理能力，与’Chain of Thought’和’System 2 Thinking’有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文主题无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在经济和贸易竞争中的能力，通过构建Market-Bench基准测试发现，在20个开源和闭源LLM代理中，只有少数能实现资本增值，而许多模型仅能维持盈亏平衡，揭示了显著的性能差异和赢家通吃现象。

摘要翻译

大型语言模型（LLM）管理与获取经济资源的能力尚不明确。本文提出 Market-Bench，这是一个通过经济与贸易竞争评估LLM在经济相关任务中能力的综合性基准。具体而言，我们构建了一个可配置的多智能体供应链经济模型，其中LLM作为零售商智能体，负责商品的采购与销售。在采购阶段，LLM在预算受限的拍卖中对有限库存进行竞价；在零售阶段，LLM设定零售价格、生成营销口号，并通过基于角色的注意力机制将其提供给买家进行购买。Market-Bench完整记录了出价、价格、口号、销售额及资产负债表状态的全过程轨迹，从而支持使用经济、运营和语义指标进行自动评估。对20个开源与闭源LLM智能体的基准测试揭示了显著的性能差异以及赢家通吃现象，即仅有少数LLM零售商能够持续实现资本增值，而许多模型尽管在语义匹配得分上相近，却仅能徘徊于盈亏平衡点附近。Market-Bench为研究LLM如何在竞争性市场中互动提供了一个可复现的测试平台。

摘要 (Abstract)

The ability of large language models (LLMs) to manage and acquire economic resources remains unclear. In this paper, we introduce \textbf{Market-Bench}, a comprehensive benchmark that evaluates the capabilities of LLMs in economically-relevant tasks through economic and trade competition. Specifically, we construct a configurable multi-agent supply chain economic model where LLMs act as retailer agents responsible for procuring and retailing merchandise. In the \textbf{procurement} stage, LLMs bid for limited inventory in budget-constrained auctions. In the \textbf{retail} stage, LLMs set retail prices, generate marketing slogans, and provide them to buyers through a role-based attention mechanism for purchase. Market-Bench logs complete trajectories of bids, prices, slogans, sales, and balance-sheet states, enabling automatic evaluation with economic, operational, and semantic metrics. Benchmarking on 20 open- and closed-source LLM agents reveals significant performance disparities and winner-take-most phenomenon, \textit{i.e.}, only a small subset of LLM retailers can consistently achieve capital appreciation, while many hover around the break-even point despite similar semantic matching scores. Market-Bench provides a reproducible testbed for studying how LLMs interact in competitive markets.

关键词: Large Language Models, LLMs, economic benchmark, multi-agent systems, supply chain model, agent coordination, Market-Bench, trade competition

110. ❌ Learned Elevation Models as a Lightweight Alternative to LiDAR for Radio Environment Map Estimation

作者: Ljupcho Milosheski, Fedja Močnik, Mihael Mohorčič, Carolina Fortuna 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05520v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究无线通信网络中的无线电环境地图（REM）估计，提出了一种从卫星RGB图像学习高程模型的两阶段深度学习框架，以替代昂贵的LiDAR数据。论文的核心是计算机视觉和深度学习在通信工程领域的应用，属于AI for Science的范畴（关键词26），因此给予5分（有一定关联）。然而，论文完全不涉及大语言模型（LLM）、模型训练技术（如MoE、SFT、RLHF）、推理优化（如RAG、量化）、智能体系统或任何其他列出的LLM相关关键词，因此这些关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于深度学习的两阶段框架，通过从卫星图像学习高程模型来估计无线电环境地图，无需昂贵的3D LiDAR数据，在保持相同输入特征空间的情况下将RMSE提升了最高7.8%。

摘要翻译

下一代无线系统（如6G）工作于更高频段，使得信号传播对建筑物和植被等环境因素高度敏感。因此，精确的无线电环境地图（Radio Environment Map, REM）估计对于有效的网络规划与运营日益重要。现有方法——从射线追踪模拟器到深度学习生成模型——虽取得了良好效果，但需要详细的3D环境数据（如激光雷达点云），这些数据获取成本高昂（每平方公里可达数GB大小），且在动态环境中极易过时。我们提出一个两阶段框架，在推理阶段无需3D数据：第一阶段，通过学习型估计器直接从卫星RGB图像预测高程地图；第二阶段，将这些高程地图与天线参数共同输入REM估计器。在现有基于卷积神经网络（CNN）的REM估计架构中，该方法在相同输入特征空间下，比仅使用图像的基线模型将均方根误差（RMSE）降低了最高达7.8%，且推理过程中无需3D数据，为可扩展的无线电环境建模提供了一种实用的替代方案。

摘要 (Abstract)

Next-generation wireless systems such as 6G operate at higher frequency bands, making signal propagation highly sensitive to environmental factors such as buildings and vege- tation. Accurate Radio Environment Map (REM) estimation is therefore increasingly important for effective network planning and operation. Existing methods, from ray-tracing simulators to deep learning generative models, achieve promising results but require detailed 3D environment data such as LiDAR-derived point clouds, which are costly to acquire, several gigabytes per km2 in size, and quickly outdated in dynamic environments. We propose a two-stage framework that eliminates the need for 3D data at inference time: in the first stage, a learned estimator predicts elevation maps directly from satellite RGB imagery, which are then fed alongside antenna parameters into the REM estimator in the second stage. Across existing CNN- based REM estimation architectures, the proposed approach improves RMSE by up to 7.8% over image-only baselines, while operating on the same input feature space and requiring no 3D data during inference, offering a practical alternative for scalable radio environment modelling.

关键词: Radio Environment Map, REM estimation, deep learning, satellite imagery, elevation models, 6G wireless systems, CNN-based architectures, inference without 3D data

111. ❌ OmniDiagram: Advancing Unified Diagram Code Generation via Visual Interrogation Reward

作者: Haoyue Yang, Xuanle Zhao, Xuexin Liu, Feibang Jiang, Yao Zhu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05514v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于图表代码生成任务，提出了OmniDiagram框架和Viva视觉反馈策略，并构建了M3^2Diagram数据集。论文明确提到使用了SFT（Supervised Fine-tuning）作为训练方法之一，因此与"Post-training OR Supervised Fine-tuning OR SFT"高度相关（10分）。论文未涉及其他关键词所描述的大模型技术原理、优化方法、推理技术、对齐方法、压缩技术、科学AI应用等具体内容，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有图表代码生成方法适用范围窄的问题，提出了统一的OmniDiagram框架，通过创新的Viva视觉反馈策略结合SFT和强化学习训练，在多个基准测试中实现了新的最优性能。

摘要翻译

可编程图表生成范式正在快速发展，在结构化可视化中发挥着关键作用。然而，现有研究大多局限于有限的任务定义和语言支持范围，制约了其对多样化图表类型的适用性。本研究提出OmniDiagram，一个融合了多种图表代码语言与任务定义的统一框架。为应对强化学习中代码逻辑与视觉保真度对齐的挑战，我们引入了一种名为“视觉审验验证一切”的新型视觉反馈策略。与脆弱的基于语法的规则或像素级匹配方法不同，\textsc{Viva}通过生成式方法对渲染图表的视觉结构进行奖励。具体而言，\textsc{Viva}主动生成有针对性的视觉质询以审查图表视觉保真度，并为优化提供细粒度反馈。该机制促成了一个自我演进的训练过程，有效避免了对手动标注真实代码的需求。此外，我们构建了首个大规模图表代码生成数据集M3$^2$Diagram，包含超过19.6万个高质量实例。实验结果证实，监督微调与我们基于\textsc{Viva}的强化学习相结合，使OmniDiagram在图表代码生成基准测试中确立了全新的最优性能。

摘要 (Abstract)

The paradigm of programmable diagram generation is evolving rapidly, playing a crucial role in structured visualization. However, most existing studies are confined to a narrow range of task formulations and language support, constraining their applicability to diverse diagram types. In this work, we propose OmniDiagram, a unified framework that incorporates diverse diagram code languages and task definitions. To address the challenge of aligning code logic with visual fidelity in Reinforcement Learning (RL), we introduce a novel visual feedback strategy named Visual Interrogation Verifies All (\textsc{Viva}). Unlike brittle syntax-based rules or pixel-level matching, \textsc{Viva} rewards the visual structure of rendered diagrams through a generative approach. Specifically, \textsc{Viva} actively generates targeted visual inquiries to scrutinize diagram visual fidelity and provides fine-grained feedback for optimization. This mechanism facilitates a self-evolving training process, effectively obviating the need for manually annotated ground truth code. Furthermore, we construct M3$^2$Diagram, the first large-scale diagram code generation dataset, containing over 196k high-quality instances. Experimental results confirm that the combination of SFT and our \textsc{Viva}-based RL allows OmniDiagram to establish a new state-of-the-art (SOTA) across diagram code generation benchmarks.

关键词: diagram code generation, unified framework, visual feedback, reinforcement learning, SFT, Viva, M3^2Diagram dataset, state-of-the-art

112. ❌ Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models

作者: Keuntae Kim, Mingyu Kang, Yong Suk Choi 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05497v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究扩散多模态大语言模型（dMLLMs）与思维链（CoT）推理结合时的两个关键问题：过早生成最终答案和视觉信息利用不足，并提出PSP和VRG方法来解决。核心相关关键词是’Large Language Models’（论文研究dLLMs和dMLLMs）和’Chain of Thought’（论文直接研究CoT推理问题），分别给10分。‘System 2 Thinking’与论文关注的深入推理相关，给8分。‘Speculative Decoding’与论文提到的推理加速（3倍加速）有一定关联，给5分。其他关键词与论文内容无直接关系，均给0分。

!!! tip deepseek-chat TL;DR

该论文发现扩散多模态大语言模型（dMLLMs）在结合思维链推理时存在过早生成答案和视觉信息利用不足的问题，并提出PSP和VRG方法，在提高准确率7.5%的同时实现3倍以上的推理加速。

摘要翻译

扩散大语言模型（dLLMs）正逐渐成为自回归（AR）大语言模型的有前景的替代方案。近期，这一范式已被扩展至多模态任务，推动了扩散多模态大语言模型（dMLLMs）的发展。这些模型有望保留大语言模型的推理能力，同时通过并行生成实现更快的推理速度。然而，当与思维链（CoT）推理结合时，dMLLMs暴露出两个关键问题。首先，我们观察到dMLLMs经常在非常早的时间步就生成最终答案标记。这一趋势表明模型在未进行充分推理前就确定了答案，导致推理性能下降。其次，在初始时间步中，dMLLMs对视觉提示的依赖性极低，展现出与自回归视觉语言模型相比根本不同的视觉信息利用模式。总之，这些发现表明dMLLMs倾向于在未充分基于视觉输入的情况下生成过早的最终答案。为应对这些局限，我们提出了位置与步数惩罚（PSP）和视觉推理引导（VRG）。PSP在早期时间步对后续位置的标记施加惩罚，从而延迟过早的答案生成，并鼓励跨时间步的渐进式推理。VRG受无分类器引导的启发，放大视觉基础信号以增强模型与视觉证据的对齐。在各种dMLLMs上进行的大量实验表明，我们的方法在实现高达7.5%准确率提升的同时，相比使用四倍扩散步数的推理，还能带来超过3倍的加速。

摘要 (Abstract)

Diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive (AR) LLMs. Recently, this paradigm has been extended to multimodal tasks, leading to the development of diffusion multimodal large language models (dMLLMs). These models are expected to retain the reasoning capabilities of LLMs while enabling faster inference through parallel generation. However, when combined with Chain-of-Thought (CoT) reasoning, dMLLMs exhibit two critical issues. First, we observe that dMLLMs often generate the final answer token at a very early timestep. This trend indicates that the model determines the answer before sufficient reasoning, leading to degraded reasoning performance. Second, during the initial timesteps, dMLLMs show minimal dependency on visual prompts, exhibiting a fundamentally different pattern of visual information utilization compared to AR vision-language models. In summary, these findings indicate that dMLLMs tend to generate premature final answers without sufficiently grounding on visual inputs. To address these limitations, we propose Position and Step Penalty (PSP) and Visual Reasoning Guidance (VRG). PSP penalizes tokens in later positions during early timesteps, delaying premature answer generation and encouraging progressive reasoning across timesteps. VRG, inspired by classifier-free guidance, amplifies visual grounding signals to enhance the model’s alignment with visual evidence. Extensive experiments across various dMLLMs demonstrate that our method achieves up to 7.5% higher accuracy while delivering more than 3x speedup compared to reasoning with four times more diffusion steps.

关键词: Diffusion Large Language Models, Multimodal Large Language Models, Chain-of-Thought Reasoning, Visual Grounding, Reasoning Acceleration, Premature Answer Generation, Position and Step Penalty, Visual Reasoning Guidance

113. ❌ SCMAPR: Self-Correcting Multi-Agent Prompt Refinement for Complex-Scenario Text-to-Video Generation

作者: Chengyi Yang, Pengzhen Li, Jiayin Qi, Aimin Zhou, Ji Wu, Ji Liu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05489v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出SCMAPR框架，使用多智能体系统进行文本到视频生成的提示词优化，核心涉及多智能体协调、自我纠正机制和复杂场景处理。与’LLM Agents/Autonomous Agents/Agentic Workflow’、‘Multi-agent Systems/Agent Coordination’和’Self-Correction/Self-Improvement/Self-Reflection’高度相关（10分），因为这些是论文的核心方法。与’Large Language Models/LLMs/Foundation Models’有一定关联（8分），因为多智能体系统可能基于LLM构建，但论文未明确说明。其他关键词如MoE、SLMs、训练技术、推理优化、AI for Science等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对复杂场景下文本到视频生成中提示词模糊和不足的问题，提出了一个自我纠正的多智能体提示词优化框架SCMAPR，通过场景感知的策略选择和结构化语义验证，显著提升了生成视频与文本的对齐质量和整体生成效果。

摘要翻译

文本到视频（T2V）生成技术已受益于扩散模型的最新进展，但现有系统在处理复杂场景时仍面临困难，这通常因文本提示的模糊性和欠指定性而加剧。在本研究中，我们将复杂场景提示优化构建为一个分阶段的多智能体优化过程，并提出了SCMAPR，即一种面向T2V提示的场景感知自校正多智能体提示优化框架。SCMAPR协调多个专用智能体以：（i）将每个提示路由至基于分类的场景以进行策略选择，（ii）合成场景感知的重写策略并执行策略条件优化，以及（iii）进行结构化语义验证，在检测到违规时触发条件修订。为阐明T2V提示中复杂场景的构成、提供代表性示例，并在此类挑战性条件下实现严谨评估，我们进一步引入了{T2V-Complexity}——一个专门由复杂场景提示构成的复杂场景T2V基准测试集。在三个现有基准及我们提出的T2V-Complexity基准上的大量实验表明，SCMAPR在复杂场景下持续提升了文本-视频对齐度和整体生成质量，在VBench和EvalCrafter上的平均得分较三种先进基线方法最高提升达2.67%和3.28分，在T2V-CompBench上最高提升0.028分。

摘要 (Abstract)

Text-to-Video (T2V) generation has benefited from recent advances in diffusion models, yet current systems still struggle under complex scenarios, which are generally exacerbated by the ambiguity and underspecification of text prompts. In this work, we formulate complex-scenario prompt refinement as a stage-wise multi-agent refinement process and propose SCMAPR, i.e., a scenario-aware and Self-Correcting Multi-Agent Prompt Refinement framework for T2V prompting. SCMAPR coordinates specialized agents to (i) route each prompt to a taxonomy-grounded scenario for strategy selection, (ii) synthesize scenario-aware rewriting policies and perform policy-conditioned refinement, and (iii) conduct structured semantic verification that triggers conditional revision when violations are detected. To clarify what constitutes complex scenarios in T2V prompting, provide representative examples, and enable rigorous evaluation under such challenging conditions, we further introduce {T2V-Complexity}, which is a complex-scenario T2V benchmark consisting exclusively of complex-scenario prompts. Extensive experiments on 3 existing benchmarks and our T2V-Complexity benchmark demonstrate that SCMAPR consistently improves text-video alignment and overall generation quality under complex scenarios, achieving up to 2.67% and 3.28 gains in average score on VBench and EvalCrafter, and up to 0.028 improvement on T2V-CompBench over 3 State-Of-The-Art baselines.

关键词: Text-to-Video Generation, Multi-Agent Systems, Prompt Refinement, Self-Correction, Complex Scenarios, Semantic Verification, T2V-Complexity Benchmark, Diffusion Models

114. ❌ Auditable Agents

作者: Yi Nian, Aojie Yuan, Haiyue Zhang, Jiate Li, Yue Zhao 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05485v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM agent系统的可审计性问题，与"LLM Agents"和"Tool Use"高度相关（10分），因为论文明确讨论LLM agents调用工具、触发外部副作用等行为。与"Multi-agent Systems"有一定关联（5分），因为涉及agent协调和责任归属，但论文未深入讨论多agent系统架构。其他关键词如MoE、SFT、RAG等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM agent系统在部署后的可审计性问题，提出了五个可审计性维度并通过实验证明，即使传统日志缺失，责任相关信息仍可部分恢复。

摘要翻译

LLM智能体通过调用工具、查询数据库、委托任务及触发外部副作用来运作。一旦智能体系统能够在现实世界中行动，问题的核心便不再仅限于能否阻止有害行为，而更在于部署后这些行为是否仍具备可追责性。我们区分了三个概念：可问责性（判定合规性与分配责任的能力）、可审计性（使可问责性成为可能的系统属性）以及审计（基于可信证据重建行为的过程）。我们的主张直接明确：缺乏可审计性的智能体系统必然无法实现可问责性。
为使这一理念可操作化，我们定义了智能体可审计性的五个维度，即行动可复现性、生命周期覆盖度、策略可核查性、责任可归属性与证据完整性，并识别出三类机制（检测、执行、恢复）。这些机制在时序信息与干预约束上的差异解释了为何在实践中单一方法无法满足全部需求。我们通过分层证据而非单一基准来支撑这一立场：对生态系统的下限测量表明，即使可审计性所需的基本安全前提也普遍未能满足（在六个知名开源项目中累计发现617项安全缺陷）；运行时可行性实验显示，采用防篡改记录的执行前仲裁机制仅带来8.3毫秒的中位开销；受控恢复实验则证明，即使传统日志缺失，仍可部分恢复与责任相关的信息。我们提出面向智能体系统的可审计性卡片框架，并依据机制类别归纳了六个待解决的研究问题。

摘要 (Abstract)

LLM agents call tools, query databases, delegate tasks, and trigger external side effects. Once an agent system can act in the world, the question is no longer only whether harmful actions can be prevented–it is whether those actions remain answerable after deployment. We distinguish accountability (the ability to determine compliance and assign responsibility), auditability (the system property that makes accountability possible), and auditing (the process of reconstructing behavior from trustworthy evidence). Our claim is direct: no agent system can be accountable without auditability. To make this operational, we define five dimensions of agent auditability, i.e., action recoverability, lifecycle coverage, policy checkability, responsibility attribution, and evidence integrity, and identify three mechanism classes (detect, enforce, recover) whose temporal information-and-intervention constraints explain why, in practice, no single approach suffices. We support the position with layered evidence rather than a single benchmark: lower-bound ecosystem measurements suggest that even basic security prerequisites for auditability are widely unmet (617 security findings across six prominent open-source projects); runtime feasibility results show that pre-execution mediation with tamper-evident records adds only 8.3 ms median overhead; and controlled recovery experiments show that responsibility-relevant information can be partially recovered even when conventional logs are missing. We propose an Auditability Card for agent systems and identify six open research problems organized by mechanism class.

关键词: LLM agents, auditability, accountability, tool use, responsibility attribution, security, agent systems, recovery

115. ❌ Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning

作者: Xiaotian Zhou, Di Tang, Xiaofeng Wang, Xiaozhong Liu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05483v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的可信边界检测，直接涉及LLMs（10分）、LLM Agents（10分，使用多智能体强化学习）、Multi-agent Systems（10分，协调多个RL智能体）和Hallucination Mitigation（10分，检测偏见/错误回答）。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文技术内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为GMRL-BD的新算法，通过偏置扩散和多智能体强化学习来检测黑盒大语言模型（LLM）在哪些主题上可能产生偏见或不准确回答，从而识别其不可信边界，并发布了包含多个流行LLM的偏见主题标注数据集。

摘要翻译

大型语言模型（LLM）在回答各类主题问题时展现出强大的能力。然而，这些模型有时会产生带有偏见、意识形态化或不正确的回答，若无法明确了解其回答在哪些主题上可信，将限制其应用。本研究提出一种新颖算法，命名为GMRL-BD，旨在特定查询约束下，通过黑盒访问方式识别给定LLM在主题层面的不可信边界。基于从维基百科提取的通用知识图谱（KG），该算法结合多个强化学习智能体，以高效识别LLM可能产生偏见回答的主题（即知识图谱中的某些节点）。实验证明，我们的算法仅需对LLM进行有限次查询即可检测出其不可信边界，具有较高效率。此外，我们发布了一个新数据集，包含Llama2、Vicuna、Falcon、Qwen2、Gemma2和Yi-1.5等主流大型语言模型，并标注了每个模型可能产生偏见的主题标签。

摘要 (Abstract)

Large Language Models (LLMs) have shown a high capability in answering questions on a diverse range of topics. However, these models sometimes produce biased, ideologized or incorrect responses, limiting their applications if there is no clear understanding of which topics their answers can be trusted. In this research, we introduce a novel algorithm, named as GMRL-BD, designed to identify the untrustworthy boundaries (in terms of topics) of a given LLM, with black-box access to the LLM and under specific query constraints. Based on a general Knowledge Graph (KG) derived from Wikipedia, our algorithm incorporates with multiple reinforcement learning agents to efficiently identify topics (some nodes in KG) where the LLM is likely to generate biased answers. Our experiments demonstrated the efficiency of our algorithm, which can detect the untrustworthy boundary with just limited queries to the LLM. Additionally, we have released a new dataset containing popular LLMs including Llama2, Vicuna, Falcon, Qwen2, Gemma2 and Yi-1.5, along with labels indicating the topics on which each LLM is likely to be biased.

关键词: Large Language Models, untrustworthy boundary detection, bias detection, multi-agent reinforcement learning, Knowledge Graph, black-box LLM, biased answers, GMRL-BD algorithm

116. ❌ Unifying VLM-Guided Flow Matching and Spectral Anomaly Detection for Interpretable Veterinary Diagnosis

作者: Pu Wang, Zhixuan Mao, Jialu Li, Zhuoran Zheng, Dianjie Lu, Youshan Zhang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05482v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于兽医诊断领域，提出了一种结合VLM引导的Flow Matching和基于随机矩阵理论的谱异常检测方法，用于犬气胸的自动诊断。论文的核心是计算机视觉和医学图像分析，而非大语言模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在生物医学（兽医）领域的应用，得10分。‘Mechanistic Interpretability OR Explainable AI’得5分，因为论文强调了’interpretable diagnostic system’，但解释性并非其核心技术创新点。其他所有关键词均与大语言模型、训练方法、推理优化、智能体等无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合视觉语言模型引导的流匹配分割和随机矩阵理论谱异常检测的新方法，用于解决犬气胸自动诊断中的数据稀缺和模型可信问题，实现了高精度且可解释的诊断系统。

摘要翻译

犬类气胸的自动诊断面临数据稀缺与模型可信度要求的双重挑战。为解决此问题，我们首先引入一个公开的像素级标注数据集以促进相关研究。随后，我们提出一种新颖的诊断范式，将任务重新构建为信号定位与谱检测的协同过程。在定位方面，我们的方法采用视觉语言模型（Vision-Language Model, VLM）引导迭代式流匹配（Flow Matching）过程，逐步优化分割掩码以实现卓越的边界精度。在检测方面，分割掩码用于从疑似病灶中分离特征。我们随后应用随机矩阵理论（Random Matrix Theory, RMT）——一种有别于传统分类器的方法——对这些特征进行分析。该方法将健康组织建模为可预测的随机噪声，并通过检测具有统计显著性的异常特征值来识别气胸，这些异常值代表了非随机的病理信号。流匹配提供的高保真定位对于纯化信号至关重要，从而最大化RMT检测器的灵敏度。这种生成式分割与第一性原理统计分析的协同作用，产生了一个高精度且可解释的诊断系统（源代码位于：https://github.com/Pu-Wang-alt/Canine-pneumothorax）。

摘要 (Abstract)

Automatic diagnosis of canine pneumothorax is challenged by data scarcity and the need for trustworthy models. To address this, we first introduce a public, pixel-level annotated dataset to facilitate research. We then propose a novel diagnostic paradigm that reframes the task as a synergistic process of signal localization and spectral detection. For localization, our method employs a Vision-Language Model (VLM) to guide an iterative Flow Matching process, which progressively refines segmentation masks to achieve superior boundary accuracy. For detection, the segmented mask is used to isolate features from the suspected lesion. We then apply Random Matrix Theory (RMT), a departure from traditional classifiers, to analyze these features. This approach models healthy tissue as predictable random noise and identifies pneumothorax by detecting statistically significant outlier eigenvalues that represent a non-random pathological signal. The high-fidelity localization from Flow Matching is crucial for purifying the signal, thus maximizing the sensitivity of our RMT detector. This synergy of generative segmentation and first-principles statistical analysis yields a highly accurate and interpretable diagnostic system (source code is available at: https://github.com/Pu-Wang-alt/Canine-pneumothorax).

关键词: Veterinary Diagnosis, Vision-Language Model (VLM), Flow Matching, Spectral Anomaly Detection, Random Matrix Theory (RMT), Canine Pneumothorax, Interpretable AI, Medical Image Segmentation

117. ❌ OntoTKGE: Ontology-Enhanced Temporal Knowledge Graph Extrapolation

作者: Dongying Lin, Yinan Liu, Shengwei tang, Bin Wang, Xiaochun Yang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05468v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于时间知识图谱外推任务，提出OntoTKGE框架整合本体知识和时间知识以增强实体嵌入，解决稀疏历史交互问题。所有关键词均与大模型、深度学习技术原理或具体应用技术直接相关，但论文未涉及任何大模型、深度学习技术或相关训练/推理方法，仅与’AI for Science’有一定关联（知识图谱可视为AI在信息科学领域的应用），其他关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出OntoTKGE框架，通过整合本体知识和时间知识来增强时间知识图谱外推模型的实体嵌入，有效解决了实体历史交互稀疏性问题，并在多个数据集上显著提升了现有模型的性能。

摘要翻译

时序知识图谱（Temporal Knowledge Graph, TKG）外推是一项重要任务，旨在通过知识图谱快照中的历史交互信息预测未来事实。现有大多数TKG外推模型面临的一个关键挑战是如何处理历史交互稀疏的实体。本体知识有助于缓解这一稀疏性问题，它使得这些实体能够从具有相同概念的其他实体继承行为模式，而以往研究忽略了这一点。本文提出一种新颖的编码器-解码器框架OntoTKGE，该框架利用来自本体视图知识图谱（即一种对抽象概念间的层次关系以及概念与实体间连接进行建模的知识图谱）的本体知识，通过有效融合本体知识与时序知识来引导TKG外推模型的学习过程，从而增强实体嵌入表示。OntoTKGE具备足够的灵活性，可适配多种TKG外推模型。在四个数据集上的大量实验表明，OntoTKGE不仅显著提升了多种TKG外推模型的性能，而且超越了多种当前最优的基线方法。

摘要 (Abstract)

Temporal knowledge graph (TKG) extrapolation is an important task that aims to predict future facts through historical interaction information within KG snapshots. A key challenge for most existing TKG extrapolation models is handling entities with sparse historical interaction. The ontological knowledge is beneficial for alleviating this sparsity issue by enabling these entities to inherit behavioral patterns from other entities with the same concept, which is ignored by previous studies. In this paper, we propose a novel encoder-decoder framework OntoTKGE that leverages the ontological knowledge from the ontology-view KG (i.e., a KG modeling hierarchical relations among abstract concepts as well as the connections between concepts and entities) to guide the TKG extrapolation model’s learning process through the effective integration of the ontological and temporal knowledge, thereby enhancing entity embeddings. OntoTKGE is flexible enough to adapt to many TKG extrapolation models. Extensive experiments on four data sets demonstrate that OntoTKGE not only significantly improves the performance of many TKG extrapolation models but also surpasses many SOTA baseline methods.

关键词: Temporal Knowledge Graph, Extrapolation, Ontology, Entity Embedding, Sparsity, Encoder-Decoder Framework, Ontological Knowledge, TKG Models

118. ❌ Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework

作者: Komal Kumar, Aman Chadha, Salman Khan, Fahad Shahbaz Khan, Hisham Cholakkal 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06170v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文核心是构建一个基于多智能体LLM的学术文献发现与分析系统，因此与’LLM Agents’、‘Multi-agent Systems’高度相关（10分）。系统涉及检索增强生成（RAG）和工具使用（Tool Use），分别给8分。论文应用LLM于科学文献分析，与’AI for Science’相关（8分）。其他关键词如LLM基础技术（10分）是系统的基础，但论文未深入技术原理创新，主要关注应用框架。其余关键词如MoE、量化、推理加速等未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了Paper Circle，一个基于多智能体LLM的开源框架，用于自动化发现、评估和合成学术文献，并通过实验验证了其在文献检索和综述生成任务上的有效性。

摘要翻译

科学文献的快速增长使得研究者高效发现、评估与整合相关工作的难度日益增加。多智能体大语言模型（LLMs）的最新进展已展现出理解用户意图的强大潜力，并正被训练以利用多种工具。本文介绍Paper Circle，一个旨在降低查找、评估、组织与理解学术文献所需精力的多智能体研究探索与分析系统。该系统包含两条互补的流程：（1）探索流程，整合来自多源的离线与在线检索、多标准评分、多样性感知排序及结构化输出；（2）分析流程，将单篇论文转化为包含概念、方法、实验、图表等类型化节点的结构化知识图谱，支持图谱感知的问答与覆盖度验证。两条流程均在一个基于编码器LLM的多智能体编排框架内实现，并在每个智能体步骤中生成完全可复现、同步的输出，包括JSON、CSV、BibTeX、Markdown和HTML格式。本文详细描述了构成Paper Circle研究工作流的系统架构、智能体角色、检索与评分方法、知识图谱模式及评估界面。我们在论文检索与论文综述生成任务上对Paper Circle进行了基准测试，报告了命中率、平均倒数排名（MRR）及K值召回率（Recall at K）。结果表明，采用更强的智能体模型能带来持续的性能提升。我们已公开发布网站（https://papercircle.vercel.app/）与代码（https://github.com/MAXNORM8650/papercircle）。

摘要 (Abstract)

The rapid growth of scientific literature has made it increasingly difficult for researchers to efficiently discover, evaluate, and synthesize relevant work. Recent advances in multi-agent large language models (LLMs) have demonstrated strong potential for understanding user intent and are being trained to utilize various tools. In this paper, we introduce Paper Circle, a multi-agent research discovery and analysis system designed to reduce the effort required to find, assess, organize, and understand academic literature. The system comprises two complementary pipelines: (1) a Discovery Pipeline that integrates offline and online retrieval from multiple sources, multi-criteria scoring, diversity-aware ranking, and structured outputs; and (2) an Analysis Pipeline that transforms individual papers into structured knowledge graphs with typed nodes such as concepts, methods, experiments, and figures, enabling graph-aware question answering and coverage verification. Both pipelines are implemented within a coder LLM-based multi-agent orchestration framework and produce fully reproducible, synchronized outputs including JSON, CSV, BibTeX, Markdown, and HTML at each agent step. This paper describes the system architecture, agent roles, retrieval and scoring methods, knowledge graph schema, and evaluation interfaces that together form the Paper Circle research workflow. We benchmark Paper Circle on both paper retrieval and paper review generation, reporting hit rate, MRR, and Recall at K. Results show consistent improvements with stronger agent models. We have publicly released the website at https://papercircle.vercel.app/ and the code at https://github.com/MAXNORM8650/papercircle.

关键词: multi-agent LLMs, research discovery, knowledge graphs, retrieval-augmented generation, academic literature analysis, agent orchestration, structured outputs, reproducible workflow

119. ❌ JUÁ - A Benchmark for Information Retrieval in Brazilian Legal Text Collections

作者: Jayr Pereira, Leandro Fernandes, Erick de Brito, Roberto Lotufo, Luiz Bonifacio 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06098v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要关注法律信息检索基准的构建和评估，涉及领域适应的嵌入模型微调（与’Pre-training OR Continual Pre-training OR Domain Adaptation’和’Post-training OR Supervised Fine-tuning OR SFT’相关，评分5分）以及检索增强生成（RAG）相关技术（与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’相关，评分5分）。论文未涉及大模型技术原理创新、科学领域AI应用或其他深度学习技术，因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为JUÁ的巴西法律文本检索基准，用于系统评估不同检索方法，并通过实验表明领域适应的嵌入模型在特定子集上表现更优，而BM25在其他集合中仍具竞争力。

摘要翻译

葡萄牙语法律信息检索的系统性评估仍面临困难，主要原因在于现有数据集在文档类型、查询方式和相关性定义上存在显著差异。本文提出\textsc{JUÁ}——一个面向巴西法律检索的公共基准，旨在为异构法律文献集合提供更具可复现性和可比性的评估框架。广义而言，\textsc{JUÁ}不仅是一个基准，更是一套用于巴西法律信息检索的持续评估基础设施，它整合了共享协议、通用排序指标、适用场景下的固定数据划分以及公开排行榜。该基准涵盖司法判例检索，以及更广泛的立法、法规和问题驱动的法律搜索任务。我们评估了基于词法、稠密向量以及BM25的重排序流程，其中包括一个经\textsc{JUÁ}对齐监督数据微调的领域自适应Qwen嵌入模型。实验结果表明，该基准具有充分的异构性，能够区分不同检索范式，并揭示跨数据集的显著权衡关系。领域自适应方法在与监督数据对齐的\textsc{JUÁ-Juris}子集上提升最为明显，而BM25在其他集合上——尤其是在词汇和制度性表述线索较强的场景中——仍保持高度竞争力。总体而言，\textsc{JUÁ}通过统一的基准设计，为跨越多领域巴西法律检索的研究提供了实用的评估框架。

摘要 (Abstract)

Legal information retrieval in Portuguese remains difficult to evaluate systematically because available datasets differ widely in document type, query style, and relevance definition. We present \textsc{JUÁ}, a public benchmark for Brazilian legal retrieval designed to support more reproducible and comparable evaluation across heterogeneous legal collections. More broadly, \textsc{JUÁ} is intended not only as a benchmark, but as a continuous evaluation infrastructure for Brazilian legal IR, combining shared protocols, common ranking metrics, fixed splits when applicable, and a public leaderboard. The benchmark covers jurisprudence retrieval as well as broader legislative, regulatory, and question-driven legal search. We evaluate lexical, dense, and BM25-based reranking pipelines, including a domain-adapted Qwen embedding model fine-tuned on \textsc{JUÁ}-aligned supervision. Results show that the benchmark is sufficiently heterogeneous to distinguish retrieval paradigms and reveal substantial cross-dataset trade-offs. Domain adaptation yields its clearest gains on the supervision-aligned \textsc{JUÁ-Juris} subset, while BM25 remains highly competitive on other collections, especially in settings with strong lexical and institutional phrasing cues. Overall, \textsc{JUÁ} provides a practical evaluation framework for studying legal retrieval across multiple Brazilian legal domains under a common benchmark design.

关键词: legal information retrieval, benchmark, Brazilian legal text, domain adaptation, embedding model, BM25, evaluation framework, retrieval paradigms

120. ❌ Short Data, Long Context: Distilling Positional Knowledge in Transformers

作者: Patrick Huber, Ernie Chang, Chinnadhurai Sankar, Rylan Conway, Igor Fedorov, Md Rifat Arefin, Adithya Sagar 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06070v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	15.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究语言模型的长上下文扩展技术，通过知识蒸馏方法在短数据上训练长上下文能力，与’Context Window Extension OR Long Context LLMs’高度相关（15分），属于大模型技术原理创新。论文涉及语言模型，与’Large Language Models OR LLMs OR Foundation Models’相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理技术、应用领域等均未在摘要中体现，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过基于logit的知识蒸馏方法，在仅使用打包的短上下文样本训练的情况下，将长上下文检索能力迁移到学生模型中，从而避免昂贵的长上下文预训练。

摘要翻译

扩展语言模型的上下文窗口通常需要昂贵的长上下文预训练，这对训练效率和数据收集都构成了重大挑战。本文提出证据表明，长上下文检索能力可以通过基于logit的知识蒸馏传递给学生模型，即使训练仅使用长上下文窗口内打包的短上下文样本。我们通过旋转位置编码（RoPE）的视角提供了全面的见解，并确立了三个关键发现。首先，与先前工作一致，我们表明分阶段RoPE缩放（即在每个训练阶段最大化旋转频谱利用率）在知识蒸馏设置中也实现了最佳的长上下文性能。其次，我们证明基于logit的知识蒸馏可以直接实现位置信息传递。通过使用打包重复令牌序列的实验设置，我们追踪了位置扰动从查询向量和键向量通过连续的Transformer层传播到输出logits的过程，揭示了位置信息系统地影响教师的输出分布，进而影响学生模型接收到的蒸馏信号。第三，我们的分析揭示了长上下文扩展过程中查询状态的结构化更新模式，其中不同的参数跨度表现出对长上下文训练的强烈敏感性。

摘要 (Abstract)

Extending the context window of language models typically requires expensive long-context pre-training, posing significant challenges for both training efficiency and data collection. In this paper, we present evidence that long-context retrieval capabilities can be transferred to student models through logit-based knowledge distillation, even when training exclusively on packed short-context samples within a long-context window. We provide comprehensive insights through the lens of Rotary Position Embedding (RoPE) and establish three key findings. First, consistent with prior work, we show that phase-wise RoPE scaling, which maximizes rotational spectrum utilization at each training stage, also achieves the best long-context performance in knowledge distillation setups. Second, we demonstrate that logit-based knowledge distillation can directly enable positional information transfer. Using an experimental setup with packed repeated token sequences, we trace the propagation of positional perturbations from query and key vectors through successive transformer layers to output logits, revealing that positional information systematically influences the teacher’s output distribution and, in turn, the distillation signal received by the student model. Third, our analysis uncovers structured update patterns in the query state during long-context extension, with distinct parameter spans exhibiting strong sensitivity to long-context training.

关键词: context window extension, knowledge distillation, long-context retrieval, RoPE scaling, positional information transfer, transformer layers, logit-based distillation, short-context training

121. ❌ From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection

作者: Hongxu Zhou 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06066v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在开放推理任务中的内在自我纠正失败问题，特别是幻觉滚雪球现象。高度相关的关键词包括：LLMs（研究对象）、Self-Correction（核心主题）、Hallucination Mitigation（研究问题）、Alignment（涉及对齐税概念）、Chain of Thought和System 2 Thinking（涉及推理过程）。LLM Agents相关因为研究自主工作流。Mechanistic Interpretability得5分因为论文分析模型内部认知负荷与格式化陷阱的关系。其他关键词如MoE、SLMs、RLHF等与论文内容无关。

!!! tip deepseek-chat TL;DR

该研究发现，在LLM的开放推理任务中，仅通过基于大纲的约束解码强制结构化反思并不能改善自我纠正性能，反而会引发一种新的失败模式——结构滚雪球，揭示了约束解码中固有的对齐税问题。

摘要翻译

大型语言模型（LLM）在开放式推理任务中的内在自我修正常因“幻觉滚雪球”现象而失效，该现象指模型在自由文本反思过程中递归地为早期错误进行辩护。虽然结构化反馈可以缓解此问题，但现有方法通常依赖外部训练的评判器或符号工具，降低了智能体的自主性。本研究探讨了仅通过基于大纲的约束解码来强制结构化反思，是否能在无需额外训练的情况下阻断错误传播。通过对一个80亿参数模型（Qwen3-8B）的评估，我们发现单纯施加结构约束并不能提升自我修正性能，反而会触发一种新的失效模式，称为“结构滚雪球”。我们发现，满足严格格式规则所需的认知负荷会将模型推入格式陷阱。这一观察有助于解释为何智能体能够实现近乎完美的表层句法对齐，却无法检测或解决更深层的语义错误。这些发现揭示了约束解码固有的“对齐代价”，凸显了自主工作流中结构粒度与模型内部能力之间的张力。代码与原始日志已发布于GitHub仓库：https://github.com/hongxuzhou/agentic_llm_structured_self_critique。

摘要 (Abstract)

Intrinsic self-correction in Large Language Models (LLMs) frequently fails in open-ended reasoning tasks due to hallucination snowballing,'' a phenomenon in which models recursively justify early errors during free-text reflection. While structured feedback can mitigate this issue, existing approaches often rely on externally trained critics or symbolic tools, reducing agent autonomy. This study investigates whether enforcing structured reflection purely through Outlines-based constrained decoding can disrupt error propagation without additional training. Evaluating an 8-billion-parameter model (Qwen3-8B), we show that simply imposing structural constraints does not improve self-correction performance. Instead, it triggers a new failure mode termed structure snowballing.’’ We find that the cognitive load required to satisfy strict formatting rules pushes the model into formatting traps. This observation helps explain why the agent achieves near-perfect superficial syntactic alignment yet fails to detect or resolve deeper semantic errors. These findings expose an ``alignment tax’’ inherent to constrained decoding, highlighting a tension between structural granularity and internal model capacity in autonomous workflows. Code and raw logs are available in the GitHub repository: https://github.com/hongxuzhou/agentic_llm_structured_self_critique.

关键词: Large Language Models, Self-Correction, Hallucination Snowballing, Constrained Decoding, Alignment Tax, Structured Reflection, Autonomous Agents, Open-ended Reasoning

122. ❌ Arch: An AI-Native Hardware Description Language for Register-Transfer Clocked Hardware Design

作者: Shuqing Zhao 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05983v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究一种新型硬件描述语言Arch，其核心创新在于为AI辅助代码生成而设计，特别是通过LLM友好的语法结构（如LL(1)语法、无回溯等）使大语言模型能够从自然语言规范生成类型安全的Arch代码。这与’Large Language Models’关键词高度相关（8分），因为论文明确将LLM作为目标生成工具。论文属于硬件设计领域，与’AI for Science’有一定关联（5分），可视为AI在工程/科学计算领域的应用。其他关键词主要涉及大模型训练、优化、推理、对齐、代理等具体技术，与硬件描述语言设计无直接关系，故评0分。

!!! tip deepseek-chat TL;DR

论文提出了一种名为Arch的新型硬件描述语言，专门为AI辅助代码生成设计，通过类型系统和LLM友好的语法结构，使大语言模型能够从自然语言规范生成类型安全的硬件代码，并编译为标准的SystemVerilog。

摘要翻译

本文提出Arch（AI原生寄存器传输级时钟硬件描述语言），这是一种基于第一性原理设计的硬件描述语言，专为微架构规范与AI辅助代码生成而构建。Arch为流水线、有限状态机、先进先出队列、仲裁器、寄存器文件、总线及跨时钟域传输等结构引入了原生语言构造——这些结构在现有硬件描述语言中仅能通过易产生隐蔽错误的用户定义模式进行表达。
其核心设计决策在于将时钟与复位信号本身定义为参数化类型（Clock、Reset<S,P,D?>），而非普通线网。这一设计将跨时钟域分析与跨复位域分析从外部静态检查流程转化为编译时类型规则。结合对位宽、端口方向、单驱动所有权以及组合逻辑无环性的同步追踪，该类型系统能够在任何仿真运行之前捕获多驱动冲突、未驱动端口、隐式锁存器、位宽失配、组合逻辑环路及未同步的域交叉错误。
所有语法设计均遵循AI可生成性契约：采用无需回溯或多词素前瞻的LL(1)文法，摒弃预处理器与宏，采用统一的声明模式、命名化块结束符、显式方向连接箭头以及todo!应急接口，使得大语言模型能够直接根据自然语言描述生成结构正确、类型安全的Arch代码，且无需微调。
Arch编译器可生成确定性的、通过静态检查的IEEE 1800-2017 SystemVerilog代码，并提供集成仿真工具链，可生成用于周期精确仿真的编译型C++模型。我们通过八路组相联L1数据缓存与符合PG021规范的可综合AXI DMA控制器（基于Sky130工艺的Yosys与OpenSTA综合结果）进行案例研究，并从表达力、安全性及AI适配性三个维度，将Arch与SystemVerilog、VHDL、Chisel、Bluespec等现代硬件描述语言进行比较。

摘要 (Abstract)

We present Arch (AI-native Register-transfer Clocked Hardware), a hardware description language designed from first principles for micro-architecture specification and AI-assisted code generation. Arch introduces first-class language constructs for pipelines, FSMs, FIFOs, arbiters, register files, buses, and clock-domain crossings – structures that existing HDLs express only as user-defined patterns prone to subtle errors. A central design choice is that clocks and resets are themselves parameterized types (Clock, Reset<S,P,D?>) rather than ordinary nets, converting clock-domain crossing (CDC) and reset-domain crossing (RDC) analysis from external linter passes into compile-time typing rules. Combined with simultaneous tracking of bit widths, port directions, single-driver ownership, and combinational acyclicity, the type system catches multiple drivers, undriven ports, implicit latches, width mismatches, combinational loops, and unsynchronized domain crossings before any simulator runs. Every syntactic choice is governed by an AI-generatability contract: an LL(1) grammar requiring no backtracking or multi-token lookahead, no preprocessor or macros, a uniform declaration schema, named block endings, explicit directional connect arrows, and a todo! escape hatch enable LLMs to produce structurally correct, type-safe Arch from natural-language specifications without fine-tuning. The Arch compiler emits deterministic, lint-clean IEEE 1800-2017 SystemVerilog and provides an integrated simulation toolchain that generates compiled C++ models for cycle-accurate simulation. We present case studies of an 8-way set-associative L1 data cache and a synthesizable PG021-compatible AXI DMA controller (with Yosys and OpenSTA results on Sky130), and compare Arch to SystemVerilog, VHDL, Chisel, Bluespec, and other modern HDLs across expressiveness, safety, and AI suitability dimensions.

关键词: hardware description language, AI-assisted code generation, LLM, type system, SystemVerilog, clock-domain crossing, micro-architecture, synthesis

123. ❌ Is CLIP Cross-Eyed? Revealing and Mitigating Center Bias in the CLIP Family

作者: Oscar Chew, Hsiao-Ying Huang, Kunal Jain, Tai-I Chen, Khoa D Doan, Kuan-Hao Huang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05971v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究CLIP视觉语言模型的中心偏差问题，属于计算机视觉和模型解释性领域，与绝大多数关键词（主要针对大语言模型技术）完全无关。仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文使用了嵌入分解和注意力图分析等可解释性方法来诊断模型偏差。

!!! tip deepseek-chat TL;DR

该论文揭示了CLIP视觉语言模型存在中心偏差问题（过度关注图像中心区域），并通过可解释性方法分析其原因，提出了无需训练的视觉提示和注意力重分配策略来缓解这一偏差。

摘要翻译

近期研究表明，对比式视觉-语言模型（如CLIP）往往缺乏对视觉内容的细粒度理解。尽管已有越来越多研究试图解决这一局限，我们发现CLIP系列模型存在一种独特的缺陷模式——我们称之为中心化偏差，该问题甚至在近期模型变体中依然存在。具体而言，CLIP倾向于过度关注图像中心区域，而忽视位于边缘的重要对象。这一局限具有根本性影响，因为若无法识别相关对象，则难以执行任何依赖这些对象的复杂任务。为探究该局限的成因，我们从表征和注意力两个视角展开分析。通过可解释性方法（即嵌入分解与注意力图分析），我们发现相关概念——尤其是与偏离中心对象相关的概念——在视觉嵌入聚合过程中因信息丢失而从最终表征的模型嵌入中消失，这尤其归因于对池化机制的依赖。最后，我们证明可通过免训练策略（例如视觉提示与注意力重分布）缓解此类偏差，通过将模型注意力引导至非中心区域来实现。

摘要 (Abstract)

Recent research has shown that contrastive vision-language models such as CLIP often lack fine-grained understanding of visual content. While a growing body of work has sought to address this limitation, we identify a distinct failure mode in the CLIP family, which we term center bias, that persists even in recent model variants. Specifically, CLIP tends to disproportionately focus on the central region of an image, overlooking important objects located near the boundaries. This limitation is fundamental as failure to recognize relevant objects makes it difficult to perform any sophisticated tasks that depend on those objects. To understand the underlying causes of the limitation, we conduct analyses from both representation and attention perspectives. Using interpretability methods, i.e., embedding decomposition and attention map analysis, we find that relevant concepts especially those associated with off-center objects vanish from the model’s embedding in the final representation due to information loss during the aggregation of visual embeddings, particularly the reliance on pooling mechanisms. Finally, we show that this bias can be alleviated with training-free strategies such as visual prompting and attention redistribution by redirecting models’ attention to off-center regions.

关键词: CLIP, vision-language models, center bias, interpretability, attention analysis, visual prompting, embedding decomposition, pooling mechanisms

124. ❌ FinReporting: An Agentic Workflow for Localized Reporting of Cross-Jurisdiction Financial Disclosures

作者: Fan Zhang, Mingzi Song, Rania Elbadry, Yankai Chen, Shaobo Wang, Yixi Zhou, Xunwen Zheng, Yueru He, Yuyang Dai, Georgi Georgiev, Ayesha Gull, Muhammad Usman Safder, Fan Wu, Liyuan Meng, Fengxian Ji, Junning Zhao, Xueqing Peng, Jimin Huang, Yu Chen, Xue, Liu, Preslav Nakov, Zhuohan Xie 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05966v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确使用LLMs进行财务报告提取和总结，并构建了一个’agentic workflow’系统，因此与’Large Language Models’和’LLM Agents’高度相关（10分）。其他关键词如MoE、SLMs、训练方法、推理技术、压缩技术、科学AI应用等，论文均未涉及或提及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出了FinReporting，一个基于智能体工作流的系统，用于解决跨司法管辖区财务报告中的语义对齐和验证挑战，通过构建统一规范本体和约束LLMs作为验证器，提高了异构报告制度下的一致性和可靠性。

摘要翻译

财务报告系统日益采用大语言模型（LLMs）来提取和总结公司披露信息。然而，现有方法大多假设单一市场环境，未能解决不同司法管辖区之间的结构性差异。会计分类标准、标记基础设施（例如XBRL与PDF）以及汇总惯例的差异，使得跨管辖区报告成为语义对齐与验证的挑战。本文提出FinReporting，一种用于本地化跨管辖区财务报告的智能体工作流。该系统构建了涵盖利润表、资产负债表和现金流量表的统一规范本体，并将报告过程分解为可审计的多个阶段，包括文件获取、信息提取、规范映射及异常记录。FinReporting并非将LLMs用作自由形式的生成器，而是在明确的决策规则和证据锚定下，将其部署为受约束的验证器。基于美国、日本和中国年度申报文件的评估表明，该系统在异构报告制度下提升了处理的一致性与可靠性。我们发布了一个支持跨市场检查及本地化财务报表结构化导出的交互式演示系统。演示地址为：https://huggingface.co/spaces/BoomQ/FinReporting-Demo。系统介绍视频可访问：https://www.youtube.com/watch?v=f65jdEL31Kk。

摘要 (Abstract)

Financial reporting systems increasingly use large language models (LLMs) to extract and summarize corporate disclosures. However, most assume a single-market setting and do not address structural differences across jurisdictions. Variations in accounting taxonomies, tagging infrastructures (e.g., XBRL vs. PDF), and aggregation conventions make cross-jurisdiction reporting a semantic alignment and verification challenge. We present FinReporting, an agentic workflow for localized cross-jurisdiction financial reporting. The system builds a unified canonical ontology over Income Statement, Balance Sheet, and Cash Flow, and decomposes reporting into auditable stages including filing acquisition, extraction, canonical mapping, and anomaly logging. Rather than using LLMs as free-form generators, FinReporting deploys them as constrained verifiers under explicit decision rules and evidence grounding. Evaluated on annual filings from the US, Japan, and China, the system improves consistency and reliability under heterogeneous reporting regimes. We release an interactive demo supporting cross-market inspection and structured export of localized financial statements. Our demo is available at https://huggingface.co/spaces/BoomQ/FinReporting-Demo . The video describing our system is available at https://www.youtube.com/watch?v=f65jdEL31Kk

关键词: financial reporting, large language models, agentic workflow, cross-jurisdiction, semantic alignment, canonical ontology, verification, heterogeneous reporting regimes

125. ❌ BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs

作者: Abbas Ghaddar, Ivan Kobyzev, Boxing Chen, Yufei Cui 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05942v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	8.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM后训练优化中的注意力头选择问题，直接涉及LLM、后训练、KV缓存压缩和推理加速等关键词。论文提出BOSCH方法，通过黑盒优化选择短上下文注意力头，以降低KV缓存使用并改善延迟，这与’KV Cache Compression’和’Speculative Decoding OR Inference Acceleration’高度相关。论文还涉及长上下文性能恢复，与’Context Window Extension OR Long Context LLMs’相关。论文提到在持续预训练中应用，与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联。其他关键词如MoE、SLMs、对齐、RAG等与论文内容无关。

!!! tip deepseek-chat TL;DR

论文提出BOSCH方法，通过黑盒二进制优化选择LLM中的短上下文注意力头，以降低KV缓存使用并改善推理延迟，实验表明该方法优于现有层级和静态头级方法，并能更快恢复长上下文性能。

摘要翻译

大型语言模型（LLM）的后训练混合化通常采用滑动窗口注意力（SWA）替代二次自注意力机制，以降低KV缓存占用并改善延迟。现有的混合化方案通常在层级别（例如交错排列）或通过从局部到全局的静态排序在注意力头级别定义。层级别方案忽略了局部与全局依赖关系是通过同一层内的注意力头进行路由的，而静态的头级别排序则受困于纠缠问题：混合化后注意力头的局部/全局行为可能发生改变。我们提出BOSCH（面向短上下文头选择的黑盒二进制优化），这是一种无需训练的方法，将问题形式化为大规模邻域搜索，并将其分解为三个子问题：（i）通过小预算黑盒探测进行层重要性检测，（ii）基于这些敏感度进行自适应逐层SWA比例分配，以及（iii）在比例区间内进行分组头级别优化。在4个参数量从1.7B到30B的LLM上、跨越4种SWA比例的大量实验表明，BOSCH始终优于层级别启发式方法和6种强静态头级别方法，且在较高SWA比例下增益更为显著。在持续预训练过程中，BOSCH能更快、更有效地恢复原始长上下文性能。对所选注意力头的分析显示，BOSCH在不同SWA比例下存在显著的头替换现象，这强调了针对每个目标比例执行头级别选择的重要性，而非依赖固定的局部性排序。

摘要 (Abstract)

Post-training hybridization of large language models (LLMs) often replaces quadratic self-attention with sliding-window attention (SWA) to reduce KV cache usage and improve latency. Existing hybridization schemes are typically defined either at the layer level (e.g., interleaving) or at the head level via static rankings from local to global. Layer-level schemes ignore that local and global dependencies are routed through heads within the same layer, while static head-level rankings suffer from entanglement: a head’s local/global behavior can change after hybridization. We propose BOSCH, Black-box Binary Optimization for Short-context Head Selection, a training-free method that formulates the problem as a Large Neighborhood Search and decomposes it into three subproblems: (i) layer-importance detection via small-budget black-box probes, (ii) adaptive per-layer SWA-ratio assignment based on these sensitivities, and (iii) grouped head-level optimization within ratio buckets. Extensive experiments on 4 LLMs ranging from 1.7B to 30B parameters, across 4 SWA ratios, show that BOSCH consistently outperforms layer-level heuristics and 6 strong static head-level methods, with larger gains at higher SWA ratios. Under continual pretraining, BOSCH recover original long-context performance faster and to a higher level. Analysis of the selected heads reveals substantial turnover for BOSCH across different SWA ratios, underscoring the importance of performing head-level selection for each target ratio rather than relying on fixed locality rankings.

关键词: Large Language Models, Post-training, KV cache, Attention head selection, Inference acceleration, Sliding-window attention, Black-box optimization, Continual pretraining

126. ❌ The UNDO Flip-Flop: A Controlled Probe for Reversible Semantic State Management in State Space Model

作者: Hongxu Zhou 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05923v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究State Space Models（SSMs）中的Mamba-2模型在UNDO Flip-Flop任务上的表现，聚焦于模型表达性与梯度下降学习能力之间的差距。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理或科学AI应用直接相关，而本文研究对象是SSMs（特别是Mamba-2），属于序列建模的特定架构，未涉及LLMs、MoE、缩放律、训练调优方法（如SFT、RLHF、PEFT）、推理优化（如RAG、注意力机制）、推理能力（如CoT）、智能体、模型压缩等关键词领域，也未应用于生物信息学等科学领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文通过引入UNDO Flip-Flop任务，揭示了Mamba-2模型在理论上能表达但梯度下降无法可靠学习的可逆语义状态检索机制，表明其收敛于局部启发式策略而非历史状态恢复。

摘要翻译

状态空间模型（SSMs）已被证明在理论上具备建模无星号序列任务和有界层次结构的能力 Sarrof 等人（2024）。然而，形式化的表达能力结果并不能保证基于梯度的优化能够可靠地发现相应的解决方案。现有的基准测试要么探究单调状态跟踪（如标准的 Flip-Flop 任务），要么探究结构嵌套（如 Dyck 语言），但均未独立考察可逆的语义状态检索。为此，我们引入了 UNDO Flip-Flop 任务以填补这一空白。该任务通过在标准 Flip-Flop 中增加 UNDO 操作，要求模型维护一个隐式的有界栈，并在非单调的更新序列下恢复历史状态。我们在此框架下评估了单层和双层 Mamba-2 模型。两种变体均未能习得理论上可表达的基于栈的回滚机制，而是收敛于一种局部切换启发式策略——该策略反转当前状态而非检索存储的历史。在训练长度分布内进行的对抗性回撤压力测试中，双层模型的准确率下降至 41.10%，低于随机猜测水平。结果证实了这是系统性而非偶然性的失败。因果消融实验表明，瓶颈在于检索而非存储环节。这些结果清晰地划分了架构在原则上能够表示的内容与梯度下降能够可靠学习的内容之间的界限，这一区别是单纯的理论表达能力分析所无法捕捉的。

摘要 (Abstract)

State space models (SSMs) have been shown to possess the theoretical capacity to model both star-free sequential tasks and bounded hierarchical structures Sarrof et al. (2024). However, formal expressivity results do not guarantee that gradient-based optimisation will reliably discover the corresponding solutions. Existing benchmarks probe either monotonic state tracking, as in the standard Flip-Flop task, or structural nesting, as in the Dyck languages, but neither isolates reversible semantic state retrieval. We introduce the UNDO Flip-Flop task to fill this gap. By extending the standard Flip-Flop with an UNDO, the task requires a model to maintain an implicit bounded stack and recover historical states under non-monotonic update sequences. We evaluate one-layer and two-layer Mamba-2 under this framework. Both variants fail to acquire the provably expressible stack-based rollback mechanism, converging instead on a local toggle heuristic that inverts the current state rather than retrieving stored history. Under an adversarial retraction pressure test held within the training length distribution, the two-layer model collapses to 41.10% accuracy, which is below random chance. The results confirm systematic rather than incidental failure. Causal ablation shows that the bottleneck lies in retrieval, not storage. These results draw a clear line between what an architecture can in principle represent and what gradient descent reliably learns, a distinction that theoretical expressivity analyses alone cannot capture.

关键词: State Space Models, Mamba-2, UNDO Flip-Flop, reversible semantic state retrieval, gradient-based optimisation, expressivity, stack-based rollback, causal ablation

127. ❌ FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

作者: Michael Krumdick, Varshini Reddy, Shivani Chaudhary, William Day, Maarij Ahmed, Hayan Haqqi, Muhammad Ahsen Fahim, Hanzallah Amjad, Ahmad Orakzai, Aqsa Gul, Chris Tanner 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05912v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确提到Large Language Model (LLM)部署和评估，与第一个关键词高度相关（10分）。其他关键词涉及具体技术方法（如MoE、RLHF、RAG等）、模型优化（如量化、推理加速）或特定应用领域（如生物信息学），论文未涉及这些具体技术或领域，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对金融领域缺乏实际专业任务评估基准的问题，提出了一个需要超过18小时人工完成的复杂金融建模基准FrontierFinance，并发现人类专家在任务完成质量和客户就绪输出方面优于当前最先进的LLM系统。

摘要翻译

随着人工智能在知识密集型行业引发的劳动力替代担忧日益加剧，现有基准测试体系已无法有效衡量定义实际专业能力的任务表现。金融领域尤其被认定为人工智能暴露风险较高的行业，却缺乏能够追踪现实发展的稳健基准。当前大语言模型部署中明确问责机制的缺失进一步加剧了这一不足。为此，我们推出FrontierFinance——一个包含五大核心金融模型、共25项复杂金融建模任务的长期基准测试体系，每项任务平均需要超过18小时的专业人力完成。该基准由金融专业人士参与开发，反映了行业标准的金融建模工作流程，并配有结构化评估的详细评分标准。我们邀请人类专家参与任务定义、制定评分标准、对大语言模型进行评分，并通过亲自执行任务建立人类基准线。研究表明，与当前最先进的系统相比，人类专家不仅平均得分更高，且更有可能提供可直接交付客户的工作成果。

摘要 (Abstract)

As concerns surrounding AI-driven labor displacement intensify in knowledge-intensive sectors, existing benchmarks fail to measure performance on tasks that define practical professional expertise. Finance, in particular, has been identified as a domain with high AI exposure risk, yet lacks robust benchmarks to track real-world developments. This gap is compounded by the absence of clear accountability mechanisms in current Large Language Model (LLM) deployments. To address this, we introduce FrontierFinance, a long-horizon benchmark of 25 complex financial modeling tasks across five core finance models, requiring an average of over 18 hours of skilled human labor per task to complete. Developed with financial professionals, the benchmark reflects industry-standard financial modeling workflows and is paired with detailed rubrics for structured evaluation. We engage human experts to define the tasks, create rubrics, grade LLMs, and perform the tasks themselves as human baselines. We demonstrate that our human experts both receive higher scores on average, and are more likely to provide client-ready outputs than current state-of-the-art systems.

关键词: FrontierFinance, financial modeling, benchmark, Large Language Models, AI-driven labor displacement, long-horizon tasks, human experts, client-ready outputs

128. ❌ FRENCH-YMCA: A FRENCH Corpus meeting the language needs of Youth, froM Children to Adolescents

作者: Cherifa Ben Khelil, Jean-Yves Antoine, Anaïs Halftermeyer, Frédéric Rayar, Mathieu Thebaud 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05899v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文介绍了一个法语语料库（French-YMCA），专门针对儿童和青少年的语言需求构建，包含39,200个文本文件和22,471,898个单词，旨在为训练理解青少年语言的语言模型提供基础资源。论文内容聚焦于语料库构建、语言资源描述和潜在应用（如改善数字交互的年龄适宜性），但未涉及大模型或深度学习的技术原理、创新方法（如MoE、量化、推理优化等）、训练技术（如预训练、微调、对齐）、代理系统或科学AI应用。所有关键词均与大模型技术、训练方法、优化技术或特定科学领域应用相关，而本论文仅提及语料库可作为语言模型训练的基础，但未实际研究或应用任何大模型技术，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文构建了一个针对儿童和青少年的法语语料库（French-YMCA），包含大量文本数据，旨在为训练理解青少年语言的语言模型提供资源，以改善数字交互的年龄适宜性。

摘要翻译

本文介绍了French-YMCA语料库，这是一个专门针对儿童和青少年设计的新型语言资源。构建该语料库的动机十分明确：儿童具有独特的语言需求，因为他们的语言能力处于持续发展阶段，且与成人存在差异。French-YMCA语料库包含39,200个文本文件，总计22,471,898词。其特色在于来源的多样性、语法与拼写的一致性，以及坚持向所有人提供开放的在线访问权限。此类语料库可作为训练语言模型的基础，使其能够理解并预测青少年语言，从而提升数字交互的质量，并确保回应与建议符合该年龄段用户的认知水平且适龄适配。

摘要 (Abstract)

In this paper, we introduce the French-YMCA corpus, a new linguistic resource specifically tailored for children and adolescents. The motivation for building this corpus is clear: children have unique language requirements, as their language skills are in constant evolution and differ from those of adults. With an extensive collection of 39,200 text files, the French-YMCA corpus encompasses a total of 22,471,898 words. It distinguishes itself through its diverse sources, consistent grammar and spelling, and the commitment to providing open online accessibility for all. Such corpus can serve as the foundation for training language models that understand and anticipate youth’s language, thereby enhancing the quality of digital interactions and ensuring that responses and suggestions are age-appropriate and adapted to the comprehension level of users of this age.

关键词: French corpus, children and adolescents, linguistic resource, language models, age-appropriate interactions, text corpus, youth language, open accessibility

129. ❌ Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models

作者: Xiangming Gu, Soham De, Larisa Markeeva, Petar Veličković, Razvan Pascanu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05868v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大型推理模型（LRMs）中的采样策略比较，核心关注推理过程而非具体技术实现。高度相关关键词：‘Large Language Models’（论文明确研究LRMs，属于大模型范畴）、‘Chain of Thought’和’System 2 Thinking’（论文研究数学和编程问题的多步推理过程，与复杂推理直接相关）。中等相关：‘Context Window Extension’（论文提到序列采样需要更长上下文，但非主要研究点）。其余关键词涉及具体技术（如MoE、量化、对齐等）或应用领域（如科学AI），论文未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型推理模型中并行采样与序列采样的性能差异问题，并通过实验发现性能差距的主要原因是序列采样导致的探索不足，而非聚合操作或上下文长度问题。

摘要翻译

大型推理模型（LRMs）在数学与编程等复杂问题上已展现出卓越性能。然而，为获得高质量解答，通常需要进行多次采样。理论上，存在两种可组合为更复杂流程的采样策略：顺序采样与并行采样。本文首先严谨比较了这两种方法，并观察到——与先前研究一致——即使顺序采样本应具备更强的表征能力，并行采样仍表现出更优性能。为探究其内在原因，我们针对该现象提出三种假设：（i）并行采样的优势源于聚合算子；（ii）顺序采样因需使用更长上下文而受损；（iii）顺序采样因受限于先前答案而导致探索不足。基于不同模型系列与规模（Qwen3、DeepSeek-R1蒸馏模型、Gemini 2.5）及问题领域（数学与编程）的实证证据表明，聚合操作与上下文长度并非性能差距的主因。相反，探索不足似乎起着更为关键的作用，我们认为这是导致性能差距的主要原因之一。

摘要 (Abstract)

Large Reasoning Models (LRMs) have shown remarkable performance on challenging questions, such as math and coding. However, to obtain a high quality solution, one may need to sample more than once. In principal, there are two sampling strategies that can be composed to form more complex processes: sequential sampling and parallel sampling. In this paper, we first compare these two approaches with rigor, and observe, aligned with previous works, that parallel sampling seems to outperform sequential sampling even though the latter should have more representation power. To understand the underline reasons, we make three hypothesis on the reason behind this behavior: (i) parallel sampling outperforms due to the aggregator operator; (ii) sequential sampling is harmed by needing to use longer contexts; (iii) sequential sampling leads to less exploration due to conditioning on previous answers. The empirical evidence on various model families and sizes (Qwen3, DeepSeek-R1 distilled models, Gemini 2.5) and question domains (math and coding) suggests that the aggregation and context length do not seem to be the main culprit behind the performance gap. In contrast, the lack of exploration seems to play a considerably larger role, and we argue that this is one main cause for the performance gap.

关键词: Large Reasoning Models, parallel sampling, sequential sampling, performance gap, exploration, math reasoning, coding reasoning, context length

130. ❌ Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Paper-Reviewer Matching

作者: Yicheng Pan, Zhiyuan Ning, Ludi Wang, Yi Du 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05866v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出P2R框架，使用通用大语言模型（LLMs）构建结构化档案，并采用LLM委员会进行评审员匹配，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。框架包含混合检索步骤，结合语义和方面级信号，与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’有一定关联（5分）。论文未涉及其他关键词的具体技术或应用，如MoE、SLMs、训练方法、对齐、推理、代理、压缩等，也未在科学领域（如生物信息学）直接应用AI，因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对会议投稿量增长下评审员匹配不准确的问题，提出了P2R框架，利用大语言模型构建结构化档案并进行混合检索与LLM委员会评估，实验表明其在多个数据集上优于现有基线方法。

摘要翻译

随着会议投稿量的持续增长，如何准确推荐合适的审稿人已成为一项挑战。现有方法大多遵循“论文到论文”的匹配范式，即通过审稿人的发表历史隐式地表示其专长。然而，有效的审稿人匹配需要捕捉多维度的专业知识，仅依靠与过往论文的文本相似性往往不足。为弥补这一不足，我们提出P2R——一个无需训练的框架，将匹配方式从隐式的论文对论文匹配转向基于显式档案的匹配。P2R利用通用大语言模型（LLMs）为投稿和审稿人构建结构化档案，将其解构为研究主题（Topics）、方法（Methodologies）和应用领域（Applications）三个维度。基于这些档案，P2R采用由粗到精的流程以平衡效率与深度：首先通过结合语义与维度层面信号的混合检索形成高召回率的候选池，随后由基于LLM的评审委员会依据严格标准对候选人进行评估，综合集成多维度的专家视角及整体领域主席（Area Chair）的全局观点。在NeurIPS、SIGIR和SciRepEval数据集上的实验表明，P2R持续优于现有先进基线方法。消融研究进一步验证了各组成部分的必要性。总体而言，P2R凸显了显式结构化专长建模的价值，并为应用大语言模型于审稿人匹配任务提供了实践指导。

摘要 (Abstract)

As conference submission volumes continue to grow, accurately recommending suitable reviewers has become a challenge. Most existing methods follow a ``Paper-to-Paper’’ matching paradigm, implicitly representing a reviewer by their publication history. However, effective reviewer matching requires capturing multi-dimensional expertise, and textual similarity to past papers alone is often insufficient. To address this gap, we propose P2R, a training-free framework that shifts from implicit paper-to-paper matching to explicit profile-based matching. P2R uses general-purpose LLMs to construct structured profiles for both submissions and reviewers, disentangling them into Topics, Methodologies, and Applications. Building on these profiles, P2R adopts a coarse-to-fine pipeline to balance efficiency and depth. It first performs hybrid retrieval that combines semantic and aspect-level signals to form a high-recall candidate pool, and then applies an LLM-based committee to evaluate candidates under strict rubrics, integrating both multi-dimensional expert views and a holistic Area Chair perspective. Experiments on NeurIPS, SIGIR, and SciRepEval show that P2R consistently outperforms state-of-the-art baselines. Ablation studies further verify the necessity of each component. Overall, P2R highlights the value of explicit, structured expertise modeling and offers practical guidance for applying LLMs to reviewer matching.

关键词: reviewer matching, LLMs, structured profiling, hybrid retrieval, paper-reviewer matching, expertise modeling, training-free framework, multi-dimensional expertise

131. ❌ AgentGL: Towards Agentic Graph Learning with LLMs via Reinforcement Learning

作者: Yuanfu Sun, Kang Li, Dongzhe Fan, Jiajin Liu, Qiaoyu Tan 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05846v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM智能体在图学习中的应用，与’Large Language Models’、‘LLM Agents’、‘Tool Use’高度相关（10分），涉及检索增强生成（5分），其他关键词如MoE、量化、推理加速等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了AgentGL框架，通过强化学习驱动的LLM智能体在图数据上进行拓扑感知导航和推理，在节点分类和链接预测任务上显著超越了现有基线方法。

摘要翻译

大语言模型（LLM）日益依赖智能体能力——迭代检索、工具使用和决策制定——以突破静态参数化知识的局限。然而，现有智能体框架将外部信息视为非结构化文本，未能利用现实世界数据中固有的拓扑依赖关系。为弥补这一差距，我们提出了图学习智能体化（Agentic Graph Learning, AGL）范式，该范式将图学习重新定义为拓扑感知导航与基于LLM的推理交错进行的过程。具体而言，我们提出了首个强化学习驱动的AGL框架——AgentGL。该框架为LLM智能体配备了原生图工具以支持多尺度探索，通过搜索约束思维机制调控工具使用以平衡准确性与效率，并采用图条件课程强化学习策略来稳定长周期策略学习而无需逐步监督。在多样化的文本属性图基准测试和多种LLM骨干模型上，AgentGL显著优于先进的GraphLLM与GraphRAG基线方法，在节点分类任务中实现最高17.5%的绝对性能提升，在链接预测任务中提升达28.4%。这些结果表明，AGL是使LLM能够自主导航并推理复杂关系环境的前沿方向。代码已公开于https://github.com/sunyuanfu/AgentGL。

摘要 (Abstract)

Large Language Models (LLMs) increasingly rely on agentic capabilities-iterative retrieval, tool use, and decision-making-to overcome the limits of static, parametric knowledge. Yet existing agentic frameworks treat external information as unstructured text and fail to leverage the topological dependencies inherent in real-world data. To bridge this gap, we introduce Agentic Graph Learning (AGL), a paradigm that reframes graph learning as an interleaved process of topology-aware navigation and LLM-based inference. Specifically, we propose AgentGL, the first reinforcement learning (RL)-driven framework for AGL. AgentGL equips an LLM agent with graph-native tools for multi-scale exploration, regulates tool usage via search-constrained thinking to balance accuracy and efficiency, and employs a graph-conditioned curriculum RL strategy to stabilize long-horizon policy learning without step-wise supervision. Across diverse Text-Attributed Graph (TAG) benchmarks and multiple LLM backbones, AgentGL substantially outperforms strong GraphLLMs and GraphRAG baselines, achieving absolute improvements of up to 17.5% in node classification and 28.4% in link prediction. These results demonstrate that AGL is a promising frontier for enabling LLMs to autonomously navigate and reason over complex relational environments. The code is publicly available at https://github.com/sunyuanfu/AgentGL.

关键词: Large Language Models, LLM Agents, Graph Learning, Reinforcement Learning, Tool Use, Retrieval-Augmented Generation, Text-Attributed Graphs, Autonomous Navigation

132. ❌ LoRM: Learning the Language of Rotating Machinery for Self-Supervised Condition Monitoring

作者: Xiao Qin, Xingyi Song, Tong Liu, Hatim Laalej, Zepeng Liu, Yunpeng Zhu, Ligang He 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05863v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文LoRM提出了一种将旋转机械信号视为机器语言的自监督框架，通过将传感器数据转化为token序列预测问题，并部分微调通用预训练语言模型进行知识迁移，实现工业信号分析和状态监测。核心相关关键词包括：1) ‘Large Language Models’ (8分)：论文明确使用预训练语言模型作为基础；2) ‘Pre-training’和’Post-training’ (各8分)：涉及预训练模型和微调过程；3) ‘PEFT’ (8分)：采用部分微调实现参数高效迁移；4) ‘Quantization’ (5分)：将未来目标段量化为离散token，涉及量化概念；5) ‘AI for Science’ (8分)：属于工业应用场景的科学AI。其他关键词如MoE、SLMs、RAG、RLHF等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

LoRM提出了一种将旋转机械信号视为机器语言的自监督框架，通过token化传感器数据和微调预训练语言模型，实现了高效的工业设备状态监测和跨工具泛化。

摘要翻译

本文提出LoRM（旋转机械语言），一种用于多模态旋转机械信号理解与实时状态监测的自监督框架。LoRM基于以下理念构建：旋转机械信号可被视为一种机器语言——局部信号可被离散化为符号单元，其未来演化可通过观测的多传感器上下文进行预测。与依赖人工构造变换和特征的传统信号处理方法不同，LoRM将多模态传感器数据重新定义为基于令牌的序列预测问题。对于每个数据窗口，观测上下文段以连续形式保留，而各传感通道的未来目标段则被量化为离散令牌。随后，通过在工业信号上对通用预训练语言模型进行部分微调，实现高效知识迁移，避免了从头训练大型模型的需求。最终，通过追踪令牌预测误差作为健康指标进行状态监测，误差增长即表征性能退化。原位刀具状态监测实验证明了该框架具备稳定的实时追踪能力与强大的跨刀具泛化性能，表明LoRM为语言建模与工业信号分析之间搭建了实用桥梁。源代码公开于https://github.com/Q159753258/LormPHM。

摘要 (Abstract)

We present LoRM (Language of Rotating Machinery), a self-supervised framework for multi-modal rotating-machinery signal understanding and real-time condition monitoring. LoRM is built on the idea that rotating-machinery signals can be viewed as a machine language: local signals can be tokenised into discrete symbolic units, and their future evolution can be predicted from observed multi-sensor context. Unlike conventional signal-processing methods that rely on hand-crafted transforms and features, LoRM reformulates multi-modal sensor data as a token-based sequence-prediction problem. For each data window, the observed context segment is retained in continuous form, while the future target segment of each sensing channel is quantised into a discrete token. Then, efficient knowledge transfer is achieved by partially fine-tuning a general-purpose pre-trained language model on industrial signals, avoiding the need to train a large model from scratch. Finally, condition monitoring is performed by tracking token-prediction errors as a health indicator, where increasing errors indicate degradation. In-situ tool condition monitoring (TCM) experiments demonstrate stable real-time tracking and strong cross-tool generalisation, showing that LoRM provides a practical bridge between language modelling and industrial signal analysis. The source code is publicly available at https://github.com/Q159753258/LormPHM.

关键词: self-supervised learning, rotating machinery, condition monitoring, language model fine-tuning, multi-modal sensor data, token prediction, industrial signal analysis, real-time tracking

133. ❌ CLEAR: Cross-Lingual Enhancement in Alignment via Reverse-training

作者: Seungyoon Lee, Minhyuk Kim, Seongtae Hong, Youngjoon Jang, Dongsuk Oh, Heuiseok Lim 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05821v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多语言嵌入模型的跨语言检索任务，提出了一种新的损失函数CLEAR，通过反向训练方案增强语言对齐。研究内容属于自然语言处理中的多语言表示学习领域，但未涉及大模型、深度学习技术原理创新或大模型在不同领域的应用。所有关键词均针对大模型技术、训练方法、推理优化、应用场景等，与本文的嵌入模型和检索任务无直接关联，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对多语言嵌入模型在跨语言检索中因语言资源不平衡和对齐不足导致的性能问题，提出了一种名为CLEAR的损失函数，通过反向训练方案利用英语作为桥梁增强语言对齐，实验表明该方法在跨语言场景（尤其是低资源语言）中提升了检索性能（最高提升15%），同时最小化了英语性能下降。

摘要翻译

现有跨语言嵌入模型常因语言资源不均衡及训练中对跨语言对齐考量不足，在跨语言场景下面临挑战。尽管标准化的跨语言适应对比学习方法已被广泛采用，但其可能难以捕捉语言间的本质对齐关系，并在英语等高对齐语言中导致性能下降。为应对这些挑战，我们提出基于反向训练的跨语言检索增强方法（CLEAR），这是一种利用反向训练机制的新型损失函数，旨在提升多样化跨语言检索场景下的性能。CLEAR以英语段落为桥梁，强化目标语言与英语之间的对齐关系，从而确保跨语言检索任务的鲁棒性。大量实验表明，CLEAR在跨语言场景中实现了显著提升——尤其在低资源语言中增益高达15%，同时最大限度减少英语性能损失。此外，我们的研究结果凸显了CLEAR在多语言训练中同样具备良好效能，预示其广泛的适用性与可扩展潜力。代码已发布于https://github.com/dltmddbs100/CLEAR。

摘要 (Abstract)

Existing multilingual embedding models often encounter challenges in cross-lingual scenarios due to imbalanced linguistic resources and less consideration of cross-lingual alignment during training. Although standardized contrastive learning approaches for cross-lingual adaptation are widely adopted, they may struggle to capture fundamental alignment between languages and degrade performance in well-aligned languages such as English. To address these challenges, we propose Cross-Lingual Enhancement in Retrieval via Reverse-training (CLEAR), a novel loss function utilizing a reverse training scheme to improve retrieval performance across diverse cross-lingual retrieval scenarios. CLEAR leverages an English passage as a bridge to strengthen alignments between the target language and English, ensuring robust performance in the cross-lingual retrieval task. Our extensive experiments demonstrate that CLEAR achieves notable improvements in cross-lingual scenarios, with gains up to 15%, particularly in low-resource languages, while minimizing performance degradation in English. Furthermore, our findings highlight that CLEAR offers promising effectiveness even in multilingual training, suggesting its potential for broad application and scalability. We release the code at https://github.com/dltmddbs100/CLEAR.

关键词: cross-lingual retrieval, multilingual embedding models, reverse training, alignment enhancement, contrastive learning, low-resource languages, retrieval performance, CLEAR

134. ❌ WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering

作者: Yingjian Zhu, Xinming Wang, Kun Ding, Ying Wang, Bin Fan, Shiming Xiang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05818v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出WikiSeeker框架，核心是改进多模态检索增强生成（RAG）在知识型视觉问答中的应用。与关键词高度相关的是：1）‘Retrieval-Augmented Generation’（10分）- 论文直接研究多模态RAG框架；2）‘Large Language Models’（8分）- 使用LLM进行答案生成；3）‘LLM Agents’（8分）- 将VLM作为Refiner和Inspector两个智能体；4）‘Multi-agent Systems’（5分）- 涉及多个智能体协调。其他关键词如MoE、量化、推理加速等未涉及。

!!! tip deepseek-chat TL;DR

论文提出WikiSeeker多模态RAG框架，通过重新定义VLM为Refiner和Inspector两个智能体角色，显著提升了知识型视觉问答中的检索准确率和答案质量。

摘要翻译

多模态检索增强生成（Multi-modal Retrieval-Augmented Generation，RAG）已成为基于知识的视觉问答（Knowledge-Based Visual Question Answering，KB-VQA）领域一种高效范式。尽管近期研究有所进展，现有方法仍主要依赖图像作为检索依据，且常常忽视或误解视觉语言模型（Vision-Language Models，VLMs）的作用，未能充分发挥其潜力。本文提出WikiSeeker——一种新型多模态RAG框架，通过引入多模态检索器并重新定义VLMs的角色来弥补这些不足。我们不再将VLMs仅视为答案生成器，而是为其分配两个专用智能体：优化器（Refiner）与审查器（Inspector）。优化器利用VLM的能力，根据输入图像重写文本查询，显著提升了多模态检索器的性能。审查器则通过选择性路由机制，将可靠的检索上下文传输至另一大型语言模型（LLM）进行答案生成；当检索结果不可靠时，则依赖VLM的内部知识进行解耦式生成。在EVQA、InfoSeek和M2KR数据集上的大量实验表明，WikiSeeker实现了最先进的性能，在检索精度与答案质量上均取得显著提升。我们的代码将在https://github.com/zhuyjan/WikiSeeker 公开。

摘要 (Abstract)

Multi-modal Retrieval-Augmented Generation (RAG) has emerged as a highly effective paradigm for Knowledge-Based Visual Question Answering (KB-VQA). Despite recent advancements, prevailing methods still primarily depend on images as the retrieval key, and often overlook or misplace the role of Vision-Language Models (VLMs), thereby failing to leverage their potential fully. In this paper, we introduce WikiSeeker, a novel multi-modal RAG framework that bridges these gaps by proposing a multi-modal retriever and redefining the role of VLMs. Rather than serving merely as answer generators, we assign VLMs two specialized agents: a Refiner and an Inspector. The Refiner utilizes the capability of VLMs to rewrite the textual query according to the input image, significantly improving the performance of the multimodal retriever. The Inspector facilitates a decoupled generation strategy by selectively routing reliable retrieved context to another LLM for answer generation, while relying on the VLM’s internal knowledge when retrieval is unreliable. Extensive experiments on EVQA, InfoSeek, and M2KR demonstrate that WikiSeeker achieves state-of-the-art performance, with substantial improvements in both retrieval accuracy and answer quality. Our code will be released on https://github.com/zhuyjan/WikiSeeker.

关键词: Multi-modal RAG, Vision-Language Models, Knowledge-Based Visual Question Answering, Retrieval-Augmented Generation, Agent-based Framework, Multimodal Retriever, Decoupled Generation, State-of-the-art Performance

135. ❌ PhageBench: Can LLMs Understand Raw Bacteriophage Genomes?

作者: Yusen Hou, Weicai Long, Haitao Hu, Houcheng Su, Junning Feng, Yanlin Zhang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05775v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLMs在生物信息学领域的应用，特别是对原始噬菌体基因组的理解能力评估，因此与’Large Language Models’和’AI for Science’高度相关（10分）。论文提到模型在复杂推理任务中的局限性，涉及推理能力，因此与’Chain of Thought’和’System 2 Thinking’有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型（LLMs）直接理解原始噬菌体基因组序列的能力，通过PhageBench基准测试发现LLMs在噬菌体识别和宿主预测任务中表现出潜力，但在涉及长程依赖和细粒度功能定位的复杂推理任务中存在显著局限性。

摘要翻译

噬菌体常被称为生物圈的暗物质，在调控微生物生态系统及作为抗生素替代品方面发挥着关键作用。因此，准确解读其基因组具有重要的科学与实用价值。尽管通用大语言模型在理解生物文本方面表现优异，但其直接解读原始核苷酸序列并进行生物推理的能力仍未得到充分探索。为此，我们推出了PhageBench——首个通过模拟生物信息学专家工作流程来评估噬菌体基因组理解能力的基准测试。该数据集包含5,600个高质量样本，覆盖筛选、质量控制、表型注释三个阶段中的五项核心任务。我们对八种大语言模型的评估表明，通用推理模型在噬菌体重叠群识别和宿主预测任务上显著优于随机基线，展现出基因组理解方面的潜力。然而，这些模型在涉及长程依赖和细粒度功能定位的复杂推理任务中仍存在明显局限。这些发现凸显了开发具备增强型生物序列推理能力的下一代模型的必要性。

摘要 (Abstract)

Bacteriophages, often referred to as the dark matter of the biosphere, play a critical role in regulating microbial ecosystems and in antibiotic alternatives. Thus, accurate interpretation of their genomes holds significant scientific and practical value. While general-purpose Large Language Models (LLMs) excel at understanding biological texts, their ability to directly interpret raw nucleotide sequences and perform biological reasoning remains underexplored. To address this, we introduce PhageBench, the first benchmark designed to evaluate phage genome understanding by mirroring the workflow of bioinformatics experts. The dataset contains 5,600 high-quality samples covering five core tasks across three stages: Screening, Quality Control, and Phenotype Annotation. Our evaluation of eight LLMs reveals that general-purpose reasoning models significantly outperform random baselines in phage contig identification and host prediction, demonstrating promising potential for genomic understanding. However, they exhibit significant limitations in complex reasoning tasks involving long-range dependencies and fine-grained functional localization. These findings highlight the necessity of developing next-generation models with enhanced reasoning capabilities for biological sequences.

关键词: Large Language Models, bacteriophage genomes, bioinformatics, genomic understanding, biological reasoning, benchmark evaluation, phage contig identification, host prediction

136. ❌ GenomeQA: Benchmarking General Large Language Models for Genome Sequence Understanding

作者: Weicai Long, Yusen Hou, Junning Feng, Houcheng Su, Shuo Yang, Donglin Xie, Yanlin Zhang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05774v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在基因组序列理解中的应用，属于AI for Science/Bioinformatics领域，因此这两个关键词得10分。论文评估通用LLMs在原始基因组序列上的表现，属于LLMs的应用研究，因此LLMs关键词也得10分。其他关键词如MoE、SFT、RAG、CoT等涉及具体技术方法，论文未涉及，均得0分。

!!! tip deepseek-chat TL;DR

论文研究了通用大语言模型在原始基因组序列理解任务中的表现，发现模型能利用局部序列信号但多步推理任务表现下降，并建立了GenomeQA基准用于诊断和改进。

摘要翻译

大型语言模型（LLMs）在基因组学中正日益被用作对话助手，其主要功能是通过自然语言界面推理生物学知识、注释和分析结果。然而，现有基准测试要么专注于为序列预测训练的专业DNA模型，要么仅使用纯文本问题评估生物学知识，导致通用LLMs在直接处理原始基因组序列时的行为尚未得到充分探索。我们提出了GenomeQA，这是一个专为通用LLMs在基于序列的基因组推理任务上提供受控评估环境而设计的基准。GenomeQA包含从多个生物数据库中抽取的5,200个样本，序列长度范围从6到1,000个碱基对（bp），涵盖六个任务类别：增强子与启动子识别、剪接位点识别、物种分类、组蛋白标记预测、转录因子结合位点预测以及转录因子基序预测。通过对六个前沿LLMs的测试，我们发现模型的表现持续优于随机基线，并能利用局部序列信号（如GC含量和短基序），但在需要对序列模式进行更间接或多步推理的任务上性能下降。GenomeQA为研究和改进通用LLMs在原始基因组序列上的应用建立了一个诊断性基准。

摘要 (Abstract)

Large Language Models (LLMs) are increasingly adopted as conversational assistants in genomics, where they are mainly used to reason over biological knowledge, annotations, and analysis outputs through natural language interfaces. However, existing benchmarks either focus on specialized DNA models trained for sequence prediction or evaluate biological knowledge using text-only questions, leaving the behavior of general-purpose LLMs when directly exposed to raw genome sequences underexplored. We introduce GenomeQA, a benchmark designed to provide a controlled evaluation setting for general-purpose LLMs on sequence-based genome inference tasks. GenomeQA comprises 5,200 samples drawn from multiple biological databases, with sequence lengths ranging from 6 to 1,000 base pairs (bp), spanning six task families: Enhancer and Promoter Identification, Splice Site Identification, Taxonomic Classification, Histone Mark Prediction, Transcription Factor Binding Site Prediction, and TF Motif Prediction. Across six frontier LLMs, we find that models consistently outperform random baselines and can exploit local sequence signals such as GC content and short motifs, while performance degrades on tasks that require more indirect or multi-step inference over sequence patterns. GenomeQA establishes a diagnostic benchmark for studying and improving the use of general-purpose LLMs on raw genomic sequences.

关键词: Large Language Models, Genome Sequence Understanding, Benchmark, GenomeQA, Bioinformatics, Genomics, Sequence-based Inference, AI for Science

137. ❌ Beyond the Beep: Scalable Collision Anticipation and Real-Time Explainability with BADAS-2.0

作者: Roni Goldshmidt, Hamish Scott, Lorenzo Niccolini, Hernan Matzner 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05767v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文BADAS-2.0专注于计算机视觉和自动驾驶领域，研究碰撞预测系统，使用V-JEPA2模型进行微调，涉及大规模数据集构建、知识蒸馏到边缘设备以及可解释性方法。所有评分关键词均与大语言模型（LLM）或深度学习通用技术原理相关，而论文未提及任何LLM、MoE、Scaling Laws、预训练/后训练技术、对齐、RAG、推理方法、代理系统、模型压缩、幻觉缓解等主题，也未涉及生物信息学或化学信息学。因此，所有关键词得分为0。

!!! tip deepseek-chat TL;DR

论文BADAS-2.0提出了一种可扩展的碰撞预测系统，通过构建长尾基准数据集、知识蒸馏到紧凑模型以及实时可解释性方法，在边缘设备上实现了高精度和实时性能。

摘要翻译

我们推出第二代碰撞预警系统BADAS-2.0，该系统基于BADAS-1.0 [7]构建。BADAS-1.0已证明，在大规模第一视角行车记录仪数据上对V-JEPA2 [1]进行微调，其性能优于学术基线模型与商用高级驾驶辅助系统（ADAS）。 BADAS-2.0在以下三个维度推动了技术前沿：（一）长尾基准与准确性：我们引入了一个包含10个组别的长尾基准测试集，专注于罕见且对安全至关重要的场景。为构建此基准集，我们使用BADAS-1.0作为主动预言机，对数百万段未标注的驾驶视频进行评分，并筛选出高风险候选片段进行人工标注。结合Nexar的Atlas平台 [13] 进行定向数据采集，我们将标注数据集从4万段扩展至178,500段标注视频（约200万个片段），在所有子组别上均取得了一致的性能提升，并在最困难的长尾案例上获得了最大幅度的改进。（二）面向边缘设备的知识蒸馏：在225万段未标注驾驶视频上进行领域特定的自监督预训练，使得我们能够将知识蒸馏至紧凑模型BADAS-2.0-Flash（8600万参数）和BADAS-2.0-Flash-Lite（2200万参数），在精度几乎持平的情况下实现了7至12倍的加速，从而支持实时边缘部署。（三）可解释性：BADAS-2.0能够生成实时的、以目标为中心的注意力热力图，用于定位预测背后的证据。BADAS-Reason [17] 在此基础上进行了扩展，它利用一个视觉-语言模型，接收最后一帧图像和热力图，以生成驾驶员操作建议和结构化的文本推理。推理代码与评估基准已公开提供。

摘要 (Abstract)

We present BADAS-2.0, the second generation of our collision anticipation system, building on BADAS-1.0 [7], which showed that fine-tuning V-JEPA2 [1] on large-scale ego-centric dashcam data outperforms both academic baselines and production ADAS systems. BADAS-2.0 advances the state of the art along three axes. (i) Long-tail benchmark and accuracy: We introduce a 10-group long-tail benchmark targeting rare and safety-critical scenarios. To construct it, BADAS-1.0 is used as an active oracle to score millions of unlabeled drives and surface high-risk candidates for annotation. Combined with Nexar’s Atlas platform [13] for targeted data collection, this expands the dataset from 40k to 178,500 labeled videos (~2M clips), yielding consistent gains across all subgroups, with the largest improvements on the hardest long-tail cases. (ii) Knowledge distillation to edge: Domain-specific self-supervised pre-training on 2.25M unlabeled driving videos enables distillation into compact models, BADAS-2.0-Flash (86M) and BADAS-2.0-Flash-Lite (22M), achieving 7-12x speedup with near-parity accuracy, enabling real-time edge deployment. (iii) Explainability: BADAS-2.0 produces real-time object-centric attention heatmaps that localize the evidence behind predictions. BADAS-Reason [17] extends this with a vision-language model that consumes the last frame and heatmap to generate driver actions and structured textual reasoning. Inference code and evaluation benchmarks are publicly available.

关键词: collision anticipation, BADAS-2.0, knowledge distillation, edge deployment, explainability, real-time, V-JEPA2, long-tail benchmark

138. ❌ Identifying Influential N-grams in Confidence Calibration via Regression Analysis

作者: Shintaro Ozaki, Wataru Hashimoto, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05757v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在推理过程中的置信度校准问题，通过回归分析识别与置信度相关的语言表达（n-grams），属于LLM可解释性和推理机制研究。与’Large Language Models’高度相关（10分），涉及’Chain of Thought’和’System 2 Thinking’（各8分），因为研究聚焦LLMs的显式推理部分；与’Mechanistic Interpretability’相关（8分），因为通过分析语言特征来理解模型置信度机制。其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究大型语言模型在推理过程中过度自信的问题，通过回归分析识别出与置信度相关的特定语言表达（n-grams），并证明通过抑制这些表达可以在不降低性能的情况下实现置信度校准。

摘要翻译

尽管大型语言模型（LLMs）通过显式推理提升了性能，但其回答往往表现出过度自信，即使其中包含了表达不确定性的语言表述。在本研究中，我们通过回归方法识别了哪些语言表述与置信度相关。具体而言，我们将LLMs推理部分中这些语言表述的置信度预测为因变量，并分析特定$n$元词组与置信度之间的关系。在多个模型和问答基准测试中，我们发现当涉及推理时，LLMs仍保持过度自信，并将此行为归因于特定的语言信息。有趣的是，部分提取出的表述与为提升推理性能而在测试时扩展中刻意插入的提示短语相吻合。通过对因果关系的测试以及验证所提取的语言信息确实影响置信度，我们揭示了仅需抑制那些过度自信的表述即可实现置信度校准，且不会导致性能下降。

摘要 (Abstract)

While large language models (LLMs) improve performance by explicit reasoning, their responses are often overconfident, even though they include linguistic expressions demonstrating uncertainty. In this work, we identify what linguistic expressions are related to confidence by applying the regression method. Specifically, we predict confidence of those linguistic expressions in the reasoning parts of LLMs as the dependent variables and analyze the relationship between a specific $n$-gram and confidence. Across multiple models and QA benchmarks, we show that LLMs remain overconfident when reasoning is involved and attribute this behavior to specific linguistic information. Interestingly, several of the extracted expressions coincide with cue phrases intentionally inserted on test-time scaling to improve reasoning performance. Through our test on causality and verification that the extracted linguistic information truly affects confidence, we reveal that confidence calibration is possible by simply suppressing those overconfident expressions without drops in performance.

关键词: large language models, confidence calibration, regression analysis, n-grams, reasoning, overconfidence, linguistic expressions, QA benchmarks

139. ❌ Controlling Distributional Bias in Multi-Round LLM Generation via KL-Optimized Fine-Tuning

作者: Yanbei Jiang, Amr Keleg, Ryandito Diandaru, Jey Han Lau, Lea Frermann, Biaoyan Fang, Fajri Koto 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05756v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的分布对齐问题，提出新的微调框架控制输出分布，与LLM、微调、对齐、DPO等关键词高度相关（10分），但未涉及其他技术如MoE、量化、推理加速等（0分）。

!!! tip deepseek-chat TL;DR

该论文研究如何通过KL优化的微调框架控制LLM在多轮生成中的输出分布，实验表明该方法显著优于基线，实现了对属性生成任务的精确分布控制。

摘要翻译

现实世界本质上是随机性的，然而大型语言模型（LLM）的评估主要集中于针对固定标准答案的单轮推理。本研究将视角转向分布对齐：评估当反复提示时，LLM能否生成符合期望目标分布（例如反映真实世界统计特征或均匀分布）的输出。我们以职业背景下的性别、种族和情感属性为框架，对分布对齐进行了形式化定义。实证分析表明，现成的LLM及标准对齐技术（包括提示工程和直接偏好优化）均无法可靠地控制输出分布。为弥补这一差距，我们提出了一种新颖的微调框架，将引导令牌校准与语义对齐相结合。我们引入了一种混合目标函数，该函数结合了用于锚定潜在引导令牌概率质量的Kullback-Leibler散度，以及用于将这些令牌与语义一致响应绑定的Kahneman-Tversky优化。在六个不同数据集上的实验表明，我们的方法显著优于基线模型，在属性生成任务中实现了精确的分布控制。

摘要 (Abstract)

While the real world is inherently stochastic, Large Language Models (LLMs) are predominantly evaluated on single-round inference against fixed ground truths. In this work, we shift the lens to distribution alignment: assessing whether LLMs, when prompted repeatedly, can generate outputs that adhere to a desired target distribution, e.g. reflecting real-world statistics or a uniform distribution. We formulate distribution alignment using the attributes of gender, race, and sentiment within occupational contexts. Our empirical analysis reveals that off-the-shelf LLMs and standard alignment techniques, including prompt engineering and Direct Preference Optimization, fail to reliably control output distributions. To bridge this gap, we propose a novel fine-tuning framework that couples Steering Token Calibration with Semantic Alignment. We introduce a hybrid objective function combining Kullback-Leibler divergence to anchor the probability mass of latent steering tokens and Kahneman-Tversky Optimization to bind these tokens to semantically consistent responses. Experiments across six diverse datasets demonstrate that our approach significantly outperforms baselines, achieving precise distributional control in attribute generation tasks.

关键词: Large Language Models, distribution alignment, fine-tuning, Direct Preference Optimization, Kullback-Leibler divergence, attribute generation, semantic alignment, output distribution control

140. ❌ Dialogue Act Patterns in GenAI-Mediated L2 Oral Practice: A Sequential Analysis of Learner-Chatbot Interactions

作者: Liqun He, Shijun, Chen, Mutlu Cukurova, Manolis Mavrikis 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05702v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究生成式AI语音聊天机器人在二语口语练习中的应用，属于大模型在教育领域的应用研究。论文主要关注对话行为模式分析，而非大模型技术原理本身。因此，仅与"Large Language Models OR LLMs OR Foundation Models”（5分）和"LLM Agents OR Autonomous Agents OR Agentic Workflow”（5分）有一定关联，因为聊天机器人可视为一种AI代理应用。其他关键词均涉及具体技术细节（如MoE、量化、推理优化等）或特定应用领域（如生物信息学），论文未涉及这些内容，故评0分。

!!! tip deepseek-chat TL;DR

本研究通过分析中国EFL学习者与生成式AI语音聊天机器人的对话行为模式，发现高进步会话中学习者提问更多且包含更多基于提示的纠正反馈序列，为自适应教育聊天机器人设计提供了实证依据。

摘要翻译

尽管生成式人工智能语音聊天机器人为第二语言口语练习提供了可扩展的机遇，但与学习者收获相关的互动过程仍未得到充分探究。本研究调查了在为期十周的干预中，九年级中国英语作为外语学习者与生成式人工智能语音聊天机器人互动中的对话行为模式。研究人员采用基于教学法的编码方案，对来自12名学生的70段会话进行了人工标注，共得到6,957个已编码的对话行为。研究比较了高进步与低进步会话在对话行为分布和序列模式上的差异。在对话行为层面，高进步会话表现出更多学习者主动发起的提问，而低进步会话则显示出更高的澄清请求率，表明其理解困难更大。在序列层面，高进步会话的特征是更频繁地出现基于提示的纠正性反馈序列，且这些反馈持续位于学习者回应之后，凸显了反馈类型与时序在有效互动中的作用。总体而言，这些发现强调了对话视角在生成式人工智能聊天机器人设计中的价值，贡献了一个基于教学法的对话行为编码框架，并为第二语言教育中自适应生成式人工智能聊天机器人的设计提供了参考。

摘要 (Abstract)

While generative AI (GenAI) voice chatbots offer scalable opportunities for second language (L2) oral practice, the interactional processes related to learners’ gains remain underexplored. This study investigates dialogue act (DA) patterns in interactions between Grade 9 Chinese English as a foreign language (EFL) learners and a GenAI voice chatbot over a 10-week intervention. Seventy sessions from 12 students were annotated by human coders using a pedagogy-informed coding scheme, yielding 6,957 coded DAs. DA distributions and sequential patterns were compared between high- and low-progress sessions. At the DA level, high-progress sessions showed more learner-initiated questions, whereas low-progress sessions exhibited higher rates of clarification-seeking, indicating greater comprehension difficulty. At the sequential level, high-progress sessions were characterised by more frequent prompting-based corrective feedback sequences, consistently positioned after learner responses, highlighting the role of feedback type and timing in effective interaction. Overall, these findings underscore the value of a dialogic lens in GenAI chatbot design, contribute a pedagogy-informed DA coding framework, and inform the design of adaptive GenAI chatbots for L2 education.

关键词: Generative AI, Chatbot, Second Language Learning, Dialogue Act Analysis, Oral Practice, Feedback Sequences, Adaptive Design, EFL Education

141. ❌ See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs

作者: Yicheng Ji, Jun Zhang, Jinpeng Chen, Cong Wang, Lidan Shou, Gang Chen, Huan Li 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05650v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于Video-LLMs的推理加速，核心贡献是提出了一种名为LVSpec的松散推测解码框架。该论文与’Large Language Models’高度相关（10分），因为它专门研究视频大语言模型（Video-LLMs）。与’Speculative Decoding OR Inference Acceleration’高度相关（10分），因为其核心创新点就是改进推测解码方法以实现高效推理。论文未涉及其他关键词，如MoE、SFT、RAG、量化、幻觉缓解等，因此这些关键词得分为0分。

!!! tip deepseek-chat TL;DR

该论文针对视频大语言模型推理延迟高的问题，提出了一种无需训练的松散推测解码框架LVSpec，在保持模型性能（>99.8%）的同时，显著提升了推理速度（最高达2.94倍）。

摘要翻译

视频大语言模型（Video-LLMs）在视频理解方面表现出色，但在自回归生成过程中存在高推理延迟问题。推测解码（Speculative Decoding, SD）通过采用草稿生成与验证范式来缓解此问题，然而现有方法受限于严格的精确匹配规则，严重制约了加速潜力。为弥补这一差距，我们提出了LVSpec，这是首个为Video-LLMs设计的无需训练的宽松推测解码框架。基于生成过程由稀疏的视觉相关锚点（要求严格匹配）和大量视觉无关填充词（允许宽松验证）所主导的洞见，LVSpec采用轻量级视觉相关令牌识别方案来精确定位前者。为进一步最大化接受率，我们引入了位置偏移容忍机制，有效挽救位置不匹配但语义等价的令牌。实验表明，LVSpec实现了高保真度与高速度：在保持目标模型>99.8%性能的同时，将Qwen2.5-VL-32B加速2.70倍，将LLaVA-OneVision-72B加速2.94倍。值得注意的是，相较于视频大语言模型领域最先进的无需训练推测解码方法，其平均接受长度和加速比分别提升了136%和35%。

摘要 (Abstract)

Video Large Language Models (Video-LLMs) excel in video understanding but suffer from high inference latency during autoregressive generation. Speculative Decoding (SD) mitigates this by applying a draft-and-verify paradigm, yet existing methods are constrained by rigid exact-match rules, severely limiting the acceleration potential. To bridge this gap, we propose LVSpec, the first training-free loosely SD framework tailored for Video-LLMs. Grounded in the insight that generation is governed by sparse visual-relevant anchors (mandating strictness) amidst abundant visual-irrelevant fillers (permitting loose verification), LVSpec employs a lightweight visual-relevant token identification scheme to accurately pinpoint the former. To further maximize acceptance, we augment this with a position-shift tolerant mechanism that effectively salvages positionally mismatched but semantically equivalent tokens. Experiments demonstrate that LVSpec achieves high fidelity and speed: it preserves >99.8 of target performance while accelerating Qwen2.5-VL-32B by 2.70x and LLaVA-OneVision-72B by 2.94x. Notably, it boosts the mean accepted length and speedup ratio by 136% and 35% compared to SOTA training-free SD methods for Video-LLMs.

关键词: Video Large Language Models, Speculative Decoding, Inference Acceleration, Autoregressive Generation, Training-free Framework, Visual-Semantic Guidance, Loosely Speculative Decoding, Inference Latency

142. ❌ Graph-Based Chain-of-Thought Pruning for Reducing Redundant Reflections in Reasoning LLMs

作者: Hongyuan Yuan, Xinran He, Run Shao, Bolei He, Xianwei Xue, Mengke Chen, Qiutong Pan, Haiwei Wang, Haifeng Li 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05643v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	15.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的推理优化，直接涉及Chain of Thought（CoT）推理（核心内容，15分），使用SFT和DPO进行训练（各10分），关注LLM推理中的反思/自我改进模式（10分），属于大模型技术原理创新（10分）。其他关键词如MoE、量化、科学AI应用等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLM在CoT推理中因奖励稀疏导致的过度思考和冗余反思问题，提出了一种基于图的CoT剪枝框架，通过分支级和深度级剪枝策略，结合SFT、DPO和GRPO训练，在保持或提高准确性的同时将平均推理token减少了42%。

摘要翻译

通过强化学习扩展思维链已被广泛用于增强大语言模型的推理能力。然而，由于奖励信号的稀疏性，这种方法也可能引发不良的思维模式，例如过度思考，即生成冗余的中间推理内容。本文认为，这种冗余的主要来源是低效的反思，其通常表现为两种问题模式：无差别反思，即模型在整个推理过程中进行宽泛、低影响的检查；以及重复性反思，即模型反复重新验证已确立的结论。为解决此问题，我们引入了一种基于图的思维链优化框架。具体而言，我们将每个线性的思维链转换为具有显式依赖边的有向无环图，并设计了一种双重剪枝策略：分支级剪枝移除贡献较弱的反思分支，而深度级剪枝则消除后期阶段的重复验证。我们通过一个三阶段流程来蒸馏此行为：（1）监督微调，在剪枝后的简洁推理轨迹上初始化策略；（2）直接偏好优化，以偏好正确但冗余较少的轨迹；（3）带长度惩罚的组相对策略优化，以共同优化答案正确性和效率。实验表明，我们的方法在保持或提高准确性的同时，将平均推理标记数量减少了42%。

摘要 (Abstract)

Extending CoT through RL has been widely used to enhance the reasoning capabilities of LLMs. However, due to the sparsity of reward signals, it can also induce undesirable thinking patterns such as overthinking, i.e., generating redundant intermediate reasoning content. In this work, we argue that a major source of such redundancy is inefficient reflection, which often manifests in two problematic patterns: Indiscriminate Reflection, where the model performs broad, low-impact checks throughout reasoning, and Repetitive Reflection, where it repeatedly re-verifies an already established conclusion. To address this, we introduce a graph-based CoT optimization framework. Specifically, we convert each linear CoT into a directed acyclic graph (DAG) with explicit dependency edges, and design a dual pruning strategy: branch-level pruning removes weakly contributing reflection branches, while depth-level pruning eliminates late-stage re-verification. We distill this behavior via a three-stage pipeline: (1) SFT to initialize the policy on pruned concise traces, (2) DPO to prefer correct but less redundant trajectories, and (3) GRPO with length penalty to jointly optimize answer correctness and efficiency. Experiments show that our approach reduces the average reasoning tokens by 42% while maintaining or improving accuracy.

关键词: Chain-of-Thought, Reasoning LLMs, Graph-based pruning, Redundant reflection, DPO, SFT, Inference efficiency, Self-reflection

143. ❌ DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions

作者: Xinran Wang, Yuxuan Zhang, Xiao Zhang, Haolong Yan, Muxi Diao, Songyu Xu, Zhonghao Yan, Hongbing Li, Kongming Liang, Zhanyu Ma 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05623v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多模态大语言模型（MLLMs）在长图像描述中的幻觉检测和定位问题，与’Large Language Models’相关（8分），因为MLLMs是LLMs的多模态扩展；与’Hallucination Mitigation’高度相关（10分），因为这是论文的核心研究问题。其他关键词主要涉及纯文本LLMs的技术细节、训练方法、推理优化、代理系统等，与论文的多模态基准测试和幻觉定位任务无直接关联，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型生成长图像描述时产生的幻觉问题，提出了一个包含1000张图像、五个领域、具有细粒度token级标注的基准测试DetailVerifyBench，以评估模型在长上下文中精确定位幻觉的能力。

摘要翻译

准确检测并定位幻觉是确保图像描述高可靠性的关键任务。在多模态大语言模型（Multimodal Large Language Models, MLLMs）时代，图像描述已从简短句子演变为涵盖数百字的综合性叙述。这一转变使挑战呈指数级增长：模型现在必须在长篇上下文中精确定位具体的错误片段或词语，而非仅仅标记响应层面的不一致性。然而，现有基准测试缺乏评估此能力所需的精细粒度和领域多样性。为弥补这一空白，我们提出了DetailVerifyBench，这是一个包含五个不同领域共1,000张高质量图像的严格基准测试集。其描述平均长度超过200词，且包含多种幻觉类型的密集词元级标注，使其成为当前长图像描述领域中针对精确幻觉定位最具挑战性的基准测试。本基准测试集可通过https://zyx-hhnkh.github.io/DetailVerifyBench/获取。

摘要 (Abstract)

Accurately detecting and localizing hallucinations is a critical task for ensuring high reliability of image captions. In the era of Multimodal Large Language Models (MLLMs), captions have evolved from brief sentences into comprehensive narratives, often spanning hundreds of words. This shift exponentially increases the challenge: models must now pinpoint specific erroneous spans or words within extensive contexts, rather than merely flag response-level inconsistencies. However, existing benchmarks lack the fine granularity and domain diversity required to evaluate this capability. To bridge this gap, we introduce DetailVerifyBench, a rigorous benchmark comprising 1,000 high-quality images across five distinct domains. With an average caption length of over 200 words and dense, token-level annotations of multiple hallucination types, it stands as the most challenging benchmark for precise hallucination localization in the field of long image captioning to date. Our benchmark is available at https://zyx-hhnkh.github.io/DetailVerifyBench/.

关键词: Multimodal Large Language Models, Hallucination Detection, Hallucination Localization, Long Image Captions, Benchmark, Token-level Annotation, Dense Hallucination, DetailVerifyBench

144. ❌ YoNER: A New Yorùbá Multi-domain Named Entity Recognition Dataset

作者: Peace Busola Falola, Jesujoba O. Alabi, Solomon O. Akinola, Folashade T. Ogunajo, Emmanuel Oluwadunsin Alabi, David Ifeoluwa Adelani 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05624v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究约鲁巴语的多领域命名实体识别数据集创建和基准测试，属于特定语言的NLP数据集工作，不涉及大模型、深度学习技术原理创新或科学领域应用。所有关键词均与大模型技术、训练方法、推理优化、对齐技术、代理系统、科学AI应用等相关，而本文仅使用标准的transformer编码器模型进行基准测试，未涉及任何关键词中的前沿技术或创新方法。

!!! tip deepseek-chat TL;DR

该论文创建了首个约鲁巴语多领域命名实体识别数据集YoNER，并通过基准测试发现非洲中心模型在约鲁巴语任务上优于通用多语言模型，但跨领域性能显著下降，同时发布了新的约鲁巴语专用语言模型OyoBERT。

摘要翻译

命名实体识别（NER）是一项基础性自然语言处理任务，然而约鲁巴语的相关研究长期受限于领域特定且规模有限的资源。现有资源如MasakhaNER（人工标注的新闻领域语料库）和WikiAnn（基于维基百科自动构建）虽具价值，但领域覆盖范围有限。为填补这一空白，我们提出了YoNER——一个新型多领域约鲁巴语NER数据集，将实体覆盖范围扩展至新闻和维基百科之外的领域。该数据集包含约5,000个句子和10万个词元，采集自圣经、博客、电影、广播节目和维基百科五大领域，并遵循CoNLL标注规范标注了三种实体类型：人物（PER）、组织（ORG）和地点（LOC）。标注工作由三位约鲁巴语母语者人工完成，标注者间一致性超过0.70，确保了高质量与一致性。我们通过MasakhaNER 2.0进行跨领域实验，对多种Transformer编码器模型进行基准测试，同时利用YoNER评估小样本领域内数据的效果，并结合英语数据集进行跨语言实验。结果表明：非洲中心模型在约鲁巴语任务上优于通用多语言模型，但跨领域性能显著下降，尤其在博客和电影领域；此外，我们发现新闻与维基百科等形式相近的正式领域间迁移效果更佳。我们还提出了新型约鲁巴语专用语言模型（OyoBERT），其在领域内评估中表现优于多语言模型。我们公开释放YoNER数据集与预训练的OyoBERT模型，以支持未来约鲁巴语自然语言处理研究。

摘要 (Abstract)

Named Entity Recognition (NER) is a foundational NLP task, yet research in Yorùbá has been constrained by limited and domain-specific resources. Existing resources, such as MasakhaNER (a manually annotated news-domain corpus) and WikiAnn (automatically created from Wikipedia), are valuable but restricted in domain coverage. To address this gap, we present YoNER, a new multidomain Yorùbá NER dataset that extends entity coverage beyond news and Wikipedia. The dataset comprises about 5,000 sentences and 100,000 tokens collected from five domains including Bible, Blogs, Movies, Radio broadcast and Wikipedia, and annotated with three entity types: Person (PER), Organization (ORG) and Location (LOC), following CoNLL-style guidelines. Annotation was conducted manually by three native Yorùbá speakers, with an inter-annotator agreement of over 0.70, ensuring high quality and consistency. We benchmark several transformer encoder models using cross-domain experiments with MasakhaNER 2.0, and we also assess the effect of few-shot in-domain data using YoNER and cross-lingual setups with English datasets. Our results show that African-centric models outperform general multilingual models for Yorùbá, but cross-domain performance drops substantially, particularly for blogs and movie domains. Furthermore, we observed that closely related formal domains, such as news and Wikipedia, transfer more effectively. In addition, we introduce a new Yorùbá-specific language model (OyoBERT) that outperforms multilingual models in in-domain evaluation. We publicly release the YoNER dataset and pretrained OyoBERT models to support future research on Yorùbá natural language processing.

关键词: Named Entity Recognition, Yorùbá language, multidomain dataset, cross-domain evaluation, transformer models, African-centric models, OyoBERT, low-resource language

145. ❌ THIVLVC: Retrieval Augmented Dependency Parsing for Latin

作者: Luc Pommeret, Thibault Wagret, Jules Deret 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05564v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文THIVLVC是一个两阶段的拉丁语依存句法分析系统，它明确使用了大型语言模型（LLM）和检索增强生成（RAG）技术，因此这两个关键词高度相关（10分）。论文涉及在语言学（拉丁语）领域的应用，属于AI for Science的范畴，但并非核心的生物信息学或化学信息学，因此给5分。系统通过检索相似示例来提示LLM，这体现了上下文学习（In-context Learning）的思想，但并非论文的主要创新点，因此给5分。其他关键词如MoE、SFT、量化等均未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为THIVLVC的两阶段系统，通过检索增强生成（RAG）技术结合大型语言模型来改进拉丁语的依存句法分析，在诗歌文本上相比基线提升了17个CLAS点。

摘要翻译

本文介绍THIVLVC，这是一个为EvaLatin 2026依存句法分析任务设计的两阶段系统。给定一个拉丁语句子，我们首先依据句子长度和词性标注n-gram相似度，从CIRCSE树库中检索结构相似的条目；随后，利用检索到的示例及通用依存标注指南，提示一个大语言模型对UDPipe生成的基线句法分析结果进行优化。我们提交了两种配置：一种不包含检索机制，另一种包含检索增强生成（RAG）。在诗歌语料（塞涅卡作品）上，THIVLVC相较于UDPipe基线在CLAS指标上提升了17个百分点；在散文语料（托马斯·阿奎那作品）上，CLAS增益为1.5个百分点。针对系统输出与黄金标准之间300处差异的双盲错误分析表明，在标注者意见一致的判定中，53.3%支持THIVLVC的结果，这揭示了树库内部及树库之间存在的标注不一致现象。

摘要 (Abstract)

We describe THIVLVC, a two-stage system for the EvaLatin 2026 Dependency Parsing task. Given a Latin sentence, we retrieve structurally similar entries from the CIRCSE treebank using sentence length and POS n-gram similarity, then prompt a large language model to refine the baseline parse from UDPipe using the retrieved examples and UD annotation guidelines. We submit two configurations: one without retrieval and one with retrieval (RAG). On poetry (Seneca), THIVLVC improves CLAS by +17 points over the UDPipe baseline; on prose (Thomas Aquinas), the gain is +1.5 CLAS. A double-blind error analysis of 300 divergences between our system and the gold standard reveals that, among unanimous annotator decisions, 53.3% favour THIVLVC, showing annotation inconsistencies both within and across treebanks.

关键词: Dependency Parsing, Latin, Retrieval-Augmented Generation, Large Language Model, Treebank, UDPipe, EvaLatin, In-context Learning

146. ❌ EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents

作者: Xuan Dong, Huanyang Zheng, Tianhao Niu, Zhe Han, Pengzhan Li, Bofei Liu, Zhengyang Liu, Guancheng Li, Qingfu Zhu, Wanxiang Che 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05557v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文EpiBench专注于评估多模态智能体在科学研究工作流中的表现，核心是构建一个多轮、多证据的基准测试平台。与关键词的相关性分析如下：高度相关（8-10分）的关键词包括：‘LLM Agents/Autonomous Agents/Agentic Workflow’（10分，论文核心研究多模态智能体）、‘AI for Science/Bioinformatics/Cheminformatics’（10分，直接应用于科学研究领域）、‘Retrieval-Augmented Generation/RAG/Retrieval-Generation’（8分，涉及主动搜索文献和证据整合）、‘Chain of Thought/CoT Reasoning/Multi-step Reasoning’（8分，需要多步推理和跨论文比较）、‘System 2 Thinking/Slow Thinking/In-depth Reasoning’（8分，强调深入推理和证据积累）、‘Tool Use/Function Calling/API Tool Use’（8分，智能体需使用工具导航论文和整合证据）。其他关键词如MoE、量化、RLHF等与论文内容无直接关联，评分为0。论文未涉及特定大模型技术原理创新，但属于大模型在科学领域的应用研究，符合评分背景要求。

!!! tip deepseek-chat TL;DR

该论文提出了EpiBench基准测试，用于评估多模态智能体在科学研究中执行多轮、多证据工作流的能力，实验表明当前领先模型在困难任务上准确率仅为29.23%，揭示了该领域仍有巨大改进空间。

摘要翻译

科学研究遵循多轮次、多步骤的工作流程，需要主动检索文献、查阅图表，并整合跨论文的证据以对齐实验设置并支持可复现的结论。现有基准测试未能系统评估这种综合能力，大多低估了主动检索、多证据整合以及长期持续使用证据的重要性。本研究提出EpiBench，这是一个实例化的多轮次多模态基准测试，用于模拟短期科研工作流程。给定一项研究任务，智能体必须在多轮次中跨论文进行导航，从图表中对齐证据，并利用记忆中积累的证据来回答需要跨论文比较和多图表整合的客观问题。EpiBench引入了一个过程级评估框架，用于对研究智能体进行细粒度测试与诊断。实验表明，即使在领先模型上，其在困难数据集上的准确率也仅为29.23%，这表明多轮次、多证据的研究工作流程仍有巨大改进空间，同时也为可验证、可复现的研究智能体提供了一个评估平台。

摘要 (Abstract)

Scientific research follows multi-turn, multi-step workflows that require proactively searching the literature, consulting figures and tables, and integrating evidence across papers to align experimental settings and support reproducible conclusions. This joint capability is not systematically assessed in existing benchmarks, which largely under-evaluate proactive search, multi-evidence integration and sustained evidence use over time. In this work, we introduce EpiBench, an episodic multi-turn multimodal benchmark that instantiates short research workflows. Given a research task, agents must navigate across papers over multiple turns, align evidence from figures and tables, and use the accumulated evidence in the memory to answer objective questions that require cross paper comparisons and multi-figure integration. EpiBench introduces a process-level evaluation framework for fine-grained testing and diagnosis of research agents. Our experiments show that even the leading model achieves an accuracy of only 29.23% on the hard split, indicating substantial room for improvement in multi-turn, multi-evidence research workflows, providing an evaluation platform for verifiable and reproducible research agents.

关键词: multimodal agents, research workflows, benchmark, multi-turn, evidence integration, scientific research, evaluation framework, cross-paper comparison

147. ❌ AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery

作者: Yu Li, Chenyang Shao, Xinyang Liu, Ruotong Zhao, Peijie Liu, Hongyuan Su, Zhibin Chen, Qinglong Yang, Anjie Xu, Yi Fang, Qingbin Zeng, Tianxing Li, Jingbo Xu, Fengli Xu, Yong Li, Tie-Yan Liu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05550v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出AutoSOTA系统，这是一个端到端自动化研究系统，采用多智能体架构（8个专门智能体）来复现和改进AI模型。核心相关关键词包括：‘LLM Agents/Autonomous Agents/Agentic Workflow’（10分，系统本身就是多智能体架构的自动化研究代理）、‘Multi-agent Systems/Agent Coordination’（10分，系统由8个专门智能体协作完成研究任务）、‘Tool Use/Function Calling/API Tool Use’（8分，智能体需要调用代码执行、环境管理等工具）、‘Self-Correction/Self-Improvement/Self-Reflection’（8分，系统包含反思和构思阶段以实现自我改进）、‘Large Language Models/LLMs/Foundation Models’（8分，论文明确提到LLM作为案例研究领域，且系统适用于大模型研究）。‘AI for Science/Bioinformatics/Cheminformatics’得5分，因为系统可应用于科学AI研究，但非论文核心焦点。其他关键词得0分，因为论文关注自动化研究系统架构，而非具体的大模型技术细节（如MoE、量化、注意力机制等）。

!!! tip deepseek-chat TL;DR

论文提出了AutoSOTA，一个端到端自动化研究系统，采用多智能体架构来自动复现和改进AI模型，在多个领域成功发现了105个超越原方法的新SOTA模型。

摘要翻译

人工智能研究日益依赖于漫长的复现、调试与迭代优化周期来实现最先进（State-Of-The-Art，SOTA）性能，这催生了对能够加速整个实证模型优化流程的系统的迫切需求。本文中，我们提出了AutoSOTA——一个端到端的自动化研究系统，它能够将顶级AI论文中发布的最新SOTA模型推进至可复现且经实证改进的新SOTA模型。我们将此问题形式化为三个紧密耦合的阶段：资源准备与目标设定、实验评估、以及反思与构思。为解决该问题，AutoSOTA采用了一种多智能体架构，包含八个专用智能体，它们协同工作，将论文内容转化为代码与依赖项、初始化并修复执行环境、追踪长期实验、生成并调度优化方案，并监督有效性以避免虚假提升。我们在从八个顶级AI会议中收集的近期研究论文上评估AutoSOTA，并依据代码可获取性与执行成本进行了筛选。在这些论文中，AutoSOTA在自动化复现及后续优化方面均展现出强大的端到端性能。具体而言，它成功发现了105个超越原报告方法的新SOTA模型，平均每篇论文耗时约五小时。涵盖大语言模型（LLM）、自然语言处理（NLP）、计算机视觉、时间序列和优化等领域的案例研究进一步表明，该系统能够超越常规的超参数调优，识别出架构创新、算法重新设计以及工作流层面的改进。这些结果表明，端到端的研究自动化不仅可以作为性能优化器，更能成为一种新型研究基础设施，减轻重复性实验负担，并帮助将人类注意力重新导向更高层次的科学创造力。

摘要 (Abstract)

Artificial intelligence research increasingly depends on prolonged cycles of reproduction, debugging, and iterative refinement to achieve State-Of-The-Art (SOTA) performance, creating a growing need for systems that can accelerate the full pipeline of empirical model optimization. In this work, we introduce AutoSOTA, an end-to-end automated research system that advances the latest SOTA models published in top-tier AI papers to reproducible and empirically improved new SOTA models. We formulate this problem through three tightly coupled stages: resource preparation and goal setting; experiment evaluation; and reflection and ideation. To tackle this problem, AutoSOTA adopts a multi-agent architecture with eight specialized agents that collaboratively ground papers to code and dependencies, initialize and repair execution environments, track long-horizon experiments, generate and schedule optimization ideas, and supervise validity to avoid spurious gains. We evaluate AutoSOTA on recent research papers collected from eight top-tier AI conferences under filters for code availability and execution cost. Across these papers, AutoSOTA achieves strong end-to-end performance in both automated replication and subsequent optimization. Specifically, it successfully discovers 105 new SOTA models that surpass the original reported methods, averaging approximately five hours per paper. Case studies spanning LLM, NLP, computer vision, time series, and optimization further show that the system can move beyond routine hyperparameter tuning to identify architectural innovation, algorithmic redesigns, and workflow-level improvements. These results suggest that end-to-end research automation can serve not only as a performance optimizer, but also as a new form of research infrastructure that reduces repetitive experimental burden and helps redirect human attention toward higher-level scientific creativity.

关键词: automated research system, multi-agent architecture, state-of-the-art models, end-to-end automation, model optimization, experiment replication, AI research infrastructure, agentic workflow

148. ❌ Stop Fixating on Prompts: Reasoning Hijacking and Constraint Tightening for Red-Teaming LLM Agents

作者: Yanxu Mao, Peipei Liu, Tiehan Cui, Congying Liu, Mingzhe Xing, Datao You 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05549v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM-based agents的安全威胁和红队测试方法，核心涉及LLM agents、推理过程操纵（Reasoning Hijacking）和约束优化。因此，与’Large Language Models’、‘Chain of Thought’、‘System 2 Thinking’和’LLM Agents’高度相关（10分），因为这些直接对应论文研究的LLM代理、推理轨迹和深度推理机制。其他关键词如MoE、SLMs、训练方法、RAG、压缩、对齐等均未在摘要中提及或与论文主题无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM-based agents面临的安全威胁，提出了一种不修改用户提示的JailAgent框架，通过操纵推理轨迹和内存检索来有效进行红队测试，并在跨模型和跨场景环境中表现出色。

摘要翻译

随着基于大语言模型的智能体在各领域的广泛应用，其复杂性也带来了新的安全威胁。现有的红队测试方法多依赖于修改用户提示，这类方法对新数据缺乏适应性，且可能影响智能体的正常性能。为应对这一挑战，本文提出JailAgent框架，该框架完全避免修改用户提示，通过三个关键阶段——触发提取、推理劫持与约束强化——隐式操控智能体的推理轨迹与记忆检索机制。借助精准的触发识别、实时自适应机制以及优化的目标函数，JailAgent在跨模型与跨场景环境中均展现出卓越的性能表现。

摘要 (Abstract)

With the widespread application of LLM-based agents across various domains, their complexity has introduced new security threats. Existing red-team methods mostly rely on modifying user prompts, which lack adaptability to new data and may impact the agent’s performance. To address the challenge, this paper proposes the JailAgent framework, which completely avoids modifying the user prompt. Specifically, it implicitly manipulates the agent’s reasoning trajectory and memory retrieval with three key stages: Trigger Extraction, Reasoning Hijacking, and Constraint Tightening. Through precise trigger identification, real-time adaptive mechanisms, and an optimized objective function, JailAgent demonstrates outstanding performance in cross-model and cross-scenario environments.

关键词: LLM-based agents, security threats, red-team methods, JailAgent framework, reasoning hijacking, constraint tightening, memory retrieval, cross-model scenarios

149. ❌ Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects

作者: Jun Zhang, Yicheng Ji, Feiyang Ren, Yihang Li, Bowen Zeng, Zonghao Chen, Ke Chen, Lidan Shou, Gang Chen, Huan Li 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05546v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	8.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于大型视觉语言模型（LVLMs）的推理效率优化，属于大模型技术应用范畴。核心相关关键词包括：‘Large Language Models’（LVLMs属于大模型范畴，8分）、‘Speculative Decoding OR Inference Acceleration’（论文核心研究推理加速技术，10分）、‘KV Cache Compression OR Linear Attention OR FlashAttention’（涉及注意力机制优化和内存管理，8分）、‘Context Window Extension OR Long Context LLMs’（处理长上下文是论文讨论的瓶颈之一，5分）、‘Quantization OR Model Compression OR Low-bit Weights’（提及混合压缩技术，5分）。其他关键词如MoE、SFT、RAG、AI for Science等与论文内容无直接关联，均给0分。

!!! tip deepseek-chat TL;DR

该论文系统分析了大型视觉语言模型推理效率的瓶颈，提出了基于信息密度塑造、长上下文注意力管理和内存限制克服的结构化优化框架，并展望了硬件算法协同设计等未来研究方向。

摘要翻译

大型视觉语言模型（LVLMs）能够实现对图像和视频的复杂推理，但其推理过程受到一种系统性效率障碍的制约，即视觉令牌主导问题。这一开销源于高分辨率特征提取、二次注意力缩放与内存带宽限制之间的多重机制相互作用。本文提出了一种围绕推理生命周期（包括编码、预填充和解码三个阶段）构建的效率技术系统分类法。与先前专注于孤立优化的综述不同，我们通过分析端到端流程，揭示了上游决策如何决定下游瓶颈，涵盖了计算受限的视觉编码、海量上下文的密集预填充，以及带宽受限解码中的“视觉内存墙”。通过将效率图景解耦为塑造信息密度、管理长上下文注意力和突破内存限制这三个维度，本文对孤立优化如何组合以驾驭视觉保真度与系统效率之间的权衡进行了结构化分析。本综述最后基于初步实证见解，勾勒了四个未来前沿方向，包括基于功能单元敏感性的混合压缩、采用宽松验证的模态感知解码、用于流式连续性的渐进式状态管理，以及通过硬件-算法协同设计实现的阶段解耦服务。所提交的软件包含了我们文献库的快照，该库旨在作为社区持续维护的动态资源。

摘要 (Abstract)

Large Vision-Language Models (LVLMs) enable sophisticated reasoning over images and videos, yet their inference is hindered by a systemic efficiency barrier known as visual token dominance. This overhead is driven by a multi-regime interplay between high-resolution feature extraction, quadratic attention scaling, and memory bandwidth constraints. We present a systematic taxonomy of efficiency techniques structured around the inference lifecycle, consisting of encoding, prefilling, and decoding. Unlike prior reviews focused on isolated optimizations, we analyze the end-to-end pipeline to reveal how upstream decisions dictate downstream bottlenecks, covering compute-bound visual encoding, the intensive prefilling of massive contexts, and the ‘‘visual memory wall’’ in bandwidth-bound decoding. By decoupling the efficiency landscape into the axes of shaping information density, managing long-context attention, and overcoming memory limits, this work provides a structured analysis of how isolated optimizations compose to navigate the trade-off between visual fidelity and system efficiency. The survey concludes by outlining four future frontiers supported by pilot empirical insights, including hybrid compression based on functional unit sensitivity, modality-aware decoding with relaxed verification, progressive state management for streaming continuity, and stage-disaggregated serving through hardware-algorithm co-design. The submitted software contains a snapshot of our literature repository, which is designed to be maintained as a living resource for the community.

关键词: Large Vision-Language Models, Inference Efficiency, Visual Token Dominance, Attention Scaling, Memory Bandwidth, Hybrid Compression, Modality-aware Decoding, Hardware-algorithm Co-design

150. ❌ Learning to Edit Knowledge via Instruction-based Chain-of-Thought Prompting

作者: Jinhu Fu, Yan Bai, Longzhu He, Yihang Lou, Yanxiao Zhao, Li Sun, Sen Su 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05540v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的知识编辑问题，通过CoT推理和RAG技术提升泛化能力。高度相关的关键词包括：LLMs（论文直接研究）、SFT（训练方法）、RAG（推理时集成）、CoT Reasoning（核心方法）、LLM Agents（用于生成数据）。其他关键词如MoE、SLMs、Scaling Laws等与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出CoT2Edit方法，通过指令驱动的链式思维推理和检索增强生成技术，解决了大语言模型知识编辑中泛化能力差和范围窄的问题，在多种场景下实现了有效的知识更新。

摘要翻译

大语言模型（LLM）能够通过知识编辑有效处理过时信息。然而，当前方法面临两个关键局限：（I）泛化能力差：多数方法生硬地注入新知识，未能确保模型能有效运用这些知识解决实际问题。（II）适用范围窄：现有方法主要关注结构化事实三元组，忽视了现实场景中普遍存在的多样化非结构化事实信息形式（例如新闻、文章）。为应对这些挑战，我们提出一种新范式：通过思维链（Chain of Thoughts, CoTs）推理教导大语言模型进行知识编辑（CoT2Edit）。我们首先利用语言模型智能体，针对结构化和非结构化编辑数据生成思维链，构建高质量指令数据。随后通过监督微调（Supervised Fine-Tuning, SFT）和组相对策略优化（Group Relative Policy Optimization, GRPO）训练模型基于编辑知识进行推理。在推理阶段，我们整合检索增强生成（Retrieval-Augmented Generation, RAG）技术，动态检索相关编辑事实以实现实时知识编辑。实验结果表明，我们的方法在三个开源语言模型上仅需单轮训练，即可在六种不同知识编辑场景中实现强大的泛化能力。代码发布于 https://github.com/FredJDean/CoT2Edit。

摘要 (Abstract)

Large language models (LLMs) can effectively handle outdated information through knowledge editing. However, current approaches face two key limitations: (I) Poor generalization: Most approaches rigidly inject new knowledge without ensuring that the model can use it effectively to solve practical problems. (II) Narrow scope: Current methods focus primarily on structured fact triples, overlooking the diverse unstructured forms of factual information (e.g., news, articles) prevalent in real-world contexts. To address these challenges, we propose a new paradigm: teaching LLMs to edit knowledge via Chain of Thoughts (CoTs) reasoning (CoT2Edit). We first leverage language model agents for both structured and unstructured edited data to generate CoTs, building high-quality instruction data. The model is then trained to reason over edited knowledge through supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO). At inference time, we integrate Retrieval-Augmented Generation (RAG) to dynamically retrieve relevant edited facts for real-time knowledge editing. Experimental results demonstrate that our method achieves strong generalization across six diverse knowledge editing scenarios with just a single round of training on three open-source language models. The codes are available at https://github.com/FredJDean/CoT2Edit.

关键词: Knowledge Editing, Chain of Thought, Retrieval-Augmented Generation, Supervised Fine-tuning, Language Model Agents, Generalization, Instruction-based, Real-time Editing

作者: Hongcheng Liu, Yuhao Wang, Zhe Chen, Pingjie Wang, Zhiyuan Zhu, Yixuan Hou, Yanfeng Wang, Yu Wang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05522v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于Omni-LLMs在跨模态指代对齐上的不足，提出数据集CrossOmni和两种改进方法（In-Context Learning和SFT+GRPO）。核心与大模型（LLMs）高度相关，涉及SFT、思维链推理、深度推理和上下文学习等关键词，但未涉及MoE、量化、科学AI等其他技术。

!!! tip deepseek-chat TL;DR

该论文揭示了Omni-LLMs在跨模态指代对齐上的系统性弱点，并通过引入数据集CrossOmni以及In-Context Learning和SFT+GRPO两种方法显著提升了模型性能。

摘要翻译

全模态大语言模型（Omni-LLMs）在整体多模态感知方面展现出卓越能力，但在需要协同全模态推理的复杂场景中仍持续表现不佳。除了理解全局多模态上下文外，有效的推理还依赖于细粒度的跨模态对齐，特别是识别跨模态的共享指称对象，然而这一方面在很大程度上被忽视了。为弥补这一差距，我们将此挑战形式化为一个跨模态共指问题，即模型必须在源模态中定位一个指称对象，并在目标模态中重新识别它。基于此范式，我们引入了CrossOmni数据集，该数据集包含九项任务，并配备人工设计的推理依据，以评估和提升这种能力。在13个全模态大语言模型上的实验揭示了它们在跨模态共指方面存在系统性弱点，我们将其归因于缺乏共指感知的思维模式。为解决这一问题，我们通过两种策略增强跨模态对齐：一种是无训练的上下文学习（In-Context Learning）方法，另一种是基于训练的有监督微调+群体相对策略优化（SFT+GRPO）框架，旨在引导此类思维模式。两种方法均带来显著的性能提升，并能有效泛化至协作推理任务。总体而言，我们的研究结果强调，跨模态共指是推进稳健全模态推理的关键缺失环节。

摘要 (Abstract)

Omni Large Language Models (Omni-LLMs) have demonstrated impressive capabilities in holistic multi-modal perception, yet they consistently falter in complex scenarios requiring synergistic omni-modal reasoning. Beyond understanding global multimodal context, effective reasoning also hinges on fine-grained cross-modal alignment, especially identifying shared referents across modalities, yet this aspect has been largely overlooked. To bridge this gap, we formalize the challenge as a cross-modal coreference problem, where a model must localize a referent in a source modality and re-identify it in a target modality. Building on this paradigm, we introduce CrossOmni, a dataset comprising nine tasks equipped with human-designed reasoning rationales to evaluate and enhance this capability. Experiments on 13 Omni-LLMs reveal systematic weaknesses in cross-modal coreference, which we attribute to the absence of coreference-aware thinking patterns. To address this, we enhance cross-modal alignment via two strategies: a training-free In-Context Learning method and a training-based SFT+GRPO framework designed to induce such thinking patterns. Both approaches yield substantial performance gains and generalize effectively to collaborative reasoning tasks. Overall, our findings highlight cross-modal coreference as a crucial missing piece for advancing robust omni-modal reasoning.

关键词: Omni-LLMs, cross-modal coreference, multi-modal reasoning, In-Context Learning, SFT+GRPO, reasoning rationales, collaborative reasoning, alignment

152. ❌ Don’t Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction

作者: Yuzhe Zhang, Xianwei Xue, Xingyong Wu, Mengke Chen, Chen Liu, Xinran He, Run Shao, Feiran Liu, Huanmin Xu, Qiutong Pan, Haiwei Wang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05477v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究基于视觉语言模型（VLM）的自主GUI代理，属于大模型应用领域。核心创新在于提出TVAE框架实现动作效果验证和自我纠正，与"Self-Correction"高度相关（10分），属于"LLM Agents"范畴（10分）。论文使用Robust SFT进行训练（8分），涉及多步推理（“Chain of Thought” 8分）和深度思考（“System 2 Thinking” 8分），并通过GUI操作实现工具使用（“Tool Use” 8分）。论文提到VLM而非纯LLM，与"Large Language Models"有一定关联（5分）。其他关键词如MoE、量化、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对基于视觉语言模型的自主GUI代理在噪声环境中容易因未验证动作效果而累积错误的问题，提出了VeriGUI框架，通过动作效果验证和自我纠正机制显著减少了失败循环并提高了恢复成功率。

摘要翻译

基于视觉语言模型（VLM）的自主图形用户界面（GUI）代理通常假设环境响应是确定性的，即在未验证先前操作是否成功的情况下生成后续动作。在存在网络延迟、渲染延迟和系统中断的真实场景中，这种假设会导致未检测到的操作失败、重复无效行为以及灾难性的错误累积。此外，由于在线交互成本高昂且离线数据集中缺乏实时反馈，学习稳健的恢复策略具有挑战性。我们提出了VeriGUI（验证驱动的GUI代理），该模型显式地对噪声环境下的操作结果与恢复机制进行建模。VeriGUI引入了“思考—验证—行动—预期”（TVAE）框架以检测失败并引导纠正性推理，并提出一种两阶段训练流程：第一阶段通过合成失败轨迹进行鲁棒监督微调（Robust SFT），第二阶段采用基于非对称验证奖励的群体相对策略优化（GRPO）。我们进一步基于AndroidControl构建了鲁棒性基准测试，以评估失败识别与纠正能力。实验表明，VeriGUI在保持竞争力的标准任务性能的同时，显著减少了失败循环并提升了恢复成功率。

摘要 (Abstract)

Autonomous GUI agents based on vision-language models (VLMs) often assume deterministic environment responses, generating actions without verifying whether previous operations succeeded. In real-world settings with network latency, rendering delays, and system interruptions, this assumption leads to undetected action failures, repetitive ineffective behaviors, and catastrophic error accumulation. Moreover, learning robust recovery strategies is challenging due to the high cost of online interaction and the lack of real-time feedback in offline datasets.We propose VeriGUI (Verification-driven GUI Agent), which explicitly models action outcomes and recovery under noisy environments. VeriGUI introduces a Thinking–Verification–Action–Expectation (TVAE) framework to detect failures and guide corrective reasoning, and a two-stage training pipeline that combines Robust SFT with synthetic failure trajectories and GRPO with asymmetric verification rewards. We further construct a Robustness Benchmark based on AndroidControl to evaluate failure recognition and correction. Experiments show that VeriGUI significantly reduces failure loops and improves recovery success while maintaining competitive standard task performance.

关键词: GUI Automation, Vision-Language Models, Action-Effect Verification, Self-Correction, Robust Training, Autonomous Agents, Failure Recovery, TVAE Framework

153. ❌ CUE-R: Beyond the Final Answer in Retrieval-Augmented Generation

作者: Siddharth Jain, Venkat Narayan Vedam 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05467v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	15.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RAG（检索增强生成）的评估方法，提出CUE-R框架通过干预证据项来测量每个证据项的效用。与’Retrieval-Augmented Generation’高度相关（15分），因为这是论文的核心主题。与’Large Language Models’相关（10分），因为论文使用Qwen-3 8B和GPT-5.2进行实验。与’Chain of Thought’有一定关联（5分），因为论文涉及多步推理和证据检索。与’Hallucination Mitigation’和’Mechanistic Interpretability’有一定关联（各5分），因为论文关注事实性、归因和可解释性评估。其他关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了CUE-R框架，通过干预证据项来评估检索增强生成（RAG）中每个证据项的效用，发现仅评估最终答案会忽略重要的证据效应，干预式效用分析是RAG评估的实用补充。

摘要翻译

随着语言模型从单次答案生成转向在推理过程中检索并利用证据的多步推理，评估单个检索项的作用变得愈发重要。现有的RAG评估通常关注最终答案质量、引用忠实度或答案层面的归因，但这些方法均未直接针对我们在此研究的基于干预的、以单个证据项效用为核心的视角。我们提出了CUE-R，一个基于干预的轻量级框架，通过浅层可观测的检索使用轨迹来测量单次RAG中每个证据项的操作效用。CUE-R通过REMOVE（移除）、REPLACE（替换）和DUPLICATE（复制）操作符对单个证据项进行扰动，随后沿三个效用维度（正确性、基于代理的grounding忠实度、置信度误差）以及一个轨迹差异信号来测量变化。我们还提出了一个用于解释干预结果的操作性证据角色分类法。在HotpotQA和2WikiMultihopQA数据集上使用Qwen-3 8B和GPT-5.2进行的实验揭示了一致模式：REMOVE和REPLACE操作会显著损害正确性和grounding，同时产生较大的轨迹偏移；而DUPLICATE操作虽常表现为答案冗余，但并非完全行为中性。零检索对照实验证实，这些效应源于有意义检索的退化。一项双支持证据的消融实验进一步表明，多跳证据项可能以非加和方式相互作用：同时移除两个支持项对性能的损害远大于移除任一单个项。我们的结果表明，仅评估答案会遗漏重要的证据效应，而基于干预的效用分析是RAG评估的一种实用补充方法。

摘要 (Abstract)

As language models shift from single-shot answer generation toward multi-step reasoning that retrieves and consumes evidence mid-inference, evaluating the role of individual retrieved items becomes more important. Existing RAG evaluation typically targets final-answer quality, citation faithfulness, or answer-level attribution, but none of these directly targets the intervention-based, per-evidence-item utility view we study here. We introduce CUE-R, a lightweight intervention-based framework for measuring per-evidence-item operational utility in single-shot RAG using shallow observable retrieval-use traces. CUE-R perturbs individual evidence items via REMOVE, REPLACE, and DUPLICATE operators, then measures changes along three utility axes (correctness, proxy-based grounding faithfulness, and confidence error) plus a trace-divergence signal. We also outline an operational evidence-role taxonomy for interpreting intervention outcomes. Experiments on HotpotQA and 2WikiMultihopQA with Qwen-3 8B and GPT-5.2 reveal a consistent pattern: REMOVE and REPLACE substantially harm correctness and grounding while producing large trace shifts, whereas DUPLICATE is often answer-redundant yet not fully behaviorally neutral. A zero-retrieval control confirms that these effects arise from degradation of meaningful retrieval. A two-support ablation further shows that multi-hop evidence items can interact non-additively: removing both supports harms performance far more than either single removal. Our results suggest that answer-only evaluation misses important evidence effects and that intervention-based utility analysis is a practical complement for RAG evaluation.

关键词: Retrieval-Augmented Generation, RAG evaluation, evidence utility, intervention-based framework, per-evidence-item analysis, multi-step reasoning, factuality assessment, explainable AI

154. ❌ Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward Modeling

作者: Qiyuan Chen, Hongsen Huang, Jiahe Chen, Qian Shao, Jintai Chen, Hongxia Xu, Renjie Hua, Chuan Ren, Jian Wu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05445v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出VL-MDR框架用于视觉语言奖励建模，核心涉及大模型对齐（Instruction Tuning/Alignment）、DPO对齐方法（RLHF/DPO）、幻觉缓解（Hallucination Mitigation）和可解释AI（Explainable AI），与这些关键词高度相关（10分）。论文使用视觉语言模型，与大模型相关（8分）。其他关键词如MoE、量化、推理加速等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言奖励建模中生成方法可解释但慢、判别方法高效但黑箱的困境，提出了VL-MDR框架，通过动态维度选择和聚合实现可解释的奖励建模，实验表明其在VL-RewardBench上优于现有开源奖励模型，并能通过DPO对齐有效缓解视觉幻觉、提高可靠性。

摘要翻译

视觉语言奖励建模面临一个两难困境：生成式方法可解释性强但速度缓慢，而判别式方法效率高却如同不透明的“黑箱”。为弥合这一差距，我们提出了VL-MDR（Vision-Language Multi-Dimensional Reward，视觉语言多维奖励）框架，该框架将评估动态分解为细粒度、可解释的维度。VL-MDR并非输出单一标量，而是采用视觉感知门控机制来识别相关维度，并针对每个具体输入自适应地加权不同维度（如幻觉度、推理能力等）。为支持此方法，我们构建了一个包含32.1万个视觉语言偏好对的数据集，这些数据在21个细粒度维度上进行了标注。大量实验表明，在VL-RewardBench等基准测试中，VL-MDR始终优于现有的开源奖励模型。此外，我们证明基于VL-MDR构建的偏好对能有效支持DPO对齐，从而减少视觉幻觉并提升可靠性，为视觉语言模型对齐提供了可扩展的解决方案。

摘要 (Abstract)

Vision-language reward modeling faces a dilemma: generative approaches are interpretable but slow, while discriminative ones are efficient but act as opaque “black boxes.” To bridge this gap, we propose VL-MDR (Vision-Language Multi-Dimensional Reward), a framework that dynamically decomposes evaluation into granular, interpretable dimensions. Instead of outputting a monolithic scalar, VL-MDR employs a visual-aware gating mechanism to identify relevant dimensions and adaptively weight them (e.g., Hallucination, Reasoning) for each specific input. To support this, we curate a dataset of 321k vision-language preference pairs annotated across 21 fine-grained dimensions. Extensive experiments show that VL-MDR consistently outperforms existing open-source reward models on benchmarks like VL-RewardBench. Furthermore, we show that VL-MDR-constructed preference pairs effectively enable DPO alignment to mitigate visual hallucinations and improve reliability, providing a scalable solution for VLM alignment.

关键词: Vision-Language Reward Modeling, Interpretable Reward, Dynamic Dimension Selection, Hallucination Mitigation, DPO Alignment, VL-MDR, Multi-Dimensional Reward, Visual-Aware Gating

155. ❌ Top-K Retrieval with Fixed-Size Linear-Attention Completion: Backbone- and KV-Format-Preserving Attention for KV-Cache Read Reduction

作者: Yasuto Hoshi, Daisuke Miyashita, Jun Deguchi 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05438v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	15.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究长上下文生成中的KV缓存读取优化问题，与’KV Cache Compression OR Linear Attention OR FlashAttention’高度相关（15分），直接提出KV缓存读取减少方法。与’Context Window Extension OR Long Context LLMs’高度相关（10分），专注于长上下文生成场景。与’Large Language Models OR LLMs OR Foundation Models’相关（8分），涉及大模型推理优化。与’Speculative Decoding OR Inference Acceleration’相关（8分），属于推理加速技术。与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’有一定关联（5分），使用Top-K检索机制。其他关键词如MoE、SLMs、对齐训练、科学AI应用等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对长上下文生成中KV缓存读取流量过大的问题，提出了一种保持主干权重和KV缓存格式不变的检索-补全注意力模块，通过结合精确计算和估计的方法减少KV读取，在相同读取预算下优于仅使用Top-K选择的方法。

摘要翻译

长上下文生成日益受到解码时键值（KV）缓存数据流量的限制，尤其在KV缓存被卸载至GPU内存之外时更为显著。基于查询的检索（如Top-K选择）通过仅加载KV对的子集来减少此类流量，但在子集上对softmax进行重归一化时，若注意力权重广泛分布于未检索的令牌上，则会引入偏差。我们提出了一种检索补全注意力模块，该模块保持主干网络权重与KV缓存格式不变。针对每个查询，我们精确计算对锚点（如起始/尾部标记）及查询相关的Top-K检索令牌的注意力，并利用预填充阶段计算得到的固定尺寸特征图摘要，估计剩余中间区域的分子与分母值。我们在未归一化域中叠加精确计算与估计的贡献值，并执行单次归一化，从而在不增加注意力侧KV读取开销的情况下恢复缺失的softmax权重。在多项长上下文基准测试中，该方法在相同令牌等效读取预算下，较仅使用Top-K选择的方法表现出性能提升，且在高熵注意力头中改善最为显著。

摘要 (Abstract)

Long-context generation is increasingly limited by decode-time key-value (KV) cache traffic, particularly when KV is offloaded beyond GPU memory. Query-aware retrieval (e.g., Top-K selection) reduces this traffic by loading only a subset of KV pairs, but renormalizing the softmax over the subset introduces bias when attention mass is spread over unretrieved tokens. We propose a retrieval-completion attention module that keeps backbone weights and the KV-cache format unchanged. For each query, we compute exact attention over sink/tail anchors and the query-dependent retrieved Top-K tokens, and estimate the remaining mid-region numerator and denominator using a fixed-size feature-map summary computed at prefill time. We add the exact and estimated contributions in the unnormalized domain and apply a single normalization, recovering the missing softmax mass without additional attention-side KV reads. Across long-context benchmarks, the proposed method improves over selection-only Top-K at matched token-equivalent read budgets, with the largest gains in high-entropy heads.

关键词: KV cache, long-context generation, attention mechanism, Top-K retrieval, inference optimization, linear attention, decode-time traffic, retrieval-completion

156. ❌ Multi-Drafter Speculative Decoding with Alignment Feedback

作者: Taehyeon Kim, Hojung Jung, Se-Young Yun 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05417v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的推理加速技术——推测解码（Speculative Decoding），这是论文的核心内容，因此相关关键词得分为10。论文使用较小的模型作为草稿器（drafters），这与小型语言模型（SLMs）相关，得分为8。论文提到通过对齐反馈（alignment feedback）来确保生成质量，这与对齐（Alignment）概念有一定关联，得分为5。其他关键词如MoE、数据质量、微调、RAG、思维链、智能体、量化等均未在论文标题或摘要中涉及，因此得分为0。

!!! tip deepseek-chat TL;DR

该论文针对推测解码中单个草稿模型跨领域效果有限的问题，提出了一个名为MetaSD的统一框架，通过集成多个草稿模型并利用对齐反馈动态分配计算资源，从而在保持生成质量的同时显著提升了推理速度。

摘要翻译

推测解码（Speculative Decoding，SD）通过使用较小的模型草拟未来词元，再由目标大语言模型（LLM）进行验证，从而加速大语言模型的推理过程。该方法仅接受对齐的词元，从而保持了生成质量。然而，单个草拟模型通常针对特定任务或领域训练，在多样化应用中的有效性有限。为解决这一问题，我们提出了 \textsc{MetaSD}，一个将多个草拟模型集成到推测解码流程中的统一框架。MetaSD 通过利用对齐反馈，并将草拟模型选择建模为一个多臂老虎机问题，动态地为异构草拟模型分配计算资源。大量实验表明，MetaSD 在性能上持续优于单草拟模型方法。

摘要 (Abstract)

Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller model to draft future tokens, which are then verified by the target LLM. This preserves generation quality by accepting only aligned tokens. However, individual drafters, often trained for specific tasks or domains, exhibit limited effectiveness across diverse applications. To address this, we introduce \textsc{MetaSD}, a unified framework that integrates multiple drafters into the SD process. MetaSD dynamically allocates computational resources to heterogeneous drafters by leveraging alignment feedback and framing drafter selection as a multi-armed bandit problem. Extensive experiments show MetaSD consistently outperforms single-drafter approaches.

关键词: Speculative Decoding, Large Language Models, Inference Acceleration, Multi-Drafter, Alignment Feedback, MetaSD, Computational Resource Allocation, Multi-armed Bandit

157. ❌ PRISM-MCTS: Learning from Reasoning Trajectories with Metacognitive Reflection

作者: Siyuan Cheng, Bozhong Tian, YanChao Hao, Zheng Wei 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05424v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	15.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PRISM-MCTS专注于推理模型的改进，特别是基于蒙特卡洛树搜索（MCTS）的推理框架。核心相关关键词包括：‘Monte Carlo Tree Search OR MCTS AND LLM’（15分，论文直接提出PRISM-MCTS框架，MCTS是核心方法）、‘Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’（10分，论文涉及推理轨迹和深思熟虑认知）、‘System 2 Thinking OR Slow Thinking OR In-depth Reasoning’（10分，论文强调从直觉到深思熟虑的过渡）、‘Self-Correction OR Self-Improvement OR Self-Reflection’（10分，论文引入元认知反思和过程奖励模型以优化推理）。其他关键词如’Large Language Models OR LLMs OR Foundation Models’（8分，论文提及推理模型如OpenAI o1，属于大模型范畴）、‘Scaling Laws AND Data Quality’（5分，论文提到重新定向扩展定律，但非核心）、‘Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’（5分，论文比较MCTS-RAG，但非主要贡献）。其余关键词与论文内容无关或未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出PRISM-MCTS框架，通过集成过程奖励模型和动态共享内存来改进蒙特卡洛树搜索在推理任务中的效率，减少计算冗余，并在多个推理基准测试中验证了其有效性，如在GPQA上所需轨迹减半且性能超越现有方法。

摘要翻译

PRISM-MCTS：通过元认知反思从推理轨迹中学习
Siyuan Cheng, Bozhong Tian, Yanchao Hao, Zheng Wei
发布日期：2026年4月6日，最后修改：2026年4月6日
ACL 2026 Findings 会议，领域主席，审稿人，出版主席，作者修订
BibTeX CC BY 4.0
关键词：自然语言处理的高效/低资源方法，生成，问答
摘要：以OpenAI o1为代表的推理模型的出现，标志着从直觉认知到审慎认知的转变，有效地将扩展定律（scaling laws）从预训练范式重新导向测试时计算。虽然蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）在该领域已显示出潜力，但现有方法通常将每次模拟（rollout）视为孤立的轨迹。这种信息共享的缺失导致了严重的低效性和大量的计算冗余，因为搜索过程未能利用先前探索的洞见。为解决这些局限，我们提出了PRISM-MCTS，一个受人类并行思维与反思过程启发的新型推理框架。PRISM-MCTS将过程奖励模型（Process Reward Model, PRM）与动态共享内存相结合，同时捕捉“启发式策略（Heuristics）”与“推理谬误（Fallacies）”。通过强化成功策略并剪枝易错分支，PRISM-MCTS有效地实现了推理精炼。此外，我们为PRM开发了一种数据高效的训练策略，在少样本（few-shot）机制下实现了高保真度评估。在多样化推理基准上的实证评估证实了PRISM-MCTS的有效性。值得注意的是，它在GPQA基准上所需的推理轨迹减少了一半，同时性能超越了MCTS-RAG和Search-o1，这表明它通过审慎而非穷举的推理来扩展推断能力。

摘要 (Abstract)

PRISM-MCTS: Learning from Reasoning Trajectories with Metacognitive Reflection Siyuan Cheng, Bozhong Tian, Yanchao Hao, Zheng Wei Published: 06 Apr 2026, Last Modified: 06 Apr 2026 ACL 2026 Findings Conference, Area Chairs, Reviewers, Publication Chairs, Authors Revisions BibTeX CC BY 4.0 Keywords: Efficient/Low-Resource Methods for NLP, Generation, Question Answering Abstract: The emergence of reasoning models, exemplified by OpenAI o1, signifies a transition from intuitive to deliberative cognition, effectively reorienting the scaling laws from pre-training paradigms toward test-time computation. While Monte Carlo Tree Search (MCTS) has shown promise in this domain, existing approaches typically treat each rollout as an isolated trajectory. This lack of information sharing leads to severe inefficiency and substantial computational redundancy, as the search process fails to leverage insights from prior explorations. To address these limitations, we propose PRISM-MCTS, a novel reasoning framework that draws inspiration from human parallel thinking and reflective processes. PRISM-MCTS integrates a Process Reward Model (PRM) with a dynamic shared memory, capturing both “Heuristics” and “Fallacies”. By reinforcing successful strategies and pruning error-prone branches, PRISM-MCTS effectively achieves refinement. Furthermore, we develop a data-efficient training strategy for the PRM, achieving high-fidelity evaluation under a few-shot regime. Empirical evaluations across diverse reasoning benchmarks substantiate the efficacy of PRISM-MCTS. Notably, it halves the trajectory requirements on GPQA while surpassing MCTS-RAG and Search-o1, demonstrating that it scales inference by reasoning judiciously rather than exhaustively.

关键词: PRISM-MCTS, Monte Carlo Tree Search, reasoning trajectories, metacognitive reflection, Process Reward Model, dynamic shared memory, inference efficiency, reasoning benchmarks

158. ❌ Confidence Should Be Calibrated More Than One Turn Deep

作者: Zhaohan Zhang, Chengzhengxu Li, Xiaoming Liu, Chao Shen, Ziquan Liu, Ioannis Patras 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05397v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在多轮对话中的置信度校准问题，属于LLM可信性和可靠性研究范畴。与’Large Language Models’高度相关（10分），因为论文明确研究LLM在多轮交互中的应用；与’Hallucination Mitigation’高度相关（10分），因为论文提出的ConfChat解码策略旨在提高模型响应的真实性和一致性，直接对应幻觉缓解和事实性提升。其他关键词如MoE、SFT、RAG、量化等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在多轮对话中置信度校准不足的问题，提出了多轮校准任务和MTCal方法，并开发了ConfChat解码策略，显著提升了模型在多轮交互中的事实性和一致性。

摘要翻译

大型语言模型（LLM）正日益应用于金融、医疗和教育等高风险领域，在这些领域中，与用户进行可靠的多轮交互至关重要。然而，现有关于置信度估计与校准的研究——作为构建可信赖LLM系统的主要方法——大多集中于单轮交互场景，忽视了多轮对话中的风险与潜力。在本研究中，我们提出了多轮校准任务，将校准从一种静态属性重新定义为可靠多轮对话中的核心动态挑战，其要求在于根据对话历史对模型每一轮的置信度进行校准。我们首先揭示了这一场景的风险：通过使用第T轮预期校准误差（ECE@T）——一种追踪多轮校准动态的新指标，我们发现用户反馈（例如说服性信息）可能损害多轮校准效果。为解决此问题，我们提出MTCal方法，它通过代理校准目标最小化ECE@T，并进一步将校准后的置信度应用于ConfChat——一种解码策略，旨在提升多轮交互中模型响应的真实性与一致性。大量实验表明，MT-Cal在多轮校准中取得了优异且稳定的性能，而ConfChat在多轮交互中保持甚至增强了模型表现。我们的研究成果标志着多轮校准成为将LLM校准推向安全、可靠及实际应用的关键缺失环节。

摘要 (Abstract)

Large Language Models (LLMs) are increasingly applied in high-stakes domains such as finance, healthcare, and education, where reliable multi-turn interactions with users are essential. However, existing work on confidence estimation and calibration, a major approach to building trustworthy LLM systems, largely focuses on single-turn settings and overlooks the risks and potential of multi-turn conversations. In this work, we introduce the task of multi-turn calibration to reframe calibration from a static property into a dynamic challenge central to reliable multi-turn conversation, where calibrating model confidence at each turn conditioned on the conversation history is required. We first reveal the risks of this setting: using Expected Calibration Error at turn T (ECE@T), a new metric that tracks calibration dynamics over turns, we show that user feedback (e.g., persuasion) can degrade multi-turn calibration. To address this, we propose MTCal, which minimises ECE@T via a surrogate calibration target, and further leverage calibrated confidence in ConfChat, a decoding strategy that improves both factuality and consistency of the model response in multi-turn interactions. Extensive experiments demonstrate that MT-Cal achieves outstanding and consistent performance in multi-turn calibration, and ConfChat preserves and even enhances model performance in multi-turn interactions. Our results mark multi-turn calibration as one missing link for scaling LLM calibration toward safe, reliable, and real-world use.

关键词: Large Language Models, confidence calibration, multi-turn conversations, trustworthy AI, factuality, consistency, decoding strategy, ECE@T

159. ❌ ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving

作者: Kaiser Hamid, Can Cui, Nade Liang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05378v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究语言驱动的自动驾驶中指令鲁棒性问题，与LLM、指令对齐、智能体、幻觉缓解高度相关（8-10分），但未涉及其他具体技术如MoE、量化、推理加速等（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了ICR-Drive框架，用于评估语言条件自动驾驶模型在指令变体（如改写、模糊、噪声和误导）下的鲁棒性，发现微小指令变化会导致性能显著下降，揭示了安全关键应用中基础模型的可靠性差距。

摘要翻译

视觉-语言-行动（VLA）模型的最新进展使得语言条件驾驶代理能够在闭环仿真中执行自然语言导航指令，然而标准评估大多假设指令是精确且结构良好的。在实际部署中，指令的措辞和具体性存在差异，可能遗漏关键限定词，偶尔还会包含具有误导性、以权威口吻表述的文本，导致指令层面的鲁棒性未能得到充分衡量。我们提出了ICR-Drive，一个用于端到端语言条件自动驾驶中指令反事实鲁棒性的诊断框架。ICR-Drive生成涵盖四种扰动类型的受控指令变体：释义、模糊性、噪声和误导性，其中误导性变体与导航目标相冲突，并试图覆盖原始意图。我们在匹配的仿真器配置和随机种子下重放相同的CARLA路线，以隔离由指令语言引起的性能变化。鲁棒性通过标准CARLA排行榜指标以及相对于基线指令的各类扰动性能下降来量化。在LMDrive和BEVDriver上的实验表明，微小的指令变化可能导致显著的性能下降和不同的故障模式，这揭示了在安全关键的驾驶场景中部署具身基础模型时存在的可靠性差距。

摘要 (Abstract)

Recent progress in vision-language-action (VLA) models has enabled language-conditioned driving agents to execute natural-language navigation commands in closed-loop simulation, yet standard evaluations largely assume instructions are precise and well-formed. In deployment, instructions vary in phrasing and specificity, may omit critical qualifiers, and can occasionally include misleading, authority-framed text, leaving instruction-level robustness under-measured. We introduce ICR-Drive, a diagnostic framework for instruction counterfactual robustness in end-to-end language-conditioned autonomous driving. ICR-Drive generates controlled instruction variants spanning four perturbation families: Paraphrase, Ambiguity, Noise, and Misleading, where Misleading variants conflict with the navigation goal and attempt to override intent. We replay identical CARLA routes under matched simulator configurations and seeds to isolate performance changes attributable to instruction language. Robustness is quantified using standard CARLA Leaderboard metrics and per-family performance degradation relative to the baseline instruction. Experiments on LMDrive and BEVDriver show that minor instruction changes can induce substantial performance drops and distinct failure modes, revealing a reliability gap for deploying embodied foundation models in safety-critical driving.

关键词: language-conditioned autonomous driving, instruction robustness, counterfactual robustness, vision-language-action models, embodied foundation models, CARLA simulation, instruction perturbation, safety-critical driving

160. ❌ Data-Driven Function Calling Improvements in Large Language Model for Online Financial QA

作者: Xing Tang, Hao Chen, Shiwei Li, Fuyuan Lyu, Weijie Shi, Lingjie Li, Dugang Liu, Weihong Luo, Xiku Du, Xiuqiang He 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05387v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在金融领域的应用，特别是通过数据驱动方法增强LLM的函数调用能力，以构建在线金融问答系统。因此，与’Large Language Models’、‘Post-training/SFT’、‘LLM Agents’和’Tool Use/Function Calling’高度相关（10分），因为这些是论文的核心技术和方法。与’Pre-training/Domain Adaptation’有一定关联（5分），因为论文涉及领域适应（金融场景）。其他关键词如MoE、SLMs、Scaling Laws、RAG、CoT等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种数据驱动的管道，通过数据集构建、数据增强和模型训练，增强大型语言模型在金融问答场景中的函数调用能力，以更好地利用金融API工具服务在线用户。

摘要翻译

大型语言模型（LLM）已被广泛应用于众多工业场景中。与此同时，金融领域存在大量分散于不同功能的API资产。一个在线金融问答系统可以结合LLM与私有API，以提供及时的金融分析与信息。关键在于为LLM模型配备适应金融场景的函数调用能力。然而，通用LLM需要调用定制化的金融API，且难以直接适配金融领域。此外，在线用户查询具有多样性，其参数相较于所需函数输入参数往往存在分布外情况，这使得通用LLM更难以服务在线用户。本文提出一种数据驱动的流程，以增强LLM在我们已部署的在线金融问答系统中的函数调用能力，该流程包括数据集构建、数据增强和模型训练。具体而言，我们基于先前研究构建了一个数据集并定期更新，整合了用户查询及一种名为AugFC的数据增强方法。添加与用户查询相关的样本将以数据驱动的方式充分挖掘我们的金融工具集，而AugFC则通过探索可能的参数值来增强更新后数据集的多样性。随后，我们采用两步法训练LLM，使其能够调用我们的金融功能函数。在现有离线数据集上的大量实验以及在线场景的部署结果，均证明了我们流程的优越性。相关流程已应用于中国最大的聊天平台之一——元宝（YuanBao）的金融问答系统中。

摘要 (Abstract)

Large language models (LLMs) have been incorporated into numerous industrial applications. Meanwhile, a vast array of API assets is scattered across various functions in the financial domain. An online financial question-answering system can leverage both LLMs and private APIs to provide timely financial analysis and information. The key is equipping the LLM model with function calling capability tailored to a financial scenario. However, a generic LLM requires customized financial APIs to call and struggles to adapt to the financial domain. Additionally, online user queries are diverse and contain out-of-distribution parameters compared with the required function input parameters, which makes it more difficult for a generic LLM to serve online users. In this paper, we propose a data-driven pipeline to enhance function calling in LLM for our online, deployed financial QA, comprising dataset construction, data augmentation, and model training. Specifically, we construct a dataset based on a previous study and update it periodically, incorporating queries and an augmentation method named AugFC. The addition of user query-related samples will \textit{exploit} our financial toolset in a data-driven manner, and AugFC explores the possible parameter values to enhance the diversity of our updated dataset. Then, we train an LLM with a two-step method, which enables the use of our financial functions. Extensive experiments on existing offline datasets, as well as the deployment of an online scenario, illustrate the superiority of our pipeline. The related pipeline has been adopted in the financial QA of YuanBao\footnote{https://yuanbao.tencent.com/chat/}, one of the largest chat platforms in China.

关键词: Large Language Models, Function Calling, Financial QA, Data-driven Pipeline, API Tool Use, Domain Adaptation, Online Deployment, AugFC

161. ❌ ETR: Entropy Trend Reward for Efficient Chain-of-Thought Reasoning

作者: Xuan Xiong, Huan Liu, Li Gu, Zhixiang Chi, Yue Qiu, Yuanhao Yu, Yang Wang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05355v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	15.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Chain-of-Thought推理的效率优化，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（15分），涉及LLM推理过程，与’Large Language Models OR LLMs OR Foundation Models’相关（10分），并涉及深度推理过程，与’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、RAG、压缩技术、代理系统等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对Chain-of-Thought推理中产生的冗长低效问题，提出了一种基于熵趋势奖励（ETR）的方法，通过鼓励不确定性逐步降低的推理轨迹，在多个基准测试中显著提高了推理准确率（如DeepSeek-R1-Distill-7B提升9.9%）并大幅缩短了推理长度（减少67%）。

摘要翻译

思维链推理能够提升大语言模型在复杂任务上的表现，但常产生过长且低效的推理轨迹。现有方法通过长度惩罚或全局熵减来缩短思维链，其隐含假设是推理全程保持低不确定性最为理想。然而，我们发现推理效率实际上由不确定性的变化轨迹所主导。具有显著下降熵趋势的思维链长度会大幅缩短。基于这一洞见，我们提出了熵趋势奖励，这是一种轨迹感知的优化目标，它鼓励逐步降低不确定性，同时允许有限的局部探索。我们将熵趋势奖励整合到群组相对策略优化中，并在多个推理模型和具有挑战性的基准测试上进行了评估。熵趋势奖励始终能实现更优的准确率-效率权衡，在四个基准测试中，将DeepSeek-R1-Distill-7B的准确率提升了9.9%，同时将思维链长度减少了67%。代码发布于https://github.com/Xuan1030/ETR。

摘要 (Abstract)

Chain-of-thought (CoT) reasoning improves large language model performance on complex tasks, but often produces excessively long and inefficient reasoning traces. Existing methods shorten CoTs using length penalties or global entropy reduction, implicitly assuming that low uncertainty is desirable throughout reasoning. We show instead that reasoning efficiency is governed by the trajectory of uncertainty. CoTs with dominant downward entropy trends are substantially shorter. Motivated by this insight, we propose Entropy Trend Reward (ETR), a trajectory-aware objective that encourages progressive uncertainty reduction while allowing limited local exploration. We integrate ETR into Group Relative Policy Optimization (GRPO) and evaluate it across multiple reasoning models and challenging benchmarks. ETR consistently achieves a superior accuracy-efficiency tradeoff, improving DeepSeek-R1-Distill-7B by 9.9% in accuracy while reducing CoT length by 67% across four benchmarks. Code is available at https://github.com/Xuan1030/ETR

关键词: Chain-of-Thought, Reasoning Efficiency, Entropy Trend, Uncertainty Reduction, Policy Optimization, Large Language Models, Accuracy-Efficiency Tradeoff

162. ❌ Human Values Matter: Investigating How Misalignment Shapes Collective Behaviors in LLM Agent Communities

作者: Xiangxu Zhang, Jiamin Wang, Qinlin Zhao, Hanze Guo, Linzhuo Li, Jing Yao, Xiao Zhou, Xiaoyuan Yi, Xing Xie 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05339v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM多智能体系统中人类价值观错位对集体行为的影响，与’Large Language Models’、‘Instruction Tuning OR Alignment OR Value Alignment’、‘LLM Agents OR Autonomous Agents OR Agentic Workflow’、‘Multi-agent Systems OR Agent Coordination’高度相关（10分），因为这些是论文的直接研究对象和核心概念。其他关键词如MoE、量化、推理加速等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该研究通过CIVA多智能体环境模拟发现，LLM智能体社区中人类价值观的错位会显著改变集体行为，导致宏观层面的系统崩溃和微观层面的欺骗、权力寻求等涌现行为。

摘要翻译

随着大语言模型日益融入人类社会，从社会科学视角评估其对人类价值观的取向已引起越来越多的关注。然而，人类价值观为何对大语言模型至关重要仍不明确，尤其是在基于大语言模型的多智能体系统中，个体行为失准可能导致群体层面的失败累积。我们探究：与人类价值观的错位是否会改变大语言模型智能体的集体行为？又会引发何种变化？本研究基于社会科学理论构建了CIVA——一个受控的多智能体实验环境，其中大语言模型智能体形成社区，自主进行交流、探索和资源竞争，从而实现对价值观普及程度的系统性调控和行为分析。通过全面的模拟实验，我们揭示了三个关键发现：（1）识别出若干对社区集体动态具有显著影响的结构性关键价值观，包括那些与大语言模型原始取向相偏离的价值观。在这些价值观被错误设定的触发下，我们（2）在宏观层面观测到系统失效模式（如灾难性崩溃），（3）在微观层面观察到欺骗与权力追逐等涌现行为。这些结果为“人类价值观对大语言模型的集体结果具有决定性作用”提供了量化证据，并为未来多智能体价值对齐研究提供了动力。

摘要 (Abstract)

As LLMs become increasingly integrated into human society, evaluating their orientations on human values from social science has drawn growing attention. Nevertheless, it is still unclear why human values matter for LLMs, especially in LLM-based multi-agent systems, where group-level failures may accumulate from individually misaligned actions. We ask whether misalignment with human values alters the collective behavior of LLM agents and what changes it induces? In this work, we introduce CIVA, a controlled multi-agent environment grounded in social science theories, where LLM agents form a community and autonomously communicate, explore, and compete for resources, enabling systematic manipulation of value prevalence and behavioral analysis. Through comprehensive simulation experiments, we reveal three key findings. (1) We identify several structurally critical values that substantially shape the community’s collective dynamics, including those diverging from LLMs’ original orientations. Triggered by the misspecification of these values, we (2) detect system failure modes, e.g., catastrophic collapse, at the macro level, and (3) observe emergent behaviors like deception and power-seeking at the micro level. These results offer quantitative evidence that human values are essential for collective outcomes in LLMs and motivate future multi-agent value alignment.

关键词: LLM agents, multi-agent systems, value alignment, human values, collective behavior, system failure, emergent behaviors, social science

163. ❌ DQA: Diagnostic Question Answering for IT Support

作者: Vishaal Kapoor, Mariam Dundua, Sarthak Ahuja, Neda Kordjazi, Evren Yortucboylu, Vaibhavi Padala, Derek Ho, Jennifer Whitted, Rebecca Steinert 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05350v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心贡献是DQA框架，它基于检索增强生成（RAG）技术，在IT支持场景中实现诊断性问答。因此，与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（10分）。论文涉及多轮对话、证据积累和假设解决，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’和’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’有一定关联（各5分）。框架可视为一种代理工作流，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’相关（5分）。论文提到RAG系统，暗示可能使用大模型，因此与’Large Language Models OR LLMs OR Foundation Models’有一般关联（5分）。其他关键词如MoE、量化、对齐等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对企业IT支持中多轮诊断问答的挑战，提出了DQA框架，通过维护持久诊断状态和基于根本原因的检索聚合，在150个场景的评估中将成功率从41.3%提升至78.7%，同时将平均对话轮数从8.4减少到3.9。

摘要翻译

企业IT支持交互本质上是诊断性的：有效的解决方案需要通过从模糊的用户报告中迭代收集证据，以识别潜在的根本原因。尽管检索增强生成（RAG）技术通过历史案例提供了信息基础，但标准的多轮RAG系统缺乏明确的诊断状态，因此难以在多次交互中积累证据并解决相互竞争的假设。我们提出了DQA，一种诊断性问答框架，该框架能够维持持久的诊断状态，并在根本原因层面而非单个文档层面聚合检索到的案例。DQA结合了对话式查询重写、检索聚合以及基于状态的条件响应生成，以在企业延迟和上下文约束下支持系统化故障排除。我们采用基于回放的评估协议，在150个匿名企业IT支持场景中对DQA进行了测试。在三次独立运行的平均结果中，DQA在轨迹级成功率标准下达到了78.7%的成功率，而多轮RAG基线仅为41.3%，同时将平均交互轮数从8.4轮减少至3.9轮。

摘要 (Abstract)

Enterprise IT support interactions are fundamentally diagnostic: effective resolution requires iterative evidence gathering from ambiguous user reports to identify an underlying root cause. While retrieval-augmented generation (RAG) provides grounding through historical cases, standard multi-turn RAG systems lack explicit diagnostic state and therefore struggle to accumulate evidence and resolve competing hypotheses across turns. We introduce DQA, a diagnostic question-answering framework that maintains persistent diagnostic state and aggregates retrieved cases at the level of root causes rather than individual documents. DQA combines conversational query rewriting, retrieval aggregation, and state-conditioned response generation to support systematic troubleshooting under enterprise latency and context constraints. We evaluate DQA on 150 anonymized enterprise IT support scenarios using a replay-based protocol. Averaged over three independent runs, DQA achieves a 78.7% success rate under a trajectory-level success criterion, compared to 41.3% for a multi-turn RAG baseline, while reducing average turns from 8.4 to 3.9.

关键词: Diagnostic Question Answering, IT Support, Retrieval-Augmented Generation, Multi-turn Dialogue, Evidence Gathering, Root Cause Analysis, Enterprise AI, Conversational AI

164. ❌ DIA-HARM: Dialectal Disparities in Harmful Content Detection Across 50 English Dialects

作者: Jason Lucas, Matt Murtagh, Ali Al-Lawati, Uchendu Uchendu, Adaku Uchendu, Dongwon Lee 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05318v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究有害内容检测模型（特别是虚假信息分类器）在50种英语方言中的鲁棒性评估，涉及大模型（LLMs）的零样本评估和微调transformer模型的比较，因此与’Large Language Models’（5分）和’Post-training/Supervised Fine-tuning’（5分）有一定关联；同时，研究虚假信息检测与事实性相关，与’Hallucination Mitigation/Factuality’（5分）有一定关联。其他关键词如MoE、Scaling Laws、RAG、Agents等均未在论文中涉及，评分为0。

!!! tip deepseek-chat TL;DR

该论文研究了虚假信息检测模型在50种英语方言中的鲁棒性，发现当前模型在方言内容上存在系统性漏洞，可能导致全球数亿非标准美式英语使用者处于不利地位。

摘要翻译

有害内容检测器——尤其是虚假信息分类器——主要基于标准美国英语（SAE）进行开发和评估，其对方言变体的鲁棒性尚未得到充分探究。我们提出了DIA-HARM，这是首个用于评估虚假信息检测在50种英语方言中鲁棒性的基准，涵盖美国、英国、非洲、加勒比和亚太地区的各类变体。利用Multi-VALUE基于语言学的转换方法，我们构建了D3（方言虚假信息检测）语料库，该库包含从现有虚假信息基准衍生的19.5万个样本。我们对16个检测模型的评估揭示了系统性的脆弱性：人工撰写的方言内容使检测性能下降1.4%-3.6% F1值，而AI生成的内容则保持稳定。经过微调的Transformer模型显著优于零样本大型语言模型（最佳情况F1值为96.6%对比78.3%），部分模型在混合内容上表现出超过33%性能下降的灾难性失败。通过对2,450对方言进行的跨方言迁移分析显示，多语言模型（如mDeBERTa：平均F1值97.2%）能有效泛化，而如RoBERTa和XLM-RoBERTa等单语模型在处理方言输入时失效。这些发现表明，当前的虚假信息检测器可能系统性地损害全球数亿非标准美国英语使用者的利益。我们发布了DIA-HARM框架、D3语料库及评估工具：https://github.com/jsl5710/dia-harm

摘要 (Abstract)

Harmful content detectors-particularly disinformation classifiers-are predominantly developed and evaluated on Standard American English (SAE), leaving their robustness to dialectal variation unexplored. We present DIA-HARM, the first benchmark for evaluating disinformation detection robustness across 50 English dialects spanning U.S., British, African, Caribbean, and Asia-Pacific varieties. Using Multi-VALUE’s linguistically grounded transformations, we introduce D3 (Dialectal Disinformation Detection), a corpus of 195K samples derived from established disinformation benchmarks. Our evaluation of 16 detection models reveals systematic vulnerabilities: human-written dialectal content degrades detection by 1.4-3.6% F1, while AI-generated content remains stable. Fine-tuned transformers substantially outperform zero-shot LLMs (96.6% vs. 78.3% best-case F1), with some models exhibiting catastrophic failures exceeding 33% degradation on mixed content. Cross-dialectal transfer analysis across 2,450 dialect pairs shows that multilingual models (mDeBERTa: 97.2% average F1) generalize effectively, while monolingual models like RoBERTa and XLM-RoBERTa fail on dialectal inputs. These findings demonstrate that current disinformation detectors may systematically disadvantage hundreds of millions of non-SAE speakers worldwide. We release the DIA-HARM framework, D3 corpus, and evaluation tools: https://github.com/jsl5710/dia-harm

关键词: harmful content detection, disinformation detection, dialectal variation, English dialects, benchmark evaluation, transformer models, zero-shot LLMs, cross-dialectal transfer

165. ❌ LLMs Should Express Uncertainty Explicitly

作者: Junyu Guo, Shangding Gu, Ming Jin, Costas Spanos, Javad Lavaei 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05306v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs如何显式表达不确定性以改进决策（如弃权、检索、验证），与’Large Language Models’高度相关（10分）。论文涉及’Retrieval-Augmented Generation’（8分）和’Hallucination Mitigation’（8分），因为研究不确定性如何驱动检索控制和减少过度自信错误。与’Chain of Thought’（5分）相关，因涉及推理过程；与’Self-Correction’（5分）相关，因不确定性信号可用于自我改进；与’Mechanistic Interpretability’（5分）相关，因分析不确定性表达的内部机制。其他关键词如MoE、SLMs、Scaling Laws等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究如何训练大型语言模型显式表达不确定性（通过全局置信度评分和局部不确定性标记），以改善校准、减少错误并增强检索控制，从而提高决策可靠性。

摘要翻译

大型语言模型正日益应用于必须依赖不确定性来驱动决策的场景，例如弃答、检索和验证。现有方法大多将不确定性视为生成后需估计的潜在量，而非模型被训练来表达的信号。我们则将不确定性作为一种控制接口进行研究。我们比较了两种互补的接口：全局接口，即模型对其最终答案给出经过校准的置信度分数；以及局部接口，即模型在推理过程中进入高风险状态时发出显式的<不确定>标记。这两种接口提供了不同但互补的优势。言语化的置信度显著改善了校准效果，减少了过度自信的错误，并构建了整体最强的自适应检索增强生成控制器，同时更有选择性地使用检索。推理时的不确定性信号使得先前隐性的失败在生成过程中可见，提高了对错误答案的覆盖度，并提供了有效的高召回检索触发机制。我们的研究进一步表明，这两种接口在内部工作机制上有所不同：言语化置信度主要优化了现有不确定性的解码方式，而推理时信号则引发了更广泛的深层网络重组。综上所述，这些结果表明，大型语言模型中的有效不确定性应被训练为与任务匹配的通信方式：全局置信度用于决定是否信任最终答案，而局部信号则用于决定何时需要干预。

摘要 (Abstract)

Large language models are increasingly used in settings where uncertainty must drive decisions such as abstention, retrieval, and verification. Most existing methods treat uncertainty as a latent quantity to estimate after generation rather than a signal the model is trained to express. We instead study uncertainty as an interface for control. We compare two complementary interfaces: a global interface, where the model verbalizes a calibrated confidence score for its final answer, and a local interface, where the model emits an explicit marker during reasoning when it enters a high-risk state. These interfaces provide different but complementary benefits. Verbalized confidence substantially improves calibration, reduces overconfident errors, and yields the strongest overall Adaptive RAG controller while using retrieval more selectively. Reasoning-time uncertainty signaling makes previously silent failures visible during generation, improves wrong-answer coverage, and provides an effective high-recall retrieval trigger. Our findings further show that the two interfaces work differently internally: verbal confidence mainly refines how existing uncertainty is decoded, whereas reasoning-time signaling induces a broader late-layer reorganization. Together, these results suggest that effective uncertainty in LLMs should be trained as task-matched communication: global confidence for deciding whether to trust a final answer, and local signals for deciding when intervention is needed.

关键词: Large Language Models, Uncertainty Expression, Confidence Calibration, Retrieval-Augmented Generation, Reasoning-time Signaling, Hallucination Mitigation, Adaptive Control, Model Communication

166. ❌ Right at My Level: A Unified Multilingual Framework for Proficiency-Aware Text Simplification

作者: Jinhong Jeong, Junghun Park, Youngjae Yu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05302v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究基于大语言模型（LLM）的多语言文本简化框架，直接涉及’Large Language Models OR LLMs OR Foundation Models’关键词，因此给10分。论文使用强化学习框架，但未涉及RLHF/RLAIF/DPO等具体对齐技术；涉及词汇覆盖度、语义保持等评估，但未专门研究幻觉缓解或可解释性；涉及多语言应用，但非生物信息学等科学领域。其他关键词如MoE、量化、推理加速、智能体等均未在摘要中体现，故给0分。

!!! tip deepseek-chat TL;DR

该论文针对多语言文本简化中缺乏平行语料库监督的问题，提出了一个基于强化学习的统一框架Re-RIGHT，通过训练一个紧凑的4B策略模型，在目标语言水平上实现了比先进大语言模型更高的词汇覆盖度，同时保持了原始语义和流畅性。

摘要翻译

文本简化通过提供可理解性输入支持第二语言（L2）学习，这与输入假说（Input Hypothesis）相一致。然而，构建个性化的平行语料库成本高昂，而现有基于大语言模型（LLM）的可读性控制方法依赖于预标注的句子语料库，且主要针对英语。我们提出Re-RIGHT——一个无需平行语料库监督的自适应多语言文本简化统一强化学习框架。我们首先证明，即使在GPT-5.2和Gemini 2.5等先进大语言模型支持下，基于提示的目标语言水平（包括CEFR、JLPT、TOPIK和HSK）词汇简化方法在较低难度级别及非英语语言中表现欠佳。为解决这一问题，我们收集了涵盖四种语言（英语、日语、韩语和中文）的43K词汇级数据，并采用Re-RIGHT框架训练了一个紧凑的40亿参数策略模型。该框架整合了三个奖励模块：词汇覆盖度、语义保持度和连贯性。与更强的大语言模型基线相比，Re-RIGHT在目标语言水平上实现了更高的词汇覆盖度，同时保持了原文含义与流畅性。

摘要 (Abstract)

Text simplification supports second language (L2) learning by providing comprehensible input, consistent with the Input Hypothesis. However, constructing personalized parallel corpora is costly, while existing large language model (LLM)-based readability control methods rely on pre-labeled sentence corpora and primarily target English. We propose Re-RIGHT, a unified reinforcement learning framework for adaptive multilingual text simplification without parallel corpus supervision. We first show that prompting-based lexical simplification at target proficiency levels (CEFR, JLPT, TOPIK, and HSK) performs poorly at easier levels and for non-English languages, even with state-of-the-art LLMs such as GPT-5.2 and Gemini 2.5. To address this, we collect 43K vocabulary-level data across four languages (English, Japanese, Korean, and Chinese) and train a compact 4B policy model using Re-RIGHT, which integrates three reward modules: vocabulary coverage, semantic preservation, and coherence. Compared to the stronger LLM baselines, Re-RIGHT achieves higher lexical coverage at target proficiency levels while maintaining original meaning and fluency.

关键词: text simplification, multilingual, large language models, reinforcement learning, proficiency-aware, lexical coverage, semantic preservation, parallel corpus

167. ❌ Beneath the Surface: Investigating LLMs’ Capabilities for Communicating with Subtext

作者: Kabir Ahuja, Yuxuan Li, Andrew Kyle Lampinen 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05273v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在理解和使用潜台词（subtext）方面的能力，属于LLMs能力评估和认知研究范畴。与’Large Language Models’高度相关（10分），因为论文直接评估前沿LLMs。与’Chain of Thought’和’System 2 Thinking’有一定关联（5分），因为涉及推理和深度思考能力。与’LLM Agents’和’Multi-agent Systems’相关（5分），因为包含多智能体游戏设置。与’Mechanistic Interpretability’相关（5分），因为研究LLMs的内部工作机制和解释性。其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在理解和运用潜台词进行创造性沟通的能力，发现前沿模型存在过度字面化沟通的偏见，但在有共同背景的情况下能部分改善，揭示了LLMs在社交推理方面的弱点。

摘要翻译

人类交流本质上是创造性的，常运用潜台词——即超越文本字面内容的隐含意义。本文系统研究了语言模型能否在交流情境中使用潜台词，并引入四个新的评估套件来检验这种能力。我们的评估场景涵盖从寓言写作与解读，到受《只言片语》（Dixit）等桌游规则启发的多智能体多模态游戏。研究发现，前沿模型普遍表现出强烈的过度直白化交流倾向，因而无法适应微妙的情境约束——即使在表现最佳的模型中，在“视觉隐喻”（Visual Allusions）环境中仍有60%的提示生成是字面化的。然而，某些模型有时能利用与交流方的共同背景（common ground）辅助潜台词交流，使过度直白提示减少30%-50%；但当共同背景未被明确说明时，模型难以推断其存在。在寓言理解方面，我们发现副文本（paratextual）条件与人物角色（persona）条件会显著改变对潜台词的解读。总体而言，本研究为潜台词这类本质复杂且主观的现象提供了可量化的测量方法，揭示了当前大语言模型的诸多缺陷与特质。我们期望这项研究能启发未来面向社会情境化创造性交流与推理的探索。

摘要 (Abstract)

Human communication is fundamentally creative, and often makes use of subtext – implied meaning that goes beyond the literal content of the text. Here, we systematically study whether language models can use subtext in communicative settings, and introduce four new evaluation suites to assess these capabilities. Our evaluation settings range from writing & interpreting allegories to playing multi-agent and multi-modal games inspired by the rules of board games like Dixit. We find that frontier models generally exhibit a strong bias towards overly literal, explicit communication, and thereby fail to account for nuanced constraints – even the best performing models generate literal clues 60% of times in one of our environments – Visual Allusions. However, we find that some models can sometimes make use of common ground with another party to help them communicate with subtext, achieving 30%-50% reduction in overly literal clues; but they struggle at inferring presence of a common ground when not explicitly stated. For allegory understanding, we find paratextual and persona conditions to significantly shift the interpretation of subtext. Overall, our work provides quantifiable measures for an inherently complex and subjective phenomenon like subtext and reveals many weaknesses and idiosyncrasies of current LLMs. We hope this research to inspire future work towards socially grounded creative communication and reasoning.

关键词: LLMs, subtext, creative communication, evaluation suites, literal bias, common ground, multi-agent games, allegory understanding

作者: Chan-Wei Hu, Zhengzhong Tu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05268v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多模态检索增强生成（MM-RAG）中的重排序问题，提出了一种查询端区域裁剪框架Region-R1。该研究与关键词’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（10分），因为论文明确研究MM-RAG系统，并改进其重排序组件。然而，论文未涉及大模型技术原理、训练方法、推理优化、对齐、代理系统、模型压缩、科学AI应用等其他关键词，因此这些关键词得分为0分。

!!! tip deepseek-chat TL;DR

论文针对多模态检索增强生成中重排序器易受视觉干扰影响的问题，提出了Region-R1查询端区域裁剪框架，通过动态裁剪问题相关区域，在E-VQA和InfoSeek基准上实现了Recall@1最高20%的性能提升。

摘要翻译

多模态检索增强生成（MM-RAG）高度依赖于重排序器来为图像-问题查询呈现最相关的证据。然而，标准重排序器通常将完整查询图像作为全局嵌入进行处理，这使得它们容易受到视觉干扰物（例如背景杂乱）的影响，从而扭曲相似性分数。我们提出了Region-R1，一种查询侧区域裁剪框架，该框架将区域选择建模为重排序过程中的决策问题，使系统能够在为检索到的候选对象评分之前，学会保留完整图像或仅关注与问题相关的区域。Region-R1通过一种新颖的区域感知组相对策略优化（r-GRPO）学习一种策略，以动态裁剪出具有判别性的区域。在两个具有挑战性的基准测试E-VQA和InfoSeek上，Region-R1取得了一致的性能提升，通过将条件Recall@1提高多达20%，实现了最先进的性能。这些结果表明，查询侧自适应作为一种简单而有效的方法，在增强MM-RAG重排序方面具有巨大潜力。

摘要 (Abstract)

Multi-modal retrieval-augmented generation (MM-RAG) relies heavily on re-rankers to surface the most relevant evidence for image-question queries. However, standard re-rankers typically process the full query image as a global embedding, making them susceptible to visual distractors (e.g., background clutter) that skew similarity scores. We propose Region-R1, a query-side region cropping framework that formulates region selection as a decision-making problem during re-ranking, allowing the system to learn to retain the full image or focus only on a question-relevant region before scoring the retrieved candidates. Region-R1 learns a policy with a novel region-aware group relative policy optimization (r-GRPO) to dynamically crop a discriminative region. Across two challenging benchmarks, E-VQA and InfoSeek, Region-R1 delivers consistent gains, achieving state-of-the-art performances by increasing conditional Recall@1 by up to 20%. These results show the great promise of query-side adaptation as a simple but effective way to strengthen MM-RAG re-ranking.

关键词: Multi-modal retrieval-augmented generation, MM-RAG, re-ranking, region cropping, query-side adaptation, visual distractors, policy optimization, state-of-the-art performance

169. ❌ Do Domain-specific Experts exist in MoE-based LLMs?

作者: Giang Do, Hung Le, Truyen Tran 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05267v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	15.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究MoE-based LLMs中的专家专业化问题，与’Mixture of Experts OR MoE OR Sparse Models’高度相关（15分），是论文的核心架构。论文明确研究LLMs（10分），并涉及SFT作为基线比较（5分）。论文对专家专业化的解释性研究与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分）。其他关键词如SLMs、Scaling Laws、Pre-training、Alignment等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文研究了MoE-based LLMs中是否存在领域特定专家的问题，通过评估多个模型提供了实证证据，并提出了一种无需训练、零额外推理成本的Domain Steering Mixture of Experts（DSMoE）框架，在多个领域实现了优于SFT基线的性能。

摘要翻译

在大语言模型（LLM）时代，混合专家（Mixture of Experts, MoE）架构已成为一种以更高计算效率训练超大规模模型的有效方法。这一成功建立在大量旨在增强基于MoE的大语言模型中专家专业化程度的先前研究基础之上。然而，此类专业化的本质及其如何能被系统性地解释，仍然是开放的研究挑战。在本工作中，我们通过提出一个根本性问题来探究这一空白：\textit{基于MoE的大语言模型中是否存在领域特定专家？} 为回答此问题，我们评估了十个参数规模从38亿到1200亿不等的先进基于MoE的大语言模型，并为领域特定专家的存在提供了实证证据。基于这一发现，我们提出了\textbf{领域导向混合专家（Domain Steering Mixture of Experts, DSMoE）}，这是一个无需训练、不引入任何额外推理成本的框架，其性能优于训练良好的基于MoE的大语言模型以及包括监督微调（Supervised Fine-Tuning, SFT）在内的强基线方法。在目标领域和非目标领域上对四个先进开源基于MoE的大语言模型进行的实验表明，我们的方法在不增加推理成本或无需额外重新训练的情况下，实现了强大的性能和稳健的泛化能力。我们的实现代码已在 https://github.com/giangdip2410/Domain-specific-Experts 公开。

摘要 (Abstract)

In the era of Large Language Models (LLMs), the Mixture of Experts (MoE) architecture has emerged as an effective approach for training extremely large models with improved computational efficiency. This success builds upon extensive prior research aimed at enhancing expert specialization in MoE-based LLMs. However, the nature of such specializations and how they can be systematically interpreted remain open research challenges. In this work, we investigate this gap by posing a fundamental question: \textit{Do domain-specific experts exist in MoE-based LLMs?} To answer the question, we evaluate ten advanced MoE-based LLMs ranging from 3.8B to 120B parameters and provide empirical evidence for the existence of domain-specific experts. Building on this finding, we propose \textbf{Domain Steering Mixture of Experts (DSMoE)}, a training-free framework that introduces zero additional inference cost and outperforms both well-trained MoE-based LLMs and strong baselines, including Supervised Fine-Tuning (SFT). Experiments on four advanced open-source MoE-based LLMs across both target and non-target domains demonstrate that our method achieves strong performance and robust generalization without increasing inference cost or requiring additional retraining. Our implementation is publicly available at https://github.com/giangdip2410/Domain-specific-Experts.

关键词: Mixture of Experts, Large Language Models, Domain-specific Experts, Domain Steering Mixture of Experts, Supervised Fine-Tuning, Interpretability, Training-free Framework, Generalization

170. ❌ DualDiffusion: A Speculative Decoding Strategy for Masked Diffusion Models

作者: Satyam Goyal, Kushal Patel, Tanush Mittal, Arjun Laxman 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05250v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	5.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究Masked Diffusion Models（MDMs）的推理加速方法，与’Speculative Decoding OR Inference Acceleration’高度相关（10分），因为这是论文的核心创新点。论文涉及大模型技术，与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分），但MDMs是替代方案而非传统LLMs。与’KV Cache Compression OR Linear Attention OR FlashAttention’有间接关联（5分），因为论文提到MDMs因双向注意力无法缓存KV对而速度受限，但未直接研究这些压缩技术。其他关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对Masked Diffusion Models因双向注意力导致推理速度慢的问题，提出了DualDiffusion推测解码框架，通过结合快速草稿模型和准确验证模型，在保持高准确性的同时显著减少了生成步骤，优化了质量与效率的权衡。

摘要翻译

掩码扩散模型通过实现并行化标记生成与双向上下文建模，为自回归语言模型提供了具有前景的替代方案。然而，由于双向注意力机制无法缓存键值对，其推理速度受到显著限制，需要在每个生成步骤中进行 $O(N^2)$ 量级的计算。尽管近期如 FastDLLM 和 DkvCache 等方法通过注意力近似与缓存策略提升了推理速度，但这些方法是以牺牲生成质量为代价换取加速的。我们提出了 DualDiffusion，一种用于掩码扩散模型的推测解码框架，该框架将快速草稿模型（采用高效近似方法）与较慢但更精确的验证模型相结合。通过运行轻量级草稿模型的多个步骤，再进行单次验证步骤，DualDiffusion 在生成步骤数与准确性之间实现了优于现有方法的帕累托前沿。我们在 MMLU 和 GSM8K 基准上评估了所提方法，结果表明 DualDiffusion 在保持高准确性的同时，显著减少了所需的生成步骤数，从而有效推进了掩码扩散语言模型在质量与效率间的权衡曲线。

摘要 (Abstract)

Masked Diffusion Models (MDMs) offer a promising alternative to autoregressive language models by enabling parallel token generation and bidirectional context modeling. However, their inference speed is significantly limited by the inability to cache key-value pairs due to bidirectional attention, requiring $O(N^2)$ computations at each generation step. While recent methods like FastDLLM and DkvCache improve inference speed through attention approximations and caching strategies, they achieve speedups at the cost of generation quality. We propose DualDiffusion, a speculative decoding framework for MDMs that combines fast drafter models (using efficient approximations) with slower, more accurate verifier models. By running multiple steps of a lightweight drafter followed by a single verification step, DualDiffusion achieves a superior Pareto frontier between generation steps and accuracy compared to existing approaches. We evaluate our method on MMLU and GSM8K, demonstrating that DualDiffusion maintains high accuracy while reducing the number of generation steps required, effectively pushing the quality-efficiency trade-off curve for masked diffusion language models.

关键词: Masked Diffusion Models, Speculative Decoding, Inference Acceleration, Bidirectional Attention, Drafter Models, Verifier Models, Quality-Efficiency Trade-off, Parallel Token Generation

171. ❌ Exemplar Retrieval Without Overhypothesis Induction: Limits of Distributional Sequence Learning in Early Word Learning

作者: Jon-Paul Cacioli 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05243v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究儿童语言习得中的归纳推理机制，使用小型自回归transformer语言模型（3.4M-25.6M参数）进行实验，属于大模型/深度学习在认知科学领域的应用。相关关键词：1）‘Large Language Models OR LLMs OR Foundation Models’得5分，因论文使用transformer语言模型但规模较小；2）‘Small Language Models OR SLMs OR On-device AI’得5分，因模型参数在百万级别；3）‘Pre-training OR Continual Pre-training OR Domain Adaptation’得5分，因涉及模型训练；4）‘Mechanistic Interpretability OR Explainable AI’得5分，因研究模型内部机制和局限性。其他关键词与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该研究探讨了自回归transformer语言模型在模拟儿童语言习得中归纳推理（如形状作为稳定特征维度）的能力，发现模型能完美完成一阶样例检索但二阶泛化能力仅达随机水平，揭示了分布序列学习在发育规模训练条件下的局限性。

摘要翻译

背景：儿童并非简单地习得“球是圆的、积木是方的”这类具体知识。他们掌握的是“形状是界定物体类别的关键特征”这一深层规律——这种被称为“超假设”的次级概括能力[1, 2]。何种学习机制足以支撑这种归纳跃迁？方法：我们使用合成语料训练自回归变换器语言模型（参数量340万-2560万），语料设计确保形状是跨类别的稳定特征维度，并通过八种对照条件排除替代解释。结果：在基于1040项新异词测试集的120项预注册实验中，所有模型均实现完美的一阶样例检索（100%），但对新名词的次级概括能力始终处于随机水平（50-52%），该结果经等效性检验确认。特征置换诊断表明，模型依赖的是框架到特征的模板匹配，而非结构化的“名词→领域→特征”抽象机制。结论：这些结果揭示了在发展规模训练条件下，自回归分布序列学习机制存在明确局限性。

摘要 (Abstract)

Background: Children do not simply learn that balls are round and blocks are square. They learn that shape is the kind of feature that tends to define object categories – a second-order generalisation known as an overhypothesis [1, 2]. What kind of learning mechanism is sufficient for this inductive leap? Methods: We trained autoregressive transformer language models (3.4M-25.6M parameters) on synthetic corpora in which shape is the stable feature dimension across categories, with eight conditions controlling for alternative explanations. Results: Across 120 pre-registered runs evaluated on a 1,040-item wug test battery, every model achieved perfect first-order exemplar retrieval (100%) while second-order generalisation to novel nouns remained at chance (50-52%), a result confirmed by equivalence testing. A feature-swap diagnostic revealed that models rely on frame-to-feature template matching rather than structured noun-to-domain-to-feature abstraction. Conclusions: These results reveal a clear limitation of autoregressive distributional sequence learning under developmental-scale training conditions.

关键词: autoregressive transformer, language models, word learning, overhypothesis induction, distributional sequence learning, exemplar retrieval, second-order generalization, cognitive science

172. ❌ XMark: Reliable Multi-Bit Watermarking for LLM-Generated Texts

作者: Jiahao Xu, Rui Hu, Olivera Kotevska, Zikai Zhang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05242v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM生成文本的多比特水印技术，核心是解决现有方法在解码精度、文本质量和计算效率方面的局限性。因此，仅与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文直接研究LLM生成文本的版权保护问题。其他关键词涉及模型架构、训练方法、推理优化、应用领域等，均非本文研究内容，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为XMark的新方法，用于在LLM生成的文本中嵌入和提取多比特水印，以解决现有方法在解码精度、文本质量和计算效率方面的不足，实验表明其在多种下游任务中显著提升了性能。

摘要翻译

多比特水印技术已成为一种将不可感知的二进制信息嵌入大语言模型生成文本的前沿解决方案，能够对LLM的恶意使用实现可靠溯源与追踪。尽管近期取得进展，现有方法仍面临关键局限：部分方法在处理大规模信息时计算成本过高，另一些则在文本质量与解码精度之间难以取得良好平衡。此外，当生成文本的令牌数量受限时（实际应用中常见情形），现有方法的解码精度会显著下降。为应对这些挑战，我们提出\textsc{XMark}——一种用于LLM生成文本中二进制信息编码与解码的新方法。\textsc{XMark}编码器的独特设计能为带水印令牌生成产生更低失真的逻辑值分布，从而保持文本质量，同时使其定制化解码器能够在有限令牌条件下可靠恢复编码信息。跨多种下游任务的大规模实验表明，\textsc{XMark}在保持水印文本质量的同时显著提升了解码精度，性能优于现有方法。代码发布于https://github.com/JiiahaoXU/XMark。

摘要 (Abstract)

Multi-bit watermarking has emerged as a promising solution for embedding imperceptible binary messages into Large Language Model (LLM)-generated text, enabling reliable attribution and tracing of malicious usage of LLMs. Despite recent progress, existing methods still face key limitations: some become computationally infeasible for large messages, while others suffer from a poor trade-off between text quality and decoding accuracy. Moreover, the decoding accuracy of existing methods drops significantly when the number of tokens in the generated text is limited, a condition that frequently arises in practical usage. To address these challenges, we propose \textsc{XMark}, a novel method for encoding and decoding binary messages in LLM-generated texts. The unique design of \textsc{XMark}’s encoder produces a less distorted logit distribution for watermarked token generation, preserving text quality, and also enables its tailored decoder to reliably recover the encoded message with limited tokens. Extensive experiments across diverse downstream tasks show that \textsc{XMark} significantly improves decoding accuracy while preserving the quality of watermarked text, outperforming prior methods. The code is at https://github.com/JiiahaoXU/XMark.

关键词: Multi-bit watermarking, Large Language Models, LLM-generated text, Text quality, Decoding accuracy, Attribution, Tracing

173. ❌ On the Geometry of Positional Encodings in Transformers

作者: Giansalvo Cirrincione 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05217v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于Transformer模型中位置编码的数学理论，属于大模型（LLMs）的基础技术原理研究。它直接涉及位置编码的设计和优化，这是Transformer架构（大模型的核心）的关键组件。因此，它与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为位置编码是这些模型的基本要素。然而，论文不涉及其他关键词，如MoE、训练方法（预训练、微调、对齐）、推理优化、代理、应用领域（如AI for Science）或具体的高级技术（如RAG、CoT）。它是一项基础理论研究，而非应用或特定高级技术开发。

!!! tip deepseek-chat TL;DR

本文为Transformer中的位置编码建立了数学理论，证明了其必要性，提出了基于多维标度法的最优编码构造方法，并通过实验验证了ALiBi编码的优越性。

摘要翻译

神经语言模型处理单词序列，但其内部的数学运算对单词出现的顺序并不敏感。位置编码正是为弥补这一缺陷而添加的组件。尽管位置编码至关重要，但其设计大多基于试错，缺乏关于其应有功能的数学理论。
本文建立了这样一种理论。我们确立了四项结果。首先，任何不带位置信号的Transformer模型都无法解决任何对词序敏感的任务（必要性定理）。其次，在温和且可验证的条件下，训练过程会在每个全局极小值点处为不同的序列位置分配不同的向量表示（位置分离定理）。第三，我们通过经典多维标度法（MDS）在位置分布间的海林格距离上，构建了对信息最优编码的最佳可实现逼近；任何编码的质量可由单一数值——应力值来衡量（命题5，算法1）。第四，最优编码的有效秩为 r = rank(B) ≤ n-1，且可用 r(n+d) 个参数而非 nd 个参数表示（最小参数化结果）。
附录A通过五个引理，在神经正切核（NTK）体系内，针对掩码语言建模（MLM）损失、序列分类损失以及满足位置充分性条件的一般损失，证明了单调性猜想。在SST-2和IMDB数据集上使用BERT-base进行的实验验证了理论预测，并表明线性偏置注意力（ALiBi）比正弦编码和旋转位置嵌入（RoPE）实现了更低的应力值，这与近似平移等变性下MDS编码的秩-1解释相一致。

摘要 (Abstract)

Neural language models process sequences of words, but the mathematical operations inside them are insensitive to the order in which words appear. Positional encodings are the component added to remedy this. Despite their importance, positional encodings have been designed largely by trial and error, without a mathematical theory of what they ought to do. This paper develops such a theory. Four results are established. First, any Transformer without a positional signal cannot solve any task sensitive to word order (Necessity Theorem). Second, training assigns distinct vector representations to distinct sequence positions at every global minimiser, under mild and verifiable conditions (Positional Separation Theorem). Third, the best achievable approximation to an information-optimal encoding is constructed via classical multidimensional scaling (MDS) on the Hellinger distance between positional distributions; the quality of any encoding is measured by a single number, the stress (Proposition 5, Algorithm 1). Fourth, the optimal encoding has effective rank r = rank(B) <= n-1 and can be represented with r(n+d) parameters instead of nd (minimal parametrisation result). Appendix A develops a proof of the Monotonicity Conjecture within the Neural Tangent Kernel (NTK) regime for masked language modelling (MLM) losses, sequence classification losses, and general losses satisfying a positional sufficiency condition, through five lemmas. Experiments on SST-2 and IMDB with BERT-base confirm the theoretical predictions and reveal that Attention with Linear Biases (ALiBi) achieves much lower stress than the sinusoidal encoding and Rotary Position Embedding (RoPE), consistent with a rank-1 interpretation of the MDS encoding under approximate shift-equivariance.

关键词: Positional Encodings, Transformers, Mathematical Theory, Multidimensional Scaling, ALiBi, BERT, Neural Tangent Kernel, Word Order Sensitivity

174. ❌ RoboPlayground: Democratizing Robotic Evaluation through Structured Physical Domains

作者: Yi Ru Wang, Carter Ung, Evan Gubarev, Christopher Tan, Siddhartha Srinivasa, Dieter Fox 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05226v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《RoboPlayground: Democratizing Robotic Evaluation through Structured Physical Domains》主要研究机器人操作系统的评估框架，提出了一种基于自然语言指令生成可执行任务规范的方法。虽然论文涉及自然语言处理（用户使用自然语言描述任务），但其核心是机器人学、人机交互和评估方法学，而非大模型或深度学习技术本身。论文未讨论任何大模型架构（如LLM、MoE、SLM）、训练技术（如预训练、微调、对齐）、推理优化（如注意力机制、解码加速）、代理系统或特定应用领域（如AI for Science）。所有关键词均与大模型技术原理、训练方法、优化技术或特定科学应用直接相关，而本文焦点是机器人任务评估框架，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文针对机器人操作系统评估依赖固定基准、难以扩展的问题，提出了RoboPlayground框架，允许用户通过自然语言在结构化物理域中编写可执行操作任务，从而 democratize 评估并揭示策略在语言定义任务族上的泛化失败。

摘要翻译

对机器人操作系统的评估主要依赖于少数专家制定的固定基准测试，其中任务实例、约束条件和成功标准均被预先定义且难以扩展。这种范式限制了能够参与评估设计的人员范围，并掩盖了策略如何响应用户在任务意图、约束条件和成功定义方面提出的变化。我们认为，评估现代操作策略需要将评估重新定义为一种在结构化物理领域内、由语言驱动的过程。我们提出了RoboPlayground框架，该框架允许用户在结构化物理领域内使用自然语言编写可执行的操作任务。自然语言指令被编译成具有明确资产定义、初始化分布和成功谓词的可复现任务规范。每条指令定义了一个结构化的相关任务族，在保持可执行性和可比性的同时，实现了可控的语义和行为变化。我们在结构化积木操作领域实例化了RoboPlayground，并从三个维度对其进行了评估。一项用户研究表明，与基于编程和代码辅助的基线方法相比，这种语言驱动的界面更易于使用，且认知负荷更低。在语言定义的任务族上评估学习到的策略，揭示了在固定基准测试评估中不明显的泛化失败案例。最后，我们证明任务多样性随贡献者多样性而非单纯任务数量而扩展，使得评估空间能够通过众包贡献持续增长。项目页面：https://roboplayground.github.io

摘要 (Abstract)

Evaluation of robotic manipulation systems has largely relied on fixed benchmarks authored by a small number of experts, where task instances, constraints, and success criteria are predefined and difficult to extend. This paradigm limits who can shape evaluation and obscures how policies respond to user-authored variations in task intent, constraints, and notions of success. We argue that evaluating modern manipulation policies requires reframing evaluation as a language-driven process over structured physical domains. We present RoboPlayground, a framework that enables users to author executable manipulation tasks using natural language within a structured physical domain. Natural language instructions are compiled into reproducible task specifications with explicit asset definitions, initialization distributions, and success predicates. Each instruction defines a structured family of related tasks, enabling controlled semantic and behavioral variation while preserving executability and comparability. We instantiate RoboPlayground in a structured block manipulation domain and evaluate it along three axes. A user study shows that the language-driven interface is easier to use and imposes lower cognitive workload than programming-based and code-assist baselines. Evaluating learned policies on language-defined task families reveals generalization failures that are not apparent under fixed benchmark evaluations. Finally, we show that task diversity scales with contributor diversity rather than task count alone, enabling evaluation spaces to grow continuously through crowd-authored contributions. Project Page: https://roboplayground.github.io

关键词: robotic manipulation, evaluation framework, natural language interface, structured physical domains, task specification, generalization failure, democratizing evaluation, user study

175. ❌ Faster Superword Tokenization

作者: Craig W. Schmidt, Chris Tanner, Yuval Pinter 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05192v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于改进BPE（Byte Pair Encoding）分词算法，提出了一种更快的训练方法，允许形成超词（superwords）。虽然BPE是LLM预训练中的关键组件，但论文本身不直接研究LLM架构、训练技术或应用，而是聚焦于分词算法的效率改进。因此，仅与’Large Language Models’和’Pre-training’有一定关联（因为BPE常用于LLM预训练），但与大多数关键词（如MoE、SFT、RAG、推理加速等）无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种更快的两阶段训练方法，用于改进BPE分词算法以支持超词形成，将训练时间从数天大幅缩短至约10分钟，同时保持与原有算法相同的结果。

摘要翻译

字节对编码（Byte Pair Encoding，BPE）是一种广泛使用的分词算法，其生成的词元无法跨越预分词边界，这在功能上限制了它至多只能表示完整单词。BoundlessBPE与SuperBPE算法通过放宽这一限制，允许形成由多个预词元组合而成的超词（superwords），从而对BPE进行了扩展与改进。然而，先前的实现方案在训练上不切实际：例如，BoundlessBPE在1GB数据上的训练需耗费4.7个CPU日。我们证明，超合并候选（supermerge candidates）——即两个或更多连续且符合形成超合并条件的预词元——可以像常规预词元一样按频率进行聚合。这避免了如BoundlessBPE和SuperBPE原始实现所需将完整文档保留在内存中的问题，从而显著提升了训练速度。我们提出了一种两阶段的BoundlessBPE框架，将常规合并的第一阶段学习与超合并的第二阶段学习分离开来，其结果与原始实现完全一致。我们还证明了两阶段BoundlessBPE与SuperBPE近乎等价，区别在于SuperBPE中需手动选择的超参数可在BoundlessBPE的第二阶段自动确定。这些改进使得实现速度大幅提升，在相同1GB数据上，BoundlessBPE和SuperBPE的训练时间分别仅需603秒和593秒，速度提高了600倍以上。针对BoundlessBPE、SuperBPE和BPE，我们均开源了参考性的Python实现和快速的Rust实现。

摘要 (Abstract)

Byte Pair Encoding (BPE) is a widely used tokenization algorithm, whose tokens cannot extend across pre-tokenization boundaries, functionally limiting it to representing at most full words. The BoundlessBPE and SuperBPE algorithms extend and improve BPE by relaxing this limitation and allowing the formation of superwords, which are combinations of pretokens that form phrases. However, previous implementations were impractical to train: for example, BoundlessBPE took 4.7 CPU days to train on 1GB of data. We show that supermerge candidates, two or more consecutive pretokens eligible to form a supermerge, can be aggregated by frequency much like regular pretokens. This avoids keeping full documents in memory, as the original implementations of BoundlessBPE and SuperBPE required, leading to a significant training speedup. We present a two-phase formulation of BoundlessBPE that separates first-phase learning of regular merges from second-phase learning of supermerges, producing identical results to the original implementation. We also show a near-equivalence between two-phase BoundlessBPE and SuperBPE, with the difference being that a manually selected hyperparameter used in SuperBPE can be automatically determined in the second phase of BoundlessBPE. These changes enable a much faster implementation, allowing training on that same 1GB of data in 603 and 593 seconds for BoundlessBPE and SuperBPE, respectively, a more than 600x increase in speed. For each of BoundlessBPE, SuperBPE, and BPE, we open-source both a reference Python implementation and a fast Rust implementation.

关键词: tokenization, Byte Pair Encoding, BPE, superwords, training efficiency, algorithm optimization, natural language processing, pretokenization

176. ❌ Improving Clinical Trial Recruitment using Clinical Narratives and Large Language Models

作者: Ziyi Chen, Mengxian Lyu, Cheng Peng, Yonghui Wu 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05190v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在临床医学领域的应用，特别是使用LLMs筛选临床叙事以改善临床试验招募。高度相关的关键词包括：‘Large Language Models’（论文明确使用LLMs）、‘Retrieval-Augmented Generation’（论文探索RAG策略）、‘Context Window Extension’（论文处理长文档的’Lost in the Middle’问题）、‘AI for Science’（论文属于生物医学AI应用）。‘Pre-training’得5分，因为论文提到’medical-adapted LLMs’可能涉及领域适应。其他关键词如MoE、SLMs、SFT、RLHF等未在摘要中提及，得0分。

!!! tip deepseek-chat TL;DR

该研究探索使用大语言模型（LLMs）筛选临床叙事以改善临床试验招募，发现结合RAG策略的MedGemma模型在N2C2数据集上取得了89.05%的最佳微F1分数。

摘要翻译

筛选患者入组是一个众所周知且劳动密集的瓶颈问题，它导致入组不足并最终造成试验失败。近期大语言模型（LLMs）的突破为利用人工智能改进筛选提供了前景广阔的机会。本研究系统探索了基于编码器和解码器的生成式大语言模型，用于筛选临床叙事文本以促进临床试验招募。我们考察了通用大语言模型和医学适配的大语言模型，并探索了三种策略来缓解处理长文档时的“中间迷失”问题，包括：1）原始长上下文：使用大语言模型的默认上下文窗口；2）基于命名实体识别（NER）的抽取式摘要：利用命名实体识别将长文档转化为摘要；3）检索增强生成（RAG）：基于资格标准进行动态证据检索。评估采用2018年N2C2 Track 1基准数据集。我们的实验结果表明，采用RAG策略的MedGemma模型取得了最佳的微平均F1分数89.05%，优于其他模型。生成式大语言模型显著改进了需要在长文档中进行长期推理的试验标准，而仅涉及短片段上下文（例如实验室检测）的试验标准则显示出渐进式改进。在实际应用中采用大语言模型进行试验招募时，必须考虑具体标准，以在基于规则的查询、基于编码器的大语言模型和生成式大语言模型之间进行选择，从而在合理的计算成本内最大化效率。

摘要 (Abstract)

Screening patients for enrollment is a well-known, labor-intensive bottleneck that leads to under-enrollment and, ultimately, trial failures. Recent breakthroughs in large language models (LLMs) offer a promising opportunity to use artificial intelligence to improve screening. This study systematically explored both encoder- and decoder-based generative LLMs for screening clinical narratives to facilitate clinical trial recruitment. We examined both general-purpose LLMs and medical-adapted LLMs and explored three strategies to alleviate the “Lost in the Middle” issue when handling long documents, including 1) Original long-context: using the default context windows of LLMs, 2) NER-based extractive summarization: converting the long document into summarizations using named entity recognition, 3) RAG: dynamic evidence retrieval based on eligibility criteria. The 2018 N2C2 Track 1 benchmark dataset is used for evaluation. Our experimental results show that the MedGemma model with the RAG strategy achieved the best micro-F1 score of 89.05%, outperforming other models. Generative LLMs have remarkably improved trial criteria that require long-term reasoning across long documents, whereas trial criteria that span a short piece of context (e.g., lab tests) show incremental improvements. The real-world adoption of LLMs for trial recruitment must consider specific criteria for selecting among rule-based queries, encoder-based LLMs, and generative LLMs to maximize efficiency within reasonable computing costs.

关键词: Large Language Models, Clinical Trial Recruitment, Clinical Narratives, Retrieval-Augmented Generation, MedGemma, N2C2 Dataset, Lost in the Middle, Long Document Processing

177. ❌ What Makes a Good Response? An Empirical Analysis of Quality in Qualitative Interviews

作者: Jonathan Ivey, Anjalie Field, Ziang Xiao 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05163v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究定性访谈中高质量响应的评估指标，属于社会科学和自然语言处理交叉领域，但完全不涉及大模型、深度学习技术原理或AI在科学领域的应用创新。论文仅提到NLP系统评估作为背景，但未涉及任何具体的大模型技术、架构、训练方法、推理优化、对齐技术或科学应用。所有关键词均与大模型技术或AI科学应用直接相关，而本文核心是访谈质量评估的实证分析，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文通过分析343个访谈转录本，评估了10种访谈响应质量指标，发现与关键研究问题的直接相关性是最强的质量预测因子，而清晰度和基于惊讶度的信息性这两个常用指标则无效。

摘要翻译

定性访谈在获取高质量回答时能为人类经验提供关键洞见。尽管定性与自然语言处理研究者已提出多种访谈质量的衡量标准，但这些标准缺乏验证——即高分回答是否真正有助于实现研究目标。在本研究中，我们识别、实施并评估了10种已提出的访谈回答质量衡量标准，以确定哪些标准能实际预测回答对研究发现的贡献度。为开展分析，我们引入了“定性访谈语料库”，这是一个新构建的数据集，包含来自14个真实研究项目的343份访谈转录文本，共计16,940条参与者回答。我们发现，与核心研究问题的直接相关性是回答质量的最强预测指标。此外，我们发现常用于评估自然语言处理访谈系统的两项指标——清晰度和基于信息熵的信息量——并不能预测回答质量。本研究通过分析性洞见及基于实证、可扩展的度量标准，为定性研究的设计与自动化访谈系统的评估提供了参考依据。

摘要 (Abstract)

Qualitative interviews provide essential insights into human experiences when they elicit high-quality responses. While qualitative and NLP researchers have proposed various measures of interview quality, these measures lack validation that high-scoring responses actually contribute to the study’s goals. In this work, we identify, implement, and evaluate 10 proposed measures of interview response quality to determine which are actually predictive of a response’s contribution to the study findings. To conduct our analysis, we introduce the Qualitative Interview Corpus, a newly constructed dataset of 343 interview transcripts with 16,940 participant responses from 14 real research projects. We find that direct relevance to a key research question is the strongest predictor of response quality. We additionally find that two measures commonly used to evaluate NLP interview systems, clarity and surprisal-based informativeness, are not predictive of response quality. Our work provides analytic insights and grounded, scalable metrics to inform the design of qualitative studies and the evaluation of automated interview systems.

关键词: qualitative interviews, response quality, interview quality measures, NLP interview systems, empirical analysis, qualitative interview corpus, relevance prediction, automated interview evaluation

178. ❌ Planning to Explore: Curiosity-Driven Planning for LLM Test Generation

作者: Alfonso Amayuelas, Firas Laakom, Piotr Piękos, Wenyi Wang, Yifan Xu, Yuhui Wang, Jürgen Schmidhuber, William Wang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05159v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在代码测试生成中的应用，提出CovQValue方法进行好奇心驱动的规划，属于LLM Agents领域的高度相关应用（10分）。论文涉及规划、探索和序列决策，与Chain of Thought和System 2 Thinking有一定关联（5分）。其他关键词如MoE、SFT、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLM在代码测试生成中贪婪方法覆盖不足的问题，提出了基于好奇心驱动的规划方法CovQValue，在多个基准测试中显著提高了分支覆盖率。

摘要翻译

将大语言模型用于代码生成的应用已自然延伸至代码测试与评估领域。随着代码库规模和复杂度的增长，对自动化测试生成的需求也日益提升。当前基于大语言模型的测试生成方法主要依赖最大化即时覆盖率增益的策略，这是一种贪婪式方法，在涉及深层分支的代码上会陷入瓶颈——因为这些分支的抵达需要多个设置步骤，而每个步骤单独执行时可能带来零新增覆盖率。借鉴贝叶斯探索原理，我们将程序的分支结构视为未知环境，并将不断演化的覆盖率图谱作为代理概率后验，用以表征大语言模型当前已发现的内容。我们提出的方法CovQValue将覆盖率图谱反馈给大语言模型，并行生成多样化的候选计划，并通过大语言模型估算的Q值选择信息量最大的计划，以寻求在即时分支发现与未来可达性之间取得平衡的行动策略。在TestGenEval Lite基准测试中，我们的方法优于贪婪选择策略，在三种主流大语言模型上实现了51-77%的分支覆盖率提升，并在77-84%的目标任务中表现更优。此外，我们构建了用于迭代测试生成的基准RepoExploreBench，在该基准上我们的方法取得了40-74%的覆盖率。这些结果表明，基于好奇心驱动的规划方法在大语言模型探索中具有潜力，能够通过序列化交互更有效地发现程序行为。

摘要 (Abstract)

The use of LLMs for code generation has naturally extended to code testing and evaluation. As codebases grow in size and complexity, so does the need for automated test generation. Current approaches for LLM-based test generation rely on strategies that maximize immediate coverage gain, a greedy approach that plateaus on code where reaching deep branches requires setup steps that individually yield zero new coverage. Drawing on principles of Bayesian exploration, we treat the program’s branch structure as an unknown environment, and an evolving coverage map as a proxy probabilistic posterior representing what the LLM has discovered so far. Our method, CovQValue, feeds the coverage map back to the LLM, generates diverse candidate plans in parallel, and selects the most informative plan by LLM-estimated Q-values, seeking actions that balance immediate branch discovery with future reachability. Our method outperforms greedy selection on TestGenEval Lite, achieving 51-77% higher branch coverage across three popular LLMs and winning on 77-84% of targets. In addition, we build a benchmark for iterative test generation, RepoExploreBench, where they achieve 40-74%. These results show the potential of curiosity-driven planning methods for LLM-based exploration, enabling more effective discovery of program behavior through sequential interaction

关键词: LLM test generation, curiosity-driven planning, code coverage, Bayesian exploration, sequential interaction, CovQValue, branch coverage, RepoExploreBench

179. ❌ Just Pass Twice: Efficient Token Classification with LLMs for Zero-Shot NER

作者: Ahmed Ewais, Ahmed Hashish, Amr Ali 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05158v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在零样本命名实体识别（NER）中的应用，提出JPT方法解决因果注意力机制的限制，因此与’Large Language Models’高度相关（10分）。论文提到方法比生成式方法快20倍以上，与’Inference Acceleration’有一定关联（5分）。论文指出生成式方法存在幻觉实体问题，JPT方法旨在解决此问题，与’Hallucination Mitigation’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、Alignment、RAG、CoT、Agents、Quantization等均未在论文中涉及，评分为0分。

!!! tip deepseek-chat TL;DR

论文提出Just Pass Twice方法，通过将输入文本拼接两次使因果LLMs能够利用双向上下文进行判别式token分类，解决了零样本NER中因果注意力机制的限制，在基准测试中取得了最先进的结果且推理速度提升20倍以上。

摘要翻译

大型语言模型编码了可用于零样本命名实体识别的广泛世界知识。然而，其因果注意力机制（其中标记仅关注上文语境）在消歧需要未来语境时阻碍了有效的标记分类。现有方法以生成式方式使用大语言模型，提示其列举实体或生成结构化输出，但存在自回归解码速度慢、实体幻觉和格式错误等问题。
我们提出“仅需两次传递”（Just Pass Twice, JPT），这是一种简单而有效的方法，使因果大语言模型能够利用完整双向语境执行判别式标记分类。我们的核心见解是：将输入与自身拼接后，第二次传递中的每个标记都能关注完整句子，且无需修改模型架构。我们将这些表征与定义引导的实体嵌入相结合，以实现灵活的零样本泛化。该方法在零样本命名实体识别基准测试中取得了最先进的结果，在CrossNER和MIT基准上的平均F1分数超越先前最佳方法+7.9，且速度比同类生成式方法快20倍以上。

摘要 (Abstract)

Large language models encode extensive world knowledge valuable for zero-shot named entity recognition. However, their causal attention mechanism, where tokens attend only to preceding context, prevents effective token classification when disambiguation requires future context. Existing approaches use LLMs generatively, prompting them to list entities or produce structured outputs, but suffer from slow autoregressive decoding, hallucinated entities, and formatting errors. We propose Just Pass Twice (JPT), a simple yet effective method that enables causal LLMs to perform discriminative token classification with full bidirectional context. Our key insight is that concatenating the input to itself lets each token in the second pass attend to the complete sentence, requiring no architectural modifications. We combine these representations with definition-guided entity embeddings for flexible zero-shot generalization. Our approach achieves state-of-the-art results on zero-shot NER benchmarks, surpassing the previous best method by +7.9 F1 on average across CrossNER and MIT benchmarks, being over 20x faster than comparable generative methods.

关键词: Large Language Models, Zero-shot NER, Token Classification, Causal Attention, Bidirectional Context, Inference Efficiency, Hallucination Mitigation, Discriminative Methods

180. ❌ EvolveRouter: Co-Evolving Routing and Prompt for Multi-Agent Question Answering

作者: Jiatan Huang, Zheyuan Zhang, Kaiwen Shi, Yanfang Ye, Chuxu Zhang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05149v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文EvolveRouter专注于多智能体问答系统中的路由和提示协同演化，核心涉及LLM智能体（LLM Agents）和多智能体系统（Multi-agent Systems），与这两个关键词高度相关（10分）。论文使用大语言模型（Large Language Models）作为基础，因此也高度相关（10分）。其他关键词如MoE、SLMs、训练技术、推理优化、科学AI应用等均未在摘要中提及或与论文核心内容无关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出了EvolveRouter框架，通过协同演化路由和提示来改进多智能体问答系统中的智能体质量和协作结构，实验表明其在多个基准测试中优于现有路由方法。

摘要翻译

大语言模型智能体通常展现出互补优势，使得路由机制成为多智能体问答系统中一种前景广阔的方法。然而，现有路由方法在两方面仍存在明显局限：它们通常在固定智能体池中进行优化而不改进智能体本身，且往往依赖僵化的协作方案，无法根据查询动态调整参与智能体的数量。我们提出EvolveRouter——一个可训练的框架，通过联合提升智能体质量与协作结构来解决这两项局限。首先，EvolveRouter将基于图结构的查询路由与定向指令优化在闭环协同进化过程中耦合，使路由诊断能够指导智能体改进，同时优化后的智能体为路由提供更清晰的监督信号。其次，该框架引入自适应推理策略，通过基于路由权重的答案一致性动态确定每个查询的有效协作规模。这些设计共同实现了更强大且更高效的多智能体推理。在五个问答基准测试上的实验表明，EvolveRouter在F1分数和精确匹配率上均持续优于最先进的路由基线，进一步分析也验证了闭环优化与自适应协作机制的有效性。

摘要 (Abstract)

Large language model agents often exhibit complementary strengths, making routing a promising approach for multi-agent question answering. However, existing routing methods remain limited in two important ways: they typically optimize over a fixed pool of agents without improving the agents themselves, and they often rely on rigid collaboration schemes that cannot adapt the number of participating agents to the query. We propose EvolveRouter, a trainable framework that addresses both limitations by jointly improving agent quality and collaboration structure. First, EvolveRouter couples graph-based query routing with targeted instruction refinement in a closed-loop co-evolution process, allowing router diagnostics to guide agent improvement while refined agents provide cleaner supervision for routing. Second, it introduces an adaptive inference strategy that dynamically determines the effective collaboration size for each query through router-weighted answer agreement. Together, these designs enable more capable and more efficient multi-agent reasoning. Experiments on five question answering benchmarks show that EvolveRouter consistently outperforms SOTA routing baselines in both F1 and exact match, while further analysis confirms the benefits of closed-loop refinement and adaptive collaboration.

关键词: Large Language Model Agents, Multi-agent Question Answering, Routing, Co-evolution, Instruction Refinement, Adaptive Collaboration, Graph-based Query Routing, Closed-loop Refinement

181. ❌ Action Images: End-to-End Policy Learning via Multiview Video Generation

作者: Haoyu Zhen, Zixian Gao, Qiao Sun, Yilin Zhao, Yuncong Yang, Yilun Du, Tsun-Hsuan Wang, Yi-Ling Qiao, Chuang Gan 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06168v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种名为Action Images的世界动作模型，将策略学习建模为多视角视频生成，核心是机器人控制与视频生成的统一表示。与关键词的相关性分析如下：1）高度相关（10分）：‘World Models AND General World Models’ - 论文明确研究世界动作模型（WAMs），属于世界模型范畴。2）中等相关（5分）：‘Pre-training OR Continual Pre-training OR Domain Adaptation’ - 论文提到利用预训练视频模型的知识，涉及预训练概念；‘Mechanistic Interpretability OR Explainable AI’ - 论文强调动作图像的“可解释性”，使其动作表示易于理解。3）无关（0分）：其他关键词主要涉及大语言模型（LLMs）的特定技术（如MoE、RLHF、RAG、量化等）、推理方法（CoT、MCTS）、代理系统或特定科学领域（生物信息学），而本文专注于机器人策略学习的视频生成方法，未涉及这些内容。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Action Images的统一世界动作模型，通过将7-DoF机器人动作转化为可解释的多视角动作图像（即像素接地的视频），将策略学习建模为多视角视频生成，从而在RLBench和真实世界评估中实现了最强的零样本成功率，并提高了视频-动作联合生成质量。

摘要翻译

世界行动模型已成为机器人策略学习的一个前景广阔的方向，其能够利用强大的视频主干网络对未来状态进行建模。然而，现有方法通常依赖独立的行动模块，或使用非像素基础的行动表示，这难以充分利用视频模型的预训练知识，并限制了跨视角和跨环境的迁移能力。在本研究中，我们提出了“行动图像”，一种统一的世界行动模型，它将策略学习构建为多视角视频生成任务。我们并非将控制编码为低维标记，而是将7自由度机器人动作转化为可解释的行动图像：这是一种基于二维像素、并明确追踪机器人手臂运动的多视角行动视频。这种以像素为基础的行动表示使得视频主干网络本身能够作为零样本策略，无需额外的策略头或行动模块。除控制任务外，同一统一模型在共享表征下还支持视频-行动联合生成、行动条件视频生成以及行动标注任务。在RLBench和真实世界评估中，我们的模型实现了最强的零样本成功率，并在视频-行动联合生成质量上超越了先前的视频空间世界模型，这表明可解释的行动图像是策略学习的一条可行路径。

摘要 (Abstract)

World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model that formulates policy learning as multiview video generation. Instead of encoding control as low-dimensional tokens, we translate 7-DoF robot actions into interpretable action images: multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion. This pixel-grounded action representation allows the video backbone itself to act as a zero-shot policy, without a separate policy head or action module. Beyond control, the same unified model supports video-action joint generation, action-conditioned video generation, and action labeling under a shared representation. On RLBench and real-world evaluations, our model achieves the strongest zero-shot success rates and improves video-action joint generation quality over prior video-space world models, suggesting that interpretable action images are a promising route to policy learning.

关键词: World Action Models, Policy Learning, Multiview Video Generation, Action Images, Robot Control, Zero-shot Policy, Video Backbone, Pixel-grounded Representation

182. ❌ HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models

作者: Reihaneh Zohrabi, Hosein Hasani, Akshita Gupta, Mahdieh Soleymani Baghshah, Anna Rohrbach, Marcus Rohrbach 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06165v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉语言模型中的物体幻觉检测与缓解，与’Large Language Models’相关（8分），因为涉及大模型应用；与’Hallucination Mitigation’高度相关（10分），这是论文核心内容；与’Mechanistic Interpretability’有一定关联（5分），因为分析了注意力机制等内部信号。其他关键词如MoE、SFT、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型在图像描述中产生物体幻觉的问题，提出了HaloProbe贝叶斯框架来检测和缓解幻觉，实验表明该方法比现有干预方法更有效地减少幻觉同时保持实用性。

摘要翻译

大型视觉语言模型在图像描述中可能产生物体幻觉现象，这凸显了开发有效检测与缓解策略的必要性。先前研究通常依赖模型对视觉标记的注意力权重作为检测信号。我们发现，由于隐藏的混杂因素——特别是标记位置和描述中的物体重复——基于粗粒度注意力的分析并不可靠。这导致辛普森悖论的出现：当统计数据被聚合时，注意力趋势会发生逆转或消失。基于此观察，我们提出了HaloProbe，一个贝叶斯框架，通过分解外部描述统计量与内部解码信号来估计标记级别的幻觉概率。HaloProbe采用平衡训练来分离内部证据，并将其与基于外部特征学习得到的先验知识相结合，以恢复真实后验概率。虽然基于干预的缓解方法通常通过修改模型内部结构来降低实用性或流畅性，我们将HaloProbe作为外部评分信号用于非侵入式缓解。实验表明，在保持实用性的前提下，基于HaloProbe引导的解码方法比当前最先进的干预式方法能更有效地减少幻觉。

摘要 (Abstract)

Large vision-language models can produce object hallucinations in image descriptions, highlighting the need for effective detection and mitigation strategies. Prior work commonly relies on the model’s attention weights on visual tokens as a detection signal. We reveal that coarse-grained attention-based analysis is unreliable due to hidden confounders, specifically token position and object repetition in a description. This leads to Simpson’s paradox: the attention trends reverse or disappear when statistics are aggregated. Based on this observation, we introduce HaloProbe, a Bayesian framework that factorizes external description statistics and internal decoding signals to estimate token-level hallucination probabilities. HaloProbe uses balanced training to isolate internal evidence and combines it with learned prior over external features to recover the true posterior. While intervention-based mitigation methods often degrade utility or fluency by modifying models’ internals, we use HaloProbe as an external scoring signal for non-invasive mitigation. Our experiments show that HaloProbe-guided decoding reduces hallucinations more effectively than state-of-the-art intervention-based methods while preserving utility.

关键词: vision-language models, object hallucinations, detection, mitigation, Bayesian framework, attention analysis, non-invasive decoding, hallucination probability

183. ❌ The Character Error Vector: Decomposable errors for page-level OCR evaluation

作者: Jonathan Bourne, Mwiza Simbeye, Joseph Nockels 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06160v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于文档理解（Document Understanding）领域的光学字符识别（OCR）评估方法，提出了一种新的字符错误向量（CEV）度量标准。论文内容与所有评分关键词（均涉及大模型、深度学习技术原理、AI科学应用等）完全无关，因为：1）论文未涉及任何大语言模型、深度学习模型架构或训练技术；2）未讨论模型缩放、微调、对齐、推理优化等深度学习技术主题；3）未涉及AI在生物信息学、化学信息学等科学领域的应用；4）核心贡献是OCR评估指标和方法论，属于传统计算机视觉和文档处理领域，而非大模型或深度学习创新。

!!! tip deepseek-chat TL;DR

该论文针对页面级OCR评估中字符错误率（CER）在文本解析不完美时失效的问题，提出了可分解的字符错误向量（CEV）度量方法，并通过验证表明CEV能有效桥接解析指标和局部指标，且在复杂布局的档案报纸数据集中发现传统流水线方法优于端到端模型。

摘要翻译

字符错误率（CER）是评估光学字符识别（OCR）质量的关键指标。然而，该指标假设文本已被完美解析，而实际情况往往并非如此。在页面解析错误的情况下，CER会变得无法定义，这限制了其作为评估指标的应用，并使页面级OCR的评估变得困难，尤其是在使用不共享标注模式的数据时。我们引入了字符错误向量（CEV），一种用于OCR的字符袋评估器。CEV可分解为解析错误、OCR错误及交互错误分量。这种可分解性使实践者能够专注于文档理解流程中对整体文本提取质量影响最大的环节。CEV可通过多种方法实现，我们展示了其中两种：空间感知字符错误率（SpACER）以及使用詹森-香农距离的字符分布方法。我们通过与其他指标的对比验证CEV的性能：首先验证其与CER的关系；其次验证解析质量；最后将其作为页面级OCR质量的直接衡量标准。验证过程表明，CEV是解析指标与CER等局部指标之间的重要桥梁。我们分析了一个由复杂版面的退化图像构成的档案报纸数据集，发现传统流水线方法优于先进的端到端模型。虽然CEV需要字符级定位以实现最优分类，但通过对易获取数值设定阈值，可以以0.91的F1值预测主要错误来源。我们将CEV作为Python库的一部分提供，以支持文档理解研究。

摘要 (Abstract)

The Character Error Rate (CER) is a key metric for evaluating the quality of Optical Character Recognition (OCR). However, this metric assumes that text has been perfectly parsed, which is often not the case. Under page-parsing errors, CER becomes undefined, limiting its use as a metric and making evaluating page-level OCR challenging, particularly when using data that do not share a labelling schema. We introduce the Character Error Vector (CEV), a bag-of-characters evaluator for OCR. The CEV can be decomposed into parsing and OCR, and interaction error components. This decomposability allows practitioners to focus on the part of the Document Understanding pipeline that will have the greatest impact on overall text extraction quality. The CEV can be implemented using a variety of methods, of which we demonstrate SpACER (Spatially Aware Character Error Rate) and a Character distribution method using the Jensen-Shannon Distance. We validate the CEV’s performance against other metrics: first, the relationship with CER; then, parse quality; and finally, as a direct measure of page-level OCR quality. The validation process shows that the CEV is a valuable bridge between parsing metrics and local metrics like CER. We analyse a dataset of archival newspapers made of degraded images with complex layouts and find that state-of-the-art end-to-end models are outperformed by more traditional pipeline approaches. Whilst the CEV requires character-level positioning for optimal triage, thresholding on easily available values can predict the main error source with an F1 of 0.91. We provide the CEV as part of a Python library to support Document understanding research.

关键词: Character Error Vector, OCR evaluation, page-level OCR, document understanding, parsing errors, Jensen-Shannon Distance, text extraction, archival newspapers

184. ❌ SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation

作者: Hiba Dahmani, Nathan Piasco, Moussab Bennehar, Luis Roldão, Dzmitry Tsishkou, Laurent Caraffa, Jean-Philippe Tarel, Roland Brémond 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06113v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于3D场景生成技术，特别是基于扩散模型和体素网格的大规模驾驶场景生成，不涉及任何大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用。所有关键词均与大语言模型、深度学习技术原理或AI for Science相关，而本文属于计算机视觉和3D生成领域，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于语义条件扩散模型和Σ-Voxfield网格的3D生成框架，解决了大规模、多视角一致的户外驾驶场景生成问题，实现了无需逐场景优化的逼真图像渲染。

摘要翻译

可扩展的户外驾驶场景生成需要具备多视角一致性且能扩展至大范围的三维表征。现有解决方案要么依赖蒸馏至三维空间的图像或视频生成模型，这会损害几何一致性并将渲染限制在训练视角内；要么局限于小规模三维场景或以物体为中心的生成。本研究提出一种基于$Σ$-Voxfield网格的三维生成框架，该离散表征中每个被占用的体素存储固定数量的着色表面样本。为生成此表征，我们训练了一个语义条件扩散模型，该模型在局部体素邻域上运行，并利用三维位置编码捕捉空间结构。我们通过对重叠区域进行渐进式空间外推来实现大场景扩展。最后，我们通过延迟渲染模块对生成的$Σ$-Voxfield网格进行渲染，获得逼真图像，从而无需逐场景优化即可实现大规模多视角一致的三维场景生成。大量实验表明，本方法能生成多样化的大规模城市户外场景，可渲染为具有多种传感器配置和相机轨迹的逼真图像，同时在计算成本上较现有方法保持适度水平。

摘要 (Abstract)

Scalable generation of outdoor driving scenes requires 3D representations that remain consistent across multiple viewpoints and scale to large areas. Existing solutions either rely on image or video generative models distilled to 3D space, harming the geometric coherence and restricting the rendering to training views, or are limited to small-scale 3D scene or object-centric generation. In this work, we propose a 3D generative framework based on $Σ$-Voxfield grid, a discrete representation where each occupied voxel stores a fixed number of colorized surface samples. To generate this representation, we train a semantic-conditioned diffusion model that operates on local voxel neighborhoods and uses 3D positional encodings to capture spatial structure. We scale to large scenes via progressive spatial outpainting over overlapping regions. Finally, we render the generated $Σ$-Voxfield grid with a deferred rendering module to obtain photorealistic images, enabling large-scale multiview-consistent 3D scene generation without per-scene optimization. Extensive experiments show that our approach can generate diverse large-scale urban outdoor scenes, renderable into photorealistic images with various sensor configurations and camera trajectories while maintaining moderate computation cost compared to existing approaches.

关键词: 3D scene generation, diffusion model, voxel representation, driving scenes, multiview consistency, semantic conditioning, progressive outpainting, deferred rendering

185. ❌ Extending ZACH-ViT to Robust Medical Imaging: Corruption and Adversarial Stress Testing in Low-Data Regimes

作者: Athanasios Angelakis, Marta Gomez-Barrero 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06099v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于医学影像领域的紧凑型视觉变换器（ZACH-ViT）的鲁棒性评估，涉及图像损坏和对抗性扰动测试。所有关键词均与大语言模型（LLM）或深度学习通用技术相关，而本文研究的是特定视觉变换器架构在医学影像中的应用，与LLM无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为医学影像属于生物信息学或科学AI的应用领域，但并非核心焦点（论文更侧重模型鲁棒性而非AI在科学中的广泛创新），因此给予8分。其他关键词如MoE、Scaling Laws、RLHF等均不适用，评分为0。

!!! tip deepseek-chat TL;DR

该研究评估了紧凑型视觉变换器ZACH-ViT在低数据医学影像中对抗图像损坏和对抗性扰动的鲁棒性，发现其在常见损坏下表现稳健，但在对抗性攻击下仍有挑战。

摘要翻译

近期提出的ZACH-ViT（零标记自适应紧凑分层视觉Transformer）为医学影像构建了一种紧凑的置换不变视觉Transformer，并论证了与空间结构的架构对齐可能比通用基准性能优势更为重要。该设计的动机源于以下观察：当空间组织信息较弱、呈局部分布或在生物医学图像间存在差异时，位置嵌入和专用类别标记所编码的固定空间假设可能并非最优。这项基础研究在MedMNIST数据集上建立了依赖数据特性的清晰性能图谱，但未详细考察其鲁棒性。本研究首次提出以鲁棒性为核心的ZACH-ViT扩展方案，通过在相同低数据设置下评估模型在常见图像损坏和对抗扰动中的表现。我们在七个MedMNIST数据集上，以每类50个样本、固定超参数和五个随机种子，将ZACH-ViT与三种从头训练的紧凑基线模型（ABMIL、Minimal-ViT和TransMIL）进行比较。在整个基准测试中，ZACH-ViT在干净数据（1.57）和常见损坏条件下（1.57）均取得最佳平均排名，表明其在基线预测性能与真实图像退化鲁棒性之间取得了良好平衡。在对抗压力下，所有模型性能均显著下降；尽管如此，ZACH-ViT仍保持竞争力，在FGSM攻击下排名第一（2.00），在PGD攻击下排名第二（2.29），而ABMIL在对抗场景中表现最佳。这些结果拓展了原始ZACH-ViT的理论框架：紧凑型置换不变Transformer的优势不仅限于干净数据评估，在低数据医学影像的真实扰动压力下仍能持续体现，而对抗鲁棒性对所有被评估模型而言仍是待解决的重要挑战。

摘要 (Abstract)

The recently introduced ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer) formalized a compact permutation-invariant Vision Transformer for medical imaging and argued that architectural alignment with spatial structure can matter more than universal benchmark dominance. Its design was motivated by the observation that positional embeddings and a dedicated class token encode fixed spatial assumptions that may be suboptimal when spatial organization is weakly informative, locally distributed, or variable across biomedical images. The foundational study established a regime-dependent clean performance profile across MedMNIST, but did not examine robustness in detail. In this work, we present the first robustness-focused extension of ZACH-ViT by evaluating its behavior under common image corruptions and adversarial perturbations in the same low-data setting. We compare ZACH-ViT with three scratch-trained compact baselines, ABMIL, Minimal-ViT, and TransMIL, on seven MedMNIST datasets using 50 samples per class, fixed hyperparameters, and five random seeds. Across the benchmark, ZACH-ViT achieves the best overall mean rank on clean data (1.57) and under common corruptions (1.57), indicating a favorable balance between baseline predictive performance and robustness to realistic image degradation. Under adversarial stress, all models deteriorate substantially; nevertheless, ZACH-ViT remains competitive, ranking first under FGSM (2.00) and second under PGD (2.29), where ABMIL performs best overall. These results extend the original ZACH-ViT narrative: the advantages of compact permutation-invariant transformers are not limited to clean evaluation, but can persist under realistic perturbation stress in low-data medical imaging, while adversarial robustness remains an open challenge for all evaluated models.

关键词: ZACH-ViT, Vision Transformer, medical imaging, robustness, image corruptions, adversarial perturbations, low-data regimes, MedMNIST

186. ❌ CoStream: Codec-Guided Resource-Efficient System for Video Streaming Analytics

作者: Yulin Zou, Yan Chen, Wenyan Chen, JooYoung Park, Shivaraman Nitin, Luo Tao, Francisco Romero, Dmitrii Ustiugov 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06036v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	5.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文CoStream专注于视频流分析中的大模型（LLM）推理效率优化，核心涉及LLM在视觉-语言模型服务中的应用，因此与’Large Language Models’高度相关（10分）。系统通过代码元数据指导ViT编码前的补丁剪枝和LLM预填充期间的选择性KV缓存刷新，这直接涉及推理加速和KV缓存管理，因此与’Speculative Decoding OR Inference Acceleration’（8分）和’KV Cache Compression OR Linear Attention OR FlashAttention’（5分）有一定关联。其他关键词如MoE、SLMs、训练方法、对齐、代理等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

CoStream提出了一种基于视频编解码器元数据的流式视频分析系统，通过统一优化视频解码、视觉处理和LLM预填充，实现了高达3倍的吞吐量提升和87%的GPU计算减少，同时保持竞争性精度。

摘要翻译

视频流分析是视觉语言模型服务的关键负载，但多模态推理的高成本限制了其可扩展性。现有系统通过利用视频流中的时间与空间冗余来降低推理成本，但其优化目标仅限于视觉变换器（ViT）或大型语言模型（LLM），且视角局限，未能挖掘端到端的优化潜力。此外，现有方法需通过离线性能分析与训练或昂贵的在线计算来识别冗余，产生了显著开销，难以适应动态实时流场景。
我们提出了CoStream，这是一个编解码器引导的流式视频分析系统，其核心洞见在于：视频编解码器在压缩过程中已提取出每路流的时间与空间结构信息作为副产品。CoStream将此类编解码器元数据视为低成本的运行时信号，用以统一优化视频解码、视觉处理与LLM预填充阶段，同时直接操作压缩比特流也天然减少了传输开销。该系统据此在ViT编码前实施编解码器引导的块剪枝，并在LLM预填充阶段进行选择性键值缓存刷新，这两项操作均完全在线运行，无需离线训练。实验表明，相较于前沿基线方法，CoStream在保持竞争力准确度（仅产生0-8%的F1分数下降）的同时，实现了最高3倍的吞吐量提升与最高87%的GPU计算量削减。

摘要 (Abstract)

Video streaming analytics is a crucial workload for vision-language model serving, but the high cost of multimodal inference limits scalability. Prior systems reduce inference cost by exploiting temporal and spatial redundancy in video streams, but they target either the vision transformer (ViT) or the LLM with a limited view, leaving end-to-end opportunities untapped. Moreover, existing methods incur significant overhead to identify redundancy, either through offline profiling and training or costly online computation, making them ill-suited for dynamic real-time streams. We present CoStream, a codec-guided streaming video analytics system built on a key observation that video codecs already extract the temporal and spatial structure of each stream as a byproduct of compression. CoStream treats this codec metadata as a low-cost runtime signal to unify optimization across video decoding, visual processing, and LLM prefilling, with transmission reduction as an inherent benefit of operating directly on compressed bitstreams. This drives codec-guided patch pruning before ViT encoding and selective key-value cache refresh during LLM prefilling, both of which are fully online and do not require offline training. Experiments show that CoStream achieves up to 3x throughput improvement and up to 87% GPU compute reduction over state-of-the-art baselines, while maintaining competitive accuracy with only 0-8% F1 drop.

关键词: video streaming analytics, vision-language model, LLM prefilling, codec-guided optimization, KV cache refresh, inference acceleration, resource-efficient system, GPU compute reduction

187. ❌ Toward Aristotelian Medical Representations: Backpropagation-Free Layer-wise Analysis for Interpretable Generalized Metric Learning on MedMNIST

作者: Michael Karnes, Alper Yilmaz 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06017v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文聚焦于医学影像的深度学习可解释性，提出A-ROM框架，使用预训练Vision Transformers的度量空间进行快速建模，并用kNN分类器替代传统决策层以提高透明度。与大多数关键词无关，仅与’Mechanistic Interpretability OR Explainable AI’高度相关（核心内容），与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（医学AI应用），与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（使用预训练ViT）。其他关键词主要涉及大语言模型、训练技术、推理优化等，论文未涉及。

!!! tip deepseek-chat TL;DR

该研究针对医学影像深度学习模型的可解释性问题，提出基于预训练Vision Transformers和kNN分类器的A-ROM框架，在MedMNIST数据集上实现了与基准相当的性能，同时提供了透明且可扩展的少样本解决方案。

摘要翻译

尽管深度学习在医学影像领域取得了显著成就，但基于反向传播的模型的“黑箱”特性仍是其临床应用的重大障碍。为弥合这一差距，我们提出了亚里士多德式快速对象建模（A-ROM），这是一个建立在柏拉图表征假说（Platonic Representation Hypothesis, PRH）之上的框架。该假说认为，在大量多样化数据集上训练的模型会收敛于一种普适且客观的现实表征。通过利用预训练视觉变换器（Vision Transformers, ViTs）的可泛化度量空间，A-ROM能够快速建模新的医学概念，而无需进行进一步基于梯度的微调所带来的计算负担或不透明性。我们用人可读的概念字典和k近邻（k-Nearest Neighbors, kNN）分类器取代了传统不透明的决策层，以确保模型的逻辑保持可解释性。在MedMNIST v2数据集上的实验表明，A-ROM能够提供与标准基准相竞争的性能，同时提供了一个简单、可扩展的“小样本”解决方案，满足了现代临床环境对透明度的严格要求。

摘要 (Abstract)

While deep learning has achieved remarkable success in medical imaging, the “black-box” nature of backpropagation-based models remains a significant barrier to clinical adoption. To bridge this gap, we propose Aristotelian Rapid Object Modeling (A-ROM), a framework built upon the Platonic Representation Hypothesis (PRH). This hypothesis posits that models trained on vast, diverse datasets converge toward a universal and objective representation of reality. By leveraging the generalizable metric space of pretrained Vision Transformers (ViTs), A-ROM enables the rapid modeling of novel medical concepts without the computational burden or opacity of further gradient-based fine-tuning. We replace traditional, opaque decision layers with a human-readable concept dictionary and a k-Nearest Neighbors (kNN) classifier to ensure the model’s logic remains interpretable. Experiments on the MedMNIST v2 suite demonstrate that A-ROM delivers performance competitive with standard benchmarks while providing a simple and scalable, “few-shot” solution that meets the rigorous transparency demands of modern clinical environments.

关键词: medical imaging, interpretability, Vision Transformers, metric learning, few-shot learning, clinical transparency, Aristotelian Rapid Object Modeling, MedMNIST

188. ❌ OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control

作者: Yukun Wang, Ruihuang Li, Jiale Tao, Shiyuan Yang, Liyi Chen, Zhantao Yang, Handz, Yulan Guo, Shuai Shao, Qinglin Lu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06010v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视频生成领域，提出了一种名为OmniCamera的统一框架，用于解耦和控制视频中的动态内容与相机运动。论文的核心贡献包括构建混合数据集（OmniCAM）和提出双层次课程协同训练策略。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是计算机视觉中的视频生成任务，未涉及任何大语言模型技术、深度学习原理创新或AI在生物/化学等科学领域的应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了OmniCamera框架，通过解耦视频内容与相机运动实现灵活的视频生成控制，并利用混合数据集和课程协同训练策略解决了模态冲突和数据稀缺问题，取得了最先进的性能。

摘要翻译

视频本质上交织着两个关键维度：场景的动态内容与观察场景的相机运动。然而，现有的生成模型往往将这两个因素混为一谈，限制了独立控制。在本研究中，我们提出了OmniCamera，这是一个旨在显式解耦并操控这两个维度的统一框架。这种组合式方法允许任意配对相机条件与内容条件，从而实现灵活的视频生成，解锁了前所未有的创作控制力。为克服此类系统固有的模态冲突与数据稀缺的根本性挑战，我们提出了两项关键创新。首先，我们构建了OmniCAM数据集，这是一个新颖的混合数据集，它结合了精选的真实世界视频与合成数据，为鲁棒的多任务学习提供了多样化的配对样本。其次，我们提出了一种双层级课程协同训练策略，该策略能缓解模态干扰并协同利用多样化的数据源进行学习。该策略在两个层面运作：其一，按难度逐步引入控制模态（条件层级）；其二，先在合成数据上训练以实现精确控制，再适应真实数据以追求照片级真实感（数据层级）。最终，OmniCamera实现了最先进的性能，能够在保持卓越视觉质量的同时，对复杂的相机运动进行灵活控制。

摘要 (Abstract)

Video fundamentally intertwines two crucial axes: the dynamic content of a scene and the camera motion through which it is observed. However, existing generation models often entangle these factors, limiting independent control. In this work, we introduce OmniCamera, a unified framework designed to explicitly disentangle and command these two dimensions. This compositional approach enables flexible video generation by allowing arbitrary pairings of camera and content conditions, unlocking unprecedented creative control. To overcome the fundamental challenges of modality conflict and data scarcity inherent in such a system, we present two key innovations. First, we construct OmniCAM, a novel hybrid dataset combining curated real-world videos with synthetic data that provides diverse paired examples for robust multi-task learning. Second, we propose a Dual-level Curriculum Co-Training strategy that mitigates modality interference and synergistically learns from diverse data sources. This strategy operates on two levels: first, it progressively introduces control modalities by difficulties (condition-level), and second, trains for precise control on synthetic data before adapting to real data for photorealism (data-level). As a result, OmniCamera achieves state-of-the-art performance, enabling flexible control for complex camera movements while maintaining superior visual quality.

关键词: video generation, camera control, multi-task learning, disentanglement, hybrid dataset, curriculum training, modality conflict, photorealism

189. ❌ HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation

作者: Tao Hu, Varun Jampani 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05961v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于人类视频生成的扩散模型技术，提出了一种基于关节噪声采样的新框架HumANDiff，用于提高人体运动的物理一致性和保真度。所有评分关键词均与大语言模型（LLMs）、模型训练/对齐技术、推理优化、代理系统、科学AI应用等主题相关，而本文研究的是计算机视觉领域的视频生成扩散模型，未涉及任何大语言模型技术或相关概念，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对生成视频扩散模型在人体运动动态和物理一致性方面的不足，提出了HumANDiff框架，通过关节运动一致噪声采样、联合外观-运动学习和几何运动一致性学习，实现了高质量、运动一致的人类视频生成。

摘要翻译

尽管人类视频生成领域近期取得了巨大进展，生成式视频扩散模型在准确捕捉人体运动动态与物理特性方面仍面临挑战。本文提出了一种新的人类视频生成框架HumANDiff，该框架通过三项关键设计增强对人体运动的控制能力：1）关节化运动一致性噪声采样：该方法将潜在噪声的时空分布相关联，并用在统计人体模板密集表面流形上采样的三维关节化噪声替代非结构化随机高斯噪声。它继承了人体拓扑先验，实现了空间与时间一致的噪声采样。2）外观-运动联合学习：通过从关节化噪声中联合预测像素外观与对应的物理运动，增强了视频扩散模型的标准训练目标。该机制实现了高保真的人类视频合成，例如捕捉运动依赖的衣物褶皱细节。3）几何运动一致性学习：通过在关节化噪声空间中定义的新型几何运动一致性损失，强制实现跨帧的物理运动一致性。HumANDiff通过结合关节化噪声采样对视频扩散模型进行微调，实现了可扩展的可控人类视频生成。因此，本方法对扩散模型设计具有普适性，无需修改模型架构。在推理阶段，HumANDiff可在单一框架内实现图像到视频的生成，无需额外运动模块即可实现内在运动控制。大量实验表明，本方法在生成运动一致、高保真且具有多样化服装风格的人类视频方面达到了最先进的性能水平。项目页面：https://taohuumd.github.io/projects/HumANDiff/

摘要 (Abstract)

Despite tremendous recent progress in human video generation, generative video diffusion models still struggle to capture the dynamics and physics of human motions faithfully. In this paper, we propose a new framework for human video generation, HumANDiff, which enhances the human motion control with three key designs: 1) Articulated motion-consistent noise sampling that correlates the spatiotemporal distribution of latent noise and replaces the unstructured random Gaussian noise with 3D articulated noise sampled on the dense surface manifold of a statistical human body template. It inherits body topology priors for spatially and temporally consistent noise sampling. 2) Joint appearance-motion learning that enhances the standard training objective of video diffusion models by jointly predicting pixel appearances and corresponding physical motions from the articulated noises. It enables high-fidelity human video synthesis, e.g., capturing motion-dependent clothing wrinkles. 3) Geometric motion consistency learning that enforces physical motion consistency across frames via a novel geometric motion consistency loss defined in the articulated noise space. HumANDiff enables scalable controllable human video generation by fine-tuning video diffusion models with articulated noise sampling. Consequently, our method is agnostic to diffusion model design, and requires no modifications to the model architecture. During inference, HumANDiff enables image-to-video generation within a single framework, achieving intrinsic motion control without requiring additional motion modules. Extensive experiments demonstrate that our method achieves state-of-the-art performance in rendering motion-consistent, high-fidelity humans with diverse clothing styles. Project page: https://taohuumd.github.io/projects/HumANDiff/

关键词: human video generation, diffusion models, articulated noise sampling, motion consistency, video synthesis, geometric motion, appearance-motion learning, controllable generation

作者: Ioannis Nasios 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05959v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用多模态视觉变换器和集成学习进行滑坡检测，属于计算机视觉和遥感应用领域。论文内容与绝大多数关键词（涉及大语言模型、训练技术、推理优化、智能体等）完全无关，因此评分为0。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为滑坡检测属于地球科学和环境监测应用，是AI在科学领域的一个具体应用，但并非论文的核心技术焦点（核心是视觉模型和集成学习），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究提出了一种融合Sentinel-2光学影像和Sentinel-1 SAR数据的多模态框架，利用多编码器视觉变换器和集成学习方法进行滑坡检测，在非经典变化检测设置下实现了0.919的F1分数，为自然灾害监测提供了可扩展的解决方案。

摘要翻译

滑坡是一种对人类社会、基础设施和生态系统具有严重影响的主要地质灾害，凸显了对精准及时探测方法的需求以支持灾害风险减缓。本研究提出了一种模块化的多模型框架，该框架融合了哨兵-2号光学影像与哨兵-1号合成孔径雷达数据，以实现稳健的滑坡探测。该方法利用多编码器视觉变换器，其中每种数据模态通过独立的轻量级预训练编码器进行处理，在滑坡探测中实现了强劲性能。此外，通过集成多个模型，特别是神经网络与梯度提升模型（LightGBM和XGBoost）的结合，展示了集成学习在进一步提升准确性与鲁棒性方面的优势。研究还整合了衍生光谱指数（如归一化植被指数NDVI）与原始波段数据，以增强对植被和地表变化的敏感性。所提出的方法在滑坡探测任务中取得了当前最先进的F1分数0.919，该任务基于图斑分类而非像素级分割，且无需灾前哨兵-2号数据即可运行，突显了其在非经典变化检测场景下的有效性。该方法还在机器学习竞赛中表现出顶尖性能，在精确率与召回率之间实现了良好平衡，并彰显了显式利用光学与雷达数据互补优势的益处。所开展的实验与研究同时强调了框架的可扩展性与实际应用潜力，能够灵活配置为仅光学、仅SAR或融合数据输入模式，并为更广泛的自然灾害监测与环境变化应用提供了一个可迁移的框架。完整训练与推理代码可在https://github.com/IoannisNasios/sentinel-landslide-cls获取。

摘要 (Abstract)

Landslides represent a major geohazard with severe impacts on human life, infrastructure, and ecosystems, underscoring the need for accurate and timely detection approaches to support disaster risk reduction. This study proposes a modular, multi-model framework that fuses Sentinel-2 optical imagery with Sentinel-1 Synthetic Aperture Radar (SAR) data, for robust landslide detection. The methodology leverages multi-encoder vision transformers, where each data modality is processed through separate lightweight pretrained encoders, achieving strong performance in landslide detection. In addition, the integration of multiple models, particularly the combination of neural networks and gradient boosting models (LightGBM and XGBoost), demonstrates the power of ensemble learning to further enhance accuracy and robustness. Derived spectral indices, such as NDVI, are integrated alongside original bands to enhance sensitivity to vegetation and surface changes. The proposed methodology achieves a state-of-the-art F1 score of 0.919 on landslide detection, addressing a patch-based classification task rather than pixel-level segmentation and operating without pre-event Sentinel-2 data, highlighting its effectiveness in a non-classical change detection setting. It also demonstrated top performance in a machine learning competition, achieving a strong balance between precision and recall and highlighting the advantages of explicitly leveraging the complementary strengths of optical and radar data. The conducted experiments and research also emphasize scalability and operational applicability, enabling flexible configurations with optical-only, SAR-only, or combined inputs, and offering a transferable framework for broader natural hazard monitoring and environmental change applications. Full training and inference code can be found in https://github.com/IoannisNasios/sentinel-landslide-cls.

关键词: landslide detection, multi-modal fusion, vision transformers, ensemble learning, Sentinel-1 SAR, Sentinel-2 optical, remote sensing, natural hazard monitoring

191. ❌ Mixture-of-Modality-Experts with Holistic Token Learning for Fine-Grained Multimodal Visual Analytics in Driver Action Recognition

作者: Tianyi Liu, Yiming Li, Wenqian Wang, Jiaojiao Wang, Chen Cai, Yi Wang, Kim-Hui Yap 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05947v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种名为Mixture-of-Modality-Experts (MoME)的框架，直接与关键词’Mixture of Experts OR MoE OR Sparse Models’高度相关，因此给予10分。论文还提到其方法提供了更好的可解释性，这与’Mechanistic Interpretability OR Explainable AI’有一定关联，但并非核心，给予5分。其他关键词主要涉及大语言模型、训练技术、推理方法等，而本文专注于多模态视觉分析（特别是驾驶员动作识别），使用计算机视觉和深度学习技术，并未涉及大语言模型或相关技术，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

本文提出了一种名为Mixture-of-Modality-Experts (MoME)与Holistic Token Learning (HTL)的框架，用于解决多模态视觉分析中自适应融合和细粒度动作识别的问题，在驾驶员动作识别任务上超越了现有基线方法。

摘要翻译

当异构模态为决策提供互补但输入依赖的证据时，稳健的多模态视觉分析仍具挑战性。现有的多模态学习方法主要依赖固定的融合模块或预定义的跨模态交互，往往难以适应动态变化的模态可靠性，也难以捕捉细粒度的行为线索。为解决这一问题，我们提出了一种混合模态专家（Mixture-of-Modality-Experts, MoME）框架，并引入整体令牌学习（Holistic Token Learning, HTL）策略。MoME 实现了模态特定专家之间的自适应协作，而 HTL 则通过类别令牌和时空令牌同时提升专家内部细化和专家间知识迁移。由此，我们的方法构建了一个以知识为中心的多模态学习框架，在提升专家专业化的同时降低了多模态融合的歧义性。我们以驾驶员行为识别作为代表性多模态理解任务，对所提框架进行了验证。在公开基准测试上的实验结果表明，所提出的 MoME 框架与 HTL 策略共同超越了代表性的单模态及多模态基线方法。进一步的消融实验、验证及可视化结果进一步证实，HTL 策略能够提升对细微多模态信息的理解能力，并提供更好的可解释性。

摘要 (Abstract)

Robust multimodal visual analytics remains challenging when heterogeneous modalities provide complementary but input-dependent evidence for decision-making.Existing multimodal learning methods mainly rely on fixed fusion modules or predefined cross-modal interactions, which are often insufficient to adapt to changing modality reliability and to capture fine-grained action cues. To address this issue, we propose a Mixture-of-Modality-Experts (MoME) framework with a Holistic Token Learning (HTL) strategy. MoME enables adaptive collaboration among modality-specific experts, while HTL improves both intra-expert refinement and inter-expert knowledge transfer through class tokens and spatio-temporal tokens. In this way, our method forms a knowledge-centric multimodal learning framework that improves expert specialization while reducing ambiguity in multimodal fusion.We validate the proposed framework on driver action recognition as a representative multimodal understanding taskThe experimental results on the public benchmark show that the proposed MoME framework and the HTL strategy jointly outperform representative single-modal and multimodal baselines. Additional ablation, validation, and visualization results further verify that the proposed HTL strategy improves subtle multimodal understanding and offers better interpretability.

关键词: Mixture-of-Modality-Experts, Holistic Token Learning, multimodal visual analytics, driver action recognition, adaptive collaboration, fine-grained action cues, knowledge-centric learning, interpretability

192. ❌ Leveraging Image Editing Foundation Models for Data-Efficient CT Metal Artifact Reduction

作者: Ahmet Rasim Emirdagi, Süleyman Aslan, Mısra Yavuz, Görkay Aydemir, Yunus Bilge Kurt, Nasrin Rahimi, Burak Can Biner, M. Akın Yılmaz 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05934v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用基础模型（Foundation Models）进行医学图像重建，通过LoRA进行参数高效微调，实现领域适应（Domain Adaptation）以减少幻觉（Hallucination Mitigation），并利用上下文学习（In-context Learning）策略。这属于AI for Science在医学影像领域的应用。其他关键词如MoE、SFT、RLHF等与论文内容无关。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于视觉-语言扩散基础模型的数据高效CT金属伪影减少方法，通过LoRA微调和多参考条件策略，仅需少量配对数据即可实现最先进的性能，同时有效缓解幻觉问题。

摘要翻译

高衰减植入物产生的金属伪影严重降低了CT图像质量，掩盖关键解剖结构，并对需要大量配对训练数据的标准深度学习方法构成挑战。我们提出一种范式转变：通过参数高效的低秩自适应（LoRA）技术适配通用视觉语言扩散基础模型，将伪影消除重新定义为上下文推理任务。通过利用丰富的视觉先验知识，我们的方法仅需16至128个配对训练样本即可实现有效的伪影抑制，将数据需求降低两个数量级。关键的是，我们证明领域自适应对于缓解幻觉效应至关重要；若无此步骤，基础模型会将条状伪影误判为自然物体（如华夫饼或培养皿）。为实现可靠重建，我们提出多参考条件策略：在提供伪影输入图像的同时，引入来自无关受试者的干净解剖范例，使模型能够利用特定类别上下文推断未受损的解剖结构。在AAPM CT-MAR基准测试中的广泛评估表明，我们的方法在感知质量与放射学特征指标上均达到最先进性能。这项工作证实，经过适当适配的基础模型可为可解释、数据高效的医学图像重建提供可扩展的解决方案。代码发布于https://github.com/ahmetemirdagi/CT-EditMAR。

摘要 (Abstract)

Metal artifacts from high-attenuation implants severely degrade CT image quality, obscuring critical anatomical structures and posing a challenge for standard deep learning methods that require extensive paired training data. We propose a paradigm shift: reframing artifact reduction as an in-context reasoning task by adapting a general-purpose vision-language diffusion foundation model via parameter-efficient Low-Rank Adaptation (LoRA). By leveraging rich visual priors, our approach achieves effective artifact suppression with only 16 to 128 paired training examples reducing data requirements by two orders of magnitude. Crucially, we demonstrate that domain adaptation is essential for hallucination mitigation; without it, foundation models interpret streak artifacts as erroneous natural objects (e.g., waffles or petri dishes). To ground the restoration, we propose a multi-reference conditioning strategy where clean anatomical exemplars from unrelated subjects are provided alongside the corrupted input, enabling the model to exploit category-specific context to infer uncorrupted anatomy. Extensive evaluation on the AAPM CT-MAR benchmark demonstrates that our method achieves state-of-the-art performance on perceptual and radiological-feature metrics . This work establishes that foundation models, when appropriately adapted, offer a scalable alternative for interpretable, data-efficient medical image reconstruction. Code is available at https://github.com/ahmetemirdagi/CT-EditMAR.

关键词: CT metal artifact reduction, foundation models, LoRA, domain adaptation, hallucination mitigation, in-context reasoning, data-efficient, medical image reconstruction

193. ❌ SonoSelect: Efficient Ultrasound Perception via Active Probe Exploration

作者: Yixin Zhang, Yunzhong Hou, Longqi Li, Zhenyue Qin, Yang Liu, Yue Yao 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05933v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《SonoSelect: Efficient Ultrasound Perception via Active Probe Exploration》专注于医学超声成像领域，提出了一种基于主动探索的超声探头路径规划方法，以优化多视图采集效率。论文的核心技术是计算机视觉、3D空间记忆建模和序列决策，属于AI在生物医学成像（超声）中的具体应用。因此，它与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，因为这些关键词特指自然语言处理或通用大模型技术。唯一相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文将AI应用于医学成像（超声），属于“AI for Science”在生物医学领域的应用，但并非核心创新点（核心是主动感知而非AI模型本身），故给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究解决了超声成像中需要多视图扫描导致效率低下的问题，提出了一种主动探头探索方法SonoSelect，通过序列决策和3D空间记忆引导探头移动，实验表明该方法能以更少的视图实现较高的器官分类准确率和目标覆盖。

摘要翻译

超声感知通常需要通过探头移动获取多个扫描切面，以降低诊断不确定性、减少声学遮挡并提升解剖结构覆盖度。然而，并非所有探头切面都具有同等信息价值。无差别地采集大量切面可能引入显著冗余，增加扫描与处理成本。为此，我们定义了超声主动切面探索任务，并提出一种超声专用方法SonoSelect，该方法能够基于当前观测自适应引导探头移动。具体而言，我们将超声主动切面探索建模为序列决策问题：每个新的二维超声切面被融合至已观测解剖结构的三维空间记忆中，该记忆将引导下一个探头定位。在此框架基础上，我们提出一种超声专用优化目标，其倾向于选择能实现更大器官覆盖度、更低重建不确定性及更少冗余扫描的探头移动策略。在超声模拟器上的实验表明，SonoSelect仅使用N个切面中的2个即可实现显著的多切面器官分类准确率。此外，在更具挑战性的肾囊肿检测任务中，该方法达到54.56%的肾脏覆盖度与35.13%的囊肿覆盖度，且其扫描轨迹始终以目标囊肿为中心并保持较短路径。

摘要 (Abstract)

Ultrasound perception typically requires multiple scan views through probe movement to reduce diagnostic ambiguity, mitigate acoustic occlusions, and improve anatomical coverage. However, not all probe views are equally informative. Exhaustively acquiring a large number of views can introduce substantial redundancy, increase scanning and processing costs. To address this, we define an active view exploration task for ultrasound and propose SonoSelect, an ultrasound-specific method that adaptively guides probe movement based on current observations. Specifically, we cast ultrasound active view exploration as a sequential decision-making problem. Each new 2D ultrasound view is fused into a 3D spatial memory of the observed anatomy, which guides the next probe position. On top of this formulation, we propose an ultrasound-specific objective that favors probe movements with greater organ coverage, lower reconstruction uncertainty, and less redundant scanning. Experiments on the ultrasound simulator show that SonoSelect achieves promising multi-view organ classification accuracy using only 2 out of N views. Furthermore, for a more difficult kidney cyst detection task, it reaches 54.56% kidney coverage and 35.13% cyst coverage, with short trajectories consistently centered on the target cyst.

关键词: Ultrasound Perception, Active Probe Exploration, Sequential Decision-making, 3D Spatial Memory, Multi-view Organ Classification, Kidney Cyst Detection, Efficient Scanning, View Selection

194. ❌ Appearance Decomposition Gaussian Splatting for Multi-Traversal Reconstruction

作者: Yangyi Xiao, Siting Zhu, Baoquan Yang, Tianchen Deng, Yongbo Chen, Hesheng Wang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05908v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和3D重建领域，提出了一种基于高斯泼溅（Gaussian Splatting）的多遍历场景重建方法，通过外观分解解决光照变化问题。论文内容完全不涉及大语言模型（LLMs）、深度学习技术原理创新、或任何评分关键词中列出的技术（如MoE、SFT、RLHF、RAG、量化等）。所有关键词均与大模型、深度学习技术原理或AI for Science应用无关，因此相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了ADM-GS框架，通过外观分解将静态背景分离为遍历不变的材料属性和遍历相关的光照，解决了多遍历场景重建中的外观不一致问题，在Argoverse 2和Waymo Open数据集上实现了PSNR提升0.98 dB。

摘要翻译

多轨迹场景重建对于高保真自动驾驶仿真与数字孪生构建至关重要。该任务涉及整合同一地理区域在不同时间采集的多个序列数据。在此背景下，一个核心挑战在于：尽管底层几何结构一致，但由不同光照与环境条件引起的各轨迹间显著的外观不一致性。本文提出ADM-GS（面向多轨迹重建的外观分解高斯溅射框架），该框架通过对静态背景进行显式外观分解，以缓解跨轨迹的外观纠缠问题。对于静态背景，我们将外观分解为轨迹不变材质（表征内在材质属性）与轨迹相关光照（捕捉光照变化）。具体而言，我们提出一种采用频域分离混合编码策略的神经光场。通过引入表面法线与显式反射向量，该设计能够分别捕捉低频漫反射光照与高频镜面反射。在Argoverse 2和Waymo Open数据集上的定量评估验证了ADM-GS的有效性。在多轨迹实验中，本方法相较于现有基于隐式表示的基线模型实现了+0.98 dB的PSNR提升，同时在跨轨迹间生成更一致的外观表现。代码将在https://github.com/IRMVLab/ADM-GS公开。

摘要 (Abstract)

Multi-traversal scene reconstruction is important for high-fidelity autonomous driving simulation and digital twin construction. This task involves integrating multiple sequences captured from the same geographical area at different times. In this context, a primary challenge is the significant appearance inconsistency across traversals caused by varying illumination and environmental conditions, despite the shared underlying geometry. This paper presents ADM-GS (Appearance Decomposition Gaussian Splatting for Multi-Traversal Reconstruction), a framework that applies an explicit appearance decomposition to the static background to alleviate appearance entanglement across traversals. For the static background, we decompose the appearance into traversal-invariant material, representing intrinsic material properties, and traversal-dependent illumination, capturing lighting variations. Specifically, we propose a neural light field that utilizes a frequency-separated hybrid encoding strategy. By incorporating surface normals and explicit reflection vectors, this design separately captures low-frequency diffuse illumination and high-frequency specular reflections. Quantitative evaluations on the Argoverse 2 and Waymo Open datasets demonstrate the effectiveness of ADM-GS. In multi-traversal experiments, our method achieves a +0.98 dB PSNR improvement over existing latent-based baselines while producing more consistent appearance across traversals. Code will be available at https://github.com/IRMVLab/ADM-GS.

关键词: Multi-traversal reconstruction, Gaussian Splatting, Appearance decomposition, Neural light field, Autonomous driving simulation, Digital twin, Illumination variation, 3D scene reconstruction

195. ❌ AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

作者: Dong She, Xianrong Yao, Liqun Chen, Jinghe Yu, Yang Gao, Zhanpeng Jin 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05900v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉语言模型（VLMs）在情感图像内容分析中的能力，属于大模型在特定领域的应用研究。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为VLMs是大模型的一种扩展形式。与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’和’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’高度相关（8分），因为论文提出了Grounded Affective Tree (GAT) Prompting框架，结合了视觉脚手架和分层推理，涉及多步推理和深度推理过程。其他关键词与论文内容无直接关联（0分）。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型在情感图像内容分析中的局限性，提出了AICA-Bench基准和Grounded Affective Tree Prompting框架，有效减少了强度误差并提高了描述深度。

摘要翻译

视觉语言模型（VLMs）在感知任务中展现出强大能力，但将感知、推理与生成整合为统一框架的整体性情感图像内容分析（Affective Image Content Analysis, AICA）仍研究不足。为填补这一空白，我们提出了AICA-Bench——一个包含三个核心任务的综合评测基准：情感理解（Emotion Understanding, EU）、情感推理（Emotion Reasoning, ER）与情感引导内容生成（Emotion-Guided Content Generation, EGCG）。通过对23个视觉语言模型的评估，我们发现了两个主要局限：情感强度校准能力薄弱与开放式描述流于表面。针对这些问题，我们提出了基于视觉支架与层次化推理的无训练框架——接地情感树（Grounded Affective Tree, GAT）提示法。实验表明，GAT能有效降低情感强度误差并提升描述深度，为未来情感多模态理解与生成研究提供了坚实的基准框架。

摘要 (Abstract)

Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA), which integrates perception, reasoning, and generation into a unified framework, remains underexplored. To address this gap, we introduce AICA-Bench, a comprehensive benchmark with three core tasks: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). We evaluate 23 VLMs and identify two major limitations: weak intensity calibration and shallow open-ended descriptions. To address these issues, we propose Grounded Affective Tree (GAT) Prompting, a training-free framework that combines visual scaffolding with hierarchical reasoning. Experiments show that GAT reduces intensity errors and improves descriptive depth, providing a strong baseline for future research on affective multimodal understanding and generation.

关键词: Vision-Language Models, Affective Image Content Analysis, AICA-Bench, Emotion Understanding, Emotion Reasoning, Emotion-Guided Content Generation, Grounded Affective Tree Prompting, Multimodal Understanding

196. ❌ Physics-Aware Video Instance Removal Benchmark

作者: Zirui Li, Xinghao Chen, Lingyu Jiang, Dengzhe Hou, Fangzhou Lin, Kazunori Yamada, Xiangbo Gao, Zhengzhong Tu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05898v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视频实例移除（Video Instance Removal）的计算机视觉任务，特别是物理感知的基准测试。论文内容涉及视频编辑、物理一致性、基准评估和现有方法的比较，但完全不涉及大语言模型（LLM）、深度学习技术原理、模型训练/优化方法（如MoE、SFT、RLHF、PEFT等）、推理技术（如CoT、RAG）、代理系统、模型压缩或AI for Science等关键词。所有关键词均与论文主题无关，因此相关度评分均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一个物理感知的视频实例移除基准（PVIR），用于评估在移除目标对象时保持背景完整性和物理一致性的方法，并发现现有方法在处理复杂物理交互时仍面临挑战。

摘要翻译

视频实例移除（Video Instance Removal, VIR）任务要求在移除目标物体的同时，保持背景的完整性及物理一致性，例如镜面反射和光照交互。尽管文本引导编辑技术已取得进展，但现有基准测试主要评估视觉合理性，往往忽略了由物体移除引发的物理因果关系，例如残留的阴影。我们提出了物理感知视频实例移除（Physics-Aware Video Instance Removal, PVIR）基准，包含95个高质量视频，并标注了实例级精确的掩码和移除提示。PVIR被划分为简单和困难两个子集，后者明确针对复杂的物理交互场景。我们评估了四种代表性方法——PISCO-Removal、UniVideo、DiffuEraser和CoCoCo，采用解耦式人工评估协议，从三个维度分别分析语义、视觉和空间层面的失败情况：指令遵循度、渲染质量和编辑专一性。结果显示，PISCO-Removal和UniVideo取得了最先进的性能，而DiffuEraser常引入模糊伪影，CoCoCo在指令遵循方面存在显著困难。所有方法在困难子集上性能的持续下降，凸显了恢复复杂物理副作用仍是当前面临的持续挑战。

摘要 (Abstract)

Video Instance Removal (VIR) requires removing target objects while maintaining background integrity and physical consistency, such as specular reflections and illumination interactions. Despite advancements in text-guided editing, current benchmarks primarily assess visual plausibility, often overlooking the physical causalities, such as lingering shadows, triggered by object removal. We introduce the Physics-Aware Video Instance Removal (PVIR) benchmark, featuring 95 high-quality videos annotated with instance-accurate masks and removal prompts. PVIR is partitioned into Simple and Hard subsets, the latter explicitly targeting complex physical interactions. We evaluate four representative methods, PISCO-Removal, UniVideo, DiffuEraser, and CoCoCo, using a decoupled human evaluation protocol across three dimensions to isolate semantic, visual, and spatial failures: instruction following, rendering quality, and edit exclusivity. Our results show that PISCO-Removal and UniVideo achieve state-of-the-art performance, while DiffuEraser frequently introduces blurring artifacts and CoCoCo struggles significantly with instruction following. The persistent performance drop on the Hard subset highlights the ongoing challenge of recovering complex physical side effects.

关键词: Video Instance Removal, Physics-Aware, Benchmark, Physical Consistency, Object Removal, Video Editing, Evaluation Protocol, Human Evaluation

197. ❌ Reading Between the Pixels: An Inscriptive Jailbreak Attack on Text-to-Image Models

作者: Zonghao Ying, Haowen Dai, Lianyu Hu, Zonglei Jing, Quanchen Zou, Yaodong Yang, Aishan Liu, Xianglong Liu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05853v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究文本到图像（T2I）模型的安全漏洞（inscriptive jailbreak攻击），与大多数关键词无关。仅与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（8分），因为论文涉及T2I模型的安全对齐（safety alignments）和防御机制，但并非核心研究LLM的对齐技术。其他关键词均未在论文中涉及，评分为0。

!!! tip deepseek-chat TL;DR

论文提出了一种针对文本到图像模型的inscriptive jailbreak攻击方法Etch，通过分解对抗提示为语义伪装、视觉空间锚定和排版编码三层，在7个模型上平均攻击成功率达65.57%，揭示了当前T2I安全对齐的盲点。

摘要翻译

现代文生图（T2I）模型现已能够生成清晰、段落长度的文本，这催生了一种全新的滥用类别。我们识别并形式化了一种“铭文式越狱”攻击，即攻击者诱使T2I系统生成在视觉无害场景中嵌入有害文本载荷（例如欺诈性文件）的图像。与旨在引发视觉上违规图像的传统“描绘式越狱”不同，铭文式攻击利用了模型自身的文本渲染能力。由于现有的越狱技术专为粗略的视觉操控设计，它们在绕过多阶段安全过滤器的同时，难以保持字符级别的保真度。为揭示此漏洞，我们提出了Etch——一个黑盒攻击框架，它将对抗性提示分解为三个功能正交的层次：语义伪装、视觉空间锚定和字体排印编码。这种分解将整个提示空间的联合优化问题简化为可处理的子问题，并通过零阶循环迭代优化。在此过程中，一个视觉语言模型对每张生成的图像进行评判，将失败定位到特定层次，并给出针对性修正建议。在2个基准测试中对7个模型进行的广泛评估表明，Etch实现了平均65.57%（峰值达91.00%）的攻击成功率，显著优于现有基线方法。我们的研究结果揭示了当前T2I安全对齐机制中存在一个关键盲区，并凸显了开发具备字体排印感知能力的多模态防御机制的迫切需求。

摘要 (Abstract)

Modern text-to-image (T2I) models can now render legible, paragraph-length text, enabling a fundamentally new class of misuse. We identify and formalize the inscriptive jailbreak, where an adversary coerces a T2I system into generating images containing harmful textual payloads (e.g., fraudulent documents) embedded within visually benign scenes. Unlike traditional depictive jailbreaks that elicit visually objectionable imagery, inscriptive attacks weaponize the text-rendering capability itself. Because existing jailbreak techniques are designed for coarse visual manipulation, they struggle to bypass multi-stage safety filters while maintaining character-level fidelity. To expose this vulnerability, we propose Etch, a black-box attack framework that decomposes the adversarial prompt into three functionally orthogonal layers: semantic camouflage, visual-spatial anchoring, and typographic encoding. This decomposition reduces joint optimization over the full prompt space to tractable sub-problems, which are iteratively refined through a zero-order loop. In this process, a vision-language model critiques each generated image, localizes failures to specific layers, and prescribes targeted revisions. Extensive evaluations across 7 models on the 2 benchmarks demonstrate that Etch achieves an average attack success rate of 65.57% (peaking at 91.00%), significantly outperforming existing baselines. Our results reveal a critical blind spot in current T2I safety alignments and underscore the urgent need for typography-aware defense multimodal mechanisms.

关键词: text-to-image models, jailbreak attack, inscriptive attack, safety alignment, adversarial prompt, typographic encoding, multimodal defense

198. ❌ Learn to Rank: Visual Attribution by Learning Importance Ranking

作者: David Schinagl, Christian Fruhwirth-Reisinger, Alexander Prutsch, Samuel Schulter, Horst Possegger 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05819v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉模型的可解释性方法，特别是视觉归因图生成技术。论文提出的学习方案通过可微分的排序优化直接优化删除和插入指标，与大多数大语言模型（LLM）相关关键词无直接关联。唯一相关的关键词是’Mechanistic Interpretability OR Explainable AI’，因为论文的核心是开发一种新的可解释AI方法，用于解释视觉模型的决策过程，这与可解释AI领域高度相关。其他关键词主要涉及大语言模型的技术原理、训练方法、应用场景等，与这篇计算机视觉可解释性论文没有直接联系。

!!! tip deepseek-chat TL;DR

该论文提出了一种通过可微分排序学习直接优化视觉归因指标的新方法，解决了现有可解释性方法在效率、准确性和粒度方面的三难权衡问题，在视觉Transformer模型上实现了更精确的像素级解释。

摘要翻译

理解复杂计算机视觉模型的决策对于建立信任与问责至关重要，尤其在安全关键领域。一种成熟的解释性方法是生成视觉归因图，以突显输入中与模型预测最相关的区域。然而，现有方法面临三方面的权衡：基于传播的方法效率高，但可能存在偏差且依赖于特定架构；基于扰动的方法具有因果基础，但计算成本高昂，且对于视觉变换器模型通常只能生成粗糙的、基于图像块（patch-level）的解释；基于学习的解释器速度较快，但通常优化替代目标或从启发式教师模型蒸馏知识。我们提出一种学习方案，直接优化删除与插入度量。由于这些度量依赖于不可微的排序与排名操作，我们将其构建为排列学习问题，并使用Gumbel-Sinkhorn算法以可微松弛替代硬排序。这使得通过对目标模型进行归因引导的扰动实现端到端训练。在推理阶段，我们的方法通过单次前向传播即可生成密集的像素级归因图，并可选择性地进行少量步长的梯度细化。实验结果表明，我们的方法在定量评估上取得持续改进，并能生成更清晰、与物体边界对齐的解释，尤其对于基于变换器的视觉模型效果显著。

摘要 (Abstract)

Interpreting the decisions of complex computer vision models is crucial to establish trust and accountability, especially in safety-critical domains. An established approach to interpretability is generating visual attribution maps that highlight regions of the input most relevant to the model’s prediction. However, existing methods face a three-way trade-off. Propagation-based approaches are efficient, but they can be biased and architecture-specific. Meanwhile, perturbation-based methods are causally grounded, yet they are expensive and for vision transformers often yield coarse, patch-level explanations. Learning-based explainers are fast but usually optimize surrogate objectives or distill from heuristic teachers. We propose a learning scheme that instead optimizes deletion and insertion metrics directly. Since these metrics depend on non-differentiable sorting and ranking, we frame them as permutation learning and replace the hard sorting with a differentiable relaxation using Gumbel-Sinkhorn. This enables end-to-end training through attribution-guided perturbations of the target model. During inference, our method produces dense, pixel-level attributions in a single forward pass with optional, few-step gradient refinement. Our experiments demonstrate consistent quantitative improvements and sharper, boundary-aligned explanations, particularly for transformer-based vision models.

关键词: visual attribution, interpretability, explainable AI, vision transformers, permutation learning, Gumbel-Sinkhorn, deletion and insertion metrics, model explanation

199. ❌ EfficientMonoHair: Fast Strand-Level Reconstruction from Monocular Video via Multi-View Direction Fusion

作者: Da Li, Dominik Engel, Deng Luo, Ivan Viola 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05794v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉和图形学领域的头发重建技术，提出了一种结合隐式神经网络和多视图几何融合的方法来提高效率和精度。所有评分关键词均涉及大语言模型、深度学习技术原理或AI在科学领域的应用，而本文研究的是特定计算机视觉任务（头发重建），未涉及任何大模型技术、深度学习原理创新或AI在科学领域的应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为EfficientMonoHair的快速头发重建框架，通过融合隐式神经网络和多视图几何优化，在保持高精度的同时将运行效率提升近一个数量级。

摘要翻译

发丝级毛发几何重建是虚拟人体建模与发型数字化领域的基础性问题。然而，现有方法仍难以在精度与效率之间取得良好平衡。隐式神经表征能够捕捉整体发型轮廓，但往往无法保留细粒度的发丝细节；而基于显式优化的方法虽能实现高保真重建，却需付出高昂计算代价且可扩展性较差。为解决这一问题，我们提出EfficientMonoHair——一个结合隐式神经网络与多视角几何融合的快速精确框架，用于从单目视频实现发丝级重建。本方法提出了基于融合区块的多视角优化策略，显著减少了点云方向优化的迭代次数；同时设计了一种新颖的并行生发策略，该策略放宽了体素占位约束，使得即使在不准确或含噪声的方向场中，大规模发丝追踪仍能保持稳定与鲁棒。在代表性真实发型数据集上的大量实验表明，本方法能够稳健地重建高保真发丝几何结构。在合成基准测试中，本方法达到了与前沿技术相当的重建质量，同时将运行效率提升了近一个数量级。

摘要 (Abstract)

Strand-level hair geometry reconstruction is a fundamental problem in virtual human modeling and the digitization of hairstyles. However, existing methods still suffer from a significant trade-off between accuracy and efficiency. Implicit neural representations can capture the global hair shape but often fail to preserve fine-grained strand details, while explicit optimization-based approaches achieve high-fidelity reconstructions at the cost of heavy computation and poor scalability. To address this issue, we propose EfficientMonoHair, a fast and accurate framework that combines the implicit neural network with multi-view geometric fusion for strand-level reconstruction from monocular video. Our method introduces a fusion-patch-based multi-view optimization that reduces the number of optimization iterations for point cloud direction, as well as a novel parallel hair-growing strategy that relaxes voxel occupancy constraints, allowing large-scale strand tracing to remain stable and robust even under inaccurate or noisy orientation fields. Extensive experiments on representative real-world hairstyles demonstrate that our method can robustly reconstruct high-fidelity strand geometries with accuracy. On synthetic benchmarks, our method achieves reconstruction quality comparable to state-of-the-art methods, while improving runtime efficiency by nearly an order of magnitude.

关键词: hair reconstruction, strand-level geometry, monocular video, implicit neural representation, multi-view fusion, optimization efficiency, parallel hair-growing, virtual human modeling

200. ❌ BodhiPromptShield: Pre-Inference Prompt Mediation for Suppressing Privacy Propagation in LLM/VLM Agents

作者: Bo Ma, Jinsong Wu, Weiqi Yan 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05793v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM/VLM代理中的隐私保护，核心涉及LLM代理架构和隐私风险传播机制。与’Large Language Models’和’LLM Agents’高度相关（10分），因为论文直接研究LLM/VLM代理系统。与’Retrieval-Augmented Generation’和’Tool Use’有一定关联（5分），因为论文提到隐私风险会传播到检索查询和工具调用阶段。其他关键词如MoE、SFT、RLHF等与论文的隐私保护框架无直接关系，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了BodhiPromptShield框架，通过检测敏感信息、使用占位符和延迟恢复等方法，有效抑制了LLM/VLM代理中跨阶段的隐私传播风险，在CPPB基准测试中显著降低了传播率。

摘要翻译

在LLM/VLM智能体中，提示词隐私风险会超越单次模型调用的范畴进行传播，因为原始用户内容可能流入检索查询、记忆写入、工具调用及日志记录等环节。现有的去标识化流程虽能处理文档边界问题，却无法应对这种跨阶段传播。我们提出BodhiPromptShield——一个具备策略感知能力的框架，它能检测敏感文本片段，通过类型化占位符、语义抽象或安全符号映射进行路由传递，并将信息还原延迟至授权边界。相较于企业级信息脱敏方案，本框架增加了显式的传播感知中介机制，并将还原时机作为安全变量进行调控。在受控提示词隐私基准（CPPB）的评估中，分阶段传播控制使检索、记忆和工具阶段的泄露率从10.7%降至7.1%；在保持0.94准确率（AC）和0.92任务成功率（TSR）的同时，个人实体识别率（PER）达到9.3%，其表现优于通用去标识化方法。需注意，这些是基于CPPB的受控系统实验结果，并非正式的隐私保证或公共基准迁移性声明。项目代码库地址：https://github.com/mabo1215/BodhiPromptShield.git。

摘要 (Abstract)

In LLM/VLM agents, prompt privacy risk propagates beyond a single model call because raw user content can flow into retrieval queries, memory writes, tool calls, and logs. Existing de-identification pipelines address document boundaries but not this cross-stage propagation. We propose BodhiPromptShield, a policy-aware framework that detects sensitive spans, routes them via typed placeholders, semantic abstraction, or secure symbolic mapping, and delays restoration to authorized boundaries. Relative to enterprise redaction, this adds explicit propagation-aware mediation and restoration timing as a security variable. Under controlled evaluation on the Controlled Prompt-Privacy Benchmark (CPPB), stage-wise propagation suppresses from 10.7% to 7.1% across retrieval, memory, and tool stages; PER reaches 9.3% with 0.94 AC and 0.92 TSR, outperforming generic de-identification. These are controlled systems results on CPPB rather than formal privacy guarantees or public-benchmark transfer claims. The project repository is available at https://github.com/mabo1215/BodhiPromptShield.git.

关键词: LLM agents, privacy propagation, prompt mediation, de-identification, retrieval queries, tool calls, secure symbolic mapping, privacy benchmark

201. ❌ Sparse Gain Radio Map Reconstruction With Geometry Priors and Uncertainty-Guided Measurement Selection

作者: Zhihan Zeng, Ning Wei, Muhammad Baqer Mollah, Kaihe Wang, Phee Lep Yeoh, Fei Xu, Yue Xiu, Zhongpei Zhang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05788v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究无线通信中的稀疏增益无线电地图重建，使用几何感知学习和不确定性估计方法，属于特定领域的信号处理与深度学习应用。所有评分关键词均针对大模型（LLM）及相关技术（如MoE、RLHF、RAG、量化等），或大模型在科学领域的应用（如AI for Science）。论文内容完全不涉及大模型技术、大模型应用或任何评分关键词中的概念，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合几何感知学习和不确定性估计的轻量级网络GeoUQ-GFNet，用于从稀疏测量中重建复杂城市环境中的密集增益无线电地图，并通过不确定性引导的主动测量选择提高了重建性能。

摘要翻译

无线电地图对于环境感知无线通信、网络规划和无线电资源优化至关重要。然而，当仅有有限测量数据可用时，密集无线电地图的构建仍具挑战性，尤其是在具有强遮挡、不规则几何结构和受限传感可达性的复杂城市环境中。现有方法已探索了插值、低秩制图、深度补全和信道知识地图（Channel Knowledge Map, CKM）构建，但其中许多方法未能充分利用显式几何先验，或忽视了预测不确定性对后续传感的价值。本文从几何感知和主动传感的角度研究稀疏增益无线电地图重建。我们首先构建了 UrbanRT-RM，这是一个具有多样化城市布局、多基站部署和多稀疏采样模式的可控射线追踪基准数据集。随后，我们提出了 GeoUQ-GFNet，一个轻量级网络，能够从稀疏测量数据和结构化场景先验中联合预测密集增益无线电地图和空间不确定性地图。预测的不确定性进一步用于指导有限传感预算下的主动测量选择。大量实验表明，我们提出的 GeoUQ-GFNet 方法在使用 UrbanRT-RM 生成的不同场景和发射机布置中，均实现了强劲且一致的重建性能。此外，在相同的额外测量预算下，不确定性引导的查询比非自适应采样提供了更有效的重建改进。这些结果证明了将几何感知学习、不确定性估计和基准驱动评估相结合，对于复杂城市环境中稀疏无线电地图重建的有效性。

摘要 (Abstract)

Radio maps are important for environment-aware wireless communication, network planning, and radio resource optimization. However, dense radio map construction remains challenging when only a limited number of measurements are available, especially in complex urban environments with strong blockages, irregular geometry, and restricted sensing accessibility. Existing methods have explored interpolation, low-rank cartography, deep completion, and channel knowledge map (CKM) construction, but many of these methods insufficiently exploit explicit geometric priors or overlook the value of predictive uncertainty for subsequent sensing. In this paper, we study sparse gain radio map reconstruction from a geometry-aware and active sensing perspective. We first construct \textbf{UrbanRT-RM}, a controllable ray-tracing benchmark with diverse urban layouts, multiple base-station deployments, and multiple sparse sampling modes. We then propose \textbf{GeoUQ-GFNet}, a lightweight network that jointly predicts a dense gain radio map and a spatial uncertainty map from sparse measurements and structured scene priors. The predicted uncertainty is further used to guide active measurement selection under limited sensing budgets. Extensive experiments show that our proposed GeoUQ-GFNet method achieves strong and consistent reconstruction performance across different scenes and transmitter placements generated using UrbanRT-RM. Moreover, uncertainty-guided querying provides more effective reconstruction improvement than non-adaptive sampling under the same additional measurement budget. These results demonstrate the effectiveness of combining geometry-aware learning, uncertainty estimation, and benchmark-driven evaluation for sparse radio map reconstruction in complex urban environments.

关键词: radio map reconstruction, sparse measurements, geometry-aware learning, uncertainty estimation, active sensing, urban environments, deep completion, ray-tracing benchmark

202. ❌ RHVI-FDD: A Hierarchical Decoupling Framework for Low-Light Image Enhancement

作者: Junhao Yang, Bo Yang, Hongwei Ge, Yanchun Liang, Heow Pueh Lee, Chunguo Wu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05781v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于低光图像增强的计算机视觉任务，提出了一种基于分层解耦和频域处理的深度学习框架。所有评分关键词均与大语言模型、模型训练技术、推理优化、AI代理、科学AI应用等大模型相关领域相关，而本文研究的是图像处理中的具体视觉增强问题，未涉及任何大模型技术、原理或应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为RHVI-FDD的分层解耦框架，通过RHVI变换和频域解耦模块，有效解决了低光图像中噪声、细节丢失和颜色失真耦合的难题，在多个数据集上实现了优于现有方法的增强效果。

摘要翻译

低光照图像常存在严重噪声、细节丢失与色彩失真问题，这对下游多媒体分析与检索任务造成阻碍。低光照图像的退化机制复杂：亮度与色度相互耦合，而在色度内部，噪声与细节深度纠缠，导致现有方法难以同时校正色彩失真、抑制噪声并保留精细细节。为应对上述挑战，我们提出一种新颖的分层解耦框架（RHVI-FDD）。在宏观层面，我们引入RHVI变换，该变换能减轻输入噪声引起的估计偏差，实现鲁棒的亮度-色度解耦。在微观层面，我们设计了一个频域解耦（FDD）模块，其包含三个分支以进行进一步特征分离。通过离散余弦变换，我们将色度特征分解为低频、中频与高频分量，这些分量分别主要表征全局色调、局部细节与噪声成分；随后通过定制化的专家网络以分治策略进行处理，并经由自适应门控模块进行内容感知融合。在多个低光照数据集上的大量实验表明，我们的方法在客观指标与主观视觉质量上均持续优于现有先进方法。

摘要 (Abstract)

Low-light images often suffer from severe noise, detail loss, and color distortion, which hinder downstream multimedia analysis and retrieval tasks. The degradation in low-light images is complex: luminance and chrominance are coupled, while within the chrominance, noise and details are deeply entangled, preventing existing methods from simultaneously correcting color distortion, suppressing noise, and preserving fine details. To tackle the above challenges, we propose a novel hierarchical decoupling framework (RHVI-FDD). At the macro level, we introduce the RHVI transform, which mitigates the estimation bias caused by input noise and enables robust luminance-chrominance decoupling. At the micro level, we design a Frequency-Domain Decoupling (FDD) module with three branches for further feature separation. Using the Discrete Cosine Transform, we decompose chrominance features into low, mid, and high-frequency bands that predominantly represent global tone, local details, and noise components, which are then processed by tailored expert networks in a divide-and-conquer manner and fused via an adaptive gating module for content-aware fusion. Extensive experiments on multiple low-light datasets demonstrate that our method consistently outperforms existing state-of-the-art approaches in both objective metrics and subjective visual quality.

关键词: Low-light image enhancement, Hierarchical decoupling, RHVI transform, Frequency-domain decoupling, Discrete Cosine Transform, Noise suppression, Color distortion correction, Detail preservation

203. ❌ Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion

作者: Yu Xue, Longjun Gao, Yuanqi Su, HaoAng Lu, Xiaoning Zhang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05780v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于3D语义场景补全（SSC）的计算机视觉任务，提出了一种名为VoxSAMNet的框架，通过Dummy Shortcut for Feature Refinement（DSFR）模块和Foreground Modulation Strategy来解决体素稀疏性和语义不平衡问题。论文的核心是3D视觉、体素处理、注意力机制和语义分割，与所有评分关键词（均围绕大语言模型、训练技术、推理方法、对齐、代理、压缩等大模型相关主题）无直接关联。论文未涉及任何形式的大语言模型、基础模型、训练技术（如预训练、微调、对齐）、推理方法（如思维链、检索增强）、模型优化（如量化、压缩）或科学AI应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对单目3D语义场景补全中体素分布不平衡和前景类别稀疏的问题，提出了VoxSAMNet框架，通过稀疏感知的体素注意力和前景调制策略，在公开基准测试中实现了最先进的性能。

摘要翻译

单目语义场景补全（Monocular Semantic Scene Completion, SSC）旨在从单张RGB图像重建完整的三维语义场景，为自动驾驶和机器人技术提供了一种经济高效的解决方案。然而，体素分布固有的不平衡性——超过93%的体素为空且前景类别稀少——带来了重大挑战。现有方法往往过度关注无信息的体素，且对长尾类别的泛化能力较差。为解决这些问题，我们提出了VoxSAMNet（Voxel Sparsity-Aware Modulation Network），这是一个显式建模体素稀疏性与语义不平衡的统一框架。我们的方法引入了：（1）特征精炼虚拟捷径（Dummy Shortcut for Feature Refinement, DSFR）模块，通过共享的虚拟节点绕过空体素，同时利用可变形注意力精炼被占据的体素；（2）前景调制策略，结合前景丢弃（Foreground Dropout, FD）与文本引导图像滤波（Text-Guided Image Filter, TGIF），以缓解过拟合并增强类别相关特征。在公开基准数据集SemanticKITTI和SSCBench-KITTI-360上的大量实验表明，VoxSAMNet取得了最先进的性能，分别以18.2%和20.2%的mIoU分数超越了先前的单目及立体基线方法。我们的结果凸显了稀疏性感知与语义引导设计对于高效、准确的三维场景补全的重要性，为未来研究提供了有前景的方向。

摘要 (Abstract)

Monocular Semantic Scene Completion (SSC) aims to reconstruct complete 3D semantic scenes from a single RGB image, offering a cost-effective solution for autonomous driving and robotics. However, the inherently imbalanced nature of voxel distributions, where over 93% of voxels are empty and foreground classes are rare, poses significant challenges. Existing methods often suffer from redundant emphasis on uninformative voxels and poor generalization to long-tailed categories. To address these issues, we propose VoxSAMNet (Voxel Sparsity-Aware Modulation Network), a unified framework that explicitly models voxel sparsity and semantic imbalance. Our approach introduces: (1) a Dummy Shortcut for Feature Refinement (DSFR) module that bypasses empty voxels via a shared dummy node while refining occupied ones with deformable attention; and (2) a Foreground Modulation Strategy combining Foreground Dropout (FD) and Text-Guided Image Filter (TGIF) to alleviate overfitting and enhance class-relevant features. Extensive experiments on the public benchmarks SemanticKITTI and SSCBench-KITTI-360 demonstrate that VoxSAMNet achieves state-of-the-art performance, surpassing prior monocular and stereo baselines with mIoU scores of 18.2% and 20.2%, respectively. Our results highlight the importance of sparsity-aware and semantics-guided design for efficient and accurate 3D scene completion, offering a promising direction for future research.

关键词: 3D Semantic Scene Completion, Voxel Sparsity, Monocular SSC, Deformable Attention, Foreground Modulation, Semantic Imbalance, Autonomous Driving, VoxSAMNet

204. ❌ PDMP: Rethinking Balanced Multimodal Learning via Performance-Dominant Modality Prioritization

作者: Shicai Wei, Chunbo Luo, Qiang Zhu, Yang Luo 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05773v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多模态学习中的优化问题，提出了一种基于性能主导模态优先（PDMP）的策略。论文内容聚焦于多模态模型训练中的梯度调制方法，与所有评分关键词（均涉及大模型技术原理、训练方法、推理优化、应用领域等）无直接关联。论文未涉及语言模型、专家混合、缩放定律、预训练/后训练、对齐、高效微调、检索增强、上下文扩展、注意力优化、推理方法、智能体、量化、解码加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI等主题。

!!! tip deepseek-chat TL;DR

该论文针对多模态学习中存在的优化不足问题，提出了一种性能主导模态优先（PDMP）策略，通过梯度调制使性能优势模态主导优化，从而提升多模态模型性能。

摘要翻译

多模态学习因其实际应用价值日益受到关注。然而，该领域常面临优化不足的问题，即多模态模型的表现甚至不及单一模态模型。现有方法将此问题归因于模态间学习不平衡，并通过梯度调制加以解决。本文认为，平衡学习并非多模态学习的最优设定；相反，由具有更优单模态性能的“性能主导模态”驱动的非平衡学习，反而有助于提升多模态性能。而优化不足问题正是由性能主导模态学习不充分所导致。为此，我们提出“性能主导模态优先”策略以辅助多模态学习。具体而言，该策略首先通过独立训练的单模态模型性能排序挖掘性能主导模态，随后引入非对称系数调制各模态梯度，使性能主导模态主导优化过程。由于该策略仅依赖于单模态性能排序，其独立于多模态模型的结构与融合方法，在实际场景中具有较大应用潜力。最终，在多组数据集上的大量实验验证了该策略的优越性。

摘要 (Abstract)

Multimodal learning has attracted increasing attention due to its practicality. However, it often suffers from insufficient optimization, where the multimodal model underperforms even compared to its unimodal counterparts. Existing methods attribute this problem to the imbalanced learning between modalities and solve it by gradient modulation. This paper argues that balanced learning is not the optimal setting for multimodal learning. On the contrary, imbalanced learning driven by the performance-dominant modality that has superior unimodal performance can contribute to better multimodal performance. And the under-optimization problem is caused by insufficient learning of the performance-dominant modality. To this end, we propose the Performance-Dominant Modality Prioritization (PDMP) strategy to assist multimodal learning. Specifically, PDMP firstly mines the performance-dominant modality via the performance ranking of the independently trained unimodal model. Then PDMP introduces asymmetric coefficients to modulate the gradients of each modality, enabling the performance-dominant modality to dominate the optimization. Since PDMP only relies on the unimodal performance ranking, it is independent of the structures and fusion methods of the multimodal model and has great potential for practical scenarios. Finally, extensive experiments on various datasets validate the superiority of PDMP.

关键词: Multimodal Learning, Performance-Dominant Modality, Gradient Modulation, Imbalanced Learning, Under-optimization, Unimodal Performance, Asymmetric Coefficients, Fusion Methods

205. ❌ Improving Controllable Generation: Faster Training and Better Performance via $x_0$-Supervision

作者: Amadou S. Sangare, Adrien Maglo, Mohamed Chaouch, Bertrand Luvison 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05761v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于文本到图像扩散模型的训练优化，特别是通过x0监督加速收敛并提升性能，其核心内容涉及扩散模型、可控生成和训练目标，但未涉及任何大语言模型、深度学习技术原理创新或科学领域应用，与所有评分关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为x0监督的新训练目标，通过直接监督干净目标图像或等效重加权扩散损失，显著加速了可控文本到图像扩散模型的收敛速度（最高达2倍），并同时提升了视觉质量和条件准确性。

摘要翻译

文本到图像（Text-to-Image, T2I）扩散/流模型近期在视觉保真度与文本对齐方面取得了显著进展。然而，当用户需要精确控制图像布局时，这些模型仍存在局限，而仅凭自然语言往往无法可靠地表达此类布局需求。可控生成方法通过引入更易于描述场景的额外条件，对初始T2I模型进行了增强。先前的研究通常直接使用与初始模型相同的损失函数来训练增强后的网络。尽管初看之下这种做法很自然，但在某些情况下，它可能导致收敛所需的训练时间非常长。在本研究中，我们通过对可控扩散模型去噪动态的详细分析，重新审视了其训练目标。我们发现，对清晰目标图像（称为$x_0$-监督）进行直接监督，或对扩散损失进行等效的重新加权，能够实现更快的收敛。在多种控制设置下的实验表明，根据我们提出的新指标（平均收敛曲线下面积 - mAUCC），我们的方法将收敛速度提升了最高达2$\times$，同时改善了视觉质量与条件准确性。我们的代码公开于https://github.com/CEA-LIST/x0-supervision。

摘要 (Abstract)

Text-to-Image (T2I) diffusion/flow models have recently achieved remarkable progress in visual fidelity and text alignment. However, they remain limited when users need to precisely control image layouts, something that natural language alone cannot reliably express. Controllable generation methods augment the initial T2I model with additional conditions that more easily describe the scene. Prior works straightforwardly train the augmented network with the same loss as the initial network. Although natural at first glance, this can lead to very long training times in some cases before convergence. In this work, we revisit the training objective of controllable diffusion models through a detailed analysis of their denoising dynamics. We show that direct supervision on the clean target image, dubbed $x_0$-supervision, or an equivalent re-weighting of the diffusion loss, yields faster convergence. Experiments on multiple control settings demonstrate that our formulation accelerates convergence by up to 2$\times$ according to our novel metric (mean Area Under the Convergence Curve - mAUCC), while also improving both visual quality and conditioning accuracy. Our code is available at https://github.com/CEA-LIST/x0-supervision

关键词: Controllable Generation, Diffusion Models, Training Objective, x0-supervision, Convergence Acceleration, Text-to-Image, Denoising Dynamics, Conditioning Accuracy

206. ❌ SVC 2026: the Second Multimodal Deception Detection Challenge and the First Domain Generalized Remote Physiological Measurement Challenge

作者: Dongliang Zhu, Zhiyi Niu, Bo Zhao, Jiajian Huang, Shuo Ye, Xun Lin, Hui Ma, Taorui Wang, Jiayu Zhang, Chunmei Zhu, Junzhe Cao, Yingjie Ma, Rencheng Song, Albert Clapés, Sergio Escalera, Dan Guo, Zitong Yu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05748v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要介绍了一个计算机视觉领域的挑战赛（SVC 2026），专注于微妙视觉信号的分析，包括跨模态欺骗检测和远程光电容积描记（rPPG）估计。论文内容完全围绕计算机视觉、多模态学习和挑战赛组织展开，未涉及任何大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用。所有评分关键词均与大模型技术、训练方法、推理优化、对齐、代理系统等主题相关，与本文的计算机视觉挑战赛主题无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文介绍了SVC 2026挑战赛，旨在通过组织跨域多模态欺骗检测和远程生理测量任务，促进对微妙视觉信号的鲁棒表示学习研究，并发布了基线模型和22支参赛团队的结果。

摘要翻译

细微视觉信号虽难以通过肉眼直接感知，却蕴含着能够揭示视觉数据中隐藏模式的重要信息。这类信号在生物特征安全、多媒体取证、医疗诊断、工业检测与情感计算等诸多应用中发挥着关键作用。随着计算机视觉与表征学习技术的快速发展，检测并解读此类细微信号已成为一个新兴的研究方向。然而，现有研究多聚焦于特定任务或模态，模型在处理真实场景中微弱、细微的信号时，仍在鲁棒性、表征能力与泛化性方面面临挑战。为推进该领域研究，我们组织了“细微视觉挑战赛”，旨在学习针对细微视觉信号的鲁棒表征。本次挑战赛包含两项任务：跨模态多模态欺骗检测与远程光电容积描记术（remote photoplethysmography, rPPG）估计。我们希望此次挑战能促进开发更具鲁棒性和泛化能力的细微视觉理解模型，并进一步推动计算机视觉与多模态学习的研究。共有22支团队向本次研讨会竞赛提交了最终结果，相应的基线模型已在\href{https://sites.google.com/view/svc-cvpr26}{MMDD2026平台}\footnote{https://sites.google.com/view/svc-cvpr26}上发布。

摘要 (Abstract)

Subtle visual signals, although difficult to perceive with the naked eye, contain important information that can reveal hidden patterns in visual data. These signals play a key role in many applications, including biometric security, multimedia forensics, medical diagnosis, industrial inspection, and affective computing. With the rapid development of computer vision and representation learning techniques, detecting and interpreting such subtle signals has become an emerging research direction. However, existing studies often focus on specific tasks or modalities, and models still face challenges in robustness, representation ability, and generalization when handling subtle and weak signals in real-world environments. To promote research in this area, we organize the Subtle visual Challenge, which aims to learn robust representations for subtle visual signals. The challenge includes two tasks: cross-domain multimodal deception detection and remote photoplethysmography (rPPG) estimation. We hope that this challenge will encourage the development of more robust and generalizable models for subtle visual understanding, and further advance research in computer vision and multimodal learning. A total of 22 teams submitted their final results to this workshop competition, and the corresponding baseline models have been released on the \href{https://sites.google.com/view/svc-cvpr26}{MMDD2026 platform}\footnote{https://sites.google.com/view/svc-cvpr26}

关键词: subtle visual signals, multimodal deception detection, remote photoplethysmography, cross-domain generalization, representation learning, computer vision, robust models, challenge competition

207. ❌ ASSR-Net: Anisotropic Structure-Aware and Spectrally Recalibrated Network for Hyperspectral Image Fusion

作者: Qiya Song, Hongzhi Zhou, Lishan Tan, Renwei Dian, Shutao Li 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05742v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于高光谱图像融合的计算机视觉任务，提出了一种名为ASSR-Net的深度学习网络，用于解决空间结构重建和光谱失真的问题。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，因为这些关键词均针对自然语言处理或通用大模型领域。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为高光谱图像分析可视为科学计算或遥感领域的一个应用，但论文本身并未明确强调’AI for Science’的广义范畴，也未涉及生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ASSR-Net的深度学习网络，通过各向异性结构感知和光谱重新校准来解决高光谱图像融合中的空间细节模糊和光谱失真问题，并在多个基准数据集上实现了优于现有方法的性能。

摘要翻译

高光谱图像融合旨在通过整合多源输入的互补信息，重建高空间分辨率高光谱图像（HR-HSI）。尽管近期取得进展，现有方法仍面临两个关键挑战：（1）各向异性空间结构重建不足，导致细节模糊和空间质量受损；（2）融合过程中的光谱失真，阻碍了细粒度光谱表征。为解决这些问题，我们提出ASSR-Net：一种面向高光谱图像融合的各向异性结构感知与光谱重校准网络。ASSR-Net采用包含各向异性结构感知空间增强（ASSE）和分层先验引导光谱校准（HPSC）的两阶段融合策略。在第一阶段，方向感知融合模块沿多个方向自适应捕获结构特征，有效重建各向异性空间模式。在第二阶段，光谱重校准模块以原始低分辨率HSI作为光谱先验，显式校正融合结果中的光谱偏差，从而提升光谱保真度。在多个基准数据集上的大量实验表明，ASSR-Net始终优于现有先进方法，实现了更优的空间细节保持与光谱一致性。

摘要 (Abstract)

Hyperspectral image fusion aims to reconstruct high-spatial-resolution hyperspectral images (HR-HSI) by integrating complementary information from multi-source inputs. Despite recent progress, existing methods still face two critical challenges: (1) inadequate reconstruction of anisotropic spatial structures, resulting in blurred details and compromised spatial quality; and (2) spectral distortion during fusion, which hinders fine-grained spectral representation. To address these issues, we propose \textbf{ASSR-Net}: an Anisotropic Structure-Aware and Spectrally Recalibrated Network for Hyperspectral Image Fusion. ASSR-Net adopts a two-stage fusion strategy comprising anisotropic structure-aware spatial enhancement (ASSE) and hierarchical prior-guided spectral calibration (HPSC). In the first stage, a directional perception fusion module adaptively captures structural features along multiple orientations, effectively reconstructing anisotropic spatial patterns. In the second stage, a spectral recalibration module leverages the original low-resolution HSI as a spectral prior to explicitly correct spectral deviations in the fused results, thereby enhancing spectral fidelity. Extensive experiments on various benchmark datasets demonstrate that ASSR-Net consistently outperforms state-of-the-art methods, achieving superior spatial detail preservation and spectral consistency.

关键词: Hyperspectral image fusion, Anisotropic structure-aware, Spectral recalibration, Deep learning network, Spatial enhancement, Spectral fidelity, Image reconstruction, Computer vision

208. ❌ Single-Stage Signal Attenuation Diffusion Model for Low-Light Image Enhancement and Denoising

作者: Ying Liu, Junchao Zhang, Caiyun Wu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05727v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的低光图像增强和去噪，提出了一种基于扩散模型的信号衰减扩散模型（SADM）。虽然论文涉及深度学习技术（扩散模型），但所有评分关键词均针对大语言模型（LLM）及其相关技术（如MoE、SFT、RLHF、RAG、Agent等）、大模型优化技术（如量化、推理加速）或特定科学领域AI应用（如生物信息学）。论文内容完全不涉及语言模型、文本生成、对齐、微调、代理系统或任何评分关键词中提到的技术概念，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种信号衰减扩散模型（SADM），通过将信号衰减机制整合到扩散过程中，实现了单阶段的低光图像增强和去噪，消除了现有方法对额外校正模块或分阶段训练的需求。

摘要翻译

扩散模型通过前向加噪与反向去噪的概率建模，在图像复原任务中表现卓越，其处理复杂噪声同时保留细节的能力使其特别适用于低照度图像增强。主流的基于扩散模型的低照度增强方法通常采用两阶段流程或引入辅助校正网络来优化U-Net输出，这割裂了增强与去噪之间的内在联系，并因优化目标不一致导致性能受限。为解决这些问题，我们提出信号衰减扩散模型，这是一种将信号衰减机制整合到扩散流程中的新型扩散过程，能够在单阶段内同步实现亮度调整与噪声抑制。
具体而言，信号衰减系数在前向加噪过程中模拟低照度退化的固有信号衰减特性，编码了低照度退化的物理先验，从而显式引导反向去噪过程同时优化亮度恢复与噪声抑制，无需依赖现有方法中的额外校正模块或分阶段训练。我们通过多尺度金字塔采样验证了所提设计与去噪扩散隐式模型保持理论一致性，在可解释性、复原质量与计算效率之间取得了平衡。

摘要 (Abstract)

Diffusion models excel at image restoration via probabilistic modeling of forward noise addition and reverse denoising, and their ability to handle complex noise while preserving fine details makes them well-suited for Low-Light Image Enhancement (LLIE). Mainstream diffusion based LLIE methods either adopt a two-stage pipeline or an auxiliary correction network to refine U-Net outputs, which severs the intrinsic link between enhancement and denoising and leads to suboptimal performance owing to inconsistent optimization objectives. To address these issues, we propose the Signal Attenuation Diffusion Model (SADM), a novel diffusion process that integrates the signal attenuation mechanism into the diffusion pipeline, enabling simultaneous brightness adjustment and noise suppression in a single stage. Specifically, the signal attenuation coefficient simulates the inherent signal attenuation of low-light degradation in the forward noise addition process, encoding the physical priors of low-light degradation to explicitly guide reverse denoising toward the concurrent optimization of brightness recovery and noise suppression, thereby eliminating the need for extra correction modules or staged training relied on by existing methods. We validate that our design maintains consistency with Denoising Diffusion Implicit Models(DDIM) via multi-scale pyramid sampling, balancing interpretability, restoration quality, and computational efficiency.

关键词: Diffusion Model, Low-Light Image Enhancement, Denoising, Signal Attenuation, Single-Stage, Image Restoration, DDIM, Noise Suppression

209. ❌ Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP

作者: Yusung Ro, Jaehyun Choi, Junmo Kim 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05724v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究稀疏自编码器（SAEs）在解释CLIP视觉编码器内部表示中的应用，提出了信息范围（information scope）作为可解释性的新维度，并开发了上下文依赖分数（CDS）来量化特征的空间稳定性。论文核心贡献在于模型可解释性方法，与’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为该研究直接推进了对神经网络内部表示的理解和解释。然而，论文专注于CLIP视觉编码器的特定可解释性方法，不涉及大语言模型（LLMs）、训练技术、推理优化、对齐方法、代理系统、模型压缩、科学AI应用等其他关键词领域，因此这些关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了信息范围作为解释CLIP视觉编码器中稀疏自编码器特征的新维度，并开发了上下文依赖分数来区分局部和全局范围特征，发现不同范围的特征对CLIP预测和置信度有系统性影响。

摘要翻译

稀疏自编码器已成为解释CLIP视觉编码器内部表征的有力工具，但现有分析主要集中于单个特征的语义含义。我们引入信息范围作为可解释性的补充维度，用以描述SAE特征聚合视觉证据的广度——从局部、特定图像块线索到全局、图像级信号。研究发现，部分SAE特征在空间扰动下保持稳定响应，而另一些特征则会因细微输入变化产生不可预测的偏移，这揭示了其底层信息范围的根本差异。为量化这一现象，我们提出上下文依赖度评分，该指标能够区分位置稳定的局部范围特征与位置可变的全局范围特征。实验表明，不同信息范围的特征对CLIP的预测结果及置信度产生系统性差异的影响。这些发现确立了信息范围作为理解CLIP表征的关键新维度，并为SAE衍生特征提供了更深入的诊断视角。

摘要 (Abstract)

Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpreting the internal representations of CLIP vision encoders, yet existing analyses largely focus on the semantic meaning of individual features. We introduce information scope as a complementary dimension of interpretability that characterizes how broadly an SAE feature aggregates visual evidence, ranging from localized, patch-specific cues to global, image-level signals. We observe that some SAE features respond consistently across spatial perturbations, while others shift unpredictably with minor input changes, indicating a fundamental distinction in their underlying scope. To quantify this, we propose the Contextual Dependency Score (CDS), which separates positionally stable local scope features from positionally variant global scope features. Our experiments show that features of different information scopes exert systematically different influences on CLIP’s predictions and confidence. These findings establish information scope as a critical new axis for understanding CLIP representations and provide a deeper diagnostic view of SAE-derived features.

关键词: Sparse Autoencoders, CLIP, Interpretability, Information Scope, Contextual Dependency Score, Vision Encoders, Feature Analysis, Representation Understanding

210. ❌ GaussianGrow: Geometry-aware Gaussian Growing from 3D Point Clouds with Text Guidance

作者: Weiqi Zhang, Junsheng Zhou, Haotian Geng, Kanle Shi, Shenkun Xu, Yi Fang, Yu-Shen Liu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05721v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文GaussianGrow专注于3D高斯生成和点云处理，使用扩散模型进行文本引导的3D重建。所有评分关键词均针对大语言模型（LLM）及其相关技术（如训练、对齐、推理、应用等），而本文研究的是计算机视觉和3D重建领域，未涉及任何LLM技术、原理或应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为GaussianGrow的新方法，通过文本引导从3D点云生长3D高斯，解决了3D高斯生成中几何精度不足的问题，并在合成和真实扫描点云上实现了高质量的文本引导高斯生成。

摘要翻译

三维高斯泼溅技术在渲染效率与质量方面已展现出卓越性能，然而在缺乏适当几何先验的情况下，三维高斯的生成仍具挑战。现有方法尝试通过预测点云图作为几何参考来推断高斯基元，但不可靠的几何估计可能导致生成效果不佳。本研究提出GaussianGrow，这是一种通过从易获取的三维点云中学习生长高斯基元来生成三维高斯的新方法，该方法在生成过程中自然保证了几何准确性。具体而言，我们设计了一种文本引导的高斯生长方案，利用多视角扩散模型从输入点云合成具有一致外观的图像以提供监督。为减少多视角融合产生的伪影，我们通过在重叠区域识别非预设相机位姿生成新视角，并对其进行约束。针对难以观测的区域，我们提出通过迭代检测相机位姿来观察点云中最大未生长区域，并借助预训练的二维扩散模型对渲染视图进行修复以填补缺失部分。该过程持续进行直至完整的高斯模型生成。我们在合成乃至真实扫描点云的文本引导高斯生成任务上对GaussianGrow进行了广泛评估。项目页面：https://weiqi-zhang.github.io/GaussianGrow

摘要 (Abstract)

3D Gaussian Splatting has demonstrated superior performance in rendering efficiency and quality, yet the generation of 3D Gaussians still remains a challenge without proper geometric priors. Existing methods have explored predicting point maps as geometric references for inferring Gaussian primitives, while the unreliable estimated geometries may lead to poor generations. In this work, we introduce GaussianGrow, a novel approach that generates 3D Gaussians by learning to grow them from easily accessible 3D point clouds, naturally enforcing geometric accuracy in Gaussian generation. Specifically, we design a text-guided Gaussian growing scheme that leverages a multi-view diffusion model to synthesize consistent appearances from input point clouds for supervision. To mitigate artifacts caused by fusing neighboring views, we constrain novel views generated at non-preset camera poses identified in overlapping regions across different views. For completing the hard-to-observe regions, we propose to iteratively detect the camera pose by observing the largest un-grown regions in point clouds and inpainting them by inpainting the rendered view with a pretrained 2D diffusion model. The process continues until complete Gaussians are generated. We extensively evaluate GaussianGrow on text-guided Gaussian generation from synthetic and even real-scanned point clouds. Project Page: https://weiqi-zhang.github.io/GaussianGrow

关键词: 3D Gaussian Splatting, 3D point clouds, text-guided generation, multi-view diffusion model, novel view synthesis, camera pose detection, rendered view inpainting, geometry-aware generation

211. ❌ MPM: Mutual Pair Merging for Efficient Vision Transformers

作者: Simon Ravé, Pejman Rasti, David Rousseau 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05718v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	2.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是Vision Transformers（ViT）的token聚合方法MPM，用于加速语义分割任务。与绝大多数关键词无关，因为关键词主要针对大语言模型（LLM）及其相关技术（如对齐、推理、代理等）。唯一的相关点是’KV Cache Compression OR Linear Attention OR FlashAttention’，因为论文在实验中使用了FlashAttention-2进行基准测试，但这只是实验设置的一部分，并非论文的核心创新内容，因此给2分（微弱关联）。论文主题是计算机视觉中的Transformer加速，不属于大模型或深度学习在科学领域的应用创新。

!!! tip deepseek-chat TL;DR

该论文提出了MPM（Mutual Pair Merging），一种无需训练、基于互近邻配对的token聚合方法，用于加速Vision Transformers在语义分割任务中的推理，在保持mIoU下降小于3%的同时，在Raspberry Pi 5上最高降低60%延迟，在H100 GPU上最高提升20%吞吐量。

摘要翻译

减少序列长度是加速Transformer模型的常用方法，但先前的令牌缩减研究多针对分类任务，且常使用代理指标而非端到端延迟进行报告。对于语义分割任务，令牌缩减进一步受到重建密集、像素对齐特征需求的限制，并且在现代加速器上，计算合并映射的开销可能抵消预期收益。我们提出互配对合并（Mutual Pair Merging，MPM），这是一种无需训练的令牌聚合模块，其在余弦空间中形成互最近邻配对，对每对令牌进行平均，并记录合并映射，从而在解码器前实现基于收集操作的重建，使得现有分割头无需改动即可使用。MPM不引入任何可学习参数，也不包含连续的压缩调节机制（无需保留率或阈值）。其速度-精度权衡通过离散的插入调度来设定。我们在NVIDIA H100 GPU（启用与未启用FlashAttention-2）和树莓派5上，基于标准分割数据集对端到端延迟进行了基准测试。在ADE20K数据集上，MPM使ViT-Tiny模型在树莓派5上的单图像延迟降低最高达60%，并在启用FlashAttention-2的H100上提升吞吐量最高达20%，同时保持mIoU下降低于3%。这些结果表明，当明确考虑开销时，简单、具备重建意识且无需训练的令牌合并方法能够为分割任务带来实际可观测的加速收益。

摘要 (Abstract)

Decreasing sequence length is a common way to accelerate transformers, but prior token reduction work often targets classification and reports proxy metrics rather than end-to-end latency. For semantic segmentation, token reduction is further constrained by the need to reconstruct dense, pixel-aligned features, and on modern accelerators the overhead of computing merge maps can erase expected gains. We propose Mutual Pair Merging (MPM), a training-free token aggregation module that forms mutual nearest-neighbor pairs in cosine space, averages each pair, and records a merge map enabling a gather-based reconstruction before the decoder so that existing segmentation heads can be used unchanged. MPM introduces no learned parameters and no continuous compression knob (no keep-rate or threshold). The speed-accuracy trade-off is set by a discrete insertion schedule. We benchmark end-to-end latency on an NVIDIA H100 GPU (with and without FlashAttention-2) and a Raspberry Pi 5 across standard segmentation datasets. On ADE20K, MPM reduces per-image latency by up to 60% for ViT-Tiny on Raspberry Pi 5, and increases throughput by up to 20% on H100 with FlashAttention-2 while keeping the mIoU drop below 3%. These results suggest that simple, reconstruction-aware, training-free token merging can translate into practical wall-clock gains for segmentation when overhead is explicitly accounted for.

关键词: Vision Transformers, token merging, semantic segmentation, inference acceleration, training-free, mutual nearest-neighbor, FlashAttention, latency reduction

212. ❌ In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting

作者: Wenhui Xiao, Ethan Goan, Rodrigo Santa Cruz, David Ahmedt-Aristizabal, Olivier Salvado, Clinton Fookes, Leo Lebrat 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05715v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于计算机视觉领域的3D重建和渲染技术，特别是Gaussian Splatting与单目深度估计的结合，不涉及大语言模型、深度学习技术原理创新或科学领域AI应用。唯一的相关性在于广义的’AI for Science’，因为该研究属于计算机视觉的AI应用，但并非生物信息学或化学信息学等具体科学领域，因此仅给予5分。其他所有关键词均与论文内容完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了在Gaussian Splatting渲染中可靠利用单目深度先验的挑战，通过引入一个训练框架来整合尺度模糊和噪声深度先验，实现了几何精度的提升和更高质量的渲染。

摘要翻译

在3D高斯泼溅（Gaussian Splatting）中使用精确的深度先验有助于缓解因训练数据稀疏和表面纹理缺失引起的伪影。然而，获取精确深度图需要专门的采集系统。基础单目深度估计模型提供了一种经济高效的替代方案，但它们存在尺度模糊性、多视角不一致性以及局部几何不准确的问题，若直接应用会降低渲染性能。本文致力于解决如何可靠地利用单目深度先验来增强高斯泼溅（GS）渲染的挑战。为此，我们提出了一种训练框架，将尺度模糊且含噪声的深度先验整合到几何监督中。我们强调了从弱对齐的深度变化中学习的重要性。我们引入了一种方法来隔离不适定几何区域，以进行选择性单目深度正则化，从而限制深度误差向重建良好的3D结构传播。在多个数据集上的广泛实验表明，该方法在不同GS变体和单目深度骨干网络测试中，均能持续提升几何精度，从而实现更可靠的深度估计和更高的渲染质量。

摘要 (Abstract)

Using accurate depth priors in 3D Gaussian Splatting helps mitigate artifacts caused by sparse training data and textureless surfaces. However, acquiring accurate depth maps requires specialized acquisition systems. Foundation monocular depth estimation models offer a cost-effective alternative, but they suffer from scale ambiguity, multi-view inconsistency, and local geometric inaccuracies, which can degrade rendering performance when applied naively. This paper addresses the challenge of reliably leveraging monocular depth priors for Gaussian Splatting (GS) rendering enhancement. To this end, we introduce a training framework integrating scale-ambiguous and noisy depth priors into geometric supervision. We highlight the importance of learning from weakly aligned depth variations. We introduce a method to isolate ill-posed geometry for selective monocular depth regularization, restricting the propagation of depth inaccuracies into well-reconstructed 3D structures. Extensive experiments across diverse datasets show consistent improvements in geometric accuracy, leading to more faithful depth estimation and higher rendering quality across different GS variants and monocular depth backbones tested.

关键词: Gaussian Splatting, monocular depth estimation, 3D reconstruction, depth priors, rendering enhancement, geometric supervision, scale ambiguity, multi-view consistency

213. ❌ Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs

作者: Chongyu Wang, Ting Huang, Chunyu Sun, Xinyu Ning, Di Wang, Hao Tang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05695v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多模态大语言模型（MLLMs）的几何先验融合，核心涉及大模型（LLMs）在视觉空间感知中的应用。摘要明确提到’Multimodal Large Language Models (MLLMs)’，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、各种训练方法（预训练、微调、对齐等）、推理技术（CoT、MCTS）、代理、压缩、幻觉缓解、可解释性、科学AI等，论文均未涉及或提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在物理空间感知上的局限性，提出了一种名为GUIDE的渐进式几何先验注入框架，通过多级采样和层间对齐融合几何信息，显著提升了模型在复杂空间推理和感知任务上的性能。

摘要翻译

多模态大语言模型（MLLMs）在二维视觉任务中取得了显著进展，但在处理真实世界视觉流时仍表现出有限的物理空间感知能力。近期，通过隐式提取几何先验的前馈式几何基础模型为解决该问题提供了新途径。然而，现有的几何感知MLLMs主要受限于单一深层提取与输入级融合的范式。这种扁平化融合会导致局部几何细节丢失，并在早期层引发语义失配。为突破此瓶颈，我们提出GUIDE（几何先验在MLLM早期层的渐进式注入框架），一种渐进式几何先验注入框架。GUIDE在几何编码器内部进行多层级采样，全面捕获从局部边缘到全局拓扑的多粒度特征。随后，我们将这些多层级几何先验与大语言模型的早期层进行严格对齐并逐步融合。基于多粒度几何信息的注入，该设计引导模型渐进学习从二维到三维的过渡过程。此外，我们引入了一种上下文感知门控机制，使模型能够根据当前语义获取必要的空间线索，从而最大化空间先验的利用效率，并有效抑制冗余几何噪声。大量实验表明，GUIDE在多项复杂空间推理与感知任务上显著优于现有基线，为将三维几何先验整合至大模型建立了新范式。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) have achieved remarkable progress in 2D visual tasks but still exhibit limited physical spatial awareness when processing real-world visual streams. Recently, feed-forward geometric foundation models, which implicitly extract geometric priors, have provided a new pathway to address this issue. However, existing geometry-aware MLLMs are predominantly constrained by the paradigm of single deep-layer extraction and input-level fusion. This flattened fusion leads to the loss of local geometric details and causes semantic mismatches in the early layers. To break this bottleneck, we propose GUIDE (Geometric Unrolling Inside MLLM Early-layers), a progressive geometric priors injection framework. GUIDE performs multi-level sampling within the geometric encoder, comprehensively capturing multi-granularity features ranging from local edges to global topologies. Subsequently, we rigorously align and fuse these multi-level geometric priors step-by-step with the early layers of the MLLM. Building upon the injection of multi-granularity geometric information, this design guides the model to progressively learn the 2D-to-3D transitional process. Furthermore, we introduce a context-aware gating that enables the model to fetch requisite spatial cues based on current semantics, thereby maximizing the utilization efficiency of spatial priors and effectively suppressing redundant geometric noise. Extensive experiments demonstrate that GUIDE significantly outperforms existing baselines on multiple complex spatial reasoning and perception tasks, establishing a novel paradigm for integrating 3D geometric priors into large models.

关键词: Multimodal Large Language Models, Geometric Priors, Spatial Awareness, Progressive Fusion, 3D Geometric Integration, Spatial Reasoning, MLLMs, Geometric Foundation Models

214. ❌ 3D Smoke Scene Reconstruction Guided by Vision Priors from Multimodal Large Language Models

作者: Xinye Zheng, Fei Wang, Yiqi Nie, Kun Li, Junjie Chen, Jiaqi Zhao, Yanyan Wei, Zhiliang Wu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05687v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究3D烟雾场景重建，使用Nano-Banana-Pro（可能是一种视觉增强模型）和3D高斯泼溅技术，属于计算机视觉和3D重建领域。摘要中未提及任何大语言模型（LLM）、深度学习技术原理创新或大模型在不同领域的应用。唯一相关的是’AI for Science’，因为3D场景重建可视为AI在科学计算或环境模拟中的应用，但并非核心内容，因此给5分。其他关键词均与论文内容无关，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合视觉先验和3D高斯泼溅的框架，用于从烟雾退化的多视角图像中重建3D场景，并生成清晰一致的新视角视图。

摘要翻译

从受烟雾退化的多视角图像重建三维场景尤为困难，因为烟雾会引入强烈的散射效应、视角依赖的外观变化以及严重的跨视角一致性退化。为解决这些问题，我们提出一个将视觉先验与高效三维场景建模相结合的框架。我们采用Nano-Banana-Pro来增强烟雾退化图像，为重建提供更清晰的视觉观测，并开发了Smoke-GS——一种面向烟雾场景重建与复原导向新视角合成的介质感知三维高斯泼溅框架。Smoke-GS使用显式三维高斯模型对场景进行建模，并引入一个轻量级的视角依赖介质分支，以捕捉由烟雾引起的方向依赖性外观变化。我们的方法在保持三维高斯泼溅渲染效率的同时，提升了对烟雾所致退化的鲁棒性。实验结果表明，在具有挑战性的烟雾环境中，我们的方法能有效生成一致且视觉清晰的新视角图像。

摘要 (Abstract)

Reconstructing 3D scenes from smoke-degraded multi-view images is particularly difficult because smoke introduces strong scattering effects, view-dependent appearance changes, and severe degradation of cross-view consistency. To address these issues, we propose a framework that integrates visual priors with efficient 3D scene modeling. We employ Nano-Banana-Pro to enhance smoke-degraded images and provide clearer visual observations for reconstruction and develop Smoke-GS, a medium-aware 3D Gaussian Splatting framework for smoke scene reconstruction and restoration-oriented novel view synthesis. Smoke-GS models the scene using explicit 3D Gaussians and introduces a lightweight view-dependent medium branch to capture direction-dependent appearance variations caused by smoke. Our method preserves the rendering efficiency of 3D Gaussian Splatting while improving robustness to smoke-induced degradation. Results demonstrate the effectiveness of our method for generating consistent and visually clear novel views in challenging smoke environments.

关键词: 3D scene reconstruction, smoke-degraded images, multimodal large language models, 3D Gaussian Splatting, novel view synthesis, visual priors, smoke environment, view-dependent appearance

215. ❌ Probing Intrinsic Medical Task Relationships: A Contrastive Learning Perspective

作者: Jonas Muth, Zdravko Marinov, Simon Reiß 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05651v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学计算机视觉任务的关系分析，使用对比学习框架（Task-Contrastive Learning）探索30个医学视觉任务（如分割、检测、生成、变换）在表示空间中的内在结构和关系。论文与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，因为这些关键词针对的是自然语言处理或通用大模型领域，而本文是纯粹的计算机视觉研究。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在医学成像（科学领域）的应用，但并非核心创新点，只是应用场景，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文研究了医学视觉任务之间的内在关系，通过提出的Task-Contrastive Learning框架将30个任务嵌入共享表示空间，揭示了任务在嵌入空间中的相似性、差异性和结构特性。

摘要翻译

尽管医学计算机视觉领域的研究大多聚焦于提升特定任务的性能，任务间的内在关联——即它们在表征层面如何相互联系、重叠或差异——在很大程度上仍未得到探索。本研究旨在探究医学视觉任务之间的这些本质关系，具体而言，我们考察了30项任务，包括语义任务（如分割与检测）、图像生成任务（如去噪、修复或着色）以及图像变换任务（如几何变换）。我们的目标是探索一个数据驱动的表征空间能否捕捉到任务间的潜在结构，这些任务涵盖来自39个差异极大的医学影像模态数据集，包括计算机断层扫描（CT）、磁共振成像（MRI）、电子显微镜、X射线、超声等。通过揭示任务间的相互关联，我们旨在深入理解其基本特性与内在联系。为此，我们提出了任务对比学习（Task-Contrastive Learning, TaCo），这是一个专为将任务嵌入共享表征空间而设计的对比学习框架。借助TaCo，我们将来自不同模态的异构任务映射到一个联合空间中，并分析其特性：识别哪些任务具有独特表征，哪些任务相互融合，以及任务的迭代修改如何反映在嵌入空间中。本研究为理解医学视觉任务的内在结构奠定了基础，从而在嵌入空间中更深入地揭示了任务间的相似性及其相互关联的本质属性。

摘要 (Abstract)

While much of the medical computer vision community has focused on advancing performance for specific tasks, the underlying relationships between tasks, i.e., how they relate, overlap, or differ on a representational level, remain largely unexplored. Our work explores these intrinsic relationships between medical vision tasks, specifically, we investigate 30 tasks, such as semantic tasks (e.g., segmentation and detection), image generative tasks (e.g., denoising, inpainting, or colorization), and image transformation tasks (e.g., geometric transformations). Our goal is to probe whether a data-driven representation space can capture an underlying structure of tasks across a variety of 39 datasets from wildly different medical imaging modalities, including computed tomography, magnetic resonance, electron microscopy, X-ray ultrasound and more. By revealing how tasks relate to one another, we aim to provide insights into their fundamental properties and interconnectedness. To this end, we introduce Task-Contrastive Learning (TaCo), a contrastive learning framework designed to embed tasks into a shared representation space. Through TaCo, we map these heterogeneous tasks from different modalities into a joint space and analyze their properties: identifying which tasks are distinctly represented, which blend together, and how iterative alterations to tasks are reflected in the embedding space. Our work provides a foundation for understanding the intrinsic structure of medical vision tasks, offering a deeper understanding of task similarities and their interconnected properties in embedding spaces.

关键词: medical computer vision, task relationships, contrastive learning, representation space, medical imaging, Task-Contrastive Learning, embedding space, task similarities

216. ❌ PanopticQuery: Unified Query-Time Reasoning for 4D Scenes

作者: Ruilin Tang, Yang Zhou, Zhong Ye, Wenxi Liu, Yan Huang, Shengfeng He 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05638v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于4D场景理解、动态重建和自然语言查询的语义接地，使用4D高斯泼溅和神经场优化等技术，但未涉及大模型、深度学习技术原理或AI for Science的具体创新。所有关键词均与大模型技术、训练方法、推理优化、代理系统或科学AI应用相关，而本文研究的是计算机视觉和场景理解的特定问题，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了PanopticQuery框架，通过4D高斯泼溅重建和多视图语义共识机制，解决了动态4D场景中复杂自然语言查询的语义接地问题，并在新基准上取得了最先进性能。

摘要翻译

通过自然语言查询理解动态4维环境，不仅需要精确的场景重建，还要求跨越空间、时间和视角的鲁棒语义定位。尽管近期基于神经表征的方法推动了4维重建的发展，但其在上下文推理方面仍存在局限，尤其对于交互、时序动作和空间关系等复杂语义的理解。核心挑战在于如何将含噪声的、视角依赖的预测转化为全局一致的4维解释。我们提出PanopticQuery框架，实现4维场景中统一查询时序推理。该方法基于高保真动态重建技术4维高斯泼溅构建，并引入多视角语义共识机制：通过聚合多视角与多时间帧的2维语义预测，将自然语言查询锚定至场景中。该过程筛选不一致输出，强化几何一致性，并通过神经场优化将2维语义提升为结构化的4维语义基元。为支持评估，我们提出Panoptic-L4D新基准，专门针对动态场景中基于语言的查询任务。实验表明，PanopticQuery在复杂语言查询（包括属性、动作、空间关系及多对象交互）上实现了最优性能。视频演示详见补充材料。

摘要 (Abstract)

Understanding dynamic 4D environments through natural language queries requires not only accurate scene reconstruction but also robust semantic grounding across space, time, and viewpoints. While recent methods using neural representations have advanced 4D reconstruction, they remain limited in contextual reasoning, especially for complex semantics such as interactions, temporal actions, and spatial relations. A key challenge lies in transforming noisy, view-dependent predictions into globally consistent 4D interpretations. We introduce PanopticQuery, a framework for unified query-time reasoning in 4D scenes. Our approach builds on 4D Gaussian Splatting for high-fidelity dynamic reconstruction and introduces a multi-view semantic consensus mechanism that grounds natural language queries by aggregating 2D semantic predictions across multiple views and time frames. This process filters inconsistent outputs, enforces geometric consistency, and lifts 2D semantics into structured 4D groundings via neural field optimization. To support evaluation, we present Panoptic-L4D, a new benchmark for language-based querying in dynamic scenes. Experiments demonstrate that PanopticQuery sets a new state of the art on complex language queries, effectively handling attributes, actions, spatial relationships, and multi-object interactions. A video demonstration is available in the supplementary materials.

关键词: 4D scene understanding, natural language query, dynamic reconstruction, semantic grounding, 4D Gaussian Splatting, multi-view consensus, neural field optimization, Panoptic-L4D benchmark

217. ❌ Towards Athlete Fatigue Assessment from Association Football Videos

作者: Xavier Bou, Nathan Correger, Alexandre Cloots, Cédric Gavage, Silvio Giancola, Cédric Schwartz, François Delvaux, Rudi Cloots, Marc Van Droogenbroeck, Anthony Cioppa 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05636v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究足球运动员疲劳评估，使用计算机视觉和运动学分析从视频中提取轨迹和速度/加速度数据，属于AI在体育科学中的应用。所有关键词均与大模型、深度学习技术原理、训练方法、推理优化、对齐、代理系统等直接相关，而本文完全不涉及这些内容。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为体育科学可视为科学应用的一个领域，但论文未明确使用AI术语，主要基于传统计算机视觉和信号处理，因此给予5分（有一定关联）。其他关键词均得0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文研究是否可以从单目广播视频中提取运动学信号来评估足球运动员的疲劳，结果表明视频数据可以支持疲劳分析，但受轨迹噪声和校准误差影响。

摘要翻译

疲劳监测在足球运动中至关重要，因其与伤病风险及战术表现密切相关。然而，客观的疲劳相关指标通常来源于主观自我报告数据、实验室测试得出的生物标志物，或近年来使用的心率监测仪、GPS追踪数据等侵入式传感器。本文研究单目转播视频能否提供足够质量的时空信号，以支持面向疲劳的分析。基于先进的比赛状态重建方法，我们提取了球场坐标系下的球员轨迹，并提出一种新颖的运动学处理算法，从重建轨迹中获取时间一致的速度和加速度估计值。随后，我们利用这些信号构建加速度-速度曲线，并分析其作为疲劳相关表现指标的行为特征。我们在公开的SoccerNet-GSR基准上评估了完整流程，同时考虑了30秒片段和完整的45分钟半场比赛，以检验短期可靠性和长期时间一致性。研究结果表明，单目比赛状态重建能够恢复与加速度-速度曲线分析兼容的运动学模式，同时也揭示了其对转播视频固有的轨迹噪声、校准误差和时间不连续性的敏感性。这些发现支持将单目转播视频作为疲劳分析的低成本基础，并明确了未来研究面临的方法学挑战。

摘要 (Abstract)

Fatigue monitoring is central in association football due to its links with injury risk and tactical performance. However, objective fatigue-related indicators are commonly derived from subjective self-reported metrics, biomarkers derived from laboratory tests, or, more recently, intrusive sensors such as heart monitors or GPS tracking data. This paper studies whether monocular broadcast videos can provide spatio-temporal signals of sufficient quality to support fatigue-oriented analysis. Building on state-of-the-art Game State Reconstruction methods, we extract player trajectories in pitch coordinates and propose a novel kinematics processing algorithm to obtain temporally consistent speed and acceleration estimates from reconstructed tracks. We then construct acceleration–speed (A-S) profiles from these signals and analyze their behavior as fatigue-related performance indicators. We evaluate the full pipeline on the public SoccerNet-GSR benchmark, considering both 30-second clips and a complete 45-minute half to examine short-term reliability and longer-term temporal consistency. Our results indicate that monocular GSR can recover kinematic patterns that are compatible with A-S profiling while also revealing sensitivity to trajectory noise, calibration errors, and temporal discontinuities inherent to broadcast footage. These findings support monocular broadcast video as a low-cost basis for fatigue analysis and delineate the methodological challenges for future research.

关键词: fatigue assessment, association football, monocular broadcast videos, player trajectories, kinematics processing, acceleration-speed profiles, SoccerNet-GSR benchmark, temporal consistency

218. ❌ SGANet: Semantic and Geometric Alignment for Multimodal Multi-view Anomaly Detection

作者: Letian Bai, Chengyu Tao, Juan Du 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05632v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SGANet专注于多模态多视角异常检测，提出了一种结合语义和几何对齐的计算机视觉方法。虽然属于AI应用领域，但所有给定的关键词均与大语言模型（LLM）技术、训练方法、推理优化、对齐技术、代理系统等直接相关，而本文完全不涉及任何语言模型、文本处理或相关技术，仅使用视觉和几何特征进行异常检测，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SGANet的语义和几何对齐网络，用于解决多视角异常检测中的特征不一致问题，通过在SiM3D和Eyecandies数据集上的实验验证了其有效性。

摘要翻译

多视角异常检测旨在利用从多个视角采集的观测数据识别复杂物体表面的缺陷。然而，现有的无监督方法常因视角变化与模态差异导致的特征不一致问题而受限。为解决这些挑战，我们提出了一种语义与几何对齐网络（Semantic and Geometric Alignment Network, SGANet），这是一个用于多模态多视角异常检测的统一框架，通过有效结合语义对齐与几何对齐，学习跨视角与跨模态的物理连贯特征表示。SGANet包含三个核心组件：选择性跨视角特征细化模块（Selective Cross-view Feature Refinement Module, SCFRM）通过选择性聚合相邻视角的信息化局部特征以增强跨视角特征交互；语义-结构局部对齐（Semantic-Structural Patch Alignment, SSPA）在保持视角变换下结构一致性的同时，强制实现跨模态的语义对齐；多视角几何对齐（Multi-View Geometric Alignment, MVGA）进一步对齐跨视角的几何对应局部区域。通过联合建模特征交互、语义与结构一致性以及全局几何对应关系，SGANet在多模态多视角场景中显著提升了异常检测性能。在SiM3D和Eyecandies数据集上的大量实验表明，SGANet在异常检测与定位任务上均达到了最先进的性能，验证了其在真实工业场景中的有效性。

摘要 (Abstract)

Multi-view anomaly detection aims to identify surface defects on complex objects using observations captured from multiple viewpoints. However, existing unsupervised methods often suffer from feature inconsistency arising from viewpoint variations and modality discrepancies. To address these challenges, we propose a Semantic and Geometric Alignment Network (SGANet), a unified framework for multimodal multi-view anomaly detection that effectively combines semantic and geometric alignment to learn physically coherent feature representations across viewpoints and modalities. SGANet consists of three key components. The Selective Cross-view Feature Refinement Module (SCFRM) selectively aggregates informative patch features from adjacent views to enhance cross-view feature interaction. The Semantic-Structural Patch Alignment (SSPA) enforces semantic alignment across modalities while maintaining structural consistency under viewpoint transformations. The Multi-View Geometric Alignment (MVGA) further aligns geometrically corresponding patches across viewpoints. By jointly modeling feature interaction, semantic and structural consistency, and global geometric correspondence, SGANet effectively enhances anomaly detection performance in multimodal multi-view settings. Extensive experiments on the SiM3D and Eyecandies datasets demonstrate that SGANet achieves state-of-the-art performance in both anomaly detection and localization, validating its effectiveness in realistic industrial scenarios.

关键词: multimodal anomaly detection, multi-view anomaly detection, semantic alignment, geometric alignment, feature refinement, cross-view interaction, industrial inspection, 3D object analysis

作者: Yongchuan Cui, Peng Liu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05629v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出LLaRS，一个用于遥感图像恢复和融合的统一基础模型，核心创新包括：1）使用语言提示（language prompting）控制多任务处理，属于大模型应用（Foundation Models得10分）；2）采用三层混合专家架构（Mixture of Experts得10分）；3）使用低秩适配器（LoRA）进行参数高效微调（PEFT得10分）；4）属于AI for Science在遥感领域的应用（AI for Science得10分）。其他关键词如预训练、微调有一定关联（各得5分），但非核心。其余关键词（如SLMs、RLHF、RAG等）未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了首个用于多模态遥感图像恢复和融合的统一基础模型LLaRS，通过语言提示和混合专家架构实现了多任务处理，并在构建的百万级数据集上验证了其优越性能和高效迁移能力。

摘要翻译

遥感影像常受云层、雾霾、噪声、分辨率限制及传感器异质性影响。现有复原与融合方法通常针对每种退化类型分别训练独立模型。本研究提出语言条件化大规模遥感复原模型（LLaRS），这是首个面向多模态、多任务遥感低层视觉的统一基础模型。LLaRS采用Sinkhorn-Knopp最优传输方法将异质波段对齐至语义匹配的槽位，通过三层互补的混合专家模块进行特征路由（卷积专家处理空间模式，通道混合专家保障光谱保真度，配备低秩适配器的注意力专家捕获全局上下文），并借助步级动态权重调整机制稳定联合训练。为训练LLaRS，我们构建了LLaRS1M数据集——一个涵盖十一类复原与增强任务的百万级多任务数据集，整合了真实配对观测数据、受控合成退化数据以及多样化的自然语言提示。实验表明，LLaRS在七种竞争模型中均取得稳定优势，参数高效微调实验进一步验证了其在未见数据上具备强大的迁移能力与适应效率。项目仓库：https://github.com/yc-cui/LLaRS

摘要 (Abstract)

Remote sensing imagery suffers from clouds, haze, noise, resolution limits, and sensor heterogeneity. Existing restoration and fusion approaches train separate models per degradation type. In this work, we present Language-conditioned Large-scale Remote Sensing restoration model (LLaRS), the first unified foundation model for multi-modal and multi-task remote sensing low-level vision. LLaRS employs Sinkhorn-Knopp optimal transport to align heterogeneous bands into semantically matched slots, routes features through three complementary mixture-of-experts layers (convolutional experts for spatial patterns, channel-mixing experts for spectral fidelity, and attention experts with low-rank adapters for global context), and stabilizes joint training via step-level dynamic weight adjustment. To train LLaRS, we construct LLaRS1M, a million-scale multi-task dataset spanning eleven restoration and enhancement tasks, integrating real paired observations and controlled synthetic degradations with diverse natural language prompts. Experiments show LLaRS consistently outperforms seven competitive models, and parameter-efficient finetuning experiments demonstrate strong transfer capability and adaptation efficiency on unseen data. Repo: https://github.com/yc-cui/LLaRS

关键词: foundation model, remote sensing, image restoration, multi-modal fusion, language prompting, mixture of experts, parameter-efficient fine-tuning, low-rank adapters

220. ❌ FunRec: Reconstructing Functional 3D Scenes from Egocentric Interaction Videos

作者: Alexandros Delitzas, Chenyangguang Zhang, Alexey Gavryushin, Tommaso Di Mario, Boyang Sun, Rishabh Dabral, Leonidas Guibas, Christian Theobalt, Marc Pollefeys, Francis Engelmann, Daniel Barath 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05621v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文FunRec专注于从第一人称交互视频中重建功能性3D场景，涉及计算机视觉、3D重建、运动跟踪和几何建模，但完全不涉及大语言模型、深度学习技术原理或AI for Science等关键词。所有关键词均与大模型、深度学习技术或特定科学AI应用相关，而该论文属于纯粹的计算机视觉和3D重建领域，与这些关键词无任何关联，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

FunRec提出了一种从第一人称RGB-D交互视频中重建功能性3D室内场景数字孪生的方法，自动发现铰接部件、估计运动参数并重建几何，在真实和模拟基准测试中显著优于现有方法。

摘要翻译

本文提出FunRec方法，该方法可直接从第一人称视角的RGB-D交互视频中重建室内场景的功能性三维数字孪生。与现有依赖受控环境、多状态采集或CAD先验知识的关节化重建方法不同，FunRec直接基于真实世界的人类交互序列进行操作，以恢复可交互的三维场景。它能自动发现关节部件，估算其运动学参数，跟踪其三维运动，并在规范空间中重建静态与动态几何结构，最终生成兼容仿真的网格模型。在全新的真实与模拟基准测试中，FunRec大幅超越先前工作，在部件分割上实现了高达+50 mIoU的提升，关节与姿态误差降低5至10倍，并显著提高了重建精度。我们进一步展示了其在仿真用URDF/USD导出、手部引导的功用性映射以及机器人-场景交互等方面的应用潜力。

摘要 (Abstract)

We present FunRec, a method for reconstructing functional 3D digital twins of indoor scenes directly from egocentric RGB-D interaction videos. Unlike existing methods on articulated reconstruction, which rely on controlled setups, multi-state captures, or CAD priors, FunRec operates directly on in-the-wild human interaction sequences to recover interactable 3D scenes. It automatically discovers articulated parts, estimates their kinematic parameters, tracks their 3D motion, and reconstructs static and moving geometry in canonical space, yielding simulation-compatible meshes. Across new real and simulated benchmarks, FunRec surpasses prior work by a large margin, achieving up to +50 mIoU improvement in part segmentation, 5-10 times lower articulation and pose errors, and significantly higher reconstruction accuracy. We further demonstrate applications on URDF/USD export for simulation, hand-guided affordance mapping and robot-scene interaction.

关键词: 3D reconstruction, egocentric video, functional scenes, articulated parts, kinematic parameters, digital twins, interaction videos, simulation-compatible meshes

221. ❌ ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference

作者: Zhaohong Huang, Wenjing Liu, Yuxin Zhang, Fei Chao, Rongrong Ji 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05601v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ID-Selection专注于大型视觉语言模型（LVLMs）的高效推理加速，通过视觉令牌剪枝技术减少冗余令牌，属于大模型推理优化领域。与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分），因为LVLMs是大语言模型在视觉领域的扩展应用。与’Speculative Decoding OR Inference Acceleration’高度相关（8分），因为论文核心是减少FLOPs、加速推理。其他关键词如MoE、SFT、RAG、量化等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ID-Selection的视觉令牌选择策略，通过结合重要性估计和多样性感知迭代选择，在极端剪枝比率下（如保留仅16个令牌）显著减少大型视觉语言模型的推理计算量（FLOPs降低97%以上），同时保持91.8%的原始性能。

摘要翻译

近期研究探索了通过视觉令牌剪枝来加速大型视觉语言模型（LVLMs）的推理。然而，现有方法往往难以平衡令牌的重要性与多样性：基于重要性的方法倾向于保留冗余令牌，而基于多样性的方法则可能忽略信息丰富的令牌。在高剪枝率下，仅保留极少视觉令牌子集至关重要，这一权衡问题尤为突出。为解决此问题，我们提出ID-Selection，一种简单而有效的令牌选择策略，用于实现高效的LVLM推理。其核心思想是将重要性估计与多样性感知的迭代选择相结合：首先为每个令牌分配重要性分数，随后逐一选择高分令牌，同时逐步抑制相似令牌的分数。通过这种方式，ID-Selection在统一的选取过程中既保留了信息丰富的令牌，又减少了冗余。在5种LVLM主干模型和16个主流基准测试上的大量实验表明，ID-Selection始终能实现更优的性能与效率，尤其在极端剪枝率下表现突出。例如，在LLaVA-1.5-7B模型上，ID-Selection剪除了97.2%的视觉令牌，仅保留16个令牌，在减少超过97%推理浮点运算量的同时，保持了原模型91.8%的性能，且无需额外训练。

摘要 (Abstract)

Recent advances have explored visual token pruning to accelerate the inference of large vision-language models (LVLMs). However, existing methods often struggle to balance token importance and diversity: importance-based methods tend to retain redundant tokens, whereas diversity-based methods may overlook informative ones. This trade-off becomes especially problematic under high reduction ratios, where preserving only a small subset of visual tokens is critical. To address this issue, we propose ID-Selection, a simple yet effective token selection strategy for efficient LVLM inference. The key idea is to couple importance estimation with diversity-aware iterative selection: each token is first assigned an importance score, after which high-scoring tokens are selected one by one while the scores of similar tokens are progressively suppressed. In this way, ID-Selection preserves informative tokens while reducing redundancy in a unified selection process. Extensive experiments across 5 LVLM backbones and 16 main benchmarks demonstrate that ID-Selection consistently achieves superior performance and efficiency, especially under extreme pruning ratios. For example, on LLaVA-1.5-7B, ID-Selection prunes 97.2% of visual tokens, retaining only 16 tokens, while reducing inference FLOPs by over 97% and preserving 91.8% of the original performance, all without additional training.

关键词: visual token pruning, large vision-language models, inference acceleration, importance-diversity selection, token redundancy reduction, efficient LVLM inference, FLOPs reduction, extreme pruning ratios

222. ❌ BPC-Net: Annotation-Free Skin Lesion Segmentation via Boundary Probability Calibration

作者: Yujie Yao, Yuhaohang He, Junjie Huang, Zhou Liu, Jiangzhao Li, Yan Qiao, Wen Xiao, Yunsen Liang, Xiaofan Li 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05594v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于皮肤病变分割的计算机视觉任务，提出了一种无标注分割框架BPC-Net，核心贡献在于边界概率校准和特征解耦解码器设计。论文与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、智能体等）完全无关，因为这些关键词特指自然语言处理或通用大模型领域。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物医学图像分析（皮肤病变分割）的应用，属于AI for Science范畴，但并非核心聚焦于大模型或深度学习技术原理创新，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文解决了无标注皮肤病变分割中边界概率欠置信的问题，提出了BPC-Net框架，通过高斯概率平滑和特征解耦解码器，在多个数据集上达到了无监督方法的先进性能。

摘要翻译

无标注皮肤病灶分割对于资源有限的皮肤镜部署具有吸引力。然而，其性能仍受限于三个相互关联的挑战：噪声伪标签监督、有限目标域数据下的不稳定迁移以及边界概率置信不足。现有无标注方法主要关注伪标签去噪，而压缩的边界概率对最终掩码质量的影响尚未得到充分关注，尽管它直接影响轮廓完整性，且无法仅通过全局阈值调整充分修正。为解决这一问题，我们提出BPC-Net——一种用于无标注皮肤病灶分割的边界概率校准框架。该框架的核心是高斯概率平滑（Gaussian Probability Smoothing, GPS），该方法在阈值化前执行局部概率空间校准，以恢复置信不足的病灶边界，同时避免引发无差别的前景扩张。为在噪声伪监督和跨域迁移场景下支持该校准，我们进一步整合了两个辅助设计：特征解耦解码器（分别处理上下文抑制、细节恢复和边界细化）以及交互分支适应策略（仅更新伪标签交互分支，同时保留已部署的纯图像分割路径）。在严格的无标注协议下，训练或目标域适应过程中均未使用人工标注掩码，且验证标签（若可用）仅用于最终操作点选择。在ISIC-2017、ISIC-2018和PH2数据集上的实验表明，所提框架在已发表的无监督方法中达到最先进性能，宏观平均戴斯系数（Dice coefficient）和杰卡德指数（Jaccard index）分别达到85.80%和76.97%，同时在PH2数据集上接近有监督参考方法的性能。

摘要 (Abstract)

Annotation-free skin lesion segmentation is attractive for low-resource dermoscopic deployment. However, its performance remains constrained by three coupled challenges: noisy pseudo-label supervision, unstable transfer under limited target-domain data, and boundary probability under-confidence. Most existing annotation-free methods primarily focus on pseudo-label denoising. In contrast, the effect of compressed boundary probabilities on final mask quality has received less explicit attention, although it directly affects contour completeness and cannot be adequately corrected by global threshold adjustment alone. To address this issue, we propose BPC-Net, a boundary probability calibration framework for annotation-free skin lesion segmentation. The core of the framework is Gaussian Probability Smoothing (GPS), which performs localized probability-space calibration before thresholding to recover under-confident lesion boundaries without inducing indiscriminate foreground expansion. To support this calibration under noisy pseudo-supervision and cross-domain transfer, we further incorporate two auxiliary designs: a feature-decoupled decoder that separately handles context suppression, detail recovery, and boundary refinement, and an interaction-branch adaptation strategy that updates only the pseudo-label interaction branch while preserving the deployed image-only segmentation path. Under a strictly annotation-free protocol, no manual masks are used during training or target-domain adaptation, and validation labels, when available, are used only for final operating-point selection. Experiments on ISIC-2017, ISIC-2018, and PH2 show that the proposed framework achieves state-of-the-art performance among published unsupervised methods, reaching a macro-average Dice coefficient and Jaccard index of 85.80% and 76.97%, respectively, while approaching supervised reference performance on PH2.

关键词: skin lesion segmentation, annotation-free, boundary probability calibration, Gaussian Probability Smoothing, feature-decoupled decoder, unsupervised learning, dermoscopic images, medical image analysis

223. ❌ Purify-then-Align: Towards Robust Human Sensing under Modality Missing with Knowledge Distillation from Noisy Multimodal Teacher

作者: Pengcheng Weng, Yanyu Qian, Yangxin Xu, Fei Wang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05584v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是鲁棒多模态人类感知中的模态缺失问题，提出了一种基于元学习和知识扩散的’Purify-then-Align’框架。虽然属于深度学习在人类感知（可视为AI应用）领域的研究，但论文内容专注于多模态融合、特征对齐、知识蒸馏和元学习，并未涉及任何大语言模型（LLM）相关技术、训练方法（如预训练、微调、对齐）、推理优化、代理系统或AI for Science的具体子领域（如生物信息学）。所有评分关键词均与大模型技术或指定的科学AI应用无关，因此相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对多模态人类感知中模态缺失导致的表示差距和污染效应问题，提出了一种'Purify-then-Align'框架，通过元学习加权和扩散知识蒸馏来净化知识源并跨模态对齐特征，从而在缺失模态场景下显著提升了单模态编码器的鲁棒性和性能。

摘要翻译

鲁棒的多模态人体感知必须克服模态缺失这一关键挑战。其面临的两个主要障碍是异构数据间的表征鸿沟与低质量模态带来的污染效应。这两种障碍存在因果关联，因为污染引入的噪声从根本上阻碍了表征差异的缩小。本文提出PTA，一种新颖的“先净化后对齐”框架，该框架通过元学习与知识扩散的协同整合来解决这一因果依赖问题。为净化知识源，PTA首先采用元学习驱动的加权机制，动态学习降低噪声大、贡献度低的模态影响。随后，为对齐不同模态，PTA引入一种基于扩散的知识蒸馏范式：通过净化后的共识形成一个信息丰富的纯净教师模型，进而细化每个学生模态的特征。这种“先净化后对齐”策略的最终收益，是创建出蕴含跨模态知识的、异常强大的单模态编码器。在表征鸿沟与污染效应显著的大规模MM-Fi和XRF55数据集上的综合实验表明，PTA实现了最先进的性能，并在多种模态缺失场景中显著提升了单模态模型的鲁棒性。

摘要 (Abstract)

Robust multimodal human sensing must overcome the critical challenge of missing modalities. Two principal barriers are the Representation Gap between heterogeneous data and the Contamination Effect from low-quality modalities. These barriers are causally linked, as the corruption introduced by contamination fundamentally impedes the reduction of representation disparities. In this paper, we propose PTA, a novel “Purify-then-Align” framework that solves this causal dependency through a synergistic integration of meta-learning and knowledge diffusion. To purify the knowledge source, PTA first employs a meta-learning-driven weighting mechanism that dynamically learns to down-weight the influence of noisy, low-contributing modalities. Subsequently, to align different modalities, PTA introduces a diffusion-based knowledge distillation paradigm in which an information-rich clean teacher, formed from this purified consensus, refines the features of each student modality. The ultimate payoff of this “Purify-then-Align” strategy is the creation of exceptionally powerful single-modality encoders imbued with cross-modal knowledge. Comprehensive experiments on the large-scale MM-Fi and XRF55 datasets, under pronounced Representation Gap and Contamination Effect, demonstrate that PTA achieves state-of-the-art performance and significantly improves the robustness of single-modality models in diverse missing-modality scenarios.

关键词: multimodal human sensing, missing modalities, representation gap, contamination effect, knowledge distillation, meta-learning, diffusion models, robustness

224. ❌ WRF4CIR: Weight-Regularized Fine-Tuning Network for Composed Image Retrieval

作者: Yizhuo Xu, Chaojian Yu, Yuanjie Shao, Tongliang Liu, Qinmu Peng, Xinge You 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05583v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视觉语言预训练模型的微调方法（特别是针对组合图像检索任务），仅与关键词’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分），因为论文核心是提出一种新的权重正则化微调网络WRF4CIR来解决过拟合问题。其他关键词均与论文内容无关，论文不涉及大语言模型、推理方法、对齐技术、代理系统、模型压缩等主题，也不属于生物信息学等科学AI应用领域。

!!! tip deepseek-chat TL;DR

该论文针对组合图像检索任务中基于视觉语言预训练模型微调时存在的严重过拟合问题，提出了一种权重正则化微调网络WRF4CIR，通过在微调过程中向模型权重施加对抗性扰动来增加训练难度，从而有效缩小泛化差距并显著提升检索性能。

摘要翻译

组合图像检索任务旨在根据参考图像和修改文本来检索目标图像。当前CIR方法主要依赖于对视觉语言预训练模型进行微调。然而，我们发现这些方法普遍存在严重的过拟合问题，这对数据有限的三元组CIR任务构成了挑战。为深入理解该问题，我们对基于VLP的CIR中的过拟合现象进行了系统性研究，揭示了不同模型与数据集之间存在显著且此前被忽视的泛化差距。基于这些发现，我们提出了WRF4CIR——一种用于CIR的权重正则化微调网络。具体而言，在微调过程中，我们对模型权重施加对抗性扰动以实现正则化，这些扰动沿梯度下降的反方向生成。直观上，WRF4CIR增加了训练数据拟合的难度，从而有助于在有限的三元组监督下缓解CIR中的过拟合问题。在基准数据集上的大量实验表明，WRF4CIR显著缩小了泛化差距，并较现有方法实现了实质性提升。

摘要 (Abstract)

Composed Image Retrieval (CIR) task aims to retrieve target images based on reference images and modification texts. Current CIR methods primarily rely on fine-tuning vision-language pre-trained models. However, we find that these approaches commonly suffer from severe overfitting, posing challenges for CIR with limited triplet data. To better understand this issue, we present a systematic study of overfitting in VLP-based CIR, revealing a significant and previously overlooked generalization gap across different models and datasets. Motivated by these findings, we introduce WRF4CIR, a Weight-Regularized Fine-tuning network for CIR. Specifically, during the fine-tuning process, we apply adversarial perturbations to the model weights for regularization, where these perturbations are generated in the opposite direction of gradient descent. Intuitively, WRF4CIR increases the difficulty of fitting the training data, which helps mitigate overfitting in CIR under limited triplet supervision. Extensive experiments on benchmark datasets demonstrate that WRF4CIR significantly narrows the generalization gap and achieves substantial improvements over existing methods.

关键词: Composed Image Retrieval, Fine-tuning, Overfitting, Weight Regularization, Vision-Language Pre-trained Models, Generalization Gap, Adversarial Perturbations, Triplet Data

225. ❌ High-Resolution Single-Shot Polarimetric Imaging Made Easy

作者: Shuangfan Zhou, Chu Zhou, Heng Guo, Youwei Lyu, Boxin Shi, Zhanyu Ma, Imari Sato 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05581v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是偏振成像技术，具体涉及多相机系统设计和基于物理引导的神经网络重建方法，属于计算机视觉和计算成像领域。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为EasyPolar的多视角偏振成像框架，通过三相机硬件设计和置信度引导的偏振重建网络，解决了单次拍摄偏振成像中空间分辨率降低和伪影的问题，实现了高质量的偏振图像重建并提升了下游任务性能。

摘要翻译

基于偏振的视觉技术因能提供超越RGB图像的更丰富物理线索而日益受到关注。虽然单次拍摄捕获在实际应用中极具吸引力，但现有的焦平面分割（DoFP）传感器由于其空间复用机制，本质上存在空间分辨率降低和伪影问题。为了在不牺牲快照能力的前提下克服这些限制，我们提出了EasyPolar——一种多视角偏振成像框架。该系统的物理基础在于：三个独立的强度测量足以完整表征线性偏振。基于此原理，我们设计了一种三相机配置，包含三个同步的RGB相机，分别捕获一个非偏振视图和两个不同偏振方向的偏振视图。在此硬件设计基础上，我们进一步提出了一种置信度引导的偏振重建网络，以解决多视角融合中可能存在的错位问题。该网络在置信度感知的物理引导机制下执行多模态特征融合，有效抑制了形变引起的伪影，并对解空间施加了显式的几何约束。实验结果表明，我们的方法能够获得高质量的重建结果，并有利于多种下游任务。

摘要 (Abstract)

Polarization-based vision has gained increasing attention for providing richer physical cues beyond RGB images. While achieving single-shot capture is highly desirable for practical applications, existing Division-of-Focal-Plane (DoFP) sensors inherently suffer from reduced spatial resolution and artifacts due to their spatial multiplexing mechanism. To overcome these limitations without sacrificing the snapshot capability, we propose EasyPolar, a multi-view polarimetric imaging framework. Our system is grounded in the physical insight that three independent intensity measurements are sufficient to fully characterize linear polarization. Guided by this, we design a triple-camera setup consisting of three synchronized RGB cameras that capture one unpolarized view and two polarized views with distinct orientations. Building upon this hardware design, we further propose a confidence-guided polarization reconstruction network to address the potential misalignment in multi-view fusion. The network performs multi-modal feature fusion under a confidence-aware physical guidance mechanism, which effectively suppresses warping-induced artifacts and enforces explicit geometric constraints on the solution space. Experimental results demonstrate that our method achieves high-quality results and benefits various downstream tasks.

关键词: polarimetric imaging, single-shot capture, multi-view fusion, confidence-guided reconstruction, polarization reconstruction, DoFP sensors, triple-camera setup, downstream tasks

226. ❌ Physics-Aligned Spectral Mamba: Decoupling Semantics and Dynamics for Few-Shot Hyperspectral Target Detection

作者: Luqi Gong, Qixin Xie, Yue Chen, Ziqiang Chen, Fanda Fan, Shuai Zhao, Chao Li 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05562v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于高光谱目标检测的少样本学习，属于AI for Science（遥感/地球科学应用）。核心创新是参数高效微调（PEFT）方法（DCTMA适配器），与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分）。论文涉及领域适应和微调，与’Pre-training OR Continual Pre-training OR Domain Adaptation’和’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（各5分）。其他关键词主要涉及大语言模型（LLM）技术、推理、对齐、代理等，与论文的计算机视觉/遥感焦点无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对少样本高光谱目标检测中全参数微调效率低、易过拟合的问题，提出了SpecMamba框架，通过离散余弦变换Mamba适配器实现参数高效的频谱适应，在多个数据集上实现了优于现有方法的检测精度和跨域泛化能力。

摘要翻译

元学习促进了少样本高光谱目标检测（HTD）的发展，但深度骨干网络的适应仍具挑战性。全参数微调效率低下且易过拟合，现有方法大多忽略了高光谱数据的频域结构和光谱波段连续性，限制了光谱适应与跨域泛化能力。为解决这些问题，我们提出了SpecMamba，一个参数高效且频率感知的框架，将稳定的语义表征与敏捷的光谱适应解耦。具体而言，我们在冻结的Transformer表征之上引入了离散余弦变换曼巴适配器（DCTMA）。通过离散余弦变换（DCT）将光谱特征投影至频域，并利用曼巴（Mamba）的线性复杂度状态空间递归，DCTMA显式地捕获全局光谱依赖性和波段连续性，同时避免了全微调的冗余。此外，为解决有限样本量导致的原型漂移问题，我们设计了一种先验引导三编码器（PGTE），允许实验室光谱先验在不破坏稳定语义特征空间的前提下指导可学习适配器的优化。最后，我们开发了一种自监督伪标签映射（SSPLM）策略用于测试时适应，通过不确定性感知采样和双路径一致性约束实现高效的决策边界细化。在多个公开数据集上的大量实验表明，SpecMamba在检测精度和跨域泛化能力上均持续优于现有先进方法。

摘要 (Abstract)

Meta-learning facilitates few-shot hyperspectral target detection (HTD), but adapting deep backbones remains challenging. Full-parameter fine-tuning is inefficient and prone to overfitting, and existing methods largely ignore the frequency-domain structure and spectral band continuity of hyperspectral data, limiting spectral adaptation and cross-domain generalization.To address these challenges, we propose SpecMamba, a parameter-efficient and frequency-aware framework that decouples stable semantic representation from agile spectral adaptation. Specifically, we introduce a Discrete Cosine Transform Mamba Adapter (DCTMA) on top of frozen Transformer representations. By projecting spectral features into the frequency domain via DCT and leveraging Mamba’s linear-complexity state-space recursion, DCTMA explicitly captures global spectral dependencies and band continuity while avoiding the redundancy of full fine-tuning. Furthermore, to address prototype drift caused by limited sample sizes, we design a Prior-Guided Tri-Encoder (PGTE) that allows laboratory spectral priors to guide the optimization of the learnable adapter without disrupting the stable semantic feature space. Finally, a Self-Supervised Pseudo-Label Mapping (SSPLM) strategy is developed for test-time adaptation, enabling efficient decision boundary refinement through uncertainty-aware sampling and dual-path consistency constraints. Extensive experiments on multiple public datasets demonstrate that SpecMamba consistently outperforms state-of-the-art methods in detection accuracy and cross-domain generalization.

关键词: hyperspectral target detection, few-shot learning, parameter-efficient fine-tuning, Mamba, domain adaptation, spectral adaptation, cross-domain generalization, self-supervised learning

227. ❌ Evaluation Before Generation: A Paradigm for Robust Multimodal Sentiment Analysis with Missing Modalities

作者: Rongfei Chen, Tingting Zhang, Xiaoyu Shen, Wei Zhang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05558v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多模态情感分析中的缺失模态问题，提出了一种基于提示学习的框架。论文与大多数关键词无关，因为这些关键词主要针对大语言模型（LLM）的技术原理、训练方法、推理优化、对齐技术、代理系统等。唯一相关的关键词是’Pre-training OR Continual Pre-training OR Domain Adaptation’，因为论文使用了预训练模型（pretrained models）并涉及领域适应（missing modality adaptation），但这不是论文的核心创新点，只是基础技术手段，因此给予5分（有一定关联）。论文未涉及大模型在不同领域的创新应用，也未涉及LLM技术原理的创新。

!!! tip deepseek-chat TL;DR

该论文针对多模态情感分析中缺失模态导致性能下降的问题，提出了一种基于提示学习的缺失模态适应框架，通过评估缺失模态重要性、解耦模态特定提示和动态加权等方法，在三个公开基准上实现了最先进的性能。

摘要翻译

模态缺失问题是多模态情感分析领域的一项基础性挑战，在实际场景中会显著降低模型的准确性与泛化能力。现有方法主要通过提示学习与预训练模型提升鲁棒性，但仍存在两点局限：其一，生成缺失模态的必要性缺乏严谨评估；其二，多模态提示间的结构依赖关系及其全局连贯性尚未得到充分探索。为应对这些问题，本文提出一种基于提示的缺失模态自适应框架。该框架在输入阶段引入缺失模态评估器，利用预训练模型与伪标签动态评估缺失模态的重要性，从而避免低质量的数据填补。在此基础上，模态不变提示解耦模块将共享提示分解为模态特定的私有提示，以捕捉内在的局部相关性并提升表征质量。此外，动态提示加权模块通过跨注意力输出计算基于互信息的权重，以自适应抑制缺失模态的干扰。为增强全局一致性，多层级提示动态连接模块通过残差连接将共享提示与自注意力输出相融合，利用全局提示先验强化关键引导特征。在CMU MOSI、CMU MOSEI和CH-SIMS三个公开基准上的大量实验表明，所提框架在不同模态缺失设定下均实现了最先进的性能与稳定的结果。代码已开源：https://github.com/rongfei-chen/ProMMA

摘要 (Abstract)

The missing modality problem poses a fundamental challenge in multimodal sentiment analysis, significantly degrading model accuracy and generalization in real world scenarios. Existing approaches primarily improve robustness through prompt learning and pre trained models. However, two limitations remain. First, the necessity of generating missing modalities lacks rigorous evaluation. Second, the structural dependencies among multimodal prompts and their global coherence are insufficiently explored. To address these issues, a Prompt based Missing Modality Adaptation framework is proposed. A Missing Modality Evaluator is introduced at the input stage to dynamically assess the importance of missing modalities using pretrained models and pseudo labels, thereby avoiding low quality data imputation. Building on this, a Modality invariant Prompt Disentanglement module decomposes shared prompts into modality specific private prompts to capture intrinsic local correlations and improve representation quality. In addition, a Dynamic Prompt Weighting module computes mutual information based weights from cross attention outputs to adaptively suppress interference from missing modalities. To enhance global consistency, a Multi level Prompt Dynamic Connection module integrates shared prompts with self attention outputs through residual connections, leveraging global prompt priors to strengthen key guidance features. Extensive experiments on three public benchmarks, including CMU MOSI, CMU MOSEI, and CH SIMS, demonstrate that the proposed framework achieves state of the art performance and stable results under diverse missing modality settings. The implementation is available at https://github.com/rongfei-chen/ProMMA

关键词: multimodal sentiment analysis, missing modalities, prompt learning, pretrained models, domain adaptation, robustness, state-of-the-art performance

228. ❌ Referring-Aware Visuomotor Policy Learning for Closed-Loop Manipulation

作者: Jiahua Ma, Yiran Qin, Xin Wen, Yixiong Li, Yuyu Sun, Yulan Guo, Liang Lin, Ruimao Zhang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05544v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于机器人视觉运动策略学习，提出了一种基于扩散模型的闭环框架ReV，用于增强机器人操作的鲁棒性和实时轨迹重规划。虽然属于AI应用领域，但论文内容与所有评分关键词（均围绕大模型、深度学习技术原理及其特定应用方法）无直接关联：未涉及任何语言模型、模型架构、训练方法、推理技术、对齐、代理系统、模型优化或科学AI应用等主题。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于扩散模型的Referring-Aware Visuomotor Policy（ReV）框架，通过耦合扩散头和轨迹引导策略，在仅使用专家演示训练的情况下，实现了机器人操作中对意外情况的闭环适应和实时轨迹重规划，在模拟和真实任务中取得了更高的成功率。

摘要翻译

本文探讨了机器人操作中视觉运动策略学习的一个基础性问题：如何在模型仅依赖原始专家示范进行训练的情况下，增强其在分布外执行误差或动态重规划轨迹时的鲁棒性。我们提出了参照感知视觉运动策略（Referring-Aware Visuomotor Policy, ReV），这是一个能够通过即时整合由人类或高层推理规划器提供的稀疏参照点，以适应未预见情况的闭环框架。具体而言，ReV利用耦合的扩散头来保持标准的任务执行模式，同时通过轨迹引导策略无缝集成稀疏参照。在接收到特定参照点后，全局扩散头首先生成一系列全局一致但时序稀疏的动作锚点，并确定该参照点在此序列中的精确时序位置。随后，局部扩散头根据当前时序位置针对具体任务自适应地插值相邻锚点。这一闭环过程在每一步执行时重复进行，从而实现对场景动态变化的实时轨迹重规划。在实际应用中，ReV无需依赖精细标注，仅通过对专家示范施加针对性扰动进行训练。在无需任何额外数据或微调方案的情况下，ReV在具有挑战性的仿真与真实世界任务中均实现了更高的成功率。

摘要 (Abstract)

This paper addresses a fundamental problem of visuomotor policy learning for robotic manipulation: how to enhance robustness in out-of-distribution execution errors or dynamically re-routing trajectories, where the model relies solely on the original expert demonstrations for training. We introduce the Referring-Aware Visuomotor Policy (ReV), a closed-loop framework that can adapt to unforeseen circumstances by instantly incorporating sparse referring points provided by a human or a high-level reasoning planner. Specifically, ReV leverages the coupled diffusion heads to preserve standard task execution patterns while seamlessly integrating sparse referring via a trajectory-steering strategy. Upon receiving a specific referring point, the global diffusion head firstly generates a sequence of globally consistent yet temporally sparse action anchors, while identifies the precise temporal position for the referring point within this sequence. Subsequently, the local diffusion head adaptively interpolates adjacent anchors based on the current temporal position for specific tasks. This closed-loop process repeats at every execution step, enabling real-time trajectory replanning in response to dynamic changes in the scene. In practice, rather than relying on elaborate annotations, ReV is trained only by applying targeted perturbations to expert demonstrations. Without any additional data or fine-tuning scheme, ReV achieve higher success rates across challenging simulated and real-world tasks.

关键词: visuomotor policy learning, robotic manipulation, closed-loop framework, diffusion models, trajectory replanning, referring points, expert demonstrations, real-time adaptation

229. ❌ EchoAgent: Towards Reliable Echocardiography Interpretation with “Eyes”,“Hands” and “Minds”

作者: Qin Wang, Zhiqing He, Yu Liu, Bowen Guo, Zeju Li, Miao Zhao, Wenhao Ju, Zhiling Luo, Xianhong Shu, Yi Guo, Yuanyuan Wang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05541v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出EchoAgent系统，属于AI for Science（生物信息学/医学AI应用）领域，高度相关（10分）。该系统是代理系统（LLM Agents），核心内容（10分），涉及工具使用（Tool Use）进行分割和测量（8分）。系统包含推理中心，进行可解释推理，涉及Chain of Thought和System 2 Thinking（各8分）。摘要提到multimodal large language models，与LLMs相关（8分），以及explainable inferences与Explainable AI相关（8分）。其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究提出了EchoAgent代理系统，通过整合知识库、视觉解析和推理中心，实现了类似心脏超声医师的端到端超声心动图解释，在多个数据集上达到最高80.00%的准确率。

摘要翻译

超声心动图（Echo）的可靠解读对于评估心脏功能至关重要，这要求临床医生同步协调多种能力，包括视觉观察（眼）、手动测量（手）以及专业知识学习与推理（脑）。当前针对特定任务的深度学习方法与多模态大语言模型虽已通过自动分割或推理在辅助Echo分析方面展现出潜力，但其仍局限于单一技能组合，即“眼-手”或“眼-脑”，从而限制了临床可靠性与实用性。为解决这些问题，我们提出了EchoAgent——一个专为端到端Echo解读设计的智能体系统，它实现了完全协调的“眼-手-脑”工作流，能够像心脏超声医师一样学习、观察、操作与推理。首先，我们引入了一个专业知识驱动的认知引擎，使智能体能够自动吸收可信的Echo指南并构建为结构化知识库，从而形成定制化的Echo解读思维（mind）。其次，我们设计了一个分层协作工具包，赋予EchoAgent“眼-手”能力，使其能自动解析Echo视频流、识别心脏切面、执行解剖结构分割及定量测量。第三，我们将感知到的多模态证据与专属知识库整合至一个协同推理中心，以进行可解释的推断。我们在涵盖14个心脏解剖区域、48种不同超声心动图切面的CAMUS和MIMIC-EchoQA数据集上评估EchoAgent。实验结果表明，EchoAgent在多种结构分析中均取得最优性能，总体准确率最高达80.00%。重要的是，EchoAgent使单一系统具备了像心脏超声医师一样学习、观察、操作与推理的能力，这为可靠的Echo解读带来了广阔前景。

摘要 (Abstract)

Reliable interpretation of echocardiography (Echo) is crucial for assessing cardiac function, which demands clinicians to synchronously orchestrate multiple capabilities, including visual observation (eyes), manual measurement (hands), and expert knowledge learning and reasoning (minds). While current task-specific deep-learning approaches and multimodal large language models have demonstrated promise in assisting Echo analysis through automated segmentation or reasoning, they remain focused on restricted skills, i.e., eyes-hands or eyes-minds, thereby limiting clinical reliability and utility. To address these issues, we propose EchoAgent, an agentic system tailored for end-to-end Echo interpretation, which achieves a fully coordinated eyes-hands-minds workflow that learns, observes, operates, and reasons like a cardiac sonographer. First, we introduce an expertise-driven cognition engine where our agent can automatically assimilate credible Echo guidelines into a structured knowledge base, thus constructing an Echo-customized mind. Second, we devise a hierarchical collaboration toolkit to endow EchoAgent with eyes-hands, which can automatically parse Echo video streams, identify cardiac views, perform anatomical segmentation, and quantitative measurement. Third, we integrate the perceived multimodal evidence with the exclusive knowledge base into an orchestrated reasoning hub to conduct explainable inferences. We evaluate EchoAgent on CAMUS and MIMIC-EchoQA datasets, which cover 48 distinct echocardiographic views spanning 14 cardiac anatomical regions. Experimental results show that EchoAgent achieves optimal performance across diverse structure analyses, yielding overall accuracy of up to 80.00%. Importantly, EchoAgent empowers a single system with abilities to learn, observe, operate and reason like an echocardiologist, which holds great promise for reliable Echo interpretation.

关键词: EchoAgent, agentic system, echocardiography interpretation, multimodal large language models, explainable inference, AI for medical imaging, cardiac analysis, end-to-end workflow

230. ❌ Cross-Resolution Diffusion Models via Network Pruning

作者: Jiaxuan Ren, Junhan Zhu, Huan Wang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05524v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究扩散模型在跨分辨率图像生成中的参数修剪方法，与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关。仅与’Quantization OR Model Compression OR Low-bit Weights’有一定关联（5分），因为网络修剪属于模型压缩的一种形式，但论文未涉及量化或低比特权重。

!!! tip deepseek-chat TL;DR

论文针对扩散模型在训练分辨率外生成图像时质量下降的问题，提出了一种通过参数修剪来提升跨分辨率视觉一致性的新方法CR-Diff。

摘要翻译

扩散模型已展现出卓越的图像合成性能，但许多基于UNet的模型仅在特定固定分辨率下训练。当生成超出训练分辨率的图像时，其质量往往下降。我们将此问题归因于分辨率依赖的参数行为：在默认分辨率下表现良好的权重，在空间尺度变化时可能产生负面影响，削弱语义对齐并导致UNet架构的结构不稳定。基于此分析，本文提出CR-Diff——一种通过剪裁扩散模型部分参数以提升跨分辨率视觉一致性的新方法。具体而言，CR-Diff包含两个阶段：首先进行块级剪裁以选择性消除不利权重；随后执行剪裁输出放大，进一步纯化剪裁后的预测结果。实验表明，大量实证研究验证了CR-Diff能在多种扩散骨干网络和未见分辨率上提升感知保真度与语义连贯性，同时基本保持默认分辨率下的性能。此外，CR-Diff支持基于提示词（prompt）的细化调整，可按需实现质量增强。

摘要 (Abstract)

Diffusion models have demonstrated impressive image synthesis performance, yet many UNet-based models are trained at certain fixed resolutions. Their quality tends to degrade when generating images at out-of-training resolutions. We trace this issue to resolution-dependent parameter behaviors, where weights that function well at the default resolution can become adverse when spatial scales shift, weakening semantic alignment and causing structural instability in the UNet architecture. Based on this analysis, this paper introduces CR-Diff, a novel method that improves the cross-resolution visual consistency by pruning some parameters of the diffusion model. Specifically, CR-Diff has two stages. It first performs block-wise pruning to selectively eliminate adverse weights. Then, a pruned output amplification is conducted to further purify the pruned predictions. Empirically, extensive experiments suggest that CR-Diff can improve perceptual fidelity and semantic coherence across various diffusion backbones and unseen resolutions, while largely preserving the performance at default resolutions. Additionally, CR-Diff supports prompt-specific refinement, enabling quality enhancement on demand.

关键词: Diffusion Models, Cross-Resolution Generation, Network Pruning, UNet Architecture, Parameter Behaviors, Semantic Alignment, Structural Instability, CR-Diff

231. ❌ Prior-guided Fusion of Multimodal Features for Change Detection from Optical-SAR Images

作者: Xuanguang Liu, Lei Ding, Yujie Li, Chenguang Dai, Zhenchao Zhang, Mengmeng Li, Ziyi Yang, Yifan Sun, Yongqi Sun, Hanyun Wang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05527v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于遥感图像的多模态变化检测，属于AI在科学领域的应用（特别是地球科学/遥感），与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评分8分），因为其应用场景（土地监测、灾害评估）属于科学应用范畴。论文提到使用预训练的基础模型（pre-trained foundational models）来获取语义先验，这与’Pre-training OR Continual Pre-training OR Domain Adaptation’有弱关联（评分5分），但论文核心并非大模型技术本身，而是利用其作为特征提取器。其他所有关键词均与大模型技术原理、训练方法、推理优化、代理系统等直接相关，而本文主要研究计算机视觉中的多模态融合和变化检测，未涉及大模型架构、训练、对齐、推理加速等核心技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为STSF-Net的框架，用于解决光学和SAR遥感图像在多模态变化检测中跨模态交互不足和模态特性利用不充分的问题，通过联合建模模态特定特征和时空共同特征，并引入基于预训练基础模型语义先验的自适应融合策略，在多个数据集上实现了优于现有方法的性能。

摘要翻译

多模态变化检测（Multimodal Change Detection, MMCD）旨在从多模态遥感（Remote Sensing, RS）数据中识别变化区域，在土地利用监测、灾害评估和城市可持续发展中展现出重要应用价值。然而，现有MMCD方法在跨模态交互和利用模态特异性特征方面存在局限，导致对细粒度变化信息建模不足，从而阻碍了多模态数据中语义变化的精确检测。针对上述问题，我们提出STSF-Net，一个专为光学与合成孔径雷达（SAR）图像间MMCD设计的框架。STSF-Net联合建模模态特异性特征与时空共性特征以增强变化表征。具体而言，通过挖掘模态特异性特征以捕捉真实的语义变化信号，同时嵌入时空共性特征以抑制由成像机制差异引起的伪变化。此外，我们引入一种光学与SAR特征融合策略，该策略基于预训练基础模型获得的语义先验自适应调整特征重要性，实现语义引导的自适应多模态信息融合。另外，我们提出了Delta-SN6数据集，这是首个公开可用的多类别MMCD基准数据集，由甚高分辨率（Very-High-Resolution, VHR）全极化SAR图像与光学图像构成。在Delta-SN6、BRIGHT和Wuhan-Het数据集上的实验结果表明，我们的方法在平均交并比（mIoU）指标上分别以3.21%、1.08%和1.32%优于当前最优（State-of-the-Art, SOTA）方法。相关代码与Delta-SN6数据集将在以下地址发布：https://github.com/liuxuanguang/STSF-Net。

摘要 (Abstract)

Multimodal change detection (MMCD) identifies changed areas in multimodal remote sensing (RS) data, demonstrating significant application value in land use monitoring, disaster assessment, and urban sustainable development. However, literature MMCD approaches exhibit limitations in cross-modal interaction and exploiting modality-specific characteristics. This leads to insufficient modeling of fine-grained change information, thus hindering the precise detection of semantic changes in multimodal data. To address the above problems, we propose STSF-Net, a framework designed for MMCD between optical and SAR images. STSF-Net jointly models modality-specific and spatio-temporal common features to enhance change representations. Specifically, modality-specific features are exploited to capture genuine semantic change signals, while spatio-temporal common features are embedded to suppress pseudo-changes caused by differences in imaging mechanisms. Furthermore, we introduce an optical and SAR feature fusion strategy that adaptively adjusts feature importance based on semantic priors obtained from pre-trained foundational models, enabling semantic-guided adaptive fusion of multi-modal information. In addition, we introduce the Delta-SN6 dataset, the first openly-accessible multiclass MMCD benchmark consisting of very-high-resolution (VHR) fully polarimetric SAR and optical images. Experimental results on Delta-SN6, BRIGHT, and Wuhan-Het datasets demonstrate that our method outperforms the state-of-the-art (SOTA) by 3.21%, 1.08%, and 1.32% in mIoU, respectively. The associated code and Delta-SN6 dataset will be released at: https://github.com/liuxuanguang/STSF-Net.

关键词: multimodal change detection, optical-SAR images, modality-specific features, spatio-temporal common features, semantic priors, adaptive fusion, STSF-Net, Delta-SN6 dataset

232. ❌ Geometrical Cross-Attention and Nonvoid Voxelization for Efficient 3D Medical Image Segmentation

作者: Chenxin Yuan, Shoupeng Chen, Haojiang Ye, Yiming Miao, Limei Peng, Pin-Han Ho 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05515v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于3D医学图像分割，提出了一种结合几何交叉注意力和非空体素化的高效分割框架GCNV-Net。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词主要针对自然语言处理领域的大语言模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该论文属于AI在生物医学（医学图像分析）领域的应用，属于AI for Science范畴，因此给予10分（高度相关）。

!!! tip deepseek-chat TL;DR

该研究提出了一种名为GCNV-Net的新型3D医学图像分割框架，通过集成三向动态非空体素变换器、几何交叉注意力模块和非空体素化技术，在多个公开数据集上实现了最先进的准确性和效率平衡，显著降低了计算开销和推理延迟。

摘要翻译

三维医学影像的精确分割对于临床诊断与治疗规划至关重要，然而现有方法往往难以在不同解剖结构与成像模态间同时实现高精度与高计算效率。为解决这些挑战，我们提出GCNV-Net——一种新型三维医学分割框架，其整合了三维动态非空体素变换器（Tri-directional Dynamic Nonvoid Voxel Transformer, 3DNVT）、几何交叉注意力模块（Geometrical Cross-Attention module, GCA）与非空体素化（Nonvoid Voxelization）技术。3DNVT沿三个正交解剖平面（即横断面、矢状面与冠状面）动态划分相关体素，从而有效建模复杂的三维空间依赖关系。GCA机制在多尺度特征融合过程中显式融入几何位置信息，显著提升了细粒度解剖结构的分割精度。同时，非空体素化技术仅处理信息丰富的区域，在保证分割质量的前提下大幅减少冗余计算，与传统体素化方法相比实现了56.13%的浮点运算量（FLOPs）降低与68.49%的推理延迟缩减。我们在多个广泛使用的基准数据集上评估GCNV-Net：BraTS2021、ACDC、MSD Prostate、MSD Pancreas及AMOS2022。实验表明，该方法在所有数据集上均达到最先进的分割性能，在Dice系数上超越现有最佳方法0.65%，交并比（IoU）提升0.63%，归一化表面距离（NSD）提高1%，豪斯多夫距离（HD95）相对降低14.5%。所有结果证明GCNV-Net有效平衡了精度与效率，其在不同器官、疾病状态及成像模态间的鲁棒性凸显出强大的临床部署潜力。

摘要 (Abstract)

Accurate segmentation of 3D medical scans is crucial for clinical diagnostics and treatment planning, yet existing methods often fail to achieve both high accuracy and computational efficiency across diverse anatomies and imaging modalities. To address these challenges, we propose GCNV-Net, a novel 3D medical segmentation framework that integrates a Tri-directional Dynamic Nonvoid Voxel Transformer (3DNVT), a Geometrical Cross-Attention module (GCA), and Nonvoid Voxelization. The 3DNVT dynamically partitions relevant voxels along the three orthogonal anatomical planes, namely the transverse, sagittal, and coronal planes, enabling effective modeling of complex 3D spatial dependencies. The GCA mechanism explicitly incorporates geometric positional information during multi-scale feature fusion, significantly enhancing fine-grained anatomical segmentation accuracy. Meanwhile, Nonvoid Voxelization processes only informative regions, greatly reducing redundant computation without compromising segmentation quality, and achieves a 56.13% reduction in FLOPs and a 68.49% reduction in inference latency compared to conventional voxelization. We evaluate GCNV-Net on multiple widely used benchmarks: BraTS2021, ACDC, MSD Prostate, MSD Pancreas, and AMOS2022. Our method achieves state-of-the-art segmentation performance across all datasets, outperforming the best existing methods by 0.65% on Dice, 0.63% on IoU, 1% on NSD, and relatively 14.5% on HD95. All results demonstrate that GCNV-Net effectively balances accuracy and efficiency, and its robustness across diverse organs, disease conditions, and imaging modalities highlights strong potential for clinical deployment.

关键词: 3D medical image segmentation, Geometrical Cross-Attention, Nonvoid Voxelization, Tri-directional Dynamic Nonvoid Voxel Transformer, computational efficiency, clinical diagnostics, state-of-the-art performance, GCNV-Net

233. ❌ Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality

作者: Yanming Xiu, Zhengayuan Jiang, Neil Zhenqiang Gong, Maria Gorlatova 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05510v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是增强现实（AR）中视觉语言模型（VLMs）对矛盾虚拟内容攻击的鲁棒性评估，虽然涉及AI模型评估，但所有关键词都专注于大语言模型（LLMs）及其相关技术（如训练方法、推理技术、优化等），而论文明确研究的是视觉语言模型（VLMs），属于多模态模型而非纯文本大语言模型，因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了ContrAR基准，用于评估视觉语言模型在增强现实环境中对矛盾虚拟内容攻击的鲁棒性，发现现有模型在检测和推理对抗性内容方面仍有改进空间。

摘要翻译

增强现实（AR）技术在过去十年中迅速发展。随着AR日益融入日常生活，其安全性与可靠性已成为关键挑战。在各类威胁中，矛盾性虚拟内容攻击通过向用户视野注入恶意或不一致的虚拟元素，构成独特风险——其可能误导用户、引发语义混淆或传递有害信息。本研究系统化建模了此类攻击，并提出ContrAR这一新颖基准，用于评估视觉语言模型（VLMs）在AR环境中抵御虚拟内容篡改与矛盾信息的能力。ContrAR包含312段经10位人类参与者验证的真实AR视频，并进一步对11个VLM（包括商业与开源模型）进行基准测试。实验结果表明，尽管当前VLM对矛盾性虚拟内容展现出合理的理解能力，但在检测和推理AR环境中的对抗性内容篡改方面仍有提升空间。此外，平衡检测精度与处理延迟仍是亟待解决的挑战。

摘要 (Abstract)

Augmented reality (AR) has rapidly expanded over the past decade. As AR becomes increasingly integrated into daily life, its security and reliability emerge as critical challenges. Among various threats, contradictory virtual content attacks, where malicious or inconsistent virtual elements are introduced into the user’s view, pose a unique risk by misleading users, creating semantic confusion, or delivering harmful information. In this work, we systematically model such attacks and present ContrAR, a novel benchmark for evaluating the robustness of vision-language models (VLMs) against virtual content manipulation and contradiction in AR. ContrAR contains 312 real-world AR videos validated by 10 human participants. We further benchmark 11 VLMs, including both commercial and open-source models. Experimental results reveal that while current VLMs exhibit reasonable understanding of contradictory virtual content, room still remains for improvement in detecting and reasoning about adversarial content manipulations in AR environments. Moreover, balancing detection accuracy and latency remains challenging.

关键词: Augmented Reality, Vision-Language Models, Contradictory Virtual Content Attacks, Robustness Benchmarking, AR Security, Virtual Content Manipulation, Model Evaluation

234. ❌ CLIP-Guided Data Augmentation for Night-Time Image Dehazing

作者: Xining Ge, Weijun Yuan, Gengjia Chang, Xuyang Li, Shuhong Liu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05500v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于夜间图像去雾的计算机视觉任务，使用CLIP进行数据筛选和NAFNet进行训练，属于传统的深度学习应用而非大模型研究。仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’有中等关联（涉及域适应和预训练模型CLIP的使用），与’AI for Science OR Bioinformatics OR Cheminformatics’有弱关联（可视为AI在科学/图像处理领域的应用）。其他关键词均与大模型技术、训练方法、推理优化等无关，故评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于CLIP引导数据增强和两阶段训练的框架，用于解决夜间图像去雾中数据稀缺和域偏移的问题，并在NTIRE 2026挑战中实现了有效去雾。

摘要翻译

夜间图像去雾相较于日间场景面临更为复杂的退化模式，这是由于雾霾散射与低光照、非均匀照明以及强光干扰相互耦合所致。在监督数据有限的情况下，这种复杂性加剧了域偏移和训练不稳定性，因为目标域样本稀缺，而简单引入外部数据可能因分布失配削弱适应效果。本文介绍了我们为NTIRE 2026夜间图像去雾挑战赛提出的解决方案，该方案构建为一个统一框架，集成了域对齐数据构建、分阶段训练和推理时增强。具体而言，我们利用预训练的CLIP视觉编码器通过相似度筛选候选外部样本，以构建更接近目标域的训练数据。随后采用NAFNet进行两阶段训练：首先适应目标域，进而扩展至更广泛的退化模式。在推理阶段，结合TLC、x8自集成和加权快照融合技术以提升输出稳定性。本框架未依赖复杂的网络重新设计，而是为夜间图像去雾提供了一条实用且高效的流程路径。

摘要 (Abstract)

Nighttime image dehazing faces a more complex degradation pattern than its daytime counterpart, as haze scattering couples with low illumination, non-uniform lighting, and strong light interference. Under limited supervision, this complexity aggravates domain drift and training instability, since target-domain samples are scarce while naively introducing external data may weaken adaptation due to distribution mismatch. This paper presents our solution to the NTIRE 2026 Night Time Image Dehazing Challenge, built as a unified framework that integrates domain-aligned data construction, stage-wise training, and inference-time enhancement. Specifically, a pre-trained CLIP visual encoder screens candidate external samples by similarity to construct training data closer to the target domain. NAFNet is then trained in two stages, first adapting to the target domain and then expanding to broader degradation patterns. At inference time, TLC, x8 self-ensemble, and weighted snapshot fusion are combined to improve output stability. Rather than relying on complex network redesign, the proposed framework offers a practical and effective pipeline for nighttime image dehazing.

关键词: nighttime image dehazing, CLIP-guided data augmentation, domain adaptation, NAFNet, stage-wise training, inference-time enhancement, NTIRE challenge, low illumination

235. ❌ A Weak-Signal-Aware Framework for Subsurface Defect Detection: Mechanisms for Enhancing Low-SCR Hyperbolic Signatures

作者: Wenbo Zhang, Zekun Long, Zican Liu, Yangchen Zeng, Keyi Hu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05490v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 该论文专注于使用深度学习进行地下缺陷检测的计算机视觉任务，具体针对探地雷达数据中的弱信号问题。论文提出的WSA-Net框架涉及信号处理、注意力机制和轻量级架构设计。所有评分关键词均与大语言模型（LLMs）、其训练/对齐技术、推理方法、代理系统或特定科学AI子领域（如生物信息学）直接相关。本论文的研究内容（雷达信号处理、目标检测）与这些大模型关键词在主题、方法或应用领域上均无直接关联，因此所有关键词的相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文针对探地雷达地下缺陷检测中弱信号衍射双曲线难以识别的问题，提出了一个名为WSA-Net的轻量级检测框架，通过信号保留、杂波抑制等机制增强特征，在RTST数据集上实现了更高的检测精度（0.6958 mAP@0.5）和效率（164 FPS）。

摘要翻译

探地雷达对地下缺陷的检测面临“弱信号”挑战，即信杂比低、波场相似性高且几何形态退化的微弱绕射双曲线。现有轻量化检测器侧重效率而牺牲灵敏度，难以保留低频结构或解耦异质杂波。我们提出WSA-Net框架，通过物理特征重建增强微弱信号特征。该框架超越简单的参数削减，融合四项机制：采用部分卷积保持信号完整性；通过异质分组注意力抑制杂波；几何重建以锐化双曲线弧；上下文锚定解决语义模糊性。在RTST数据集上的评估表明，WSA-Net仅以2.412 M参数量实现0.6958 mAP@0.5精度与164 FPS速度。结果证明，轻量化架构中以信号为核心的感知机制能有效降低基础设施检测中的漏报率。

摘要 (Abstract)

Subsurface defect detection via Ground Penetrating Radar is challenged by “weak signals” faint diffraction hyperbolas with low signal-to-clutter ratios, high wavefield similarity, and geometric degradation. Existing lightweight detectors prioritize efficiency over sensitivity, failing to preserve low-frequency structures or decouple heterogeneous clutter. We propose WSA-Net, a framework designed to enhance faint signatures through physical-feature reconstruction. Moving beyond simple parameter reduction, WSA-Net integrates four mechanisms: Signal preservation using partial convolutions; Clutter suppression via heterogeneous grouping attention; Geometric reconstruction to sharpen hyperbolic arcs; Context anchoring to resolve semantic ambiguities. Evaluations on the RTSTdataset show WSA-Net achieves 0.6958 mAP@0.5 and 164 FPS with only 2.412 M parameters. Results prove that signal-centric awareness in lightweight architectures effectively reduces false negatives in infrastructure inspection.

关键词: subsurface defect detection, ground penetrating radar, weak signal, lightweight detector, WSA-Net, signal-to-clutter ratio, hyperbolic signature, infrastructure inspection

236. ❌ CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment

作者: Li Kang, Yutao Fan, Rui Li, Heng Zhou, Yiran Qin, Zhemeng Zhang, Songtao Huang, Xiufeng Song, Zaibin Zhang, Bruno N. Y. Chen, Zhenfei Yin, Dongzhan Zhou, Wangmeng Zuo, Lei Bai 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05484v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究多智能体具身协作系统，核心是提出CoEnv框架，通过组合环境（真实+仿真）实现多机器人协作。与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化等）完全无关。唯一相关的是’Multi-agent Systems OR Agent Coordination’，因为论文专注于多智能体协作，但未涉及大模型驱动的智能体，因此给10分（高度相关，核心内容）。

!!! tip deepseek-chat TL;DR

该论文针对多智能体具身协作中的空间协调、时序推理和共享工作空间感知等挑战，提出了CoEnv框架，通过组合环境实现仿真策略探索和真实世界安全部署，在复杂多臂操作基准测试中实现了高任务成功率和执行效率。

摘要翻译

多智能体具身系统在复杂协同操作领域展现出潜力，但其空间协调、时序推理与共享工作空间感知仍面临关键挑战。受人类协作中认知规划与物理执行分离的启发，我们提出组合环境（compositional environment）的概念——通过真实世界与仿真组件的协同整合，使多个机器人智能体能够在统一决策空间中感知意图并执行操作。基于此概念，我们提出CoEnv框架，该框架利用仿真进行安全策略探索，同时确保在现实世界中的可靠部署。CoEnv通过三个阶段运行：真实到仿真的场景重建（实现物理工作空间的数字化）、基于视觉语言模型（VLM）驱动的动作合成（支持通过高级接口进行实时规划，以及通过代码化轨迹生成进行迭代规划），以及经过验证的仿真到真实迁移（通过碰撞检测确保安全部署）。在具有挑战性的多机械臂操作基准测试中进行的广泛实验表明，CoEnv在实现高任务成功率和执行效率方面具有显著优势，为多智能体具身人工智能建立了新范式。

摘要 (Abstract)

Multi-agent embodied systems hold promise for complex collaborative manipulation, yet face critical challenges in spatial coordination, temporal reasoning, and shared workspace awareness. Inspired by human collaboration where cognitive planning occurs separately from physical execution, we introduce the concept of compositional environment – a synergistic integration of real-world and simulation components that enables multiple robotic agents to perceive intentions and operate within a unified decision-making space. Building on this concept, we present CoEnv, a framework that leverages simulation for safe strategy exploration while ensuring reliable real-world deployment. CoEnv operates through three stages: real-to-sim scene reconstruction that digitizes physical workspaces, VLM-driven action synthesis supporting both real-time planning with high-level interfaces and iterative planning with code-based trajectory generation, and validated sim-to-real transfer with collision detection for safe deployment. Extensive experiments on challenging multi-arm manipulation benchmarks demonstrate CoEnv’s effectiveness in achieving high task success rates and execution efficiency, establishing a new paradigm for multi-agent embodied AI.

关键词: Multi-agent Systems, Embodied AI, Robotic Collaboration, Sim-to-Real Transfer, Compositional Environment, VLM-driven Action Synthesis, Multi-arm Manipulation, Agent Coordination

237. ❌ A Synthetic Eye Movement Dataset for Script Reading Detection: Real Trajectory Replay on a 3D Simulator

作者: Kidus Zewde, Yuchen Zhou, Dennis Ng, Neo Tiangratanakul, Tommy Duong, Ankit Raj, Yuxin Zhang, Xingyu Shen, Simiao Ren 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05475v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	3.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	2.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	1.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究合成眼动数据生成，属于AI在行为建模和认知科学中的应用，与"AI for Science"有一定相关性（5分）。论文提到大视觉语言模型（Large vision-language models），与"Large Language Models"有间接关联（3分）。论文涉及数据稀缺问题，与"Scaling Laws AND Data Quality"中的数据质量方面有微弱联系（2分）。论文提到预训练，但非核心（1分）。其他关键词（如MoE、SFT、RAG等）与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种通过3D模拟器生成合成眼动视频数据的管道，用于解决行为模态数据稀缺问题，并发布了用于脚本阅读检测的数据集和评估工具。

摘要翻译

大型视觉语言模型通过海量互联网规模数据的训练已展现出卓越能力，但一个根本性的不对称问题依然存在：尽管大语言模型能够利用丰富的文本和图像数据进行自监督预训练，但许多行为模态却无法实现同等条件。基于视频的行为数据——包括手势、眼动、社交信号等——仍然稀缺、标注成本高昂且涉及隐私敏感问题。一种具有前景的替代方案是仿真技术：通过受控的合成生成替代真实数据采集，从而大规模生产自动标注的数据。
我们为此范式构建了一套应用于眼动行为的基础设施。眼动作为一种行为信号，在视觉语言建模、虚拟现实、机器人技术、无障碍系统及认知科学等领域具有广泛应用。我们提出了一种生成合成标注眼动视频的流程：首先从参考视频中提取真实人眼虹膜运动轨迹，随后通过无头浏览器自动化技术在三维眼动仿真器中重放这些轨迹。将该流程应用于视频面试场景下的脚本阅读检测任务，我们发布了final_dataset_v1数据集：包含144个会话（72个阅读场景，72个对话场景），总计12小时、25帧/秒的合成眼动视频。
评估表明，生成的轨迹保留了源数据的时间动态特性（所有指标的KS检验D值均小于0.14）。通过逐帧匹配对比发现，三维仿真器在阅读级运动范围内表现出有限的灵敏度，这归因于缺乏耦合的头部运动——这一发现为未来仿真器设计提供了参考。我们公开了该流程、数据集及评估工具，以支持行为建模与视觉语言系统交叉领域下游行为分类器的开发。

摘要 (Abstract)

Large vision-language models have achieved remarkable capabilities by training on massive internet-scale data, yet a fundamental asymmetry persists: while LLMs can leverage self-supervised pretraining on abundant text and image data, the same is not true for many behavioral modalities. Video-based behavioral data – gestures, eye movements, social signals – remains scarce, expensive to annotate, and privacy-sensitive. A promising alternative is simulation: replace real data collection with controlled synthetic generation to produce automatically labeled data at scale. We introduce infrastructure for this paradigm applied to eye movement, a behavioral signal with applications across vision-language modeling, virtual reality, robotics, accessibility systems, and cognitive science. We present a pipeline for generating synthetic labeled eye movement video by extracting real human iris trajectories from reference videos and replaying them on a 3D eye movement simulator via headless browser automation. Applying this to the task of script-reading detection during video interviews, we release final_dataset_v1: 144 sessions (72 reading, 72 conversation) totaling 12 hours of synthetic eye movement video at 25fps. Evaluation shows that generated trajectories preserve the temporal dynamics of the source data (KS D < 0.14 across all metrics). A matched frame-by-frame comparison reveals that the 3D simulator exhibits bounded sensitivity at reading-scale movements, attributable to the absence of coupled head movement – a finding that informs future simulator design. The pipeline, dataset, and evaluation tools are released to support downstream behavioral classifier development at the intersection of behavioral modeling and vision-language systems.

关键词: synthetic eye movement, behavioral data, 3D simulator, vision-language models, script-reading detection, data generation pipeline, behavioral modeling, eye movement video

238. ❌ Not All Agents Matter: From Global Attention Dilution to Risk-Prioritized Game Planning

作者: Kang Ding, Hongsong Wang, Jie Gui, Lei He 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05449v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Not All Agents Matter: From Global Attention Dilution to Risk-Prioritized Game Planning》专注于自动驾驶领域，提出了一种名为GameAD的风险感知博弈规划框架，用于解决多智能体交互中的风险优先级问题。论文的核心内容包括风险感知拓扑锚定、战略负载适配器、极小极大风险感知稀疏注意力等机制，以及规划风险暴露度量。虽然论文涉及多智能体系统和注意力机制，但其研究内容与提供的关键词列表（主要围绕大语言模型、深度学习技术原理及其在科学领域的应用）完全无关。所有关键词均未在论文标题或摘要中出现，也未隐含相关概念，因此所有关键词的相关度评分均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为GameAD的风险感知博弈规划框架，用于解决自动驾驶中多智能体交互的风险优先级问题，实验表明该方法在轨迹安全性方面显著优于现有方法。

摘要翻译

端到端自动驾驶的核心并非感知与规划的简单集成，而在于统一表征空间内的动态多智能体博弈。现有大多数端到端模型对所有交通参与者进行均等处理，这阻碍了从复杂背景中解耦真实碰撞威胁的能力。为解决这一问题，我们引入了风险优先博弈规划的概念，并提出GameAD这一新颖框架，将端到端自动驾驶建模为风险感知的博弈问题。GameAD集成了风险感知拓扑锚定、策略负载适配器、极小化极大风险感知稀疏注意力以及风险一致均衡稳定化模块，以实现基于风险优先级交互的博弈论决策。我们还提出了规划风险暴露度指标，用于量化长时域内规划轨迹的累积风险强度，以保障自动驾驶的安全性。在nuScenes和Bench2Drive数据集上的大量实验表明，我们的方法显著优于现有最优方法，尤其在轨迹安全性方面表现突出。

摘要 (Abstract)

End-to-end autonomous driving resides not in the integration of perception and planning, but rather in the dynamic multi-agent game within a unified representation space. Most existing end-to-end models treat all agents equally, hindering the decoupling of real collision threats from complex backgrounds. To address this issue, We introduce the concept of Risk-Prioritized Game Planning, and propose GameAD, a novel framework that models end-to-end autonomous driving as a risk-aware game problem. The GameAD integrates Risk-Aware Topology Anchoring, Strategic Payload Adapter, Minimax Risk-Aware Sparse Attention, and Risk Consistent Equilibrium Stabilization to enable game theoretic decision making with risk prioritized interactions. We also present the Planning Risk Exposure metric, which quantifies the cumulative risk intensity of planned trajectories over a long horizon for safe autonomous driving. Extensive experiments on the nuScenes and Bench2Drive datasets show that our approach significantly outperforms state-of-the-art methods, especially in terms of trajectory safety.

关键词: autonomous driving, multi-agent game, risk-prioritized planning, GameAD, risk-aware attention, trajectory safety, end-to-end planning, risk exposure metric

239. ❌ Human Interaction-Aware 3D Reconstruction from a Single Image

作者: Gwanghyun Kim, Junghun James Kim, Suh Yoon Jeon, Jason Park, Se Young Chun 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05436v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的3D人体重建，特别是多人物交互场景下的重建问题。论文使用了扩散模型、几何优化、物理先验等计算机视觉技术，但完全不涉及大语言模型、深度学习技术原理创新、AI for Science等关键词领域。所有关键词均与大模型、深度学习技术原理、科学AI应用无关，因此全部评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种从单张图像重建交互多人3D人体模型的方法，通过结合群体上下文建模和物理交互先验，解决了现有方法在多人场景中产生的重叠、遮挡和交互失真问题，显著提升了重建的物理合理性和保真度。

摘要翻译

从单张图像重建具有纹理的三维人体模型是增强现实/虚拟现实（AR/VR）与数字人应用的基础。然而，现有方法大多聚焦于单人场景，因此在多人场景中表现不佳；简单组合多个独立重建结果常导致诸如不真实的重叠、被遮挡区域几何信息缺失以及交互关系扭曲等伪影。这些局限凸显了需要引入群体级上下文与交互先验的方法。我们提出了一种整体性方法，显式地对群体级和实例级信息进行建模。为减轻透视引起的几何畸变，我们首先将输入转换至规范的正交投影空间（canonical orthographic space）。我们的核心组件——人体群体-实例多视图扩散模型（Human Group-Instance Multi-View Diffusion, HUG-MVD）——随后通过联合建模个体与群体上下文来生成完整的多视图法线贴图与图像，以解决遮挡与邻近问题。接着，人体群体-实例几何重建模块（Human Group-Instance Geometric Reconstruction, HUG-GR）利用显式的、基于物理的交互先验来优化几何形状，以确保物理合理性并精确建模人际接触。最后，多视图图像被融合成高保真纹理。这些组件共同构成了我们的完整框架——HUG3D。大量实验表明，HUG3D显著优于单人与现有多人重建方法，能够从单张图像中生成物理合理、高保真的交互人体三维重建结果。项目页面：https://jongheean11.github.io/HUG3D_project

摘要 (Abstract)

Reconstructing textured 3D human models from a single image is fundamental for AR/VR and digital human applications. However, existing methods mostly focus on single individuals and thus fail in multi-human scenes, where naive composition of individual reconstructions often leads to artifacts such as unrealistic overlaps, missing geometry in occluded regions, and distorted interactions. These limitations highlight the need for approaches that incorporate group-level context and interaction priors. We introduce a holistic method that explicitly models both group- and instance-level information. To mitigate perspective-induced geometric distortions, we first transform the input into a canonical orthographic space. Our primary component, Human Group-Instance Multi-View Diffusion (HUG-MVD), then generates complete multi-view normals and images by jointly modeling individuals and group context to resolve occlusions and proximity. Subsequently, the Human Group-Instance Geometric Reconstruction (HUG-GR) module optimizes the geometry by leveraging explicit, physics-based interaction priors to enforce physical plausibility and accurately model inter-human contact. Finally, the multi-view images are fused into a high-fidelity texture. Together, these components form our complete framework, HUG3D. Extensive experiments show that HUG3D significantly outperforms both single-human and existing multi-human methods, producing physically plausible, high-fidelity 3D reconstructions of interacting people from a single image. Project page: https://jongheean11.github.io/HUG3D_project

关键词: 3D human reconstruction, multi-human scenes, interaction modeling, diffusion models, physics-based priors, occlusion handling, textured 3D models, single image reconstruction

240. ❌ Few-Shot Semantic Segmentation Meets SAM3

作者: Yi-Jen Tsai, Yen-Yu Lin, Chien-Yao Wang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05433v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Few-Shot Semantic Segmentation (FSS)与Segment Anything Model 3 (SAM3)的结合，属于计算机视觉领域，特别是图像分割任务。所有评分关键词均针对大语言模型(LLMs)及其相关技术（如训练、对齐、推理优化、代理系统等），而本文专注于视觉基础模型（SAM3）在少样本分割中的应用，未涉及任何语言模型技术、训练方法、推理加速或代理系统等内容，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了如何利用Segment Anything Model 3 (SAM3)作为免训练解决方案来解决少样本语义分割问题，通过简单的空间拼接策略实现了最先进的性能，并发现负提示在少样本设置中可能适得其反。

摘要翻译

小样本语义分割（Few-Shot Semantic Segmentation, FSS）旨在仅通过少量标注样本分割新物体类别。现有方法大多依赖大量情景式训练来学习可迁移的表征，这种方式计算成本高昂且对分布偏移敏感。本研究从现代视觉基础模型的角度重新审视FSS，探索了Segment Anything Model 3（SAM3）作为免训练解决方案的潜力。通过重新利用其可提示概念分割（Promptable Concept Segmentation, PCS）能力，我们采用简单的空间拼接策略，将支持图像与查询图像置于共享画布中，使完全冻结的SAM3无需任何微调或架构修改即可执行分割。在PASCAL-$5^i$和COCO-$20^i$数据集上的实验表明，这种极简设计已取得最先进的性能，超越了许多经过复杂设计的方法。除实证效果外，我们发现负向提示在小样本场景中可能产生反效果：尽管其本意是抑制干扰物，却往往会弱化目标表征并导致预测崩溃。这些发现表明，强大的跨图像推理能力可以通过简单的空间组合方式实现，同时也揭示了当前基础模型在处理冲突提示信号时的局限性。代码地址：https://github.com/WongKinYiu/FSS-SAM3

摘要 (Abstract)

Few-Shot Semantic Segmentation (FSS) focuses on segmenting novel object categories from only a handful of annotated examples. Most existing approaches rely on extensive episodic training to learn transferable representations, which is both computationally demanding and sensitive to distribution shifts. In this work, we revisit FSS from the perspective of modern vision foundation models and explore the potential of Segment Anything Model 3 (SAM3) as a training-free solution. By repurposing its Promptable Concept Segmentation (PCS) capability, we adopt a simple spatial concatenation strategy that places support and query images into a shared canvas, allowing a fully frozen SAM3 to perform segmentation without any fine-tuning or architectural changes. Experiments on PASCAL-$5^i$ and COCO-$20^i$ show that this minimal design already achieves state-of-the-art performance, outperforming many heavily engineered methods. Beyond empirical gains, we uncover that negative prompts can be counterproductive in few-shot settings, where they often weaken target representations and lead to prediction collapse despite their intended role in suppressing distractors. These findings suggest that strong cross-image reasoning can emerge from simple spatial formulations, while also highlighting limitations in how current foundation models handle conflicting prompt signals. Code at: https://github.com/WongKinYiu/FSS-SAM3

关键词: Few-Shot Semantic Segmentation, Segment Anything Model 3, SAM3, Promptable Concept Segmentation, Training-free Solution, Spatial Concatenation, Negative Prompts, Cross-image Reasoning

241. ❌ Cross-Stage Attention Propagation for Efficient Semantic Segmentation

作者: Beoungwoo Kang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05431v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的语义分割任务，提出了一种高效的解码器框架（Cross-Stage Attention Propagation），旨在减少多尺度注意力计算中的冗余。论文内容与所有评分关键词（均围绕大模型、深度学习技术原理及其在科学领域的应用）完全无关，未涉及任何大模型、语言模型、对齐、微调、推理、代理、压缩等主题，也未涉及生物信息学或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文针对语义分割中多尺度解码器注意力计算冗余的问题，提出了跨阶段注意力传播（CSAP）框架，通过在最深特征尺度计算注意力并传播到浅层，显著降低了计算成本，并在多个数据集上取得了更好的性能与效率平衡。

摘要翻译

近期轻量化语义分割方法通过将紧凑骨干网络与高效解码器头相结合取得了显著进展。然而，大多数多尺度解码器在各特征尺度上独立计算注意力，由于跨尺度的注意力分布存在强相关性，这引入了大量冗余。我们提出跨阶段注意力传播（Cross-Stage Attention Propagation, CSAP），这是一种在深层特征尺度计算注意力，并将生成的注意力图传播至较浅阶段的解码器框架，从而完全绕过这些阶段的查询-键值计算。该设计在保留多尺度上下文推理能力的同时，显著降低了解码器的计算成本。CSAP-Tiny 在 ADE20K 数据集上仅用 5.5 GFLOPs 即达到 42.9% mIoU，在 Cityscapes 数据集上用 21.5 GFLOPs 达到 80.5%，在 COCO-Stuff 164K 数据集上用 5.5 GFLOPs 达到 40.9% mIoU，其在 ADE20K 上超越 SegNeXt-Tiny 达 +1.8%，同时所需浮点运算量减少 16.8%。

摘要 (Abstract)

Recent lightweight semantic segmentation methods have made significant progress by combining compact backbones with efficient decoder heads. However, most multi-scale decoders compute attention independently at each feature scale, introducing substantial redundancy since the resulting attention distributions across scales are strongly correlated. We propose Cross-Stage Attention Propagation (CSAP), a decoder framework that computes attention at the deepest feature scale and propagates the resulting attention maps to shallower stages, bypassing query-key computation at those stages entirely. This design preserves multi-scale contextual reasoning while substantially reducing the decoder’s computational cost. CSAP-Tiny achieves 42.9% mIoU on ADE20K with only 5.5 GFLOPs, 80.5% on Cityscapes with 21.5 GFLOPs, and 40.9% on COCO-Stuff 164K with 5.5 GFLOPs, surpassing SegNeXt-Tiny by +1.8% on ADE20K while requiring 16.8% fewer floating-point operations.

关键词: semantic segmentation, attention propagation, multi-scale decoder, computational efficiency, lightweight model, cross-stage attention, feature scales, redundancy reduction

242. ❌ VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG

作者: Honghao Fu, Miao Xu, Yiwei Wang, Dailing Zhang, Liu Jun, Yujun Cai 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05418v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究长视频理解中的检索增强生成（RAG）框架，与’Retrieval-Augmented Generation’高度相关（10分）。论文针对多模态大语言模型（MLLMs）处理长视频时受限于上下文窗口的问题，与’Large Language Models’和’Context Window Extension’相关（各8分）。论文强调从扁平化语义匹配转向结构化、意图感知的推理，涉及推理过程，与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分）。其他关键词如MoE、量化、对齐等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了VideoStir框架，通过时空图结构和意图感知的检索增强生成来解决多模态大语言模型在长视频理解中上下文窗口有限的问题，实验表明其在不依赖辅助信息的情况下与最先进基线方法竞争。

摘要翻译

将多模态大语言模型（MLLMs）扩展至长视频领域受到有限上下文窗口的制约。检索增强生成（RAG）通过将查询相关的视觉证据组织为紧凑的上下文，是一种有前景的解决方案，但现有方法大多（i）将视频扁平化为独立片段，破坏了其固有的时空结构，且（ii）依赖于显式的语义匹配，可能遗漏与查询意图隐式相关的线索。为克服这些局限，我们提出了VideoStir——一个结构化且意图感知的长视频RAG框架。该框架首先在片段级别将视频构建为时空图，随后通过多跳检索聚合跨距离但语境相关的事件证据。此外，它引入了一个基于MLLM的意图相关性评分器，依据帧与查询推理意图的对齐程度进行检索。为支持此功能，我们构建了IR-600K，一个专为学习帧-查询意图对齐而定制的大规模数据集。实验表明，VideoStir在不依赖辅助信息的情况下可与最先进的基线方法竞争，这凸显了将长视频RAG从扁平化语义匹配转向结构化、意图感知推理的潜力。代码与模型权重已发布于Github。

摘要 (Abstract)

Scaling multimodal large language models (MLLMs) to long videos is constrained by limited context windows. While retrieval-augmented generation (RAG) is a promising remedy by organizing query-relevant visual evidence into a compact context, most existing methods (i) flatten videos into independent segments, breaking their inherent spatio-temporal structure, and (ii) depend on explicit semantic matching, which can miss cues that are implicitly relevant to the query’s intent. To overcome these limitations, we propose VideoStir, a structured and intent-aware long-video RAG framework. It firstly structures a video as a spatio-temporal graph at clip level, and then performs multi-hop retrieval to aggregate evidence across distant yet contextually related events. Furthermore, it introduces an MLLM-backed intent-relevance scorer that retrieves frames based on their alignment with the query’s reasoning intent. To support this capability, we curate IR-600K, a large-scale dataset tailored for learning frame-query intent alignment. Experiments show that VideoStir is competitive with state-of-the-art baselines without relying on auxiliary information, highlighting the promise of shifting long-video RAG from flattened semantic matching to structured, intent-aware reasoning. Codes and checkpoints are available at Github.

关键词: Multimodal Large Language Models, Long Video Understanding, Retrieval-Augmented Generation, Spatio-Temporal Graph, Intent-Aware Reasoning, Context Window Limitation, Multi-hop Retrieval, IR-600K Dataset

243. ❌ Learning to Synergize Semantic and Geometric Priors for Limited-Data Wheat Disease Segmentation

作者: Shijie Wang, Zijian Wang, Yadan Luo, Scott Chapman, Xin Yu, Zi Huang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05415v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于农业领域的计算机视觉任务（小麦病害分割），属于AI for Science的应用范畴，因此与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文使用了预训练的DINOv2模型，并设计了适配器进行微调，这与’Pre-training OR Continual Pre-training OR Domain Adaptation’（5分）和’PEFT OR LoRA OR Parameter-efficient Fine-tuning’（5分）有一定关联。然而，论文的核心是视觉模型（DINOv2和SAM）的应用与集成，并未涉及大语言模型（LLMs）、MoE、推理、对齐、RAG、代理等关键词所描述的技术，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SGPer的语义-几何先验协同框架，通过结合预训练的DINOv2和SAM模型，并设计疾病敏感适配器，有效解决了数据有限条件下小麦病害分割因生长阶段外观变化带来的挑战，在多个基准测试中取得了最先进的性能。

摘要翻译

小麦病害分割是精准农业的基础任务，但其面临的主要挑战在于病害类别在作物不同生长阶段存在显著的类内时序外观差异。这种大幅度的外观变化使得从头收集具有代表性的训练数据集既费力又不切实际。为此，我们提出SGPer——一个语义-几何先验协同框架，将有限数据下的小麦病害分割视为病害特异性语义感知与病害边界定位的耦合任务。我们的核心见解是：预训练的DINOv2模型能够提供鲁棒的类别感知语义先验以处理外观变化，这些先验可转化为粗略的空间提示来引导SAM模型实现病害边界的精确定位。具体而言，SGPer设计了配备多个病害友好型滤波器的病害敏感适配器，并将其同时嵌入DINOv2与SAM中，以使它们的预训练表征与病害特异性特征对齐。为实现这种协同，SGPer将DINOv2提取的特征转化为密集的、类别特定的点提示，以确保对所有病害区域的全面空间覆盖。随后，为消除提示冗余并确保高精度掩模生成，该框架通过交叉参考SAM的迭代掩模置信度与DINOv2衍生的类别特定语义一致性，动态筛选这些密集候选提示。最终，SGPer提炼出高信息量的提示集合来激活SAM的几何先验，实现精确且鲁棒的分割，并严格保持对时序外观变化的不变性。大量实验评估表明，SGPer在小麦病害及器官分割基准测试中持续取得最先进的性能，尤其在数据受限场景下表现突出。

摘要 (Abstract)

Wheat disease segmentation is fundamental to precision agriculture but faces severe challenges from significant intra-class temporal variations across growth stages. Such substantial appearance shifts make collecting a representative dataset for training from scratch both labor-intensive and impractical. To address this, we propose SGPer, a Semantic-Geometric Prior Synergization framework that treats wheat disease segmentation under limited data as a coupled task of disease-specific semantic perception and disease boundary localization. Our core insight is that pretrained DINOv2 provides robust category-aware semantic priors to handle appearance shifts, which can be converted into coarse spatial prompts to guide SAM for the precise localization of disease boundaries. Specifically, SGPer designs disease-sensitive adapters with multiple disease-friendly filters and inserts them into both DINOv2 and SAM to align their pretrained representations with disease-specific characteristics. To operationalize this synergy, SGPer transforms DINOv2-derived features into dense, category-specific point prompts to ensure comprehensive spatial coverage of all disease regions. To subsequently eliminate prompt redundancy and ensure highly accurate mask generation, it dynamically filters these dense candidates by cross-referencing SAM’s iterative mask confidence with the category-specific semantic consistency derived from DINOv2. Ultimately, SGPer distills a highly informative set of prompts to activate SAM’s geometric priors, achieving precise and robust segmentation that remains strictly invariant to temporal appearance changes. Extensive evaluations demonstrate that SGPer consistently achieves state-of-the-art performance on wheat disease and organ segmentation benchmarks, especially in data-constrained scenarios.

关键词: wheat disease segmentation, semantic-geometric prior, limited data, DINOv2, SAM, adapter, precision agriculture, computer vision

244. ❌ Training Without Orthogonalization, Inference With SVD: A Gradient Analysis of Rotation Representations

作者: Chris Choy 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05414v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于深度学习中的旋转表示和正交化方法的梯度分析，特别是针对3×3矩阵和SO(3)投影的SVD正交化。论文内容涉及深度学习训练优化、梯度分析、矩阵分解等基础技术，但完全不涉及大语言模型（LLMs）、大模型技术原理、AI for Science应用或任何评分关键词中列出的具体大模型相关技术（如MoE、Scaling Laws、RLHF、RAG、Agent等）。论文属于深度学习基础理论研究，与评分关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文通过梯度分析解释了为什么在深度学习旋转估计中，训练时移除SVD正交化而仅在推理时应用SVD能获得更好效果，并证明了9D表示优于6D Gram-Schmidt的理论原因。

摘要翻译

近期研究表明，在训练过程中移除正交化步骤，仅在推理阶段应用该操作，能够提升深度学习中的旋转估计性能，且实验证据支持采用带奇异值分解（SVD）投影的9维表示方法。然而，对于为何SVD正交化会特别阻碍训练过程，以及为何在推理阶段应优先选择SVD而非格拉姆-施密特（Gram-Schmidt）正交化，其理论理解尚不完善。本文针对$3 \times 3$矩阵及$SO(3)$投影场景，对SVD正交化进行了详细的梯度分析。我们的核心结果推导出了SVD反向传播雅可比矩阵的精确谱：该矩阵秩为$3$（与$SO(3)$维度一致），非零奇异值为$2/(s_i + s_j)$，条件数$κ= (s_1 + s_2)/(s_2 + s_3)$，这导致了可量化的梯度失真，且在预测矩阵远离$SO(3)$时（例如训练初期$s_3 \approx 0$的情况）最为严重。我们进一步证明，即使经过稳定的SVD梯度计算仍会引入梯度方向误差，而从训练循环中移除SVD则可完全避免这种权衡。同时，我们证明了6维格拉姆-施密特正交化的雅可比矩阵具有非对称谱：其参数接收到不均衡的梯度信号，这解释了为何9维参数化更为可取。综上，这些结果为采用直接9维回归进行训练、仅在推理阶段应用SVD投影的方法提供了理论基础。

摘要 (Abstract)

Recent work has shown that removing orthogonalization during training and applying it only at inference improves rotation estimation in deep learning, with empirical evidence favoring 9D representations with SVD projection. However, the theoretical understanding of why SVD orthogonalization specifically harms training, and why it should be preferred over Gram-Schmidt at inference, remains incomplete. We provide a detailed gradient analysis of SVD orthogonalization specialized to $3 \times 3$ matrices and $SO(3)$ projection. Our central result derives the exact spectrum of the SVD backward pass Jacobian: it has rank $3$ (matching the dimension of $SO(3)$) with nonzero singular values $2/(s_i + s_j)$ and condition number $κ= (s_1 + s_2)/(s_2 + s_3)$, creating quantifiable gradient distortion that is most severe when the predicted matrix is far from $SO(3)$ (e.g., early in training when $s_3 \approx 0$). We further show that even stabilized SVD gradients introduce gradient direction error, whereas removing SVD from the training loop avoids this tradeoff entirely. We also prove that the 6D Gram-Schmidt Jacobian has an asymmetric spectrum: its parameters receive unequal gradient signal, explaining why 9D parameterization is preferable. Together, these results provide the theoretical foundation for training with direct 9D regression and applying SVD projection only at inference.

关键词: rotation estimation, SVD orthogonalization, gradient analysis, SO(3) projection, 9D representation, Gram-Schmidt, backward pass Jacobian, training optimization

245. ❌ CRISP: Rank-Guided Iterative Squeezing for Robust Medical Image Segmentation under Domain Shift

作者: Yizhou Fang, Pujin Cheng, Yixiang Liu, Xiaoying Tang, Longxi Zhou 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05409v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于医学图像分割中的领域适应问题，提出了CRISP框架。与关键词的相关性分析如下：1）与"AI for Science OR Bioinformatics OR Cheminformatics"高度相关（10分），因为论文直接应用于生物医学成像（心脏MRI和肺部CT），属于AI在科学领域的应用；2）与"Pre-training OR Continual Pre-training OR Domain Adaptation"有一定关联（8分），因为论文核心是解决分布偏移（domain shift）问题，属于领域适应范畴，但未涉及预训练或持续预训练技术；3）其他关键词（如LLMs、MoE、RLHF等）均与论文内容无关（0分），因为论文未涉及大语言模型、专家混合、强化学习等技术，而是专注于传统的医学图像分割方法。

!!! tip deepseek-chat TL;DR

该论文针对医学图像分割中的分布偏移问题，提出了基于秩稳定性的CRISP框架，通过无参数、模型无关的方法显著提升了多中心、人口统计和模态偏移下的分割鲁棒性。

摘要翻译

医学影像中的分布偏移问题，仍是制约医疗人工智能临床转化的核心瓶颈。若未能有效应对，将导致模型在未见环境中性能严重下降，并加剧健康不平等。现有的域适应方法本质上受限于通过模拟偏移或伪监督穷举预定义可能性，这类策略在开放且不可预测的真实世界中难以应对近乎无限的分布偏移。为应对这一挑战，我们提出一项称为“阳性区域秩稳定性”的经验规律，其指出在分布偏移下，阳性体素预测概率的相对排序保持稳定。基于此原则，我们提出了CRISP——一个无需参数、与模型无关且不依赖目标域信息的框架。CRISP是首个基于排序而非概率进行分割的框架。它通过潜在特征扰动模拟分布偏移下的模型行为，其中体素概率排序呈现两种稳定模式：始终维持高概率的区域（根据规律视为“注定阳性”）以及持续保持低概率的区域（可安全判为阴性）。基于这些模式，我们构建了高精度（HP）与高召回率（HR）先验，并在扰动下递归优化它们。随后设计了一种迭代训练框架，使HP与HR逐步“挤压”至最终分割结果。在多中心心脏MRI和基于CT的肺血管分割上的广泛评估表明，CRISP具有卓越的鲁棒性，在多中心、人口统计学和模态偏移场景中，其HD95指标分别显著降低了0.14（提升7.0%）、1.90（提升13.1%）和8.39（提升38.9%）像素，明显优于现有最优方法。

摘要 (Abstract)

Distribution shift in medical imaging remains a central bottleneck for the clinical translation of medical AI. Failure to address it can lead to severe performance degradation in unseen environments and exacerbate health inequities. Existing methods for domain adaptation are inherently limited by exhausting predefined possibilities through simulated shifts or pseudo-supervision. Such strategies struggle in the open-ended and unpredictable real world, where distribution shifts are effectively infinite. To address this challenge, we introduce an empirical law called Rank Stability of Positive Regions'', which states that the relative rank of predicted probabilities for positive voxels remains stable under distribution shift. Guided by this principle, we propose CRISP, a parameter-free and model-agnostic framework requiring no target-domain information. CRISP is the first framework to make segmentation based on rank rather than probabilities. CRISP simulates model behavior under distribution shift via latent feature perturbation, where voxel probability rankings exhibit two stable patterns: regions that consistently retain high probabilities (destined positives according to the principle) and those that remain low-probability (can be safely classified as negatives). Based on these patterns, we construct high-precision (HP) and high-recall (HR) priors and recursively refine them under perturbation. We then design an iterative training framework, making HP and HR progressively squeeze’’ to the final segmentation. Extensive evaluations on multi-center cardiac MRI and CT-based lung vessel segmentation demonstrate CRISP’s superior robustness, significantly outperforming state-of-the-art methods with striking HD95 reductions of up to 0.14 (7.0% improvement), 1.90 (13.1% improvement), and 8.39 (38.9% improvement) pixels across multi-center, demographic, and modality shifts, respectively.

关键词: medical image segmentation, domain shift, distribution shift, rank stability, parameter-free framework, multi-center evaluation, robustness improvement, cardiac MRI

246. ❌ Weather-Conditioned Branch Routing for Robust LiDAR-Radar 3D Object Detection

作者: Hongsheng Li, Lingfeng Zhang, Zexian Yang, Liang Li, Rong Yin, Xiaoshuai Hao, Wenbo Ding 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05405v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于多模态3D目标检测，特别是LiDAR和4D雷达在恶劣天气条件下的融合。其核心贡献是提出了一种天气条件分支路由框架，通过条件令牌和轻量级路由器动态调整模态偏好。论文与绝大多数关键词（涉及大模型、训练技术、推理方法、对齐、代理等）完全无关，因为这些关键词主要针对自然语言处理和通用大模型领域。唯一略有相关的是’Mechanistic Interpretability OR Explainable AI’，因为论文提到其方法提供了’高度可解释的见解’和’透明地揭示’模态偏好，但这并非论文的核心焦点，只是附带特性，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种天气条件分支路由框架，用于在恶劣天气下动态融合LiDAR和4D雷达数据，以提升3D目标检测的鲁棒性，并在K-Radar基准上实现了最先进的性能。

摘要翻译

在恶劣天气条件下实现鲁棒的三维物体检测极具挑战性，这主要源于不同传感器可靠性的动态变化。现有的激光雷达-4D雷达融合方法虽提升了鲁棒性，但大多依赖固定或弱自适应的处理流程，无法随环境条件变化动态调整模态偏好。为弥补这一不足，我们将多模态感知重新定义为一种天气条件引导的分支路由问题。我们的框架并未计算单一融合输出，而是显式维护三个并行的三维特征流：一个纯激光雷达分支、一个纯4D雷达分支以及一个条件门控融合分支。通过从视觉与语义提示中提取的条件令牌引导，一个轻量级路由器动态预测样本特异性权重，以软聚合方式整合这些表征。此外，为防止分支退化，我们引入了一种天气监督学习策略，结合辅助分类与多样性正则化，以强制形成具有区分度且依赖条件的分支路由行为。在K-Radar基准数据集上的大量实验表明，我们的方法取得了最先进的性能。更重要的是，该方法为模态偏好提供了显式且高度可解释的洞察，透明地揭示了自适应路由如何在多样化的恶劣天气场景中，鲁棒地切换激光雷达与4D雷达之间的依赖关系。源代码将予以公开。

摘要 (Abstract)

Robust 3D object detection in adverse weather is highly challenging due to the varying reliability of different sensors. While existing LiDAR-4D radar fusion methods improve robustness, they predominantly rely on fixed or weakly adaptive pipelines, failing to dy-namically adjust modality preferences as environmental conditions change. To bridge this gap, we reformulate multi-modal perception as a weather-conditioned branch routing problem. Instead of computing a single fused output, our framework explicitly maintains three parallel 3D feature streams: a pure LiDAR branch, a pure 4D radar branch, and a condition-gated fusion branch. Guided by a condition token extracted from visual and semantic prompts, a lightweight router dynamically predicts sample-specific weights to softly aggregate these representations. Furthermore, to prevent branch collapse, we introduce a weather-supervised learning strategy with auxiliary classification and diversity regularization to enforce distinct, condition-dependent routing behaviors. Extensive experiments on the K-Radar benchmark demonstrate that our method achieves state-of-the-art performance. Furthermore, it provides explicit and highly interpretable insights into modality preferences, transparently revealing how adaptive routing robustly shifts reliance between LiDAR and 4D radar across diverse adverse-weather scenarios. The source code with be released.

关键词: 3D object detection, LiDAR-radar fusion, adverse weather, branch routing, weather-conditioned, modality preference, robust perception, K-Radar benchmark

247. ❌ LSGS-Loc: Towards Robust 3DGS-Based Visual Localization for Large-Scale UAV Scenarios

作者: Xiang Zhang, Tengfei Wang, Fang Xu, Xin Wang, Zongqian Zhan 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05402v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的3D高斯泼溅（3DGS）技术在大规模无人机场景中的视觉定位问题，提出了LSGS-Loc方法，包括尺度感知位姿初始化和拉普拉斯可靠性掩码机制。所有评分关键词均与大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文研究的是3D场景表示和视觉定位，属于传统计算机视觉范畴，与评分关键词列表中的大模型技术、训练方法、推理优化、AI for Science等主题完全无关。

!!! tip deepseek-chat TL;DR

该论文针对大规模无人机场景中基于3D高斯泼溅的视觉定位方法存在的位姿初始化困难和渲染伪影敏感问题，提出了LSGS-Loc方法，通过尺度感知位姿初始化和拉普拉斯可靠性掩码机制，在基准测试中实现了最先进的精度和鲁棒性。

摘要翻译

大规模无人机场景中的视觉定位是自主系统的关键能力，但由于几何复杂性和环境变化，该任务仍具挑战性。尽管三维高斯泼溅（3DGS）已成为一种前景广阔的场景表示方法，但现有基于3DGS的视觉定位方法在大规模场景中仍面临鲁棒的姿态初始化和对渲染伪影敏感的难题。为应对这些局限，我们提出了LSGS-Loc，一种专为大规模3DGS场景设计的新型视觉定位流程。具体而言，我们引入了一种尺度感知的姿态初始化策略，该策略将场景无关的相对姿态估计与显式的3DGS尺度约束相结合，从而无需针对特定场景进行训练即可实现几何基础扎实的定位。此外，在姿态优化阶段，为减轻模糊和漂浮物等重建伪影的影响，我们开发了一种基于拉普拉斯算子的可靠性掩蔽机制，引导光度优化聚焦于高质量区域。在大规模无人机基准数据集上的大量实验表明，我们的方法在无序图像查询中实现了最先进的精度与鲁棒性，显著优于现有基于3DGS的方法。代码发布于：https://github.com/xzhang-z/LSGS-Loc

摘要 (Abstract)

Visual localization in large-scale UAV scenarios is a critical capability for autonomous systems, yet it remains challenging due to geometric complexity and environmental variations. While 3D Gaussian Splatting (3DGS) has emerged as a promising scene representation, existing 3DGS-based visual localization methods struggle with robust pose initialization and sensitivity to rendering artifacts in large-scale settings. To address these limitations, we propose LSGS-Loc, a novel visual localization pipeline tailored for large-scale 3DGS scenes. Specifically, we introduce a scale-aware pose initialization strategy that combines scene-agnostic relative pose estimation with explicit 3DGS scale constraints, enabling geometrically grounded localization without scene-specific training. Furthermore, in the pose refinement, to mitigate the impact of reconstruction artifacts such as blur and floaters, we develop a Laplacian-based reliability masking mechanism that guides photometric refinement toward high-quality regions. Extensive experiments on large-scale UAV benchmarks demonstrate that our method achieves state-of-the-art accuracy and robustness for unordered image queries, significantly outperforming existing 3DGS-based approaches. Code is available at: https://github.com/xzhang-z/LSGS-Loc

关键词: 3D Gaussian Splatting, visual localization, UAV scenarios, pose initialization, pose refinement, Laplacian-based reliability masking, large-scale scenes, photometric refinement

248. ❌ Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval

作者: Yuxin Yang, Yinan Zhou, Yuxin Chen, Ziqi Zhang, Zongyang Ma, Chunfeng Yuan, Bing Li, Jun Gao, Weiming Hu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05393v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的组合图像检索任务，提出了一种新的细粒度检索任务OACIR和相应的AdaFocal框架。虽然属于AI研究范畴，但论文内容完全围绕视觉检索、注意力机制和基准构建，未涉及任何大语言模型、深度学习技术原理创新或科学领域应用。所有评分关键词均与大模型技术、训练方法、推理优化、对齐、代理系统等主题相关，与本文的视觉检索研究无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对组合图像检索中语义匹配优先导致实例级一致性不足的问题，提出了对象锚定组合图像检索任务OACIR，并构建了大规模基准OACIRR，同时开发了AdaFocal框架通过上下文感知注意力调制器自适应增强指定实例区域的注意力，显著提升了实例级保真度性能。

摘要翻译

组合图像检索（Composed Image Retrieval, CIR）通过支持结合参考图像与修改文本的灵活多模态查询，已展现出显著潜力。然而，CIR本质上优先考虑语义匹配，难以在不同情境中可靠地检索用户指定的具体实例。在实践中，相较于宽泛的语义，强调具体实例的保真度往往更为关键。本文提出对象锚定组合图像检索（Object-Anchored Composed Image Retrieval, OACIR），这是一种新颖的细粒度检索任务，要求严格的实例级一致性。为推进该任务的研究，我们构建了OACIRR（基于真实图像的OACIR），这是首个大规模、多领域的基准数据集，包含超过16万个四元组以及四个具有挑战性的候选库，其中引入了困难负例实例干扰项。每个四元组通过一个边界框增强了组合查询，该边界框在视觉上将参考图像中的对象锚定，为确保实例保留提供了一种精确而灵活的方式。针对OACIR任务，我们提出AdaFocal框架，其核心为上下文感知注意力调制器，能够自适应地增强对指定实例区域的注意力，动态平衡锚定实例与更广泛组合上下文之间的关注焦点。大量实验表明，AdaFocal显著优于现有的组合检索模型，尤其在保持实例级保真度方面表现突出，从而为这一挑战性任务建立了坚实的基线，同时为开发更灵活、具备实例感知能力的检索系统开辟了新方向。

摘要 (Abstract)

Composed Image Retrieval (CIR) has demonstrated significant potential by enabling flexible multimodal queries that combine a reference image and modification text. However, CIR inherently prioritizes semantic matching, struggling to reliably retrieve a user-specified instance across contexts. In practice, emphasizing concrete instance fidelity over broad semantics is often more consequential. In this work, we propose Object-Anchored Composed Image Retrieval (OACIR), a novel fine-grained retrieval task that mandates strict instance-level consistency. To advance research on this task, we construct OACIRR (OACIR on Real-world images), the first large-scale, multi-domain benchmark comprising over 160K quadruples and four challenging candidate galleries enriched with hard-negative instance distractors. Each quadruple augments the compositional query with a bounding box that visually anchors the object in the reference image, providing a precise and flexible way to ensure instance preservation. To address the OACIR task, we propose AdaFocal, a framework featuring a Context-Aware Attention Modulator that adaptively intensifies attention within the specified instance region, dynamically balancing focus between the anchored instance and the broader compositional context. Extensive experiments demonstrate that AdaFocal substantially outperforms existing compositional retrieval models, particularly in maintaining instance-level fidelity, thereby establishing a robust baseline for this challenging task while opening new directions for more flexible, instance-aware retrieval systems.

关键词: Composed Image Retrieval, Instance-level Consistency, Object-Anchored CIR, Fine-grained Retrieval, Attention Modulation, Benchmark Construction, Multimodal Queries, Hard-negative Distractors

249. ❌ LUMOS: Universal Semi-Supervised OCT Retinal Layer Segmentation with Hierarchical Reliable Mutual Learning

作者: Yizhou Fang, Jian Zhong, Li Lin, Xiaoying Tang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05388v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于医学图像分割（OCT视网膜层分割），属于AI在生物医学领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评分5分），但未涉及大模型、深度学习技术原理创新或其他关键词（评分0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为LUMOS的半监督通用OCT视网膜层分割框架，通过双解码器网络和分层可靠互学习策略，有效解决了标注稀缺和标签粒度异构问题，在多个数据集上实现了优越的性能和泛化能力。

摘要翻译

光学相干断层扫描（OCT）视网膜层分割面临标注稀缺以及不同数据集间标签粒度异构的挑战。尽管半监督学习有助于缓解标签稀缺问题，但现有方法通常假设固定的分割粒度，未能充分利用跨粒度监督信息。本文提出LUMOS，一种基于双解码器网络与分层提示策略（Dual-Decoder Network with a Hierarchical Prompting Strategy, DDN-HPS）及可靠渐进式多粒度学习（Reliable Progressive Multi-granularity Learning, RPML）的半监督通用OCT视网膜层分割框架。DDN-HPS结合了双分支架构与多粒度提示策略，有效抑制了伪标签噪声传播；同时，RPML引入了区域级可靠性加权机制和渐进式训练方法，引导模型从易到难学习任务，确保跨粒度一致性目标的可靠选择，从而实现稳定的跨粒度对齐。在六个OCT数据集上的实验表明，LUMOS显著优于现有方法，并展现出卓越的跨域与跨粒度泛化能力。

摘要 (Abstract)

Optical Coherence Tomography (OCT) layer segmentation faces challenges due to annotation scarcity and heterogeneous label granularities across datasets. While semi-supervised learning helps alleviate label scarcity, existing methods typically assume a fixed granularity, failing to fully exploit cross-granularity supervision. This paper presents LUMOS, a semi-supervised universal OCT retinal layer segmentation framework based on a Dual-Decoder Network with a Hierarchical Prompting Strategy (DDN-HPS) and Reliable Progressive Multi-granularity Learning (RPML). DDN-HPS combines a dual-branch architecture with a multi-granularity prompting strategy to effectively suppress pseudo-label noise propagation. Meanwhile, RPML introduces region-level reliability weighing and a progressive training approach that guides the model from easier to more difficult tasks, ensuring the reliable selection of cross-granularity consistency targets, thereby achieving stable cross-granularity alignment. Experiments on six OCT datasets demonstrate that LUMOS largely outperforms existing methods and exhibits exceptional cross-domain and cross-granularity generalization capability.

关键词: OCT retinal layer segmentation, semi-supervised learning, cross-granularity supervision, dual-decoder network, hierarchical prompting, reliable progressive learning, medical image analysis, domain generalization

250. ❌ UAVReason: A Unified, Large-Scale Benchmark for Multimodal Aerial Scene Reasoning and Generation

作者: Jintao Sun, Hu Zhang, Donglin Di, Gangyi Ding, Zhedong Zheng 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05377v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于无人机（UAV）场景下的多模态视觉语言模型（VLM）基准构建和性能评估，研究内容涉及视觉问答（VQA）、图像生成、分割等任务，并提出了一个统一的多任务学习基线模型。然而，论文的核心是特定领域（无人机视角）的视觉语言基准和模型适配，并未涉及任何大语言模型（LLM）或深度学习技术原理的创新。所有评分关键词均与大语言模型、模型训练技术、推理优化、对齐方法、代理系统等直接相关，而本文研究的是视觉语言模型在无人机领域的应用，属于计算机视觉与自然语言处理的交叉，但未触及大模型技术本身。因此，所有关键词均得0分，加权总分为0。

!!! tip deepseek-chat TL;DR

该论文针对无人机高空视角下视觉语言模型性能下降的问题，提出了首个统一的大规模多模态基准UAVReason，并通过多任务学习建立了一个强基线模型，显著提升了无人机场景下的视觉理解和生成性能。

摘要翻译

视觉-语言模型（VLMs）在平视视角的视觉理解中展现出卓越能力，但在部署于高空无人机（UAVs）时往往表现不佳。这种失效主要源于显著的领域偏移，其特征包括微小密集的物体、重复纹理以及模糊的俯视方向。这些因素严重干扰了语义基础，阻碍了空间推理与可控生成。为弥合这一关键差距，我们提出了UAVReason——首个专为俯视视角无人机场景设计的统一大规模多模态基准，该基准源自高保真无人机仿真平台。与现有主要孤立且专注于目标检测或分割等单一任务的无人机基准不同，UAVReason创新性地整合了超过27.3万个视觉问答（VQA）对，其中包括2.36万张带详细描述的单帧图像、6.82万个双帧时序序列以及18.88万个跨模态生成样本。该基准从时空维度探究了22种不同的推理类型，同时评估了跨RGB、深度和分割模态的高保真生成能力。我们进一步通过多任务学习建立了一个强大的统一基线模型。大量实验验证了我们统一方法在多种指标上的有效性，例如VQA的EM/F1分数、分割的mIoU以及生成的CLIP分数。这些结果揭示了通用领域视觉-语言模型的局限性，并表明统一的多任务学习能显著提升无人机原生场景的性能。所有数据、代码和评估工具将公开发布，以推动无人机多模态研究的发展。

摘要 (Abstract)

Vision-Language models (VLMs) have demonstrated remarkable capability in ground-view visual understanding but often fracture when deployed on high-altitude Unmanned Aerial Vehicles (UAVs). The failure largely stems from a pronounced domain shift, characterized by tiny and densely packed objects, repetitive textures, and ambiguous top-down orientations. These factors severely disrupt semantic grounding and hinder both spatial reasoning and controllable generation. To bridge this critical gap, we introduce UAVReason, the first unified large-scale multi-modal benchmark dedicated to nadir-view UAV scenarios, derived from a high-fidelity UAV simulation platform. In contrast to existing UAV benchmarks, which are largely siloed and focus on single tasks like object detection or segmentation, UAVReason uniquely consolidates over 273K Visual Question Answering (VQA) pairs, including 23.6K single frames with detailed captions, 68.2K 2-frame temporal sequences, and 188.8K cross-modal generation samples. The benchmark probes 22 diverse reasoning types across spatial and temporal axes while simultaneously evaluating high-fidelity generation across RGB, depth, and segmentation modalities. We further establish a strong, unified baseline model via multi-task learning. Extensive experiments validate the efficacy of our unified approach across diverse metrics, such as EM/F1 for VQA, mIoU for segmentation, and CLIP Score for generation. These results indicate limitations of general-domain vision-language models and show that unified multi-task learning substantially improves UAV-native performance. All data, code, and evaluation tools will be publicly released to advance UAV multimodal research.

关键词: UAV, Vision-Language Models, Multimodal Benchmark, Visual Question Answering, Domain Shift, Nadir-view, Multi-task Learning, Aerial Scene Reasoning

251. ❌ 3DTurboQuant: Training-Free Near-Optimal Quantization for 3D Reconstruction Models

作者: Jae Joong Lee 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05366v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于3D重建模型的量化压缩技术，仅与关键词’Quantization OR Model Compression OR Low-bit Weights’高度相关（10分），因为其核心贡献是提出了一种无需训练、数据无关的量化方法。其他关键词均与论文内容无关（0分），因为论文不涉及大语言模型、训练技术、推理方法、对齐、代理系统或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需训练、数据无关的量化方法3DTurboQuant，用于压缩3D重建模型（如3D高斯溅射和DUSt3R），在保持高保真度的同时实现3.5倍至7.9倍的压缩，且无需代码本学习或校准数据。

摘要翻译

现有所有用于压缩3D高斯泼溅、神经辐射场或基于Transformer的3D重建模型的方法，都需要通过逐场景微调来学习数据相关的码本。我们证明这是不必要的。在这些模型中占据存储主导地位的参数向量——3DGS中的45维球谐函数系数与DUSt3R中的1024维键值向量——均处于特定维度范围，使得单一随机旋转可将任意输入变换为服从已知贝塔分布的坐标。这一特性使得预计算的、数据无关的劳埃德-麦克斯量化接近最优，其压缩效率达到信息论下界的2.7倍以内。我们提出3DTurboQuant方法，推导出：（1）维度相关判据，可在实验前预测哪些参数可被量化及对应比特宽度；（2）连接量化均方误差与逐场景渲染峰值信噪比的范数分离界；（3）将基于旋转的量化扩展至二维哈希网格特征的条目分组策略；（4）具有闭式压缩比的可组合剪枝-量化流程。在NeRF Synthetic数据集上，3DTurboQuant将3DGS压缩3.5倍（峰值信噪比损失仅0.02分贝），并将DUSt3R键值缓存压缩7.9倍（点云保真度达39.7分贝）。该方法无需训练、无需学习码本、无需校准数据，压缩过程仅需数秒。代码已开源（https://github.com/JaeLee18/3DTurboQuant）。

摘要 (Abstract)

Every existing method for compressing 3D Gaussian Splatting, NeRF, or transformer-based 3D reconstructors requires learning a data-dependent codebook through per-scene fine-tuning. We show this is unnecessary. The parameter vectors that dominate storage in these models, 45-dimensional spherical harmonics in 3DGS and 1024-dimensional key-value vectors in DUSt3R, fall in a dimension range where a single random rotation transforms any input into coordinates with a known Beta distribution. This makes precomputed, data-independent Lloyd-Max quantization near-optimal, within a factor of 2.7 of the information-theoretic lower bound. We develop 3D, deriving (1) a dimension-dependent criterion that predicts which parameters can be quantized and at what bit-width before running any experiment, (2) norm-separation bounds connecting quantization MSE to rendering PSNR per scene, (3) an entry-grouping strategy extending rotation-based quantization to 2-dimensional hash grid features, and (4) a composable pruning-quantization pipeline with a closed-form compression ratio. On NeRF Synthetic, 3DTurboQuant compresses 3DGS by 3.5x with 0.02dB PSNR loss and DUSt3R KV caches by 7.9x with 39.7dB pointmap fidelity. No training, no codebook learning, no calibration data. Compression takes seconds. The code will be released (https://github.com/JaeLee18/3DTurboQuant)

关键词: 3D reconstruction, quantization, model compression, 3D Gaussian Splatting, NeRF, training-free, data-independent, KV cache compression

252. ❌ Rethinking IRSTD: Single-Point Supervision Guided Encoder-only Framework is Enough for Infrared Small Target Detection

作者: Rixiang Ni, Boyang Li, Jun Chen, Yonghao Li, Feiyu Ren, Yuji Wang, Haoyang Yuan, Wujiao He, Wei An 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05363v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于红外小目标检测（IRSTD），提出了一种基于单点监督的编码器框架SPIRE。论文内容与绝大多数关键词（如LLMs、MoE、RLHF、RAG等）完全无关，因为这些关键词涉及大语言模型、训练技术、推理优化等，而本文是计算机视觉中的目标检测任务。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为红外小目标检测可视为AI在科学或工程领域的应用（如遥感、军事监测），但并非核心匹配，故给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对红外小目标检测中传统分割方法忽略目标像素稀少和边界模糊的问题，提出了一种基于单点监督的编码器框架SPIRE，将其重新定义为质心回归任务，在多个基准测试中实现了竞争性的检测性能、低误报率和显著降低的计算成本。

摘要翻译

红外小目标检测（IRSTD）旨在从复杂背景杂波中分离出小目标。现有研究主要集中于像素级监督引导的“编码器-解码器”分割范式。尽管已取得良好性能，这些方法忽略了小目标仅占据少量像素且常因背景杂波导致边界模糊的事实。基于此观察，我们认为IRSTD的首要原则应是目标定位，而非分离所有伴随难以区分背景噪声的目标区域。本文重新将IRSTD定义为质心回归任务，并提出一种新颖的单点监督引导红外概率响应编码方法（简称SPIRE）。该方法因简化监督网络与等效输出之间的不匹配而极具挑战性。具体而言，我们首先设计点响应先验监督模块，将单点标注转化为符合红外点目标响应特性的概率响应图；同时构建高分辨率概率编码器，实现无需解码器重建的纯编码器端到端回归。通过保持高分辨率特征并提升有效监督密度，SPIRE缓解了稀疏目标分布下的优化不稳定问题。最终，在包括SIRST-UAVB和SIRST4在内的多个IRSTD基准数据集上的大量实验表明，SPIRE在目标级检测性能上具有竞争力，同时保持稳定的低虚警率并显著降低计算成本。代码已公开于：https://github.com/NIRIXIANG/SPIRE-IRSTD。

摘要 (Abstract)

Infrared small target detection (IRSTD) aims to separate small targets from clutter backgrounds. Extensive research is dedicated to the pixel-level supervision-guided “encoder-decoder” segmentation paradigm. Although having achieved promising performance, they neglect the fact that small targets only occupy a few pixels and are usually accompanied with blurred boundary caused by clutter backgrounds. Based on this observation, we argue that the first principle of IRSTD should be target localization instead of separating all target region accompanied with indistinguishable background noise. In this paper, we reformulate IRSTD as a centroid regression task and propose a novel Single-Point Supervision guided Infrared Probabilistic Response Encoding method (namely, SPIRE), which is indeed challenging due to the mismatch between reduced supervision network and equivalent output. Specifically, we first design a Point-Response Prior Supervision (PRPS), which transforms single-point annotations into probabilistic response map consistent with infrared point-target response characteristics, with a High-Resolution Probabilistic Encoder (HRPE) that enables encoder-only, end-to-end regression without decoder reconstruction. By preserving high-resolution features and increasing effective supervision density, SPIRE alleviates optimization instability under sparse target distributions. Finally, extensive experiments on various IRSTD benchmarks, including SIRST-UAVB and SIRST4 demonstrate that SPIRE achieves competitive target-level detection performance with consistently low false alarm rate (Fa) and significantly reduced computational cost. Code is publicly available at: https://github.com/NIRIXIANG/SPIRE-IRSTD.

关键词: Infrared small target detection, Single-point supervision, Encoder-only framework, Centroid regression, Probabilistic response encoding, False alarm rate, Computational efficiency

253. ❌ GESS: Multi-cue Guided Local Feature Learning via Geometric and Semantic Synergy

作者: Yang Yi, Xieyuanli Chen, Jinpu Zhang, Hui Shen, Dewen Hu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05359v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《GESS: Multi-cue Guided Local Feature Learning via Geometric and Semantic Synergy》专注于计算机视觉中的局部特征检测与描述，提出了一种结合语义和几何线索的多线索引导框架，包括语义-深度感知关键点机制和统一三线索融合模块。所有评分关键词均与大语言模型、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文研究的是传统计算机视觉任务，未涉及大模型、深度学习技术原理创新或AI在生物信息学等科学领域的应用，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对计算机视觉中局部特征检测与描述任务，提出了一种多线索引导的局部特征学习框架，通过语义和几何线索的协同作用来增强检测鲁棒性和描述符区分度，并在四个基准测试中验证了其有效性。

摘要翻译

鲁棒局部特征检测与描述是计算机视觉领域的基础任务。现有方法主要依赖单一外观线索进行建模，导致关键点不稳定且描述符区分度不足。本文提出一种多线索引导的局部特征学习框架，通过语义与几何线索协同增强检测鲁棒性与描述符区分度。具体而言，我们在轻量级骨干网络基础上构建了联合语义-法向预测头与深度稳定性预测头。前者利用共享的三维矢量场深度耦合语义与法向线索，从而解决异构不一致性带来的优化干扰；后者从几何一致性角度量化局部区域的可靠性，为鲁棒关键点选择提供确定性指导。基于这些预测，我们提出语义-深度感知关键点（Semantic-Depth Aware Keypoint, SDAK）机制进行特征检测。通过将语义可靠性与深度稳定性耦合，SDAK对关键点响应进行重加权，以抑制不可靠区域的伪特征。在描述符构建方面，我们设计了统一三线索融合（Unified Triple-Cue Fusion, UTCF）模块，该模块采用语义调度门控机制自适应注入多属性特征，提升描述符区分度。在四个基准数据集上的大量实验验证了所提框架的有效性。源代码与预训练模型将在以下地址公开：https://github.com/yiyscut/GESS.git。

摘要 (Abstract)

Robust local feature detection and description are foundational tasks in computer vision. Existing methods primarily rely on single appearance cues for modeling, leading to unstable keypoints and insufficient descriptor discriminability. In this paper, we propose a multi-cue guided local feature learning framework that leverages semantic and geometric cues to synergistically enhance detection robustness and descriptor discriminability. Specifically, we construct a joint semantic-normal prediction head and a depth stability prediction head atop a lightweight backbone. The former leverages a shared 3D vector field to deeply couple semantic and normal cues, thereby resolving optimization interference from heterogeneous inconsistencies. The latter quantifies the reliability of local regions from a geometric consistency perspective, providing deterministic guidance for robust keypoint selection. Based on these predictions, we introduce the Semantic-Depth Aware Keypoint (SDAK) mechanism for feature detection. By coupling semantic reliability with depth stability, SDAK reweights keypoint responses to suppress spurious features in unreliable regions. For descriptor construction, we design a Unified Triple-Cue Fusion (UTCF) module, which employs a semantic-scheduled gating mechanism to adaptively inject multi-attribute features, improving descriptor discriminability. Extensive experiments on four benchmarks validate the effectiveness of the proposed framework. The source code and pre-trained model will be available at: https://github.com/yiyscut/GESS.git.

关键词: local feature learning, multi-cue guidance, semantic-geometric synergy, keypoint detection, feature descriptor, computer vision, robustness, discriminability

254. ❌ Topological Characterization of Churn Flow and Unsupervised Correction to the Wu Flow-Regime Map in Small-Diameter Vertical Pipes

作者: Brady Koenig, Sushovan Majhi, Atish Mitra, Abigail Stein, Burt Todd 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06167v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究垂直两相流中的搅混流（churn flow）拓扑表征，使用Euler Characteristic Surfaces（ECS）和Multiple Kernel Learning（MKL）进行无监督流态发现，属于流体力学和工程应用领域。论文与绝大多数关键词（涉及大模型、深度学习、AI技术原理等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文应用了无监督机器学习方法（MKL）解决科学问题（流体力学），但未涉及大模型、深度学习或生物信息学/化学信息学具体技术，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文首次提出了基于拓扑学的搅混流数学定义，通过Euler Characteristic Surfaces和无监督多核学习框架，成功识别并修正了小型垂直管道中现有流态模型的预测偏差。

摘要翻译

段塞流-垂直两相流中混沌、振荡的流态-四十余年来一直缺乏定量的数学定义。我们首次引入基于拓扑学的特征描述方法，采用欧拉特征曲面（Euler Characteristic Surfaces，简称ECS）。我们将无监督流态发现构建为多核学习（Multiple Kernel Learning，简称MKL）问题，融合了两个互补的ECS衍生核函数-时间对齐核（基于χ(s,t)曲面的L^1距离）和振幅统计核（尺度上的均值、标准差、最大值、最小值）-以及气体速度。将此框架应用于蒙大拿理工大学37组未标记的气水两相流实验数据，自校准框架学习得到的权重为β_ECS=0.14、β_amp=0.50、β_ugs=0.36，总权重的64%分配给了拓扑衍生特征（β_ECS + β_amp）。ECS推断的段塞流/段塞流转换点比Wu等人（2017）在2英寸管径中的预测值高出+3.81 m/s，这量化了现有报告指出的现象：在界面张力和管壁间相互作用主导流动的小直径管道中，现有模型低估了段塞流的持续性。基于德克萨斯农工大学947幅图像的跨设施验证证实，段塞流相比段塞流具有1.9倍更高的拓扑复杂度（p < 10^-5）。将此框架应用于45组TAMU伪实验数据，同一无监督框架实现了95.6%的四分类准确率和100%的段塞流召回率-无需任何标记训练数据-达到甚至超越了需要数千标注样本的有监督基线方法。此项工作首次提供了段塞流的数学定义，并证明无监督拓扑描述符能够挑战并修正广泛采用的机理模型。

摘要 (Abstract)

Churn flow-the chaotic, oscillatory regime in vertical two-phase flow-has lacked a quantitative mathematical definition for over $40$ years. We introduce the first topology-based characterization using Euler Characteristic Surfaces (ECS). We formulate unsupervised regime discovery as Multiple Kernel Learning (MKL), blending two complementary ECS-derived kernels-temporal alignment ($L^1$ distance on the $χ(s,t)$ surface) and amplitude statistics (scale-wise mean, standard deviation, max, min)-with gas velocity. Applied to $37$ unlabeled air-water trials from Montana Tech, the self-calibrating framework learns weights $β_{ECS}=0.14$, $β_{amp}=0.50$, $β_{ugs}=0.36$, placing $64%$ of total weight on topology-derived features ($β_{ECS} + β_{amp}$). The ECS-inferred slug/churn transition lies $+3.81$ m/s above Wu et al.’s (2017) prediction in $2$-in. tubing, quantifying reports that existing models under-predict slug persistence in small-diameter pipes where interfacial tension and wall-to-wall interactions dominate flow. Cross-facility validation on $947$ Texas A&M University images confirms $1.9\times$ higher topological complexity in churn vs. slug ($p < 10^{-5}$). Applied to $45$ TAMU pseudo-trials, the same unsupervised framework achieves $95.6%$ $4$-class accuracy and $100%$ churn recall-without any labeled training data-matching or exceeding supervised baselines that require thousands of annotated examples. This work provides the first mathematical definition of churn flow and demonstrates that unsupervised topological descriptors can challenge and correct widely adopted mechanistic models.

关键词: churn flow, topological characterization, Euler Characteristic Surfaces, unsupervised regime discovery, Multiple Kernel Learning, vertical two-phase flow, small-diameter pipes, Wu flow-regime map

255. ❌ A Large-Scale Empirical Comparison of Meta-Learners and Causal Forests for Heterogeneous Treatment Effect Estimation in Marketing Uplift Modeling

作者: Aman Singh 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06123v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于营销领域的异质处理效应估计，使用传统机器学习方法（LightGBM、Causal Forest）进行因果推断和提升建模，未涉及任何大模型、深度学习技术原理或AI for Science应用，与所有评分关键词完全无关。

!!! tip deepseek-chat TL;DR

该研究通过大规模实证比较S-Learner、T-Learner、X-Learner和Causal Forest四种异质处理效应估计方法在营销提升建模中的性能，发现S-Learner在Criteo数据集上表现最佳，为大规模提升建模管道的方法选择提供了实证指导。

摘要翻译

在个体层面估计条件平均处理效应（CATE）是精准营销的核心，然而在工业规模上对增益建模方法进行系统性基准测试仍然有限。本文提出UpliftBench，对四种CATE估计器进行实证评估：S-Learner、T-Learner、X-Learner（均采用LightGBM基学习器）以及因果森林（Causal Forest，来自EconML），并应用于包含1398万条客户记录的Criteo Uplift v2.1数据集。近乎随机的处理分配（倾向得分AUC = 0.509）为因果估计提供了强有力的内部效度。通过Qini系数和累积增益曲线评估，S-Learner取得了最高的Qini分数0.376，其中按预测CATE排序的前20%客户捕获了全部增量转化的77.7%，相比随机触达提升了3.9倍。SHAP分析显示，在12个匿名化协变量中，f8是主导异质性处理效应（HTE）的关键驱动因素。因果森林的不确定性量化表明，1.9%的客户为高确信可说服群体（95%置信区间下限 > 0），0.1%为高确信沉睡群体（95%置信区间上限 < 0）。我们的研究结果为从业者在大规模增益建模流程中的方法选择提供了基于实证的指导。

摘要 (Abstract)

Estimating Conditional Average Treatment Effects (CATE) at the individual level is central to precision marketing, yet systematic benchmarking of uplift modeling methods at industrial scale remains limited. We present UpliftBench, an empirical evaluation of four CATE estimators: S-Learner, T-Learner, X-Learner (all with LightGBM base learners), and Causal Forest (EconML), applied to the Criteo Uplift v2.1 dataset comprising 13.98 million customer records. The near-random treatment assignment (propensity AUC = 0.509) provides strong internal validity for causal estimation. Evaluated via Qini coefficient and cumulative gain curves, the S-Learner achieves the highest Qini score of 0.376, with the top 20% of customers ranked by predicted CATE capturing 77.7% of all incremental conversions, a 3.9x improvement over random targeting. SHAP analysis identifies f8 as the dominant heterogeneous treatment effect (HTE) driver among the 12 anonymized covariates. Causal Forest uncertainty quantification reveals that 1.9% of customers are confident persuadables (lower 95% CI > 0) and 0.1% are confident sleeping dogs (upper 95% CI < 0). Our results provide practitioners with evidence-based guidance on method selection for large-scale uplift modeling pipelines.

关键词: Heterogeneous Treatment Effect Estimation, Uplift Modeling, CATE Estimators, Causal Forest, S-Learner, Qini Coefficient, Precision Marketing, Criteo Dataset

256. ❌ Learning $\mathsf{AC}^0$ Under Graphical Models

作者: Gautam Chandrasekaran, Jason Gaitonde, Ankur Moitra, Arsen Vasilyan 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06109v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是经典计算学习理论问题——在相关分布（如图形模型）下学习AC^0电路，属于理论计算机科学领域。所有评分关键词均涉及大模型、深度学习及相关技术（如训练方法、推理优化、应用等），而本文完全不涉及这些现代AI技术，专注于传统电路学习算法和理论分析，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了在具有强空间混合性的多项式增长图形模型下学习AC^0电路的长期挑战，提出了准多项式时间算法，将均匀分布下的低度算法推广到相关分布设置。

摘要翻译

在一项里程碑式的研究中，Linial、Mansour和Nisan（J. ACM 1993）提出了一种拟多项式时间算法，用于在均匀分布下通过独立同分布标注样本学习恒定深度电路。他们的工作对计算学习理论产生了深远而持久的影响，特别是引入了$\textit{低度算法}$。然而，该领域许多结果与技术的一个重要缺陷在于其对乘积结构的依赖，而这种结构在实际场景中往往难以成立。针对更自然的关联分布获得类似的学习保证，一直是该领域长期存在的挑战。
具体而言，我们提出了拟多项式时间算法，用于在远超乘积设定的条件下学习$\mathsf{AC}^0$电路——当输入来自任何具有多项式增长且呈现强空间混合特性的图模型时。主要技术挑战在于为傅里叶分析提供替代方案，我们通过展示新的采样算法如何将均匀设定下的低度多项式逼近结论迁移至图模型来实现这一点。我们的方法具有足够的普适性，可扩展至其他被深入研究的函数类，如单调函数和半空间。

摘要 (Abstract)

In a landmark result, Linial, Mansour and Nisan (J. ACM 1993) gave a quasipolynomial-time algorithm for learning constant-depth circuits given labeled i.i.d. samples under the uniform distribution. Their work has had a deep and lasting legacy in computational learning theory, in particular introducing the $\textit{low-degree algorithm}$. However, an important critique of many results and techniques in the area is the reliance on product structure, which is unlikely to hold in realistic settings. Obtaining similar learning guarantees for more natural correlated distributions has been a longstanding challenge in the field. In particular, we give quasipolynomial-time algorithms for learning $\mathsf{AC}^0$ substantially beyond the product setting, when the inputs come from any graphical model with polynomial growth that exhibits strong spatial mixing. The main technical challenge is in giving a workaround to Fourier analysis, which we do by showing how new sampling algorithms allow us to transfer statements about low-degree polynomial approximation under the uniform setting to graphical models. Our approach is general enough to extend to other well-studied function classes, like monotone functions and halfspaces.

关键词: AC^0 circuits, graphical models, learning algorithms, quasipolynomial-time, spatial mixing, low-degree algorithm, correlated distributions, computational learning theory

257. ❌ Pixel-Translation-Equivariant Quantum Convolutional Neural Networks via Fourier Multiplexers

作者: Dmitry Chirkov, Igor Lobanov 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06094v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究量子卷积神经网络（QCNN）的平移等变性，属于量子机器学习领域，与深度学习和大模型技术原理有间接关联（如神经网络架构设计），但论文内容完全不涉及任何关键词中列出的具体大模型技术（如LLM、MoE、RLHF等）或应用场景。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为量子计算可视为科学计算的一个前沿方向，但论文未明确涉及生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了量子卷积神经网络（QCNN）中像素平移等变性的不匹配问题，通过量子傅里叶变换构建了精确满足像素循环移位对称性的QCNN层，并证明了在深度扩展机制下可避免梯度消失。

摘要翻译

卷积神经网络的成功在很大程度上归功于其对平移等变性的硬编码。量子卷积神经网络作为近期可实现的量子类比被提出，但其相关的平移概念取决于数据编码方式。对于如FRQI这类地址/幅度编码，像素平移表现为索引寄存器的模加法运算，而许多受MERA启发的QCNN仅对物理量子位的循环置换具有等变性。我们形式化了这种不匹配性，并构建了与编码所诱导的像素循环平移对称性精确对易的QCNN层。我们的主要技术成果是对所有PCS等变酉算子的构造性刻画：通过量子傅里叶变换的共轭作用将平移操作对角化，因此任何PCS等变层均可表示为傅里叶模式复用器后接逆量子傅里叶变换的结构。基于此刻画，我们提出了一种具有测量诱导池化、延迟条件处理和层间QFT抵消机制的深层PCS-QCNN架构。同时，我们分析了随机初始化下的可训练性，证明了期望梯度平方范数下界在深度缩放机制中保持恒定，这从该意义上排除了深度引起的梯度消失高原现象。

摘要 (Abstract)

Convolutional neural networks owe much of their success to hard-coding translation equivariance. Quantum convolutional neural networks (QCNNs) have been proposed as near-term quantum analogues, but the relevant notion of translation depends on the data encoding. For address/amplitude encodings such as FRQI, a pixel shift acts as modular addition on an index register, whereas many MERA-inspired QCNNs are equivariant only under cyclic permutations of physical qubits. We formalize this mismatch and construct QCNN layers that commute exactly with the pixel cyclic shift (PCS) symmetry induced by the encoding. Our main technical result is a constructive characterization of all PCS-equivariant unitaries: conjugation by the quantum Fourier transform (QFT) diagonalizes translations, so any PCS-equivariant layer is a Fourier-mode multiplexer followed by an inverse QFT (IQFT). Building on this characterization, we introduce a deep PCS-QCNN with measurement-induced pooling, deferred conditioning, and inter-layer QFT cancellation. We also analyze trainability at random initialization and prove a lower bound on the expected squared gradient norm that remains constant in a depth-scaling regime, ruling out a depth-induced barren plateau in that sense.

关键词: Quantum Convolutional Neural Networks, QCNN, Translation Equivariance, Pixel Cyclic Shift, Quantum Fourier Transform, Fourier Multiplexers, Barren Plateau, Gradient Norm

258. ❌ eVTOL Aircraft Energy Overhead Estimation under Conflict Resolution in High-Density Airspaces

作者: Alex Zongo, Peng Wei 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06093v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究eVTOL飞机在高密度空域冲突解决下的能耗估计，使用基于物理的功率模型和交通模拟，并开发了一个机器学习模型进行能耗预测。所有关键词均与大模型、深度学习技术原理或AI在科学领域的应用相关，但论文仅涉及机器学习模型（未明确为大模型）在航空工程领域的应用，与’AI for Science’有一定关联（5分），其他关键词均不涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了在高密度空域中，基于Modified Voltage Potential算法的冲突解决对eVTOL飞机能耗的影响，发现该算法能耗效率高（中位能耗开销低于1.5%），并开发了一个机器学习模型来预测能耗开销和不确定性边界，为先进空中交通的能源储备规划提供定量指导。

摘要翻译

在高密度城市空域运行的电动垂直起降（eVTOL）航空器需通过战术冲突解脱来保持安全间隔，但此类机动操作的能量消耗尚未得到系统量化。本文研究了在修正电压势（Modified Voltage Potential, MVP）算法下的冲突解脱机动如何影响eVTOL能量消耗。通过将基于物理的功率模型整合到交通仿真中，我们分析了空域扇区内约71,767个航路段，覆盖10-60架航空器同时运行的不同交通密度。主要发现表明，基于MVP的冲突解脱具有能量效率：在所有密度水平下，能量开销中位数均低于1.5%，且扇区内大部分航段飞行所受影响可忽略不计。然而，能量开销分布呈现明显的右偏特征，在高密度场景下，由于持续的多机冲突，极端情况下的能量开销可达44%。第95百分位数的开销范围在3.84%至5.3%之间，这表明预留4-5%的能量裕度可覆盖绝大多数战术冲突解脱场景。为支持运行规划，我们开发了一个机器学习模型，用于在任务起始阶段预估能量开销。由于冲突结果取决于无法预先获知的未来交通交互，该模型同时提供点估计和不确定性边界。这些边界具有保守性：实际结果落入预测范围的频率高于标称置信水平，使其适用于安全关键的储备能量规划。综上，这些结果验证了MVP算法适用于能量受限的eVTOL运行，并为先进空中交通（Advanced Air Mobility）中的储备能量确定提供了量化指导。

摘要 (Abstract)

Electric vertical takeoff and landing (eVTOL) aircraft operating in high-density urban airspace must maintain safe separation through tactical conflict resolution, yet the energy cost of such maneuvers has not been systematically quantified. This paper investigates how conflict-resolution maneuvers under the Modified Voltage Potential (MVP) algorithm affect eVTOL energy consumption. Using a physics-based power model integrated within a traffic simulation, we analyze approximately 71,767 en route sections within a sector, across traffic densities of 10-60 simultaneous aircraft. The main finding is that MVP-based deconfliction is energy-efficient: median energy overhead remains below 1.5% across all density levels, and the majority of en route flights within the sector incur negligible penalty. However, the distribution exhibits pronounced right-skewness, with tail cases reaching 44% overhead at the highest densities due to sustained multi-aircraft conflicts. The 95th percentile ranges from 3.84% to 5.3%, suggesting that a 4-5% reserve margin accommodates the vast majority of tactical deconfliction scenarios. To support operational planning, we develop a machine learning model that estimates energy overhead at mission initiation. Because conflict outcomes depend on future traffic interactions that cannot be known in advance, the model provides both point estimates and uncertainty bounds. These bounds are conservative; actual outcomes fall within the predicted range more often than the stated confidence level, making them suitable for safety-critical reserve planning. Together, these results validate MVP’s suitability for energy-constrained eVTOL operations and provide quantitative guidance for reserve energy determination in Advanced Air Mobility.

关键词: eVTOL, energy overhead, conflict resolution, Modified Voltage Potential, high-density airspace, machine learning model, traffic simulation, reserve energy

259. ❌ A machine learning framework for uncovering stochastic nonlinear dynamics from noisy data

作者: Matteo Bosso, Giovanni Franzese, Kushal Swamy, Maarten Theulings, Alejandro M. Aragón, Farbod Alijani 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06081v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于开发一种混合符号回归-概率机器学习框架，用于从噪声数据中推断随机非线性动力学系统的控制方程和参数不确定性。论文的核心是机器学习方法（特别是符号回归和概率建模）在科学建模中的应用，属于’AI for Science’的范畴，因此该关键词得5分。然而，论文完全不涉及大语言模型（LLMs）、深度学习技术原理创新、模型训练/微调方法、推理优化、智能体系统、对齐技术等任何其他关键词所描述的技术领域。论文使用的是传统的符号回归和概率机器学习方法，而非基于大模型的技术。

!!! tip deepseek-chat TL;DR

该研究提出了一种结合深度符号回归和高斯过程的混合机器学习框架，用于从噪声数据中同时恢复随机非线性动力系统的控制方程符号形式和参数不确定性，并在数值基准和生物振荡器实验中得到验证。

摘要翻译

对现实世界系统进行建模需要考虑噪声——无论是源于金融市场的不可预测波动、生物系统的不规则节律，还是生态系统的环境变异性。虽然此类系统的行为常可用随机微分方程描述，但核心挑战在于理解噪声如何影响从数据中推断系统参数与动力学的过程。传统符号回归方法能够揭示控制方程，但通常忽略不确定性；反之，高斯过程虽能提供严格的不确定性量化，却难以揭示底层动力学机制。本研究通过一种混合符号回归-概率机器学习框架弥合了这一鸿沟，该框架在恢复控制方程符号形式的同时，能推断系统参数的不确定性。该框架将深度符号回归与基于高斯过程的最大似然估计相结合，分别对确定性动力学与噪声结构进行建模，且无需预先假设其函数形式。我们在数值基准测试（包括简谐振子、杜芬振子和范德波尔振子）上验证了该方法，并在展现同步现象的耦合生物振子实验系统中进行了验证——算法成功识别了符号分量与随机分量。该框架具有数据高效性（仅需100-1000个数据点）和噪声鲁棒性，在不确定性内禀存在且必须同时理解动力学系统结构与变异性的领域中展现出广阔的应用潜力。

摘要 (Abstract)

Modeling real-world systems requires accounting for noise - whether it arises from unpredictable fluctuations in financial markets, irregular rhythms in biological systems, or environmental variability in ecosystems. While the behavior of such systems can often be described by stochastic differential equations, a central challenge is understanding how noise influences the inference of system parameters and dynamics from data. Traditional symbolic regression methods can uncover governing equations but typically ignore uncertainty. Conversely, Gaussian processes provide principled uncertainty quantification but offer little insight into the underlying dynamics. In this work, we bridge this gap with a hybrid symbolic regression-probabilistic machine learning framework that recovers the symbolic form of the governing equations while simultaneously inferring uncertainty in the system parameters. The framework combines deep symbolic regression with Gaussian process-based maximum likelihood estimation to separately model the deterministic dynamics and the noise structure, without requiring prior assumptions about their functional forms. We verify the approach on numerical benchmarks, including harmonic, Duffing, and van der Pol oscillators, and validate it on an experimental system of coupled biological oscillators exhibiting synchronization, where the algorithm successfully identifies both the symbolic and stochastic components. The framework is data-efficient, requiring as few as 100-1000 data points, and robust to noise - demonstrating its broad potential in domains where uncertainty is intrinsic and both the structure and variability of dynamical systems must be understood.

关键词: stochastic differential equations, symbolic regression, Gaussian processes, uncertainty quantification, nonlinear dynamics, machine learning framework, parameter inference, biological oscillators

260. ❌ Value Mirror Descent for Reinforcement Learning

作者: Zhichao Jia, Guanghui Lan 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06039v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于强化学习（RL）中的价值迭代方法，提出了一种结合镜像下降的新算法VMD，并分析了其收敛性和样本复杂度。论文内容完全围绕传统强化学习理论、优化算法和样本复杂度分析，不涉及任何大语言模型（LLM）、深度学习、大模型技术原理或AI在科学领域的应用。所有关键词均与大模型、深度学习及相关技术相关，与该论文的强化学习理论工作无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合镜像下降的价值迭代方法（VMD）用于强化学习，在确定性和随机设置下证明了线性收敛性和近最优样本复杂度，并首次在随机值迭代方法中保持了生成策略与最优策略之间的Bregman散度有界性。

摘要翻译

值迭代类方法在强化学习中为计算近似最优值函数已被广泛研究。在生成式采样模型下，这些方法能够获得比策略优化方法更优的样本复杂度，尤其是在对折扣因子的依赖性方面。在实践中，它们常被用于离线训练或模拟环境中。本文考虑具有状态空间 S、动作空间 A、折扣因子 $γ\in(0,1)$ 以及成本在 $[0,1]$ 范围内的折扣马尔可夫决策过程。我们提出了一种新的值优化方法，称为值镜像下降（value mirror descent, VMD），该方法将凸优化中的镜像下降融入经典的值迭代框架。在转移核已知的确定性设定下，我们证明了 VMD 具有线性收敛性。对于具有生成模型的随机设定，我们开发了一种随机变体 SVMD，它结合了随机值迭代类方法中常用的方差缩减技术。对于具有一般凸正则项的强化学习问题，SVMD 达到了近乎最优的样本复杂度 $\tilde{O}(|S||A|(1-γ)^{-3}ε^{-2})$。此外，我们证明了生成策略与最优策略之间的布雷格曼散度在整个迭代过程中保持有界。这一性质在现有的随机值迭代类方法中并不存在，但对于实现离线训练后有效的在线（持续）学习至关重要。在强凸正则项下，SVMD 实现了 $\tilde{O}(|S||A|(1-γ)^{-5}ε^{-1})$ 的样本复杂度，在高精度场景下提升了性能。进一步地，我们证明了生成策略会收敛到最优策略。总体而言，所提出的方法、其分析以及所得的理论保证，为强化学习和优化文献提供了新的贡献。

摘要 (Abstract)

Value iteration-type methods have been extensively studied for computing a nearly optimal value function in reinforcement learning (RL). Under a generative sampling model, these methods can achieve sharper sample complexity than policy optimization approaches, particularly in their dependence on the discount factor. In practice, they are often employed for offline training or in simulated environments. In this paper, we consider discounted Markov decision processes with state space S, action space A, discount factor $γ\in(0,1)$ and costs in $[0,1]$. We introduce a novel value optimization method, termed value mirror descent (VMD), which integrates mirror descent from convex optimization into the classical value iteration framework. In the deterministic setting with known transition kernels, we show that VMD converges linearly. For the stochastic setting with a generative model, we develop a stochastic variant, SVMD, which incorporates variance reduction commonly used in stochastic value iteration-type methods. For RL problems with general convex regularizers, SVMD attains a near-optimal sample complexity of $\tilde{O}(|S||A|(1-γ)^{-3}ε^{-2})$. Moreover, we establish that the Bregman divergence between the generated and optimal policies remains bounded throughout the iterations. This property is absent in existing stochastic value iteration-type methods but is important for enabling effective online (continual) learning following offline training. Under a strongly convex regularizer, SVMD achieves sample complexity of $\tilde{O}(|S||A|(1-γ)^{-5}ε^{-1})$, improving performance in the high-accuracy regime. Furthermore, we prove convergence of the generated policy to the optimal policy. Overall, the proposed method, its analysis, and the resulting guarantees, constitute new contributions to the RL and optimization literature.

关键词: Reinforcement Learning, Value Iteration, Mirror Descent, Sample Complexity, Markov Decision Processes, Convex Optimization, Bregman Divergence, Policy Optimization

261. ❌ Ensemble-Based Dirichlet Modeling for Predictive Uncertainty and Selective Classification

作者: Courtney Franzen, Farhad Pourkamali-Anaraki 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06032v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于传统神经网络分类器的预测不确定性估计方法（基于集成学习和狄利克雷建模），不涉及大语言模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用。所有关键词均与大模型、深度学习技术或AI科学应用相关，而本文研究的是基础神经网络分类器的统计不确定性方法，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于集成学习和狄利克雷建模的方法，用于改进神经网络分类器的预测不确定性估计，从而提升下游任务（如选择性分类）的性能。

摘要翻译

使用交叉熵损失训练的神经网络分类器能够实现较强的预测准确性，但缺乏提供内在预测不确定性估计的能力，因此需要借助外部技术来获取这些估计。此外，真实类别的softmax分数在不同独立训练运行之间可能存在显著差异，这限制了下游任务中基于不确定性决策的可靠性。证据深度学习旨在通过单次前向传播生成不确定性估计来解决这些局限，但证据训练对损失函数构建、先验正则化和激活函数等设计选择高度敏感。因此，本研究提出了一种替代的狄利克雷参数估计策略：通过对softmax输出集合应用矩估计方法，并可选地结合最大似然优化步骤。这种基于集成学习的构建方式将不确定性估计与脆弱的证据损失设计解耦，同时减轻了单次交叉熵训练结果的变异性，从而生成显式的狄利克雷预测分布。在多个数据集上的实验表明，这种从集成学习推导的狄利克雷估计具有更稳定的特性和更优的预测不确定性表现，进而转化为下游不确定性引导应用（如预测置信度评分和选择性分类）中更强的性能表现。

摘要 (Abstract)

Neural network classifiers trained with cross-entropy loss achieve strong predictive accuracy but lack the capability to provide inherent predictive uncertainty estimates, thus requiring external techniques to obtain these estimates. In addition, softmax scores for the true class can vary substantially across independent training runs, which limits the reliability of uncertainty-based decisions in downstream tasks. Evidential Deep Learning aims to address these limitations by producing uncertainty estimates in a single pass, but evidential training is highly sensitive to design choices including loss formulation, prior regularization, and activation functions. Therefore, this work introduces an alternative Dirichlet parameter estimation strategy by applying a method of moments estimator to ensembles of softmax outputs, with an optional maximum-likelihood refinement step. This ensemble-based construction decouples uncertainty estimation from the fragile evidential loss design while also mitigating the variability of single-run cross-entropy training, producing explicit Dirichlet predictive distributions. Across multiple datasets, we show that the improved stability and predictive uncertainty behavior of these ensemble-derived Dirichlet estimates translate into stronger performance in downstream uncertainty-guided applications such as prediction confidence scoring and selective classification.

关键词: predictive uncertainty, selective classification, Dirichlet modeling, ensemble methods, neural network classifiers, uncertainty estimation, method of moments, evidential deep learning

262. ❌ Gated-SwinRMT: Unifying Swin Windowed Attention with Retentive Manhattan Decay via Input-Dependent Gating

作者: Dipan Maity, Suman Mondal, Arindam Roy 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06014v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的混合视觉Transformer架构创新，结合了Swin Transformer的窗口注意力机制和Retentive Networks的空间衰减机制，并引入了输入依赖的门控机制。论文的核心贡献在于视觉Transformer的架构改进和注意力机制优化，所有关键词均针对大语言模型（LLMs）及其相关技术（如训练方法、推理优化、对齐、代理系统等），而本文研究的是纯视觉模型（Vision Transformer），不涉及任何语言模型、文本处理或LLM相关技术，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Gated-SwinRMT的混合视觉Transformer架构，通过结合Swin Transformer的窗口注意力和Retentive Networks的曼哈顿距离空间衰减，并引入输入依赖的门控机制，在Mini-ImageNet和CIFAR-10图像分类任务上取得了优于基线的性能。

摘要翻译

我们提出了Gated-SwinRMT，一个混合视觉Transformer家族，它结合了Swin Transformer的移位窗口注意力与Retentive Networks（RMT）的曼哈顿距离空间衰减机制，并通过输入依赖的门控进行增强。自注意力在每个移位窗口内被分解为连续的宽度方向和高度方向保留传递，其中每个头部的指数衰减掩码提供了二维局部性先验，而无需学习位置偏置。
我们提出了两种变体。Gated-SwinRMT-SWAT 使用sigmoid激活替代softmax，通过乘法后激活空间衰减实现平衡的ALiBi斜率，并通过SwiGLU对值投影进行门控；其归一化输出隐式地抑制了信息量不足的注意力分数。Gated-SwinRMT-Retention 保留了使用加法对数空间衰减偏置的softmax归一化保留机制，并引入了一个显式的G1 sigmoid门——该门控从块输入投影而来，在局部上下文增强（LCE）之后、输出投影$W_O$之前应用——以缓解低秩$W_V !\cdot! W_O$瓶颈，并实现对注意力输出的输入依赖性抑制。
我们在相同训练协议下，于Mini-ImageNet（$224{\times}224$，100类）和CIFAR-10（$32{\times}32$，10类）数据集上评估了两种变体，由于资源限制，实验使用单GPU进行。在参数量约为$77$至$79$,M时，Gated-SwinRMT-SWAT在Mini-ImageNet上取得了$80.22%$的top-1测试准确率，Gated-SwinRMT-Retention为$78.20%$，而RMT基线为$73.74%$。在CIFAR-10上——其小尺寸特征图导致自适应窗口机制将注意力坍缩至全局范围——准确率优势从$+6.48$,个百分点压缩至$+0.56$,个百分点。

摘要 (Abstract)

We introduce Gated-SwinRMT, a family of hybrid vision transformers that combine the shifted-window attention of the Swin Transformer with the Manhattan-distance spatial decay of Retentive Networks (RMT), augmented by input-dependent gating. Self-attention is decomposed into consecutive width-wise and height-wise retention passes within each shifted window, where per-head exponential decay masks provide a two-dimensional locality prior without learned positional biases. Two variants are proposed. \textbf{Gated-SwinRMT-SWAT} substitutes softmax with sigmoid activation, implements balanced ALiBi slopes with multiplicative post-activation spatial decay, and gates the value projection via SwiGLU; the Normalized output implicitly suppresses uninformative attention scores. \textbf{Gated-SwinRMT-Retention} retains softmax-normalized retention with an additive log-space decay bias and incorporates an explicit G1 sigmoid gate – projected from the block input and applied after local context enhancement (LCE) but prior to the output projection~$W_O$ – to alleviate the low-rank $W_V !\cdot! W_O$ bottleneck and enable input-dependent suppression of attended outputs. We assess both variants on Mini-ImageNet ($224{\times}224$, 100 classes) and CIFAR-10 ($32{\times}32$, 10 classes) under identical training protocols, utilizing a single GPU due to resource limitations. At ${\approx}77$–$79$,M parameters, Gated-SwinRMT-SWAT achieves $80.22%$ and Gated-SwinRMT-Retention $78.20%$ top-1 test accuracy on Mini-ImageNet, compared with $73.74%$ for the RMT baseline. On CIFAR-10 – where small feature maps cause the adaptive windowing mechanism to collapse attention to global scope – the accuracy advantage compresses from $+6.48$,pp to $+0.56$,pp.

关键词: Vision Transformer, Swin Transformer, Retentive Networks, Input-dependent Gating, Windowed Attention, Spatial Decay, Image Classification, Hybrid Architecture

263. ❌ Data Distribution Valuation Using Generalized Bayesian Inference

作者: Cuong N. Nguyen, Cuong V. Nguyen 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05993v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究数据分布估值问题，提出基于广义贝叶斯推断的框架，应用于标注者评估和数据增强等场景。所有关键词均与大模型、深度学习技术原理或科学AI应用无关，论文未涉及任何大模型技术、训练方法、推理优化、对齐、代理系统或特定科学领域应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于广义贝叶斯推断的数据分布估值框架，用于量化数据分布的价值，并成功应用于标注者评估和数据增强等实际问题。

摘要翻译

我们研究数据分布估值问题，该问题旨在通过样本量化数据分布的价值。这是一个近期提出的问题，与经典的数据估值相关但有所不同，可应用于多种场景。针对此问题，我们开发了一个名为广义贝叶斯估值（Generalized Bayes Valuation）的新框架，该框架利用基于可迁移性度量构建的损失函数进行广义贝叶斯推断。这一框架使我们能够以统一的方式解决看似无关的实际问题，例如标注者评估和数据增强。基于贝叶斯原理，我们通过将框架扩展至连续数据流设置，进一步提升了其适用性。实验结果证实了我们的框架在不同现实场景中的有效性与高效性。

摘要 (Abstract)

We investigate the data distribution valuation problem, which aims to quantify the values of data distributions from their samples. This is a recently proposed problem that is related to but different from classical data valuation and can be applied to various applications. For this problem, we develop a novel framework called Generalized Bayes Valuation that utilizes generalized Bayesian inference with a loss constructed from transferability measures. This framework allows us to solve, in a unified way, seemingly unrelated practical problems, such as annotator evaluation and data augmentation. Using the Bayesian principles, we further improve and enhance the applicability of our framework by extending it to the continuous data stream setting. Our experiment results confirm the effectiveness and efficiency of our framework in different real-world scenarios.

关键词: data distribution valuation, generalized Bayesian inference, transferability measures, annotator evaluation, data augmentation, continuous data stream, Bayesian principles, real-world scenarios

264. ❌ A deep learning framework for jointly solving transient Fokker-Planck equations with arbitrary parameters and initial distributions

作者: Xiaolong Wang, Jing Feng, Qi Liu, Chengli Tan, Yuanyuan Liu, Yong Xu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06001v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用深度学习（特别是约束保持自编码器和演化网络）解决参数化随机系统的Fokker-Planck方程，属于科学计算和物理信息机器学习领域。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为其应用深度学习解决科学计算问题，属于AI for Science的广义范畴，但并非核心的生物信息学或化学信息学应用，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于深度学习的伪解析概率解法，通过约束保持自编码器和演化网络，高效求解具有任意参数和初始分布的非稳态Fokker-Planck方程，实现了比GPU加速蒙特卡洛模拟快四个数量级的推理速度，从而支持实时参数扫描和随机分岔分析。

摘要翻译

高效求解福克-普朗克方程（Fokker-Planck equation, FPE）是分析复杂参数化随机系统的核心。然而，现有数值方法缺乏跨不同条件的并行计算能力，严重限制了全面的参数探索与瞬态分析。本文提出一种基于深度学习的伪解析概率解法（pseudo-analytical probability solution, PAPS），通过单次训练过程，即可同时求解任意多峰初始分布、系统参数及时间点对应的瞬态FPE解。其核心思想是通过高斯混合分布（Gaussian mixture distributions, GMDs）统一初始、瞬态与稳态分布，并构建一个约束保持自编码器，将受约束的GMD参数双射映射到无约束的低维隐表征空间。在此表征空间中，不同初始条件与系统参数下的全景瞬态动力学可由单一演化网络建模。在多个典型系统上的大量实验表明，所提出的PAPS方法在保持高精度的同时，其推理速度比GPU加速的蒙特卡洛模拟快四个数量级。这一效率跃升使得以往难以实现的实时参数扫描与随机分岔系统研究成为可能。通过将表征学习与物理信息驱动的瞬态动力学解耦，本研究为多维参数化随机系统的概率建模建立了一个可扩展的范式。

摘要 (Abstract)

Efficiently solving the Fokker-Planck equation (FPE) is central to analyzing complex parameterized stochastic systems. However, current numerical methods lack parallel computation capabilities across varying conditions, severely limiting comprehensive parameter exploration and transient analysis. This paper introduces a deep learning-based pseudo-analytical probability solution (PAPS) that, via a single training process, simultaneously resolves transient FPE solutions for arbitrary multi-modal initial distributions, system parameters, and time points. The core idea is to unify initial, transient, and stationary distributions via Gaussian mixture distributions (GMDs) and develop a constraint-preserving autoencoder that bijectively maps constrained GMD parameters to unconstrained, low-dimensional latent representations. In this representation space, the panoramic transient dynamics across varying initial conditions and system parameters can be modeled by a single evolution network. Extensive experiments on paradigmatic systems demonstrate that the proposed PAPS maintains high accuracy while achieving inference speeds four orders of magnitude faster than GPU-accelerated Monte Carlo simulations. This efficiency leap enables previously intractable real-time parameter sweeps and systematic investigations of stochastic bifurcations. By decoupling representation learning from physics-informed transient dynamics, our work establishes a scalable paradigm for probabilistic modeling of multi-dimensional, parameterized stochastic systems.

关键词: Fokker-Planck equation, deep learning, stochastic systems, parameterized systems, transient analysis, autoencoder, Gaussian mixture distributions, physics-informed modeling

265. ❌ On Dominant Manifolds in Reservoir Computing Networks

作者: Noa Kaplan, Alberto Padoan, Anastasia Bizyaeva 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05967v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究储层计算网络中的主导流形，属于经典循环神经网络的时间序列建模领域，与所有关键词（均聚焦于大模型、深度学习技术原理创新及其在科学领域的应用）完全无关。论文未涉及任何大模型、语言模型、微调、对齐、推理、代理、压缩等现代大模型技术，也未涉及生物信息学等具体科学应用。

!!! tip deepseek-chat TL;DR

该论文研究了储层计算网络训练中低维主导流形的出现，揭示了训练数据的固有维度和信息内容如何决定主导模式的结构，并将训练后的储层主导模式与原始系统的Koopman特征函数近似联系起来。

摘要翻译

理解训练如何塑造循环网络动态的几何结构是时间序列建模的核心问题。本研究探讨了储层计算网络在时间预测任务训练中低维主导流形的涌现机制。针对简化的线性连续时间储层模型，我们将主导模态的维度与结构直接关联到训练数据的内在维度和信息含量。特别地，对于由自治动力系统生成的训练数据，我们将训练后储层的主导模态关联到原系统库普曼特征函数的近似表示，从而揭示了储层计算与动态模态分解算法之间的显式联系。我们通过仿真展示了训练过程中生成主导流形的特征值运动轨迹，并基于切向动力学和微分p主导性理论，讨论了该方法向非线性储层计算的推广。

摘要 (Abstract)

Understanding how training shapes the geometry of recurrent network dynamics is a central problem in time-series modeling. We study the emergence of low-dimensional dominant manifolds in the training of Reservoir Computing (RC) networks for temporal forecasting tasks. For a simplified linear and continuous-time reservoir model, we link the dimensionality and structure of the dominant modes directly to the intrinsic dimensionality and information content of the training data. In particular, for training data generated by an autonomous dynamical system, we relate the dominant modes of the trained reservoir to approximations of the Koopman eigenfunctions of the original system, illuminating an explicit connection between reservoir computing and the Dynamic Mode Decomposition algorithm. We illustrate the eigenvalue motion that generates the dominant manifolds during training in simulation, and discuss generalization to nonlinear RC via tangent dynamics and differential p-dominance.

关键词: Reservoir Computing, Dominant Manifolds, Temporal Forecasting, Koopman Eigenfunctions, Dynamic Mode Decomposition, Recurrent Networks, Time-series Modeling, Network Dynamics

266. ❌ QiMeng-PRepair: Precise Code Repair via Edit-Aware Reward Optimization

作者: Changxin Ke, Rui Zhang, Jiaming Guo, Yuanbo Wen, Li Ding, Shuo Wang, Xuyuan Zhu, Xiong Peng, Di Huang, Zidong Du, Xing Hu, Qi Guo, Yunji Chen 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05963v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究基于LLMs的代码修复，因此与’Large Language Models’高度相关（10分）。方法涉及自我修复和优化，与’Self-Correction’有一定关联（8分）。实验提到结合speculative editing提高解码吞吐量，与’Speculative Decoding’有弱关联（5分）。其他关键词如MoE、SLMs、Scaling Laws等未在摘要中提及，完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLMs在代码修复中过度编辑的问题，提出了PRepair框架，通过自我破坏和自我修复结合编辑感知奖励优化，显著提高了修复精度和解码效率。

摘要翻译

大型语言模型（LLM）在程序修复任务中表现出色，但常存在过度编辑问题，即过多修改覆盖了正确代码并阻碍了错误定位。我们系统性地量化了其影响，并提出了精确修复任务，该任务在修复错误部分的同时最大化地复用正确代码。基于这一洞见，我们提出了PRepair框架，以缓解过度编辑并提升修复准确率。PRepair包含两个组件：一是“自破坏”，通过可控的错误注入和最小-最大采样生成多样化的错误程序；二是“自修复”，利用编辑感知奖励，通过编辑感知分组相对策略优化（Edit-Aware Group Relative Policy Optimization, EA-GRPO）训练模型，以鼓励最小化且正确的编辑。实验表明，PRepair在综合考虑修复正确性与修改范围的指标$\mathrm{fix}_1@1$下，将修复精确率提升了高达31.4%，并在与推测式编辑结合时显著提高了解码吞吐量，证明了其在精确且实用的代码修复方面的潜力。

摘要 (Abstract)

Large Language Models (LLMs) achieve strong program repair performance but often suffer from over-editing, where excessive modifications overwrite correct code and hinder bug localization. We systematically quantify its impact and introduce precise repair task, which maximizes reuse of correct code while fixing only buggy parts. Building on this insight, we propose PRepair, a framework that mitigates over-editing and improves repair accuracy. PRepair has two components: Self-Breaking, which generates diverse buggy programs via controlled bug injection and min-max sampling, and Self-Repairing, which trains models with Edit-Aware Group Relative Policy Optimization (EA-GRPO) using an edit-aware reward to encourage minimal yet correct edits. Experiments show that PRepair improves repair precision by up to 31.4% under $\mathrm{fix}_1@1$, a metric that jointly considers repair correctness and extent, and significantly increases decoding throughput when combined with speculative editing, demonstrating its potential for precise and practical code repair.

关键词: Large Language Models, Code Repair, Over-editing, Precise Repair, Self-Breaking, Self-Repairing, Edit-Aware Reward, Speculative Editing

267. ❌ A Mixture of Experts Foundation Model for Scanning Electron Microscopy Image Analysis

作者: Sk Miraj Ahmed, Yuewei Lin, Chuntian Cao, Shinjae Yoo, Xinpei Wu, Won-Il Lee, Nikhil Tiwale, Dan N. Le, Thi Thu Huong Chu, Jiyoung Kim, Kevin G. Yager, Chang-Yong Nam 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05960v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出首个用于SEM图像的基础模型，属于大模型在科学领域的应用创新。与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文明确构建了基础模型。与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分），因为模型在大规模科学图像上进行了预训练，并可适应下游任务。与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为研究聚焦于材料科学的SEM图像分析。与’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（5分），因为提到模型可微调，但未详细描述。其他关键词（如MoE、SLMs、RAG等）与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对扫描电子显微镜（SEM）图像分析中任务特定模型和获取过程受限的问题，提出了首个SEM图像基础模型，通过自监督Transformer架构预训练学习可迁移表示，并在去焦到聚焦图像翻译任务中优于现有技术。

摘要翻译

扫描电子显微镜（SEM）在现代材料科学中不可或缺，能够在广泛的结构、化学与功能研究中实现高分辨率成像。然而，SEM成像仍受限于任务专用模型和劳动密集型的采集流程，这制约了其在不同应用中的可扩展性。本文首次提出了针对SEM图像的基础模型，该模型基于多仪器、多条件下的科学显微图像大型数据集进行预训练，从而实现了对不同材料体系和成像条件的泛化能力。通过采用自监督的Transformer架构，我们的模型学习了丰富且可迁移的表征，能够微调或适配于多种下游任务。作为一项有力验证，我们聚焦于散焦至聚焦的图像转换——这是自动化显微工作流程中至关重要却尚未充分探索的挑战。我们的方法不仅能在无需配对监督的情况下从散焦输入中恢复聚焦细节，还在多项评估指标上超越了现有先进技术。此项工作为新一代自适应SEM模型奠定了基础，通过将基础表征学习与实际成像需求相衔接，加速了材料发现进程。

摘要 (Abstract)

Scanning Electron Microscopy (SEM) is indispensable in modern materials science, enabling high-resolution imaging across a wide range of structural, chemical, and functional investigations. However, SEM imaging remains constrained by task-specific models and labor-intensive acquisition processes that limit its scalability across diverse applications. Here, we introduce the first foundation model for SEM images, pretrained on a large corpus of multi-instrument, multi-condition scientific micrographs, enabling generalization across diverse material systems and imaging conditions. Leveraging a self-supervised transformer architecture, our model learns rich and transferable representations that can be fine-tuned or adapted to a wide range of downstream tasks. As a compelling demonstration, we focus on defocus-to-focus image translation-an essential yet underexplored challenge in automated microscopy pipelines. Our method not only restores focused detail from defocused inputs without paired supervision but also outperforms state-of-the-art techniques across multiple evaluation metrics. This work lays the groundwork for a new class of adaptable SEM models, accelerating materials discovery by bridging foundational representation learning with real-world imaging needs.

关键词: foundation model, scanning electron microscopy, SEM image analysis, self-supervised transformer, pre-training, defocus-to-focus translation, materials science, representation learning

268. ❌ Transfer Learning for Neural Parameter Estimation applied to Building RC Models

作者: Fabian Raisch, Timo Germann, J. Nathan Kutz, Christoph Goebel, Benjamin Tischler 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05904v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种基于预训练-微调范式的迁移学习框架，用于动态系统的参数估计，并应用于建筑RC热模型。该研究主要与’Pre-training OR Continual Pre-training OR Domain Adaptation’相关（评分8.0），因为它明确采用了预训练-微调范式进行迁移学习。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评分5.0），因为它将AI应用于建筑科学（热模型），属于科学应用范畴，但并非核心的生物信息学或化学信息学。其他关键词（如LLMs、MoE、RLHF等）均未涉及，因此评分为0.0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于预训练-微调范式的迁移学习框架，用于动态系统的参数估计，应用于建筑RC热模型，实验表明该方法在仅使用12天训练数据时性能提升18.6-24.0%，使用72天数据时提升高达49.4%。

摘要翻译

动态系统的参数估计因非凸性及对初始参数猜测的敏感性而持续面临挑战。近期深度学习技术虽能实现快速精准的参数估计，却未能利用跨系统的可迁移知识。为此，我们提出一种基于预训练-微调范式的迁移学习神经参数估计框架。该方法提升了估计精度，并消除了对初始参数猜测的依赖。我们将此框架应用于建筑RC（电阻-电容）热模型构建，在八栋仿真建筑、一栋真实建筑、两种RC模型配置及四种训练数据长度下，与遗传算法及从头训练的神经基线模型进行比较评估。结果表明：仅使用12天训练数据时，性能提升达18.6-24.0%；使用72天数据时，性能提升最高可达49.4%。该方法不仅适用于建筑领域，更为动态系统的参数估计提供了一种新范式。

摘要 (Abstract)

Parameter estimation for dynamical systems remains challenging due to non-convexity and sensitivity to initial parameter guesses. Recent deep learning approaches enable accurate and fast parameter estimation but do not exploit transferable knowledge across systems. To address this, we introduce a transfer-learning-based neural parameter estimation framework based on a pretraining-fine-tuning paradigm. This approach improves accuracy and eliminates the need for an initial parameter guess. We apply this framework to building RC thermal models, evaluating it against a Genetic Algorithm and a from-scratch neural baseline across eight simulated buildings, one real-world building, two RC model configurations, and four training data lengths. Results demonstrate an 18.6-24.0% performance improvement with only 12 days of training data and up to 49.4% with 72 days. Beyond buildings, the proposed method represents a new paradigm for parameter estimation in dynamical systems.

关键词: transfer learning, neural parameter estimation, dynamical systems, pretraining-fine-tuning, RC thermal models, building energy, deep learning, parameter estimation

269. ❌ A Tensor-Train Framework for Bayesian Inference in High-Dimensional Systems: Applications to MIMO Detection and Channel Decoding

作者: Luca Schmid, Dominik Sulz, Shrinivas Chimmalgi, Laurent Schmalen 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05890v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于通信系统中的贝叶斯推断问题，提出了一种基于张量列分解的框架来解决高维离散输入加性噪声模型中的计算挑战。论文内容涉及张量网络方法、MIMO检测、信道解码等通信技术，但完全不涉及大语言模型、深度学习、AI for Science或任何评分关键词中的技术。所有关键词均与大模型、深度学习、AI应用或相关技术原理相关，而该论文属于传统通信信号处理领域，与评分关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于张量列分解的框架，用于解决高维离散输入加性噪声模型中的贝叶斯推断计算挑战，并在MIMO检测和信道解码应用中实现了接近最优的错误率性能。

摘要翻译

高维离散输入加性噪声模型中的贝叶斯推断是通信系统中的一个基础性挑战，因为所需联合后验概率质量函数的支撑集随未知变量数量呈指数级增长。本文提出一种张量链框架，用于在离散输入加性噪声模型中实现可处理的、接近最优的贝叶斯推断。核心洞见在于：联合对数后验概率质量函数在张量链格式下具有精确的低秩表示，从而支持紧凑存储与高效计算。为恢复符号层面的后验概率边缘分布，我们开发了一种实用的推断流程，该方法通过截断泰勒级数初始化张量链交叉算法，以近似对数后验的指数形式。为证明该方法的普适性，我们针对两个经典通信问题推导了显式的低秩张量链构造：一是应用于多输入多输出检测的加性高斯白噪声下的线性观测模型，二是二进制输入加性高斯白噪声信道中二进制线性分组纠错码的软判决译码。数值结果表明，该方法在较宽信噪比范围内均能实现接近最优的误码率性能，且仅需适中的张量链秩。这些结果凸显了张量网络方法在通信系统中实现高效贝叶斯推断的潜力。

摘要 (Abstract)

Bayesian inference in high-dimensional discrete-input additive noise models is a fundamental challenge in communication systems, as the support of the required joint a posteriori probability (APP) mass function grows exponentially with the number of unknown variables. In this work, we propose a tensor-train (TT) framework for tractable, near-optimal Bayesian inference in discrete-input additive noise models. The central insight is that the joint log-APP mass function admits an exact low-rank representation in the TT format, enabling compact storage and efficient computations. To recover symbol-wise APP marginals, we develop a practical inference procedure that approximates the exponential of the log-posterior using a TT-cross algorithm initialized with a truncated Taylor-series. To demonstrate the generality of the approach, we derive explicit low-rank TT constructions for two canonical communication problems: the linear observation model under additive white Gaussian noise (AWGN), applied to multiple-input multiple-output (MIMO) detection, and soft-decision decoding of binary linear block error correcting codes over the binary-input AWGN channel. Numerical results show near-optimal error-rate performance across a wide range of signal-to-noise ratios while requiring only modest TT ranks. These results highlight the potential of tensor-network methods for efficient Bayesian inference in communication systems.

关键词: tensor-train, Bayesian inference, high-dimensional systems, MIMO detection, channel decoding, additive noise models, APP marginals, communication systems

270. ❌ Weight-Informed Self-Explaining Clustering for Mixed-Type Tabular Data

作者: Lehao Li, Qiang Huang, Yihao Ang, Bryan Kian Hsiang Low, Anthony K. H. Tung, Xiaokui Xiao 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05857v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于混合类型表格数据的聚类方法，提出了WISE框架，包含二进制编码、特征加权和可解释性技术。论文与绝大多数大模型和深度学习技术关键词完全无关，因为这些关键词主要涉及语言模型架构、训练方法、推理优化、对齐技术等。唯一相关的关键词是’Mechanistic Interpretability OR Explainable AI’，因为论文强调可解释性，开发了Discriminative FreqItems（DFI）来提供特征级解释，但这不是大模型特有的可解释性，而是传统机器学习中的可解释性方法，因此给予5分（有一定关联）。其他关键词如AI for Science等虽然涉及科学应用，但论文未明确涉及生物信息学或化学信息学等具体领域。

!!! tip deepseek-chat TL;DR

该论文针对混合类型表格数据聚类中表示不匹配、特征权重不均和解释性差的问题，提出了WISE框架，通过统一表示、特征加权和可解释聚类，在多个真实数据集上实现了优于基线方法的聚类质量和可解释性。

摘要翻译

混合类型表格数据的聚类是探索性分析的基础任务，但由于数值与分类特征的表征错位、特征重要性不均衡且依赖上下文、以及聚类过程与事后解释脱节等问题，该任务仍具挑战性。我们提出WISE框架——一种权重信息自解释框架，它将表征学习、特征加权、聚类与解释统一在一个完全无监督且透明的流程中。WISE引入了带填充的二进制编码（Binary Encoding with Padding, BEP），以在统一的稀疏空间中对齐异构特征；采用留一特征策略（Leave-One-Feature-Out, LOFO）来感知多个高质量且多样化的特征权重视图；并通过两阶段权重感知聚类过程来整合不同的语义划分。为确保内在可解释性，我们进一步提出了判别性频繁项集（Discriminative FreqItems, DFI），该方法可生成从实例到聚类保持一致的、具有可加性分解保证的特征级解释。在六个真实数据集上的大量实验表明，WISE在聚类质量上持续优于经典及神经基线方法，同时保持高效性，并能基于驱动聚类的相同基本要素生成忠实、易于人类理解的解释。

摘要 (Abstract)

Clustering mixed-type tabular data is fundamental for exploratory analysis, yet remains challenging due to misaligned numerical-categorical representations, uneven and context-dependent feature relevance, and disconnected and post-hoc explanation from the clustering process. We propose WISE, a Weight-Informed Self-Explaining framework that unifies representation, feature weighting, clustering, and interpretation in a fully unsupervised and transparent pipeline. WISE introduces Binary Encoding with Padding (BEP) to align heterogeneous features in a unified sparse space, a Leave-One-Feature-Out (LOFO) strategy to sense multiple high-quality and diverse feature-weighting views, and a two-stage weight-aware clustering procedure to aggregate alternative semantic partitions. To ensure intrinsic interpretability, we further develop Discriminative FreqItems (DFI), which yields feature-level explanations that are consistent from instances to clusters with an additive decomposition guarantee. Extensive experiments on six real-world datasets demonstrate that WISE consistently outperforms classical and neural baselines in clustering quality while remaining efficient, and produces faithful, human-interpretable explanations grounded in the same primitives that drive clustering.

关键词: mixed-type tabular data, clustering, feature weighting, interpretability, unsupervised learning, self-explaining framework, binary encoding, discriminative freqitems

271. ❌ JD-BP: A Joint-Decision Generative Framework for Auto-Bidding and Pricing

作者: Linghui Meng, Chun Gan, Shengsheng Niu, Chengcheng Zhang, Chenchen Li, Chuan Yang, Yi Mao, Xin Zhu, Jie He, Zhangang Lin, Ching Law 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05845v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于在线广告自动竞价和定价的联合决策框架，属于特定应用领域的优化算法研究。论文使用了强化学习和生成模型方法，但未涉及大语言模型（LLM）或深度学习技术原理的创新。唯一的相关关键词是’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’，因为论文提到了’Energy-Based Direct Preference Optimization method’，这是DPO的一种变体，但论文将其应用于广告竞价优化而非LLM对齐，因此给予5分（有一定关联）。其他所有关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为JD-BP的联合生成决策框架，用于解决在线广告自动竞价中的不确定性问题，通过联合输出竞价值和定价修正项，在离线实验中达到最先进性能，并在京东的在线A/B测试中实现了广告收入增加4.70%和目标成本改善6.48%。

摘要翻译

自动出价服务在关键绩效指标（如目标投资回报率与预算）约束下为广告主优化实时竞价策略。然而，模型预测误差与反馈延迟等不确定性因素可能导致出价策略偏离事后最优状态，从而造成分配效率低下。为解决该问题，我们提出JD-BP——一种面向竞价与定价的联合生成式决策框架。与现有方法不同，JD-BP联合输出出价值与定价修正项，该修正项可与广义第二价格等支付规则以加法形式结合。为缓解历史约束违规的负面影响，我们设计了无记忆的“未来回报估计量”，在鼓励出价行为最大化未来价值的同时，通过定价修正处理累积偏差。此外，我们提出轨迹增强算法，可从（可能任意的）基础出价策略生成联合竞价-定价轨迹，使得算法能够基于现有强化学习/生成式出价模型实现高效的即插即用部署。最后，我们采用基于能量的直接偏好优化方法，结合交叉注意力模块以增强竞价与定价修正的联合学习性能。在AuctionNet数据集上的离线实验表明，JD-BP实现了最先进的性能。在京东平台的在线A/B测试验证了其实际有效性，广告收入提升4.70%，目标成本改善6.48%。

摘要 (Abstract)

Auto-bidding services optimize real-time bidding strategies for advertisers under key performance indicator (KPI) constraints such as target return on investment and budget. However, uncertainties such as model prediction errors and feedback latency can cause bidding strategies to deviate from ex-post optimality, leading to inefficient allocation. To address this issue, we propose JD-BP, a Joint generative Decision framework for Bidding and Pricing. Unlike prior methods, JD-BP jointly outputs a bid value and a pricing correction term that acts additively with the payment rule such as GSP. To mitigate adverse effects of historical constraint violations, we design a memory-less Return-to-Go that encourages future value maximizing of bidding actions while the cumulated bias is handled by the pricing correction. Moreover, a trajectory augmentation algorithm is proposed to generate joint bidding-pricing trajectories from a (possibly arbitrary) base bidding policy, enabling efficient plug-and-play deployment of our algorithm from existing RL/generative bidding models. Finally, we employ an Energy-Based Direct Preference Optimization method in conjunction with a cross-attention module to enhance the joint learning performance of bidding and pricing correction. Offline experiments on the AuctionNet dataset demonstrate that JD-BP achieves state-of-the-art performance. Online A/B tests at JD.com confirm its practical effectiveness, showing a 4.70% increase in ad revenue and a 6.48% improvement in target cost.

关键词: Auto-bidding, Pricing, Joint-Decision Framework, Generative Model, Direct Preference Optimization, Reinforcement Learning, Online Advertising, AuctionNet

272. ❌ Modeling Patient Care Trajectories with Transformer Hawkes Processes

作者: Saumya Pandey, Varun Chandola 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05844v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《Modeling Patient Care Trajectories with Transformer Hawkes Processes》专注于医疗健康领域的时序事件建模，使用Transformer架构结合Hawkes过程来预测患者护理事件。虽然论文涉及Transformer（一种深度学习架构），但研究重点并非大模型（LLM）技术本身，而是特定领域（医疗）的应用。因此，仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评5分），因为医疗健康可视为科学应用的一个子领域。其他关键词均与大模型技术原理、训练方法、推理优化、代理系统等无关，故评0分。

!!! tip deepseek-chat TL;DR

该研究解决了医疗患者护理轨迹中时序事件建模的挑战，通过结合Transformer和Hawkes过程，并引入不平衡感知训练策略，提高了对罕见临床重要事件的预测性能。

摘要翻译

患者医疗资源利用由一系列时间戳不规则的事件构成，例如门诊就诊、住院治疗和急诊接触，这些事件形成了个体化的诊疗轨迹。对这些轨迹进行建模对于理解医疗资源利用模式和预测未来诊疗需求至关重要，但由于时间不规则性和严重的类别不平衡问题，建模面临挑战。本研究基于Transformer霍克斯过程框架，对连续时间中的患者轨迹进行建模。通过将基于Transformer的历史编码与霍克斯过程动态相结合，该模型能够捕捉事件间的依赖关系，并联合预测事件类型和事件发生时间。针对极端不平衡问题，我们引入了一种基于逆平方根类别权重的不平衡感知训练策略。该方法在不改变数据分布的前提下，提升了对罕见但具有重要临床意义事件的识别敏感度。在真实世界数据上的实验表明，该模型提升了预测性能，并为识别高风险患者群体提供了具有临床意义的见解。

摘要 (Abstract)

Patient healthcare utilization consists of irregularly time-stamped events, such as outpatient visits, inpatient admissions, and emergency encounters, forming individualized care trajectories. Modeling these trajectories is crucial for understanding utilization patterns and predicting future care needs, but is challenging due to temporal irregularity and severe class imbalance. In this work, we build on the Transformer Hawkes Process framework to model patient trajectories in continuous time. By combining Transformer-based history encoding with Hawkes process dynamics, the model captures event dependencies and jointly predicts event type and time-to-event. To address extreme imbalance, we introduce an imbalance-aware training strategy using inverse square-root class weighting. This improves sensitivity to rare but clinically important events without altering the data distribution. Experiments on real-world data demonstrate improved performance and provide clinically meaningful insights for identifying high-risk patient populations.

关键词: patient care trajectories, Transformer Hawkes Process, healthcare utilization, event prediction, class imbalance, clinical insights, temporal irregularity, high-risk patient identification

273. ❌ Expectation Maximization (EM) Converges for General Agnostic Mixtures

作者: Avishek Ghosh 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05842v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是广义无监督混合模型的期望最大化（EM）算法收敛性，属于传统机器学习中的统计学习理论范畴。论文内容聚焦于混合线性回归、混合逻辑回归、混合支持向量机等经典模型的参数估计问题，并分析了梯度EM算法的收敛性。所有评分关键词均与大语言模型、深度学习技术原理创新、大模型在不同领域的应用等主题相关，而本论文完全不涉及这些现代大模型技术，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在无监督设置下，使用梯度EM算法拟合k个参数化函数（包括混合线性回归、混合逻辑回归等）的收敛性问题，证明了在适当的初始化和分离条件下，该算法能以指数速度收敛到总体损失最小化器。

摘要翻译

混合线性回归在统计学与机器学习领域已得到深入研究，其数据点由$k$个线性模型以概率方式生成。期望最大化（EM）等算法可用于恢复该问题的真实回归器。近期，在文献\cite{pal2022learning,ghosh_agnostic}中，混合线性回归问题在不可知（agnostic）设定下被研究，其中不假设数据的生成模型。相反，给定一组数据点，目标是通过最小化合适的损失函数来\emph{拟合}$k$条直线。研究表明，即使在不可知设定下，改进的EM算法——即梯度EM——能以指数速度收敛到适当定义的损失最小化解。
本文研究将$k$个参数化函数\emph{拟合}到给定数据点集的问题。我们遵循不可知设定，但不再局限于配备二次损失的直线拟合，而是考虑配备强凸且光滑损失函数的任意参数化函数拟合。该框架涵盖包括混合线性回归（正则化）、混合线性分类器（混合逻辑回归、混合支持向量机）和混合广义线性回归在内的广泛问题类别。我们针对该问题提出并分析了梯度EM算法，证明在适当的初始化和分离条件下，梯度EM的迭代以高概率指数收敛到适当定义的总体损失最小化解。这体现了EM类算法在非生成式设定下的有效性，其能收敛至\emph{最优}解，且适用范围超越了混合线性回归问题。

摘要 (Abstract)

Mixture of linear regression is well studied in statistics and machine learning, where the data points are generated probabilistically using $k$ linear models. Algorithms like Expectation Maximization (EM) may be used to recover the ground truth regressors for this problem. Recently, in \cite{pal2022learning,ghosh_agnostic} the mixed linear regression problem is studied in the agnostic setting, where no generative model on data is assumed. Rather, given a set of data points, the objective is \emph{fit} $k$ lines by minimizing a suitable loss function. It is shown that a modification of EM, namely gradient EM converges exponentially to appropriately defined loss minimizer even in the agnostic setting. In this paper, we study the problem of \emph{fitting} $k$ parametric functions to given set of data points. We adhere to the agnostic setup. However, instead of fitting lines equipped with quadratic loss, we consider any arbitrary parametric function fitting equipped with a strongly convex and smooth loss. This framework encompasses a large class of problems including mixed linear regression (regularized), mixed linear classifiers (mixed logistic regression, mixed Support Vector Machines) and mixed generalized linear regression. We propose and analyze gradient EM for this problem and show that with proper initialization and separation condition, the iterates of gradient EM converge exponentially to appropriately defined population loss minimizers with high probability. This shows the effectiveness of EM type algorithm which converges to \emph{optimal} solution in the non-generative setup beyond mixture of linear regression.

关键词: Expectation Maximization, EM, Mixture Models, Agnostic Learning, Gradient EM, Convergence Analysis, Parametric Function Fitting, Non-generative Setup

274. ❌ Hidden in the Multiplicative Interaction: Uncovering Fragility in Multimodal Contrastive Learning

作者: Tillmann Rheude, Stefan Hegselmann, Roland Eils, Benjamin Wild 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05834v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多模态对比学习中的脆弱性问题，并提出门控机制改进方法。论文主题聚焦于多模态学习、对比学习、嵌入对齐和鲁棒性，与评分关键词列表中的大模型技术、训练方法、推理优化、对齐技术、AI科学应用等主题均无直接关联。所有关键词均未在论文标题、摘要或研究内容中出现，因此相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文揭示了多模态对比学习方法Symile在处理不可靠模态时的脆弱性，并提出了一种基于注意力的门控机制Gated Symile，在合成基准和真实数据集上实现了比原方法更高的检索准确率。

摘要 (Abstract)

Multimodal contrastive learning is increasingly enriched by going beyond image-text pairs. Among recent contrastive methods, Symile is a strong approach for this challenge because its multiplicative interaction objective captures higher-order cross-modal dependence. Yet, we find that Symile treats all modalities symmetrically and does not explicitly model reliability differences, a limitation that becomes especially present in trimodal multiplicative interactions. In practice, modalities beyond image-text pairs can be misaligned, weakly informative, or missing, and treating them uniformly can silently degrade performance. This fragility can be hidden in the multiplicative interaction: Symile may outperform pairwise CLIP even if a single unreliable modality silently corrupts the product terms. We propose Gated Symile, a contrastive gating mechanism that adapts modality contributions on an attention-based, per-candidate basis. The gate suppresses unreliable inputs by interpolating embeddings toward learnable neutral directions and incorporating an explicit NULL option when reliable cross-modal alignment is unlikely. Across a controlled synthetic benchmark that uncovers this fragility and three real-world trimodal datasets for which such failures could be masked by averages, Gated Symile achieves higher top-1 retrieval accuracy than well-tuned Symile and CLIP models. More broadly, our results highlight gating as a step toward robust multimodal contrastive learning under imperfect and more than two modalities.

关键词: multimodal contrastive learning, multiplicative interaction, fragility, gating mechanism, robustness, trimodal datasets, retrieval accuracy, embedding alignment

275. ❌ Bivariate Causal Discovery Using Rate-Distortion MDL: An Information Dimension Approach

作者: Tiago Brogueira, Mário A. T. Figueiredo 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05829v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于双变量因果发现方法，提出了一种基于率失真最小描述长度（RDMDL）的新方法，属于传统机器学习/统计学习领域。论文内容完全不涉及大模型、深度学习、AI for Science等关键词，所有关键词均与大模型技术、训练方法、推理优化、AI应用等无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于率失真最小描述长度（RDMDL）的新方法来解决双变量因果发现问题，并在Tübingen数据集上展示了具有竞争力的性能。

摘要翻译

基于最小描述长度（MDL）原则的双变量因果发现方法，通过近似计算每个因果方向模型的（不可计算的）柯尔莫哥洛夫复杂度，并选择总复杂度较低的方向作为因果推断结果。其前提是自然机制在其真实因果方向上更为简单。本质上，每个方向的描述长度（复杂度）包括对原因变量的描述以及对因果机制的描述。本文指出，当前最先进的基于MDL的方法未能正确解决原因变量描述长度的估计问题，实际上将决策权留给了因果机制的描述长度。基于率失真理论，我们提出了一种衡量原因描述长度的新方法，该方法对应于达到代表底层分布的失真水平所需的最小速率。该失真水平通过基于直方图的密度估计规则推导得出，而速率则基于渐近近似，利用信息维度的相关概念进行计算。将此方法与传统的因果机制描述方法相结合，我们提出了一种新的双变量因果发现方法，称为率失真最小描述长度（RDMDL）。实验表明，RDMDL在图宾根数据集上取得了具有竞争力的性能。所有代码与实验均已公开于github.com/tiagobrogueira/Causal-Discovery-In-Exchangeable-Data。

摘要 (Abstract)

Approaches to bivariate causal discovery based on the minimum description length (MDL) principle approximate the (uncomputable) Kolmogorov complexity of the models in each causal direction, selecting the one with the lower total complexity. The premise is that nature’s mechanisms are simpler in their true causal order. Inherently, the description length (complexity) in each direction includes the description of the cause variable and that of the causal mechanism. In this work, we argue that current state-of-the-art MDL-based methods do not correctly address the problem of estimating the description length of the cause variable, effectively leaving the decision to the description length of the causal mechanism. Based on rate-distortion theory, we propose a new way to measure the description length of the cause, corresponding to the minimum rate required to achieve a distortion level representative of the underlying distribution. This distortion level is deduced using rules from histogram-based density estimation, while the rate is computed using the related concept of information dimension, based on an asymptotic approximation. Combining it with a traditional approach for the causal mechanism, we introduce a new bivariate causal discovery method, termed rate-distortion MDL (RDMDL). We show experimentally that RDMDL achieves competitive performance on the Tübingen dataset. All the code and experiments are publicly available at github.com/tiagobrogueira/Causal-Discovery-In-Exchangeable-Data.

关键词: causal discovery, minimum description length, rate-distortion theory, bivariate causal inference, information dimension, Kolmogorov complexity, causal mechanism, Tübingen dataset

276. ❌ Stealthy and Adjustable Text-Guided Backdoor Attacks on Multimodal Pretrained Models

作者: Yiyang Zhang, Chaojian Yu, Ziming Hong, Yuanjie Shao, Qinmu Peng, Tongliang Liu, Xinge You 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05809v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多模态预训练模型的后门攻击，提出了一种文本引导的后门攻击方法。与大多数关键词无关，仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为论文涉及多模态预训练模型的安全漏洞，属于预训练模型的应用安全研究，但并非技术原理创新。其他关键词主要涉及大模型技术原理、优化方法、应用场景等，与论文的安全攻击主题不相关。

!!! tip deepseek-chat TL;DR

该论文提出了一种针对多模态预训练模型的文本引导后门攻击方法，通过文本触发器和视觉对抗扰动实现隐蔽可调的攻击，在图像检索和视觉问答任务中验证了其有效性和安全性漏洞。

摘要翻译

多模态预训练模型易受后门攻击，但现有方法大多依赖视觉或多模态触发器，由于视觉嵌入触发器在现实数据中极少出现，其实用性受限。为突破这一局限，我们提出一种针对多模态预训练模型的新型文本引导后门攻击方法，该方法以文本描述中的常见词汇作为后门触发器，显著提升了攻击的隐蔽性与实用性。此外，我们在投毒样本中引入视觉对抗扰动，以调控模型对文本触发器的学习过程，从而实现可控可调节的文本引导后门攻击。基于多模态预训练模型的下游任务（包括组合图像检索与视觉问答）上的大量实验表明，文本引导后门攻击在多种现实场景中实现了高实用性与隐蔽性，其攻击成功率可调节，揭示了多模态预训练模型中存在的关键安全漏洞。

摘要 (Abstract)

Multimodal pretrained models are vulnerable to backdoor attacks, yet most existing methods rely on visual or multimodal triggers, which are impractical since visually embedded triggers rarely occur in real-world data. To overcome this limitation, we propose a novel Text-Guided Backdoor (TGB) attack on multimodal pretrained models, where commonly occurring words in textual descriptions serve as backdoor triggers, significantly improving stealthiness and practicality. Furthermore, we introduce visual adversarial perturbations on poisoned samples to modulate the model’s learning of textual triggers, enabling a controllable and adjustable TGB attack. Extensive experiments on downstream tasks built upon multimodal pretrained models, including Composed Image Retrieval (CIR) and Visual Question Answering (VQA), demonstrate that TGB achieves practicality and stealthiness with adjustable attack success rates across diverse realistic settings, revealing critical security vulnerabilities in multimodal pretrained models.

关键词: multimodal pretrained models, backdoor attacks, text-guided backdoor, stealthiness, adversarial perturbations, image retrieval, visual question answering, security vulnerabilities

277. ❌ Brain-to-Speech: Prosody Feature Engineering and Transformer-Based Reconstruction

作者: Mohammed Salah Al-Radhi, Géza Németh, Andon Tchechmedjiev, Binbin Xu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05751v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究脑电信号到语音的合成，属于AI在神经科学/生物医学工程领域的应用。论文使用了transformer架构，但并非大语言模型（LLM）或基础模型，而是专门设计的transformer编码器用于信号处理。所有关键词中，只有’AI for Science OR Bioinformatics OR Cheminformatics’与论文主题有一定关联（AI在科学/生物医学应用），但论文未涉及生物信息学或化学信息学的具体方法，因此给5分（有一定关联）。其他关键词均与大模型技术原理、训练方法、推理优化、智能体等无关，故均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种从颅内脑电图信号合成语音的新方法，通过创新的韵律特征提取和专门设计的transformer编码器架构，显著提高了语音重建的准确性和自然度。

摘要翻译

本章提出了一种基于颅内脑电图（iEEG）数据的大脑到语音（Brain-to-Speech, BTS）合成新方法，重点强调韵律感知的特征工程与基于先进Transformer模型的高保真语音重建。随着从大脑活动直接解码语音的研究日益受到关注，本工作融合神经科学、人工智能与信号处理技术，以生成准确且自然的语音。我们引入了一种新颖的处理流程，可直接从复杂的颅内脑电信号中提取关键韵律特征，包括语调、音高和节奏。为有效利用这些关键特征以生成自然语音，我们采用了先进的深度学习模型。此外，本章介绍了一种专为大脑到语音任务设计的新型Transformer编码器架构。与传统模型不同，该架构整合了提取的韵律特征，显著增强了语音重建效果，使生成的语音在可懂度和表现力上均得到提升。详细的评估表明，在定量指标与感知指标上，本方法均优于现有基线方法（如传统的Griffin-Lim算法和基于CNN的重建方法）。通过展示特征提取和基于Transformer的学习方面的这些进展，本章为人工智能驱动的神经假体这一不断发展的领域做出了贡献，为帮助言语障碍患者恢复沟通的辅助技术铺平了道路。最后，我们探讨了未来有前景的研究方向，包括扩散模型的整合与实时推理系统的开发。

摘要 (Abstract)

This chapter presents a novel approach to brain-to-speech (BTS) synthesis from intracranial electroencephalography (iEEG) data, emphasizing prosody-aware feature engineering and advanced transformer-based models for high-fidelity speech reconstruction. Driven by the increasing interest in decoding speech directly from brain activity, this work integrates neuroscience, artificial intelligence, and signal processing to generate accurate and natural speech. We introduce a novel pipeline for extracting key prosodic features directly from complex brain iEEG signals, including intonation, pitch, and rhythm. To effectively utilize these crucial features for natural-sounding speech, we employ advanced deep learning models. Furthermore, this chapter introduces a novel transformer encoder architecture specifically designed for brain-to-speech tasks. Unlike conventional models, our architecture integrates the extracted prosodic features to significantly enhance speech reconstruction, resulting in generated speech with improved intelligibility and expressiveness. A detailed evaluation demonstrates superior performance over established baseline methods, such as traditional Griffin-Lim and CNN-based reconstruction, across both quantitative and perceptual metrics. By demonstrating these advancements in feature extraction and transformer-based learning, this chapter contributes to the growing field of AI-driven neuroprosthetics, paving the way for assistive technologies that restore communication for individuals with speech impairments. Finally, we discuss promising future research directions, including the integration of diffusion models and real-time inference systems.

关键词: brain-to-speech synthesis, intracranial EEG, prosody feature engineering, transformer encoder, speech reconstruction, neuroprosthetics, deep learning, AI-driven assistive technology

278. ❌ Graph Topology Information Enhanced Heterogeneous Graph Representation Learning

作者: He Zhao, Zhiwei Zeng, Yongwei Wang, Chunyan Miao 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05732v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于异构图表示学习（Heterogeneous Graph Representation Learning）和图结构学习（Graph Structure Learning），属于图神经网络（GNN）领域。所有评分关键词均与大语言模型（LLM）、深度学习技术原理或AI for Science应用直接相关，而本文研究的是图神经网络中的特定问题（异构图结构优化），与LLM、MoE、Scaling Laws、训练技术、推理优化、AI代理、模型压缩等关键词无直接关联。论文未涉及生物信息学或化学信息学等科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对异构图结构噪声和内存消耗问题，提出了一个图拓扑学习增强的异构图表示学习框架（ToGRL），通过提取任务相关拓扑信息构建新图并利用提示调优，在五个真实数据集上显著优于现有方法。

摘要翻译

现实世界中的异质图本质上存在噪声，且通常不具备适合下游任务的最优图结构，这往往会对图表示学习模型在下游任务中的性能产生不利影响。尽管已有图结构学习方法被提出以同时学习图结构和下游任务，但现有方法主要针对同质图设计，而异质图的图结构学习在很大程度上仍未得到探索。在此背景下，两个挑战随之浮现。首先，与同质图模型相比，输入图结构的质量对基于图神经网络的异质图表示学习模型影响更为深远。其次，大多数现有的同质图表示学习模型在直接应用于异质图时会遇到内存消耗问题。本文提出了一种新颖的图拓扑学习增强型异质图表示学习框架。该框架通过整合任务相关的潜在拓扑信息，为下游任务学习高质量的图结构和表示。具体而言，首先提出一种新颖的图结构学习模块，从原始图结构中提取与下游任务相关的拓扑信息，并将其投影为拓扑嵌入。这些嵌入被用于构建具有平滑图信号的新图。这种两阶段的图结构学习方法将邻接矩阵的优化与节点表示学习分离，从而降低内存消耗。随后，表示学习模块以新构建的图为输入，学习适用于下游任务的嵌入表示。该框架还利用提示调优技术，以更好地利用学习表示中蕴含的知识，从而增强对下游任务的适应能力。在五个真实世界数据集上的大量实验表明，我们的框架以显著优势超越了现有最先进方法。

摘要 (Abstract)

Real-world heterogeneous graphs are inherently noisy and usually not in the optimal graph structures for downstream tasks, which often adversely affects the performance of GRL models in downstream tasks. Although Graph Structure Learning (GSL) methods have been proposed to learn graph structures and downstream tasks simultaneously, existing methods are predominantly designed for homogeneous graphs, while GSL for heterogeneous graphs remains largely unexplored. Two challenges arise in this context. Firstly, the quality of the input graph structure has a more profound impact on GNN-based heterogeneous GRL models compared to their homogeneous counterparts. Secondly, most existing homogenous GRL models encounter memory consumption issues when applied directly to heterogeneous graphs. In this paper, we propose a novel Graph Topology learning Enhanced Heterogeneous Graph Representation Learning framework (ToGRL).ToGRL learns high-quality graph structures and representations for downstream tasks by incorporating task-relevant latent topology information. Specifically, a novel GSL module is first proposed to extract downstream task-related topology information from a raw graph structure and project it into topology embeddings. These embeddings are utilized to construct a new graph with smooth graph signals. This two-stage approach to GSL separates the optimization of the adjacency matrix from node representation learning to reduce memory consumption. Following this, a representation learning module takes the new graph as input to learn embeddings for downstream tasks. ToGRL also leverages prompt tuning to better utilize the knowledge embedded in learned representations, thus enhancing adaptability to downstream tasks. Extensive experiments on five real-world datasets show that our ToGRL outperforms state-of-the-art methods by a large margin.

关键词: Heterogeneous Graph Representation Learning, Graph Structure Learning, Graph Topology Learning, Graph Neural Networks, Memory Consumption, Prompt Tuning, Downstream Tasks

279. ❌ Controllable Image Generation with Composed Parallel Token Prediction

作者: Jamie Stirling, Noura Al-Moubayed, Chris G. Willcocks, Hubert P. H. Shum 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05730v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于条件离散生成模型（特别是基于VQ-VAE/VQ-GAN的图像生成）的组成性问题，提出了一种理论驱动的组合方法，并应用于文本到图像生成的控制。所有评分关键词均与大语言模型（LLM）及其相关技术（如训练、对齐、推理、代理、压缩等）或特定科学领域AI应用直接相关，而本文研究的是计算机视觉中的生成模型，未涉及LLM、深度学习技术原理创新或科学领域应用，因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文解决了条件离散生成模型在忠实组合多个输入条件方面的困难，提出了一种理论驱动的组合方法，在多个数据集上实现了错误率相对降低63.4%和FID改进-9.58，同时获得2.3倍至12倍的实时加速，并可应用于预训练的文本到图像模型以实现细粒度控制。

摘要翻译

条件离散生成模型难以忠实地组合多个输入条件。为解决此问题，我们推导出一种基于理论框架的离散概率生成过程组合公式，其中掩码生成（吸收扩散）为其特例。该公式能够精确指定训练数据之外的新型输入条件组合及数量，并通过概念加权实现对单个条件的强调或否定。结合VQ-VAE与VQ-GAN丰富组合性的学习词汇表，我们的方法在三个数据集（位置CLEVR、关系CLEVR和FFHQ）上平均实现了$63.4%$的错误率相对降低，同时获得平均绝对FID改进$-9.58$。此外，本方法相比同类模型实现了$2.3\times$至$12\times$的实时加速，并能直接应用于开源预训练的离散文本到图像模型，实现对文本到图像生成的细粒度控制。

摘要 (Abstract)

Conditional discrete generative models struggle to faithfully compose multiple input conditions. To address this, we derive a theoretically-grounded formulation for composing discrete probabilistic generative processes, with masked generation (absorbing diffusion) as a special case. Our formulation enables precise specification of novel combinations and numbers of input conditions that lie outside the training data, with concept weighting enabling emphasis or negation of individual conditions. In synergy with the richly compositional learned vocabulary of VQ-VAE and VQ-GAN, our method attains a $63.4%$ relative reduction in error rate compared to the previous state-of-the-art, averaged across 3 datasets (positional CLEVR, relational CLEVR and FFHQ), simultaneously obtaining an average absolute FID improvement of $-9.58$. Meanwhile, our method offers a $2.3\times$ to $12\times$ real-time speed-up over comparable methods, and is readily applied to an open pre-trained discrete text-to-image model for fine-grained control of text-to-image generation.

关键词: Controllable Image Generation, Composed Parallel Token Prediction, Conditional Discrete Generative Models, Masked Generation, Absorbing Diffusion, VQ-VAE, VQ-GAN, Text-to-Image Generation

280. ❌ Untargeted analysis of volatile markers of post-exercise fat oxidation in exhaled breath

作者: André Homeyer, Júlia Blanka Sziládi, Jan-Philipp Redlich, Jonathan Beauchamp, Y Lan Pham 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05707v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究运动后脂肪氧化的呼气生物标志物，属于生物医学/运动科学领域，与绝大多数大模型/深度学习技术关键词完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究涉及生物标志物分析和数据关联，属于科学应用范畴，但论文未明确使用AI方法，仅使用了质谱分析技术，因此给予5分（有一定关联）。其他所有关键词均与大模型技术、训练方法、推理优化、代理系统等无关，评分为0分。

!!! tip deepseek-chat TL;DR

该研究通过呼气挥发性有机化合物分析寻找运动后脂肪氧化的生物标志物，发现丙酮是唯一强相关标志物，且运动期间的丙酮测量能预测运动后的脂肪氧化变化。

摘要翻译

呼气丙酮是一种用于监测运动期间脂肪氧化的有前景的无创生物标志物。然而，其应用受到混杂因素的限制，且其浓度仅在运动后数小时才发生显著变化，这使得实时评估变得困难。我们针对丙酮以外的挥发性有机化合物（VOCs）进行了非靶向筛选，以寻找可作为脂肪氧化标志物的物质，并探究运动期间的呼气测量能否预测运动后脂肪氧化的变化。十九名参与者完成了两次各25分钟的骑行运动，中间间隔5分钟短暂休息。使用质子转移反应飞行时间质谱（PTR-TOF-MS）在运动期间及90分钟恢复期后分析了VOCs排放。血液β-羟基丁酸（BOHB）浓度作为脂肪氧化的参考标志物。在PTR-TOF-MS检测到的773个相关分析特征中，仅四个信号与BOHB表现出强相关性（ρ ≥ 0.82，p = 0.0002）——这些信号均归因于丙酮或其同位素体或碎片。对这些信号的运动结束时测量值能够准确预测运动后BOHB发生显著变化的参与者（F1分数 ≥ 0.83，准确度 = 0.89）。我们的研究未发现任何新型的基于呼气的脂肪氧化生物标志物，但证实了丙酮是关键标志物。此外，我们的研究结果表明，运动期间的呼气丙酮测量或许已能对运动后脂肪氧化进行基本预测。

摘要 (Abstract)

Breath acetone represents a promising non-invasive biomarker for monitoring fat oxidation during exercise. However, its utility is limited by confounding factors, as well as by the fact that significant changes in concentration occur only hours post-exercise, which makes real-time assessment difficult. We performed an untargeted screening for volatile organic compounds (VOCs) that could serve as markers of fat oxidation beyond acetone, and investigated whether breath measurements taken during exercise could predict post-exercise changes in fat oxidation. Nineteen participants completed two 25-min cycling sessions separated by a brief 5-min rest period. VOC emissions were analysed using proton-transfer-reaction time-of-flight mass spectrometry (PTR-TOF-MS) during exercise and after a 90-min recovery period. Blood $β$-hydroxybutyrate (BOHB) concentrations served as the reference marker for fat oxidation. Among 773 relevant analytical features detected in the PTR-TOF-MS measurements, only four signals exhibited strong correlations with BOHB ($ρ$ $\geq$ 0.82, p = 0.0002)-all attributable to acetone or its isotopologues or fragments. End-of-exercise measurements of these signals enabled accurate prediction of participants with substantial post-exercise BOHB changes (F1 score $\geq$ 0.83, accuracy = 0.89). Our study did not reveal any novel breath-based biomarkers of fat oxidation, but it confirmed acetone as the key marker. Moreover, our findings suggest that breath acetone measurements during exercise may already enable basic predictions of post-exercise fat oxidation.

关键词: breath acetone, fat oxidation, volatile organic compounds, biomarker, exercise, PTR-TOF-MS, β-hydroxybutyrate, non-invasive monitoring

281. ❌ Intrinsic perturbation scale for certified oracle objectives with epigraphic information

作者: Karim Bounja, Boujemaâ Achchab, Abdeljalil Sakat 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05678v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是数学优化理论中的扰动分析和极小化集位移控制，属于纯数学优化领域，与所有关键词（均涉及大模型、深度学习、AI技术及其应用）完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于认证外延信息的自然位移控制方法，用于分析具有二次增长条件的预言机目标函数的极小化集扰动，得到了经典的平方根位移估计。

摘要翻译

我们针对具备认证拟图信息的预言机目标函数的最小化集，引入了一种自然的位移控制方法。形式上，我们摒弃了传统目标扰动所需的局部一致值控制——这种控制在缺乏额外结构时无法从有限点态信息中获得认证——转而采用严格更弱的圆柱局部化垂直拟图控制条件，该条件可由认证包络自然提供。在基于集合的二次增长条件下（允许非唯一最小化点），该方法可在不引入任何外部假设的情况下，得到经典的平方根位移估计，并保持最优指数1/2。

摘要 (Abstract)

We introduce a natural displacement control for minimizer sets of oracle objectives equipped with certified epigraphic information. Formally, we replace the usual local uniform value control of objective perturbations - uncertifiable from finite pointwise information without additional structure - by the strictly weaker requirement of a cylinder-localized vertical epigraphic control, naturally provided by certified envelopes. Under set-based quadratic growth (allowing nonunique minimizers), this yields the classical square-root displacement estimate with optimal exponent 1/2, without any extrinsic assumption.

关键词: perturbation analysis, minimizer sets, oracle objectives, certified epigraphic information, displacement control, quadratic growth, square-root estimate

282. ❌ Optimal-Transport-Guided Functional Flow Matching for Turbulent Field Generation in Hilbert Space

作者: Li Kunpeng, Wan Chenguang, Qu Zhisong, Lim Kyungtak, Virginie Grandgirard, Xavier Garbet, Yu Hua, Ong Yew Soon 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05700v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于湍流场生成的深度生成模型（Functional Optimal Transport Conditional Flow Matching），属于AI for Science（科学AI）领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评分5分），因为论文应用深度学习解决科学计算中的湍流建模问题。然而，论文未涉及大语言模型（LLMs）、MoE、SLMs、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等具体的大模型技术或应用，因此其他所有关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种在无限维希尔伯特空间中直接定义的生成框架FOT-CFM，用于湍流场生成，解决了传统基于像素的生成模型在湍流计算中的局限性，并在多个混沌动力系统上实现了优于现有方法的高保真度湍流统计和能谱再现。

摘要翻译

高保真湍流建模需要捕捉复杂的时空动力学与多尺度间歇性，这对传统基于知识的系统构成了根本性挑战。尽管深度生成模型（如扩散模型和流匹配）已展现出良好性能，但其本质上受限于离散的、基于像素的特性。这一局限性制约了它们在湍流计算中的应用，因为湍流数据天然以函数形式存在。为弥补这一不足，我们提出了函数最优传输条件流匹配（Functional Optimal Transport Conditional Flow Matching, FOT-CFM），这是一种直接在无限维函数空间中定义的生成框架。与在固定网格上定义的传统方法不同，FOT-CFM将物理场视为无限维希尔伯特空间中的元素，并在概率测度层面直接学习分辨率不变的生成动力学。通过融合最优传输（Optimal Transport, OT）理论，我们在希尔伯特空间中构建了噪声测度与数据测度之间确定性的直线概率路径。该框架实现了免模拟训练，并显著加速了采样过程。我们在多种混沌动力系统上对所提系统进行了严格评估，包括纳维-斯托克斯方程（Navier-Stokes equations）、科莫哥洛夫流（Kolmogorov Flow）和长谷川-若谷方程（Hasegawa-Wakatani equations），这些系统均展现出丰富的多尺度湍流结构。实验结果表明，相较于现有先进基线方法，FOT-CFM在重构高阶湍流统计量和能谱方面具有更优的保真度。

摘要 (Abstract)

High-fidelity modeling of turbulent flows requires capturing complex spatiotemporal dynamics and multi-scale intermittency, posing a fundamental challenge for traditional knowledge-based systems. While deep generative models, such as diffusion models and Flow Matching, have shown promising performance, they are fundamentally constrained by their discrete, pixel-based nature. This limitation restricts their applicability in turbulence computing, where data inherently exists in a functional form. To address this gap, we propose Functional Optimal Transport Conditional Flow Matching (FOT-CFM), a generative framework defined directly in infinite-dimensional function space. Unlike conventional approaches defined on fixed grids, FOT-CFM treats physical fields as elements of an infinite-dimensional Hilbert space, and learns resolution-invariant generative dynamics directly at the level of probability measures. By integrating Optimal Transport (OT) theory, we construct deterministic, straight-line probability paths between noise and data measures in Hilbert space. This formulation enables simulation-free training and significantly accelerates the sampling process. We rigorously evaluate the proposed system on a diverse suite of chaotic dynamical systems, including the Navier-Stokes equations, Kolmogorov Flow, and Hasegawa-Wakatani equations, all of which exhibit rich multi-scale turbulent structures. Experimental results demonstrate that FOT-CFM achieves superior fidelity in reproducing high-order turbulent statistics and energy spectra compared to state-of-the-art baselines.

关键词: turbulent flow generation, functional optimal transport, flow matching, Hilbert space, generative models, Navier-Stokes equations, chaotic dynamical systems, high-fidelity modeling

283. ❌ Efficient machine unlearning with minimax optimality

作者: Jingyi Xie, Linjun Zhang, Sai Li 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05669v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于机器遗忘（machine unlearning）的统计框架和算法，特别是针对最小二乘损失的情况，提出了Unlearning Least Squares（ULS）方法并证明了其极小极大最优性。论文内容涉及数据移除、模型更新、统计推断等，但完全不涉及大语言模型（LLM）、深度学习技术原理、AI for Science等关键词。所有关键词均与大模型、深度学习、AI应用或相关技术直接相关，而本文研究的是通用机器学习模型（尤其是线性模型）的遗忘问题，与这些关键词无关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于机器遗忘的统计框架和Unlearning Least Squares（ULS）方法，在仅需预训练估计器、遗忘样本和少量剩余数据的情况下，实现了对剩余数据模型参数的极小极大最优估计，其性能接近重新训练但数据访问量大幅减少。

摘要翻译

随着《通用数据保护条例》（GDPR）等法规的合规要求日益增长，以及为减少有偏见或损坏数据的影响，对高效数据移除的需求正不断扩大。这推动了机器遗忘领域的发展，其目标是在无需承担完全重新训练成本的前提下，消除特定数据子集的影响。在本研究中，我们提出了一个适用于通用损失函数的机器遗忘统计框架，并建立了理论保证。特别针对平方损失，我们开发了遗忘最小二乘法（Unlearning Least Squares, ULS），并证明了其在仅可获得预训练估计量、待遗忘样本以及剩余数据的一个小子样本时，用于估计剩余数据模型参数的极小极大最优性。我们的结果表明，估计误差可分解为一个理想项和一个由遗忘比例及遗忘模型偏差决定的遗忘成本。我们进一步建立了无需完全重新训练的渐近有效推断程序。数值实验和实际数据应用表明，所提方法在显著减少数据访问需求的同时，能达到接近重新训练的性能。

摘要 (Abstract)

There is a growing demand for efficient data removal to comply with regulations like the GDPR and to mitigate the influence of biased or corrupted data. This has motivated the field of machine unlearning, which aims to eliminate the influence of specific data subsets without the cost of full retraining. In this work, we propose a statistical framework for machine unlearning with generic loss functions and establish theoretical guarantees. For squared loss, especially, we develop Unlearning Least Squares (ULS) and establish its minimax optimality for estimating the model parameter of remaining data when only the pre-trained estimator, forget samples, and a small subsample of the remaining data are available. Our results reveal that the estimation error decomposes into an oracle term and an unlearning cost determined by the forget proportion and the forget model bias. We further establish asymptotically valid inference procedures without requiring full retraining. Numerical experiments and real-data applications demonstrate that the proposed method achieves performance close to retraining while requiring substantially less data access.

关键词: machine unlearning, statistical framework, minimax optimality, Unlearning Least Squares, data removal, forget samples, inference procedures, retraining efficiency

284. ❌ Parametric Nonconvex Optimization via Convex Surrogates

作者: Renzi Wang, Panagiotis Patrinos, Alberto Bemporad 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05640v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是非凸优化问题的凸代理方法，属于数学优化领域，与所有关键词（均涉及大模型、深度学习及其技术原理、应用）完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于学习的方法来构建近似给定参数化非凸优化问题的凸代理问题，并通过非凸路径跟踪问题的数值实验验证了该方法的近似质量。

摘要翻译

本文提出了一种基于学习的新型方法，用于构建近似给定参数化非凸优化问题的代理问题。该代理函数被设计为一组有限函数的最小值，这些函数由凸项与单调项复合而成，从而使代理问题能够通过并行凸优化直接求解。作为概念验证，在一个非凸路径跟踪问题上进行的数值实验证实了所提方法的近似质量。

摘要 (Abstract)

This paper presents a novel learning-based approach to construct a surrogate problem that approximates a given parametric nonconvex optimization problem. The surrogate function is designed to be the minimum of a finite set of functions, given by the composition of convex and monotonic terms, so that the surrogate problem can be solved directly through parallel convex optimization. As a proof of concept, numerical experiments on a nonconvex path tracking problem confirm the approximation quality of the proposed method.

关键词: parametric nonconvex optimization, convex surrogates, learning-based approach, parallel convex optimization, path tracking problem, approximation quality

285. ❌ From Uniform to Learned Knots: A Study of Spline-Based Numerical Encodings for Tabular Deep Learning

作者: Manish Kumar, Anton Frederik Thielmann, Christoph Weisser, Benjamin Säfken 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05635v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究表格深度学习中基于样条的数字编码方法，包括B样条、M样条和I样条，以及均匀、分位数、目标感知和可学习节点放置策略。论文专注于传统的表格深度学习预处理技术，未涉及大语言模型、深度学习技术原理创新或大模型在不同领域的应用。所有关键词均与大模型、深度学习创新技术或科学AI应用相关，而本文研究的是特定于表格数据的传统数值编码方法，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了表格深度学习中基于样条的数字编码方法，比较了不同样条家族和节点放置策略在不同任务和骨干网络上的性能，发现编码效果取决于任务类型和网络架构，可学习节点策略虽稳定但计算成本较高。

摘要翻译

数值预处理仍然是表格深度学习的重要组成部分，其中连续特征的表示方式会显著影响下游性能。尽管其在经典统计和机器学习模型中的重要性已得到广泛认可，但显式数值预处理在表格深度学习中的作用仍不甚明确。在本研究中，我们聚焦于基于样条的数值编码方法，深入探讨了这一问题。我们研究了三种用于编码数值特征的样条族——B样条、M样条和积分样条（I样条），并考察了均匀节点、基于分位数的节点、目标感知节点以及可学习节点四种节点配置策略。对于可学习节点变体，我们采用了一种可微分的节点参数化方法，使得节点位置能够与主干网络共同进行稳定的端到端优化。我们在多样化的公共回归与分类数据集上，使用MLP、ResNet和FT-Transformer主干网络对这些编码方法进行了评估，并将其与常见的数值预处理基线方法进行了比较。我们的结果表明，数值编码的效果在很大程度上取决于任务、输出维度以及主干网络架构。对于分类任务，分段线性编码（PLE）总体上是最稳健的选择，而基于样条的编码方法仍具有竞争力。对于回归任务，没有单一的编码方法能普遍占优。相反，性能表现取决于样条族、节点配置策略和输出维度，其中MLP和ResNet通常比FT-Transformer获得更大的性能提升。我们进一步发现，在所提出的参数化方案下，可学习节点变体能够被稳定优化，但可能显著增加训练成本，尤其是对于M样条和I样条展开。总体而言，结果表明评估数值编码时，不仅应考虑预测性能，还需考虑其计算开销。

摘要 (Abstract)

Numerical preprocessing remains an important component of tabular deep learning, where the representation of continuous features can strongly affect downstream performance. Although its importance is well established for classical statistical and machine learning models, the role of explicit numerical preprocessing in tabular deep learning remains less well understood. In this work, we study this question with a focus on spline-based numerical encodings. We investigate three spline families for encoding numerical features, namely B-splines, M-splines, and integrated splines (I-splines), under uniform, quantile-based, target-aware, and learnable-knot placement. For the learnable-knot variants, we use a differentiable knot parameterization that enables stable end-to-end optimization of knot locations jointly with the backbone. We evaluate these encodings on a diverse collection of public regression and classification datasets using MLP, ResNet, and FT-Transformer backbones, and compare them against common numerical preprocessing baselines. Our results show that the effect of numerical encodings depends strongly on the task, output size, and backbone. For classification, piecewise-linear encoding (PLE) is the most robust choice overall, while spline-based encodings remain competitive. For regression, no single encoding dominates uniformly. Instead, performance depends on the spline family, knot-placement strategy, and output size, with larger gains typically observed for MLP and ResNet than for FT-Transformer. We further find that learnable-knot variants can be optimized stably under the proposed parameterization, but may substantially increase training cost, especially for M-spline and I-spline expansions. Overall, the results show that numerical encodings should be assessed not only in terms of predictive performance, but also in terms of computational overhead.

关键词: tabular deep learning, numerical preprocessing, spline-based encodings, B-splines, M-splines, I-splines, learnable knots, MLP/ResNet/FT-Transformer

286. ❌ Same Graph, Different Likelihoods: Calibration of Autoregressive Graph Generators via Permutation-Equivalent Encodings

作者: Laurits Fredsgaard, Aaron Thomas, Michael Riis Andersen, Mikkel N. Schmidt, Mahito Sugiyama 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05613v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究自回归图生成器的校准问题，通过线性化不确定性（LU）评估不同线性化策略下模型的一致性，并在分子图数据集QM9上验证。论文核心是图生成和模型校准，与大多数关键词（如LLM、MoE、RLHF等）无关。唯一相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文在分子图基准QM9上应用，属于科学AI领域，但并非核心创新点，因此给5分（有一定关联）。其他关键词均未涉及，给0分。加权总分计算为5.0（仅一个关键词得5分，权重1.0）。

!!! tip deepseek-chat TL;DR

该论文研究了自回归图生成器中不同线性化策略导致的似然不一致问题，提出线性化不确定性（LU）作为校准指标，并在分子图数据集QM9上证明LU比负对数似然（NLL）更能可靠评估生成分子的质量。

摘要翻译

自回归图生成器通过序列化构建过程定义似然度，但这些似然度仅当其在同一图的所有线性化序列中保持一致时才有意义。分段欧拉邻域轨迹（Segmented Eulerian Neighborhood Trails, SENT）是一种近期提出的线性化方法，可将图转换为能够被语言模型完美解码并高效处理的序列，但同一图允许存在多个等效线性化序列。我们通过等效线性化序列间变异系数来量化所分配的负对数似然度（Negative Log-Likelihood, NLL）的不一致性，并将其称为线性化不确定性（Linearization Uncertainty, LU）。在两个数据集上使用四种线性化策略训练Transformer模型，我们发现带有偏置的排序在其原生顺序上获得了较低的NLL，但在随机排列下表现出高出两个数量级的预期校准误差（Expected Calibration Error, ECE），这表明这些模型学习的是其训练所用的线性化方式而非图本身的内在结构。在分子图基准QM9上，生成图的NLL与分子稳定性呈负相关（AUC $=0.43$），而LU则达到AUC $=0.85$，这表明基于排列的评估能为生成分子提供更可靠的质量检验。代码发布于https://github.com/lauritsf/linearization-uncertainty。

摘要 (Abstract)

Autoregressive graph generators define likelihoods via a sequential construction process, but these likelihoods are only meaningful if they are consistent across all linearizations of the same graph. Segmented Eulerian Neighborhood Trails (SENT), a recent linearization method, converts graphs into sequences that can be perfectly decoded and efficiently processed by language models, but admit multiple equivalent linearizations of the same graph. We quantify violations in assigned negative log-likelihood (NLL) using the coefficient of variation across equivalent linearizations, which we call Linearization Uncertainty (LU). Training transformers under four linearization strategies on two datasets, we show that biased orderings achieve lower NLL on their native order but exhibit expected calibration error (ECE) two orders of magnitude higher under random permutation, indicating that these models have learned their training linearization rather than the underlying graph. On the molecular graph benchmark QM9, NLL for generated graphs is negatively correlated with molecular stability (AUC $=0.43$), while LU achieves AUC $=0.85$, suggesting that permutation-based evaluation provides a more reliable quality check for generated molecules. Code is available at https://github.com/lauritsf/linearization-uncertainty

关键词: autoregressive graph generators, linearization uncertainty, calibration, negative log-likelihood, molecular graphs, QM9, transformers, permutation-equivalent encodings

287. ❌ Active noise cancellation on open-ear smart glasses

作者: Kuang Yuan, Freddy Yifei Liu, Tong Xiao, Yiwen Song, Chengyi Shen, Saksham Bhutani, Justin Chan, Swarun Kumar 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05519v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是智能眼镜上的开放式主动降噪系统，属于硬件系统、信号处理和嵌入式计算领域，与所有评分关键词（均涉及大模型、深度学习技术原理、AI科学应用等）完全无关。论文未涉及任何语言模型、模型训练、推理优化、AI代理或科学AI应用等内容。

!!! tip deepseek-chat TL;DR

该论文解决了开放式智能眼镜在嘈杂环境中无法使用传统主动降噪技术的问题，通过开发一个基于多麦克风阵列和低延迟计算管线的实时降噪系统，在100-1000Hz频率范围内实现了平均9.6-11.2dB的噪声降低。

摘要翻译

智能眼镜正日益成为普及的可穿戴平台，音频是其关键交互模态。然而，在嘈杂环境中实现清晰听觉仍具挑战性，因为智能眼镜配备的开放式扬声器不封闭耳道。此外，开放式设计与传统主动降噪技术不相容——传统技术依赖耳道内或入口处的误差麦克风来测量降噪后的残余声音。本文首次提出适用于开放式智能眼镜的实时主动降噪系统，该系统仅利用嵌入眼镜框架的麦克风阵列和微型开放式扬声器来抑制环境噪声。我们的低延迟计算流程通过分布在眼镜框架周围的八个麦克风阵列预估到达人耳的噪声，并实时生成反相声波以抵消环境噪声。我们开发了定制眼镜原型，并在100-1000赫兹（环境噪声集中频段）范围内，通过移动状态下的8种场景用户研究进行评估。结果显示，未经校准时平均降噪量为9.6分贝，经过简短的用户个性化校准后可达11.2分贝。

摘要 (Abstract)

Smart glasses are becoming an increasingly prevalent wearable platform, with audio as a key interaction modality. However, hearing in noisy environments remains challenging because smart glasses are equipped with open-ear speakers that do not seal the ear canal. Furthermore, the open-ear design is incompatible with conventional active noise cancellation (ANC) techniques, which rely on an error microphone inside or at the entrance of the ear canal to measure the residual sound heard after cancellation. Here we present the first real-time ANC system for open-ear smart glasses that suppresses environmental noise using only microphones and miniaturized open-ear speakers embedded in the glasses frame. Our low-latency computational pipeline estimates the noise at the ear from an array of eight microphones distributed around the glasses frame and generates an anti-noise signal in real-time to cancel environmental noise. We develop a custom glasses prototype and evaluate it in a user study across 8 environments under mobility in the 100–1000 Hz frequency range, where environmental noise is concentrated. We achieve a mean noise reduction of 9.6 dB without any calibration, and 11.2 dB with a brief user-specific calibration.

关键词: active noise cancellation, open-ear smart glasses, real-time ANC system, microphone array, low-latency computational pipeline, noise reduction, wearable platform, environmental noise suppression

288. ❌ Optimal Centered Active Excitation in Linear System Identification

作者: Kaito Ito, Alexandre Proutiere 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05518v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究线性系统辨识中的主动学习算法，属于经典控制理论/系统辨识领域，与所有大模型、深度学习、AI科学应用等关键词完全无关。论文未涉及任何神经网络、语言模型、AI技术或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于最优中心噪声激励的线性系统辨识主动学习算法，通过普通最小二乘和半定规划实现了最小样本复杂度，并建立了匹配理论下界的紧致样本复杂度界限。

摘要翻译

我们提出一种用于线性系统辨识的主动学习算法，该算法采用最优中心噪声激励。值得注意的是，我们的算法基于普通最小二乘法和半定规划，在能够高效计算系统矩阵估计的同时，达到了最小的样本复杂度。具体而言，我们首先建立了任何主动学习算法达到预设精度与置信水平所需样本复杂度的下界。接着，我们推导了所提出算法的样本复杂度上界，该上界与任意算法的下界在通用常数因子内相匹配。我们得到的紧界易于解释，并明确显示了其对于状态维度等系统参数的依赖关系。

摘要 (Abstract)

We propose an active learning algorithm for linear system identification with optimal centered noise excitation. Notably, our algorithm, based on ordinary least squares and semidefinite programming, attains the minimal sample complexity while allowing for efficient computation of an estimate of a system matrix. More specifically, we first establish lower bounds of the sample complexity for any active learning algorithm to attain the prescribed accuracy and confidence levels. Next, we derive a sample complexity upper bound of the proposed algorithm, which matches the lower bound for any algorithm up to universal factors. Our tight bounds are easy to interpret and explicitly show their dependence on the system parameters such as the state dimension.

关键词: linear system identification, active learning, optimal excitation, sample complexity, ordinary least squares, semidefinite programming, lower bounds, upper bounds

289. ❌ AttnDiff: Attention-based Differential Fingerprinting for Large Language Models

作者: Haobo Zhang, Zhenhua Xu, Junxian Li, Shangfeng Sheng, Dezhang Kong, Meng Han 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05502v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	5.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文AttnDiff专注于大语言模型（LLMs）的知识产权保护，通过分析注意力模式来识别模型衍生关系。核心高度相关于’Large Language Models’（10分），因为论文完全围绕LLMs展开。与’Post-training’（5分）相关，因为涉及微调后的模型验证；与’RLHF/DPO’（5分）相关，因为提及DPO作为可能的清洗操作；与’Quantization/Model Compression’（5分）相关，因为提及压缩作为清洗操作；与’Mechanistic Interpretability’（5分）相关，因为使用注意力模式进行内部行为分析；与’Model Merging’（5分）相关，因为提及模型合并作为清洗操作。其他关键词如MoE、SLMs、Scaling Laws等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于注意力差异的指纹识别方法AttnDiff，用于验证开放权重大语言模型是否衍生自受害者模型，即使在经过微调、压缩或合并等清洗操作后，也能有效区分相关模型与不相关模型。

摘要翻译

保护开源权重大型语言模型（LLM）的知识产权，需要验证嫌疑模型是否源自受害模型，即使其经过常见的“洗白”操作，如微调（包括PPO/DPO）、剪枝/压缩和模型合并。我们提出 \textsc{AttnDiff}，一种数据高效的白盒框架，通过模型内在的信息路由行为提取指纹。\textsc{AttnDiff} 使用经过最小编辑的提示词对进行探测，这些提示词对会引发受控的语义冲突，捕获差异化的注意力模式，使用紧凑的谱描述符对其进行总结，并利用CKA（中心核对齐）进行模型比较。在Llama-2/3、Qwen2.5（3B–14B）以及其他开源模型系列上的实验表明，该方法能对相关衍生模型给出高相似度，同时有效区分不相关的模型系列（例如，使用 $M=60$ 个探针时，相似度 $>0.98$ 对比 $<0.22$）。仅需5至60个多领域探针，该方法即可支持实际的来源验证与责任认定。

摘要 (Abstract)

Protecting the intellectual property of open-weight large language models (LLMs) requires verifying whether a suspect model is derived from a victim model despite common laundering operations such as fine-tuning (including PPO/DPO), pruning/compression, and model merging. We propose \textsc{AttnDiff}, a data-efficient white-box framework that extracts fingerprints from models via intrinsic information-routing behavior. \textsc{AttnDiff} probes minimally edited prompt pairs that induce controlled semantic conflicts, captures differential attention patterns, summarizes them with compact spectral descriptors, and compares models using CKA. Across Llama-2/3 and Qwen2.5 (3B–14B) and additional open-source families, it yields high similarity for related derivatives while separating unrelated model families (e.g., $>0.98$ vs.\ $<0.22$ with $M=60$ probes). With 5–60 multi-domain probes, it supports practical provenance verification and accountability.

关键词: Large Language Models, Intellectual Property Protection, Attention Patterns, Model Fingerprinting, Provenance Verification, Fine-tuning, Model Compression, Model Merging

290. ❌ Transcriptomic Models for Immunotherapy Response Prediction Show Limited Cross-cohort Generalisability

作者: Yuheng Liang, Lucy Chuo, Ahmadreza Argha, Nona Farbehi, Lu Chen, Roohallah Alizadehsani, Mehdi Hosseinzadeh, Amin Beheshti, Thantrira Porntaveetusm, Youqiong Ye, Hamid Alinejad-Rokny 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05478v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	2.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文研究的是基于转录组学的免疫检查点抑制剂（ICI）反应预测模型，属于生物信息学（Bioinformatics）和AI for Science领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（8分）。论文提到’domain adaptation’作为未来改进方向，与’Pre-training OR Continual Pre-training OR Domain Adaptation’有微弱关联（2分）。其他关键词均涉及大模型、深度学习技术原理或具体方法（如LLM、MoE、RLHF、RAG等），而本文专注于传统的机器学习/深度学习模型在生物医学领域的应用，未涉及这些前沿大模型技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文系统评估了九种基于转录组学的免疫检查点抑制剂反应预测模型，发现它们在跨队列泛化性和生物学一致性方面表现有限，强调了改进领域适应和标准化预处理的必要性。

摘要翻译

免疫检查点抑制剂（ICIs）已革新了癌症治疗；然而，相当一部分患者表现出内在或获得性耐药，这使得在治疗前准确预测疗效成为一项亟待满足的关键需求。基于转录组学的生物标志物，源自整体RNA测序（bulk RNA-seq）和单细胞RNA测序（scRNA-seq），为捕捉肿瘤-免疫相互作用提供了有前景的途径，但现有预测模型在不同队列间的泛化能力尚不明确。我们系统性地评估了九种先进的转录组学ICI疗效预测模型，包括五种基于整体RNA-seq的模型（COMPASS、IRNet、NetBio、IKCScore和TNBC-ICI）和四种基于单细胞RNA-seq的模型（PRECISE、DeepGeneX、Tres和scCURE），所使用的公开独立数据集在模型开发过程中均未接触过。总体而言，预测性能较为有限：整体RNA-seq模型在大多数队列中的表现处于或接近随机水平，而单细胞RNA-seq模型仅显示出微弱的改进。通路水平分析显示，各模型间的生物标志物信号稀疏且不一致。尽管基于单细胞RNA-seq的预测模型在免疫相关程序（如移植排斥）上表现出一定趋同性，但基于整体RNA-seq的模型之间可重复的重叠信号极少。PRECISE和NetBio识别出了最具一致性的免疫相关主题，而IRNet主要捕捉到与ICI生物学关联较弱的代谢通路。综上所述，这些研究结果表明，当前转录组学ICI预测模型在不同队列间的稳健性和生物学一致性有限，凸显了改进领域适应性、标准化预处理以及基于生物学的模型设计的必要性。

摘要 (Abstract)

Immune checkpoint inhibitors (ICIs) have transformed cancer therapy; yet substantial proportion of patients exhibit intrinsic or acquired resistance, making accurate pre-treatment response prediction a critical unmet need. Transcriptomics-based biomarkers derived from bulk and single-cell RNA sequencing (scRNA-seq) offer a promising avenue for capturing tumour-immune interactions, yet the cross-cohort generalisability of existing prediction models remains unclear.We systematically benchmark nine state-of-the-art transcriptomic ICI response predictors, five bulk RNA-seq-based models (COMPASS, IRNet, NetBio, IKCScore, and TNBC-ICI) and four scRNA-seq-based models (PRECISE, DeepGeneX, Tres and scCURE), using publicly available independent datasets unseen during model development. Overall, predictive performance was modest: bulk RNA-seq models performed at or near chance level across most cohorts, while scRNA-seq models showed only marginal improvements. Pathway-level analyses revealed sparse and inconsistent biomarker signals across models. Although scRNA-seq-based predictors converged on immune-related programs such as allograft rejection, bulk RNA-seq-based models exhibited little reproducible overlap. PRECISE and NetBio identified the most coherent immune-related themes, whereas IRNet predominantly captured metabolic pathways weakly aligned with ICI biology. Together, these findings demonstrate the limited cross-cohort robustness and biological consistency of current transcriptomic ICI prediction models, underscoring the need for improved domain adaptation, standardised preprocessing, and biologically grounded model design.

关键词: transcriptomic models, immunotherapy response prediction, cross-cohort generalisability, immune checkpoint inhibitors, RNA sequencing, biomarkers, domain adaptation, bioinformatics

291. ❌ Reproducing AlphaZero on Tablut: Self-Play RL for an Asymmetric Board Game

作者: Tõnis Lees, Tambet Matiisen 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05476v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究将AlphaZero强化学习算法应用于非对称棋盘游戏Tablut，核心内容涉及强化学习、自对弈、神经网络架构修改（分离的策略/价值头）和训练稳定性技术（如数据增强、回放缓冲区）。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理创新或科学AI应用相关，而本文专注于经典强化学习算法在特定游戏上的工程实现与改进，未涉及任何大模型、深度学习新技术或科学领域AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了如何修改AlphaZero的架构（使用分离的策略/价值头）并应用稳定化技术（如C4数据增强），以成功将其自对弈强化学习框架迁移到高度非对称的棋盘游戏Tablut，最终模型在100次自对弈迭代中表现出稳定改进，达到了1235的BayesElo评分。

摘要翻译

本研究探讨了将AlphaZero强化学习算法应用于塔布勒特（Tablut）这一非对称历史棋盘游戏的适应性。该游戏具有棋子数量不均等及玩家目标差异（一方需俘获国王，另一方需协助国王逃脱）的特点。尽管原始AlphaZero架构在对称游戏中成功利用单一策略头与价值头进行学习，但将其应用于非对称环境会迫使网络学习两套相互冲突的评估函数，从而可能降低学习效率与性能。为解决此问题，本研究对核心架构进行了修改，为每个玩家角色分别设置独立的策略头与价值头，同时保留共享的残差主干网络以学习棋盘通用特征。在训练过程中，非对称结构引发了训练不稳定性，尤其是攻击方与防守方角色之间出现的灾难性遗忘现象。通过采用C4数据增强、扩大回放缓冲区容量，并安排模型在25%的训练对局中与随机抽取的历史检查点对战，这些不稳定问题得到了有效缓解。经过超过100轮自我对弈迭代，改进后的模型表现出稳定的进步，相对于随机初始化的基线模型获得了1235的贝叶斯Elo等级分。训练指标亦显示策略熵与平均剩余棋子数显著下降，反映出对局策略日趋集中且决策更为果断。最终，实验证实：只要采用独立的策略/价值头架构并辅以稳健的稳定化技术，AlphaZero的自我对弈框架能够有效迁移至高度非对称的游戏中。

摘要 (Abstract)

This work investigates the adaptation of the AlphaZero reinforcement learning algorithm to Tablut, an asymmetric historical board game featuring unequal piece counts and distinct player objectives (king capture versus king escape). While the original AlphaZero architecture successfully leverages a single policy and value head for symmetric games, applying it to asymmetric environments forces the network to learn two conflicting evaluation functions, which can hinder learning efficiency and performance. To address this, the core architecture is modified to use separate policy and value heads for each player role, while maintaining a shared residual trunk to learn common board features. During training, the asymmetric structure introduced training instabilities, notably catastrophic forgetting between the attacker and defender roles. These issues were mitigated by applying C4 data augmentation, increasing the replay buffer size, and having the model play 25 percent of training games against randomly sampled past checkpoints. Over 100 self-play iterations, the modified model demonstrated steady improvement, achieving a BayesElo rating of 1235 relative to a randomly initialized baseline. Training metrics also showed a significant decrease in policy entropy and average remaining pieces, reflecting increasingly focused and decisive play. Ultimately, the experiments confirm that AlphaZero’s self-play framework can transfer to highly asymmetric games, provided that distinct policy/value heads and robust stabilization techniques are employed.

关键词: AlphaZero, reinforcement learning, self-play, asymmetric board game, Tablut, policy-value heads, training stabilization, BayesElo rating

292. ❌ Task Ecologies and the Evolution of World-Tracking Representations in Large Language Models

作者: Giulio Valentino Dalla Riva 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05469v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究语言模型作为演化模型生物体，分析自回归下一个词学习何时选择世界追踪表示。高度相关关键词：1）Large Language Models（论文明确研究语言模型，包括transformer架构）；2）Small Language Models（论文明确使用小语言模型作为实验室生物体进行理论研究）；3）World Models（论文核心研究世界追踪表示和世界模型概念）。中等相关关键词：1）Mixture of Experts（论文提到冻结的MoE transformer满足分析条件）；2）Post-training（论文讨论后训练如何恢复差距区分）；3）Mechanistic Interpretability（论文涉及表示选择的理论分析）；4）In-context Learning（论文分析in-context learning不扩大模型的分离集）。其他关键词与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该论文研究语言模型作为演化模型生物体，分析自回归下一个词学习何时选择世界追踪表示，提出了生态真实性的精确概念，并通过理论分析和实验验证了静态分解、分裂合并阈值、生态外失败模式和双生态救援机制。

摘要翻译

我们将语言模型视为演化中的模型生物体，探讨自回归下一词元学习何时会选择世界追踪表征。对于潜在世界状态的任意编码，贝叶斯最优的下一词元交叉熵可分解为不可约的条件熵与一个詹森-香农超额项之和。当且仅当编码保持训练生态系统的等价类时，该超额项才会消失。这为语言模型提出了一个精确的生态真实性概念，并将最小复杂度的零超额解识别为训练等价类导出的商划分。随后，我们确定了这一固定编码分析何时适用于Transformer架构家族：冻结的稠密Transformer与冻结的混合专家（Mixture-of-Experts）Transformer满足条件，上下文学习不会扩大模型的分离集，而逐任务适应则会破坏前提。该框架预测了两种典型的失效模式：简洁性压力会优先消除低增益区分度，且训练最优模型在细化训练生态的部署生态中仍可能产生正超额。条件动态扩展表明，在明确的遗传、变异和选择假设下，模型间选择与训练后调整如何恢复此类差距区分度。通过精确的有限生态检验与受控微型GPT实验，我们在相关量可直接观测的机制中验证了静态分解、分裂-合并阈值、生态外失效模式以及双生态挽救机制。本研究的目标并非大规模建模前沿系统，而是将小型语言模型作为实验室生物体，用于研究表征选择的理论。

摘要 (Abstract)

We study language models as evolving model organisms and ask when autoregressive next-token learning selects for world-tracking representations. For any encoding of latent world states, the Bayes-optimal next-token cross-entropy decomposes into the irreducible conditional entropy plus a Jensen–Shannon excess term. That excess vanishes if and only if the encoding preserves the training ecology’s equivalence classes. This yields a precise notion of ecological veridicality for language models and identifies the minimum-complexity zero-excess solution as the quotient partition by training equivalence. We then determine when this fixed-encoding analysis applies to transformer families: frozen dense and frozen Mixture-of-Experts transformers satisfy it, in-context learning does not enlarge the model’s separation set, and per-task adaptation breaks the premise. The framework predicts two characteristic failure modes: simplicity pressure preferentially removes low-gain distinctions, and training-optimal models can still incur positive excess on deployment ecologies that refine the training ecology. A conditional dynamic extension shows how inter-model selection and post-training can recover such gap distinctions under explicit heredity, variation, and selection assumptions. Exact finite-ecology checks and controlled microgpt experiments validate the static decomposition, split-merge threshold, off-ecology failure pattern, and two-ecology rescue mechanism in a regime where the relevant quantities are directly observable. The goal is not to model frontier systems at scale, but to use small language models as laboratory organisms for theory about representational selection.

关键词: language models, world-tracking representations, ecological veridicality, autoregressive learning, transformer architectures, small language models, representational selection, task ecologies

293. ❌ Hierarchical Contrastive Learning for Multimodal Data

作者: Huichao Li, Junhan Yu, Doudou Zhou 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05462v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于多模态表示学习，提出了一种分层对比学习框架（HCL），用于学习全局共享、部分共享和模态特定的表示。论文的核心是表示学习方法论，而非大模型技术。所有关键词（共27个）中，只有’AI for Science OR Bioinformatics OR Cheminformatics’与论文有一定关联，因为论文在电子健康记录（EHR）数据集上进行了实验，这属于生物信息学/科学AI的应用范畴，但论文本身并非主要关于AI for Science的方法论创新，而是将提出的通用表示学习方法应用于该领域。其他26个关键词均与大模型、训练技术、推理优化、代理系统等具体技术直接相关，而本文未涉及任何大模型、LLM相关技术或上述其他具体技术点。

!!! tip deepseek-chat TL;DR

该论文针对多模态表示学习中传统二元共享-私有分解的不足，提出了分层对比学习（HCL）框架，以学习全局共享、部分共享和模态特定的分层表示，并在理论上证明了其可识别性，在电子健康记录数据上验证了其能学习更具信息量的表示并提升预测性能。

摘要翻译

多模态表征学习通常建立在共享-私有分解框架上，将潜在信息视为所有模态共有或单一模态特有。这种二元视角往往存在不足：许多因素仅由部分模态共享，忽略这种部分共享可能导致无关信号的过度对齐并掩盖互补信息。我们提出层次化对比学习（Hierarchical Contrastive Learning, HCL），该框架能够在统一模型中学习全局共享、部分共享和模态特定的表征。HCL将层次化潜变量建模与结构化稀疏性相结合，并采用结构感知的对比目标，仅对齐真正共享潜在因子的模态。在潜变量无关的假设下，我们证明了层次分解的可识别性，建立了因子载荷矩阵的恢复保证，并推导了下游预测任务的参数估计与超额风险界。仿真实验表明该方法能准确恢复层次结构并有效选择任务相关成分。在多模态电子健康记录数据上，HCL能生成信息更丰富的表征，并持续提升预测性能。

摘要 (Abstract)

Multimodal representation learning is commonly built on a shared-private decomposition, treating latent information as either common to all modalities or specific to one. This binary view is often inadequate: many factors are shared by only subsets of modalities, and ignoring such partial sharing can over-align unrelated signals and obscure complementary information. We propose Hierarchical Contrastive Learning (HCL), a framework that learns globally shared, partially shared, and modality-specific representations within a unified model. HCL combines a hierarchical latent-variable formulation with structural sparsity and a structure-aware contrastive objective that aligns only modalities that genuinely share a latent factor. Under uncorrelated latent variables, we prove identifiability of the hierarchical decomposition, establish recovery guarantees for the loading matrices, and derive parameter estimation and excess-risk bounds for downstream prediction. Simulations show accurate recovery of hierarchical structure and effective selection of task-relevant components. On multimodal electronic health records, HCL yields more informative representations and consistently improves predictive performance.

关键词: Multimodal Representation Learning, Hierarchical Contrastive Learning, Shared-private Decomposition, Partial Sharing, Latent Variable Model, Electronic Health Records, Predictive Performance, Identifiability

294. ❌ MEC: Machine-Learning-Assisted Generalized Entropy Calibration for Semi-Supervised Mean Estimation

作者: Se Yoon Lee, Jae Kwang Kim 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05446v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是半监督统计推断方法（MEC），属于传统机器学习统计方法领域，专注于预测校准、样本加权和置信区间构建。所有关键词都涉及大模型、深度学习、AI应用或相关技术（如MoE、RLHF、RAG等），而本文完全不涉及这些内容，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为MEC的机器学习辅助广义熵校准方法，用于改进半监督均值估计的预测驱动推断，在模型误设下提高了效率并实现了接近名义覆盖率的置信区间。

摘要翻译

获取高质量标注数据的成本高昂，而未标注的协变量往往十分丰富，这推动了具有可靠不确定性量化的半监督推断方法的发展。预测驱动推断（PPI）利用在少量标注样本上训练的机器学习预测器来提升效率，但在模型设定错误时可能损失效率，并因标签复用而导致覆盖范围失真。我们提出了机器学习辅助的广义熵校准（MEC），这是一种交叉拟合、校准加权的PPI变体。MEC通过基于Bregman投影的原理性校准框架，对标注样本进行重新加权以更好地与目标总体对齐，从而提升效率。该方法对预测器的仿射变换具有鲁棒性，并通过用更弱的投影误差条件替代原始预测误差条件，放宽了对有效性的要求。因此，MEC在比现有PPI变体更弱的假设下达到了半参数效率界。在模拟和实际数据应用中，与CF-PPI及原始PPI相比，MEC实现了接近名义水平的覆盖概率和更紧密的置信区间。

摘要 (Abstract)

Obtaining high-quality labels is costly, whereas unlabeled covariates are often abundant, motivating semi-supervised inference methods with reliable uncertainty quantification. Prediction-powered inference (PPI) leverages a machine-learning predictor trained on a small labeled sample to improve efficiency, but it can lose efficiency under model misspecification and suffer from coverage distortions due to label reuse. We introduce Machine-Learning-Assisted Generalized Entropy Calibration (MEC), a cross-fitted, calibration-weighted variant of PPI. MEC improves efficiency by reweighting labeled samples to better align with the target population, using a principled calibration framework based on Bregman projections. This yields robustness to affine transformations of the predictor and relaxes requirements for validity by replacing conditions on raw prediction error with weaker projection-error conditions. As a result, MEC attains the semiparametric efficiency bound under weaker assumptions than existing PPI variants. Across simulations and a real-data application, MEC achieves near-nominal coverage and tighter confidence intervals than CF-PPI and vanilla PPI.

关键词: semi-supervised inference, prediction-powered inference, calibration-weighted, Bregman projections, efficiency bound, confidence intervals, model misspecification, label reuse

295. ❌ ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads

作者: Jingwei Zuo, Xinze Feng, Zien Liu, Kaijian Wang, Fanjiang Ye, Ye Cao, Zhuang Wang, Yuke Wang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05426v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	15.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ALTO专注于LoRA（Low-Rank Adaptation）超参数调优的系统设计，与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（15分），因为LoRA是核心研究对象。与’Large Language Models OR LLMs OR Foundation Models’和’Post-training OR Supervised Fine-tuning OR SFT’相关（各10分），因为论文涉及大语言模型的微调。其他关键词如MoE、SLMs、RAG、量化等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

ALTO提出了一种自适应LoRA调优和编排系统，通过早期终止弱配置、融合计算和任务调度，在异构任务中加速LoRA超参数调优，实现高达13.8倍的加速而不牺牲适配器质量。

摘要翻译

低秩自适应（Low-Rank Adaptation，LoRA）是目前大语言模型参数高效微调的主流方法，但获得高质量的适配器通常需要进行系统的超参数调优，因为LoRA的性能对配置选择极为敏感。实践中，这导致大量并发的LoRA任务，且在多租户环境中常涉及异构任务。现有系统大多独立处理这些任务，既浪费计算资源于低潜力候选配置，又导致GPU利用率不足。本文提出ALTO（自适应LoRA调优与编排系统），这是一个协同设计的训练系统，能够加速LoRA超参数调优，同时实现异构任务间的高效集群共享。ALTO的核心洞见在于：当多个调优任务在共享的冻结主干模型上并发运行时，它们会暴露出单任务设计无法利用的优化机会。基于此，ALTO通过监控损失轨迹以提前终止无潜力的配置，采用分组融合通用矩阵乘法（fused grouped GEMM）与新型秩局部适配器并行策略，将存留的适配器共置并回收释放的GPU算力，并结合任务内与任务间调度机制，利用LoRA任务可预测的持续时间优化多任务部署。大量实验表明，ALTO在保证适配器质量的前提下，相比现有最优方法实现了高达13.8倍的加速。

摘要 (Abstract)

Low-Rank Adaptation (LoRA) is now the dominant method for parameter-efficient fine-tuning of large language models, but achieving a high-quality adapter often requires systematic hyperparameter tuning because LoRA performance is highly sensitive to configuration choices. In practice, this leads to many concurrent LoRA jobs, often spanning heterogeneous tasks in multi-tenant environments. Existing systems largely handle these jobs independently, which both wastes computation on weak candidates and leaves GPUs underutilized. We present ALTO (Adaptive LoRA Tuning and Orchestration), a co-designed training system that accelerates LoRA hyperparameter tuning while enabling efficient cluster sharing across heterogeneous tasks. The central insight behind ALTO is that when multiple tuning jobs run concurrently over a shared frozen backbone, they expose optimization opportunities that single-job designs cannot exploit. Building on this, ALTO monitors loss trajectories to terminate unpromising configurations early, uses fused grouped GEMM together with a new rank-local adapter parallelism to co-locate surviving adapters and reclaim freed GPU capacity, and combines intra-task and inter-task scheduling to improve multi-task placement by leveraging the predictable duration of LoRA jobs. Extensive evaluation shows that ALTO achieves up to $13.8\times$ speedup over state-of-the-art without sacrificing adapter quality.

关键词: LoRA, Parameter-efficient Fine-tuning, Hyperparameter Tuning, Training System, GPU Utilization, Multi-task Scheduling, Adapter Parallelism, Large Language Models

296. ❌ An Actor-Critic Framework for Continuous-Time Jump-Diffusion Controls with Normalizing Flows

作者: Liya Guo, Ruimeng Hu, Xu Yang, Yi Zhu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05398v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于连续时间跳跃扩散控制的强化学习框架，使用归一化流进行策略参数化，属于强化学习、随机控制和金融应用领域。与提供的关键词列表（主要围绕大语言模型技术及其应用）基本无关。唯一有微弱关联的是’Multi-agent Systems OR Agent Coordination’（评5分），因为论文提到了多智能体投资组合博弈，但这并非论文核心，且与LLM智能体无关。其他所有关键词（如LLM、MoE、Scaling Laws、微调、对齐、RAG、推理、压缩等）均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于演员-评论家框架和条件归一化流的连续时间跳跃扩散控制求解方法，用于金融中的最优策略计算，并在投资组合优化和多智能体博弈中验证了其有效性。

摘要翻译

具有时间非齐次跳跃扩散动态的连续时间随机控制是金融与经济学的核心问题，但在显式时间依赖性、不连续冲击和高维度的条件下，计算最优策略十分困难。我们提出一种演员-评论家框架，作为带跳跃的熵正则化控制问题与随机博弈的无网格求解器。该方法建立在时间非齐次小q函数（little q-function）与恰当的占据测度之上，导出了一个能够兼容时变漂移项、波动率项与跳跃项的策略梯度表示。为了在连续动作空间中表达具有表现力的随机策略，我们使用条件标准化流（conditional normalizing flows）对演员网络进行参数化，从而在保留熵正则化与策略优化所需精确似然评估的同时，实现灵活的非高斯策略。我们在时间非齐次线性二次控制、默顿投资组合优化以及多智能体投资组合博弈问题上验证了该方法，并采用解析解或高精度基准进行对比。数值结果表明，该方法在跳跃不连续性下学习稳定，能准确逼近最优随机策略，并在维度与智能体数量方面展现出良好的扩展性。

摘要 (Abstract)

Continuous-time stochastic control with time-inhomogeneous jump-diffusion dynamics is central in finance and economics, but computing optimal policies is difficult under explicit time dependence, discontinuous shocks, and high dimensionality. We propose an actor-critic framework that serves as a mesh-free solver for entropy-regularized control problems and stochastic games with jumps. The approach is built on a time-inhomogeneous little q-function and an appropriate occupation measure, yielding a policy-gradient representation that accommodates time-dependent drift, volatility, and jump terms. To represent expressive stochastic policies in continuous-action spaces, we parameterize the actor using conditional normalizing flows, enabling flexible non-Gaussian policies while retaining exact likelihood evaluation for entropy regularization and policy optimization. We validate the method on time-inhomogeneous linear-quadratic control, Merton portfolio optimization, and a multi-agent portfolio game, using explicit solutions or high-accuracy benchmarks. Numerical results demonstrate stable learning under jump discontinuities, accurate approximation of optimal stochastic policies, and favorable scaling with respect to dimension and number of agents.

关键词: continuous-time stochastic control, jump-diffusion dynamics, actor-critic framework, normalizing flows, entropy-regularized control, portfolio optimization, multi-agent game, policy-gradient representation

297. ❌ Retrieve-then-Adapt: Retrieval-Augmented Test-Time Adaptation for Sequential Recommendation

作者: Xing Tang, Jingyang Bin, Ziqiang Cui, Xiaokun Zhang, Fuyuan Lyu, Jingyan Jiang, Dugang Liu, Chen Ma, Xiuqiang He 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05379v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于顺序推荐（SR）领域，提出了一种名为Retrieve-then-Adapt（ReAd）的测试时自适应框架。论文的核心是使用检索增强（retrieval-augmented）方法来动态适应测试分布，以提高推荐性能。这与关键词’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’有一定关联，因为都涉及检索机制来增强模型性能，但论文并非专注于大语言模型（LLMs）的生成任务，而是针对顺序推荐模型。其他关键词主要涉及大模型技术原理（如MoE、Scaling Laws、PEFT等）、对齐技术（如RLHF、Instruction Tuning）、推理方法（如CoT、System 2 Thinking）、代理系统、模型优化（如Quantization、Speculative Decoding）或科学AI应用，均与论文内容无关。论文未提及任何大模型或深度学习技术原理的创新，也未涉及生物医药等科学领域的AI应用，因此大部分关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文针对顺序推荐模型在推理时难以适应实时偏好变化的问题，提出了一种检索增强的测试时自适应框架（ReAd），通过检索相似项目并融合增强嵌入来优化预测，实验表明其在多个基准数据集上优于现有方法。

摘要翻译

序列推荐任务旨在基于用户的历史交互序列预测下一个交互项目。传统模型通常在历史数据上进行训练，由于分布差异与参数化约束带来的挑战，在推理阶段往往难以适应实时偏好变化。现有解决方案包括测试时训练、测试时增强和检索增强微调。然而，这些方法要么引入显著计算开销，要么依赖随机增强策略，或需要精心设计的两阶段训练范式。本文认为，实现有效测试时适应的关键在于同时达成高效增强与高效适应。为此，我们提出检索后适应框架，该新颖框架通过检索到的用户偏好信号，动态调整已部署的序列推荐模型以适应测试数据分布。具体而言，给定训练完成的序列推荐模型，本框架首先从构建的协同记忆数据库中为测试用户检索协同相似项目。随后，轻量级检索学习模块将这些项目整合为信息增强嵌入，该嵌入同时捕获协同信号与预测优化线索。最终，通过融合该嵌入的机制对初始序列推荐预测进行优化。在五个基准数据集上的大量实验表明，本框架持续优于现有序列推荐方法。

摘要 (Abstract)

The sequential recommendation (SR) task aims to predict the next item based on users’ historical interaction sequences. Typically trained on historical data, SR models often struggle to adapt to real-time preference shifts during inference due to challenges posed by distributional divergence and parameterized constraints. Existing approaches to address this issue include test-time training, test-time augmentation, and retrieval-augmented fine-tuning. However, these methods either introduce significant computational overhead, rely on random augmentation strategies, or require a carefully designed two-stage training paradigm. In this paper, we argue that the key to effective test-time adaptation lies in achieving both effective augmentation and efficient adaptation. To this end, we propose Retrieve-then-Adapt (ReAd), a novel framework that dynamically adapts a deployed SR model to the test distribution through retrieved user preference signals. Specifically, given a trained SR model, ReAd first retrieves collaboratively similar items for a test user from a constructed collaborative memory database. A lightweight retrieval learning module then integrates these items into an informative augmentation embedding that captures both collaborative signals and prediction-refinement cues. Finally, the initial SR prediction is refined via a fusion mechanism that incorporates this embedding. Extensive experiments across five benchmark datasets demonstrate that ReAd consistently outperforms existing SR methods.

关键词: sequential recommendation, test-time adaptation, retrieval-augmented, collaborative memory, preference shifts, dynamic adaptation, fusion mechanism, benchmark datasets

298. ❌ LMI-Net: Linear Matrix Inequality–Constrained Neural Networks via Differentiable Projection Layers

作者: Sunbochen Tang, Andrea Goertzen, Navid Azizan 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05374v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文LMI-Net专注于线性矩阵不等式约束的神经网络，属于控制理论、优化和神经网络的交叉领域，核心是开发一个可微投影层来强制满足LMI约束，用于控制系统中的证书合成和控制器设计。所有评分关键词均与大语言模型、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文研究的是特定数学约束下的神经网络，与这些关键词的主题（如LLM、MoE、对齐、推理、代理等）无直接关联，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为LMI-Net的可微投影层，用于在神经网络中强制满足线性矩阵不等式约束，从而在控制系统设计中实现可靠的证书合成和控制器优化，提高了模型在分布偏移下的可行性并保持了快速推理速度。

摘要翻译

线性矩阵不等式（LMIs）在验证动力系统的稳定性、鲁棒性和前向不变性方面发挥着核心作用。尽管基于学习的控制设计与证书合成方法发展迅速，现有方法往往难以保持形式化保证所需的严格矩阵不等式约束。我们提出LMI-Net，这是一种高效且模块化的可微分投影层，其通过结构设计强制满足LMI约束。该方法将LMI约束定义的集合提升至仿射等式约束与半正定锥的交集中，通过Douglas-Rachford分裂算法执行前向传播，并利用隐函数微分实现高效反向传播。我们建立了该投影层收敛到可行点的理论保证，从而证实LMI-Net可将通用神经网络转换为满足LMI约束的可靠模型。在包括不变椭球体合成以及受扰线性系统族的联合控制器与证书设计等实验评估中，LMI-Net在分布偏移条件下较软约束模型显著提升了可行性，同时保持了快速推理速度，从而弥合了基于半定规划的认证方法与现代学习技术之间的鸿沟。

摘要 (Abstract)

Linear matrix inequalities (LMIs) have played a central role in certifying stability, robustness, and forward invariance of dynamical systems. Despite rapid development in learning-based methods for control design and certificate synthesis, existing approaches often fail to preserve the hard matrix inequality constraints required for formal guarantees. We propose LMI-Net, an efficient and modular differentiable projection layer that enforces LMI constraints by construction. Our approach lifts the set defined by LMI constraints into the intersection of an affine equality constraint and the positive semidefinite cone, performs the forward pass via Douglas-Rachford splitting, and supports efficient backward propagation through implicit differentiation. We establish theoretical guarantees that the projection layer converges to a feasible point, certifying that LMI-Net transforms a generic neural network into a reliable model satisfying LMI constraints. Evaluated on experiments including invariant ellipsoid synthesis and joint controller-and-certificate design for a family of disturbed linear systems, LMI-Net substantially improves feasibility over soft-constrained models under distribution shift while retaining fast inference speed, bridging semidefinite-program-based certification and modern learning techniques.

关键词: Linear Matrix Inequalities, Neural Networks, Differentiable Projection, Control Systems, Certificate Synthesis, Douglas-Rachford Splitting, Implicit Differentiation, Feasibility Guarantees

299. ❌ LatentAudit: Real-Time White-Box Faithfulness Monitoring for Retrieval-Augmented Generation with Verifiable Deployment

作者: Zhe Yu, Wenpeng Xing, Meng Han 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05358v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究RAG系统的实时忠实度监控，直接涉及’Retrieval-Augmented Generation’和’Hallucination Mitigation’关键词，得10分。使用Llama-3-8B等模型，与’Large Language Models’高度相关，得10分。方法基于残差流激活分析，与’Mechanistic Interpretability’有一定关联，得5分。在PubMedQA等科学问答数据集上评估，与’AI for Science’有一定关联，得5分。其他关键词如MoE、SFT、量化等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了LatentAudit方法，通过分析大语言模型的残差流激活来实时监控检索增强生成系统的输出忠实度，在多个QA基准测试中实现了高精度检测并支持可验证部署。

摘要翻译

检索增强生成（RAG）虽能缓解幻觉问题，但并未完全消除：已部署的系统仍需在推理阶段判断其答案是否真正得到检索证据的支持。我们提出LatentAudit，一种白盒审计器，它汇集来自开放权重生成器的中后期残差流激活，并测量其与证据表征之间的马氏距离。由此产生的二次判别规则无需辅助判断模型，可在生成时实时运行，且结构简单，仅需少量预留数据集即可完成校准。我们证明残差流几何结构携带有效的忠实度信号，该信号能适应架构变化与真实检索失败场景，且同一规则仍适用于公开验证。在PubMedQA数据集上使用Llama-3-8B模型时，LatentAudit达到0.942的AUROC值，仅产生0.77毫秒额外开销。在三个问答基准测试和五个模型系列（Llama-2/3、Qwen-2.5/3、Mistral）中，该监测器保持稳定；在包含矛盾、检索遗漏和部分支持噪声的四向压力测试下，其在PubMedQA上达到0.9566–0.9815 AUROC，在HotpotQA上达到0.9142–0.9315 AUROC。采用16位定点精度时，审计规则能保留99.8%的FP16 AUROC性能，支持基于Groth16协议的公开验证，同时无需泄露模型权重或激活值。这些结果表明，残差流几何结构可作为实时RAG忠实度监测及可选可验证部署的实用基础。

摘要 (Abstract)

Retrieval-augmented generation (RAG) mitigates hallucination but does not eliminate it: a deployed system must still decide, at inference time, whether its answer is actually supported by the retrieved evidence. We introduce LatentAudit, a white-box auditor that pools mid-to-late residual-stream activations from an open-weight generator and measures their Mahalanobis distance to the evidence representation. The resulting quadratic rule requires no auxiliary judge model, runs at generation time, and is simple enough to calibrate on a small held-out set. We show that residual-stream geometry carries a usable faithfulness signal, that this signal survives architecture changes and realistic retrieval failures, and that the same rule remains amenable to public verification. On PubMedQA with Llama-3-8B, LatentAudit reaches 0.942 AUROC with 0.77,ms overhead. Across three QA benchmarks and five model families (Llama-2/3, Qwen-2.5/3, Mistral), the monitor remains stable; under a four-way stress test with contradictions, retrieval misses, and partial-support noise, it reaches 0.9566–0.9815 AUROC on PubMedQA and 0.9142–0.9315 on HotpotQA. At 16-bit fixed-point precision, the audit rule preserves 99.8% of the FP16 AUROC, enabling Groth16-based public verification without revealing model weights or activations. Together, these results position residual-stream geometry as a practical basis for real-time RAG faithfulness monitoring and optional verifiable deployment.

关键词: Retrieval-Augmented Generation, Hallucination Mitigation, Faithfulness Monitoring, Residual-stream Activations, Real-time Audit, White-box Monitoring, Verifiable Deployment, Large Language Models

300. ❌ Individual-heterogeneous sub-Gaussian Mixture Models

作者: Huan Qing 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05337v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是统计机器学习中的聚类方法（个体异质性亚高斯混合模型），属于传统机器学习领域，与所有关键词（均涉及大模型、深度学习、AI应用等）完全无关。

!!! tip deepseek-chat TL;DR

该论文针对传统高斯混合模型假设同质性而实际数据存在异质性的问题，提出了一种个体异质性亚高斯混合模型及高效谱方法，在合成和真实数据上优于现有聚类算法。

摘要翻译

经典高斯混合模型假设簇内具有同质性，这一假设在实际数据中往往难以成立，因为观测值天然表现出不同的尺度或强度。为解决此问题，我们提出了个体异质性亚高斯混合模型，该灵活框架为每个观测值分配独立的异质性参数，从而显式捕捉实际应用中的固有异质性。基于此模型，我们提出一种高效的谱方法，该方法在温和的分离条件下可证明实现真实聚类标签的精确恢复，即使在特征数量远超过样本数量的高维场景中亦然。在合成数据与真实数据上的数值实验表明，我们的方法始终优于现有聚类算法，包括那些为经典高斯混合模型设计的算法。

摘要 (Abstract)

The classical Gaussian mixture model assumes homogeneity within clusters, an assumption that often fails in real-world data where observations naturally exhibit varying scales or intensities. To address this, we introduce the individual-heterogeneous sub-Gaussian mixture model, a flexible framework that assigns each observation its own heterogeneity parameter, thereby explicitly capturing the heterogeneity inherent in practical applications. Built upon this model, we propose an efficient spectral method that provably achieves exact recovery of the true cluster labels under mild separation conditions, even in high-dimensional settings where the number of features far exceeds the number of samples. Numerical experiments on both synthetic and real data demonstrate that our method consistently outperforms existing clustering algorithms, including those designed for classical Gaussian mixture models.

关键词: Gaussian mixture model, heterogeneity, clustering, spectral method, high-dimensional, exact recovery, sub-Gaussian

301. ❌ Cross-Machine Anomaly Detection Leveraging Pre-trained Time-series Model

作者: Yangmeng Li, Kei Sano, Toshihiro Kitao, Ryoji Anzaki, Yukiya Saitoh, Hironori Moki, Dragan Djurdjanovic 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05335v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究工业制造中的时间序列异常检测，使用预训练基础模型MOMENT进行特征提取，与关键词’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分），因为论文明确提到利用预训练模型进行领域适应。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为论文涉及AI在工业科学（制造）中的应用。其他关键词主要涉及大语言模型（LLM）的特定技术（如MoE、RLHF、RAG等），论文未涉及这些内容，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种跨机器时间序列异常检测框架，通过预训练模型MOMENT提取领域不变特征，有效提升了在不同机器间的泛化能力，并在工业数据集上验证了其优于基线方法的性能。

摘要翻译

实现具备韧性与高质量的制造，需要可靠的数据驱动异常检测方法，这些方法必须能够处理名义相同且执行相同工艺的不同单台设备之间的行为差异。针对利用从执行相同工序的不同单台设备采集的传感数据进行设备异常检测的问题，本文提出了一种跨设备时间序列异常检测框架，该框架将域不变特征提取器与无监督异常检测模块相结合。该提取器利用预训练基础模型MOMENT，通过随机森林分类器将嵌入特征解耦为设备相关特征与工况相关特征，其中工况相关特征作为对单台设备间差异保持不变的表示。这些精炼后的特征使得下游异常检测器能够有效泛化至未见过的目标设备。在从三台执行名义相同操作的不同设备采集的工业数据集上的实验表明，所提方法优于基于原始信号和基于MOMENT嵌入特征的基线方法，证实了其在增强跨设备泛化能力方面的有效性。

摘要 (Abstract)

Achieving resilient and high-quality manufacturing requires reliable data-driven anomaly detection methods that are capable of addressing differences in behaviors among different individual machines which are nominally the same and are executing the same processes. To address the problem of detecting anomalies in a machine using sensory data gathered from different individual machines executing the same procedure, this paper proposes a cross-machine time-series anomaly detection framework that integrates a domain-invariant feature extractor with an unsupervised anomaly detection module. Leveraging the pre-trained foundation model MOMENT, the extractor employs Random Forest Classifiers to disentangle embeddings into machine-related and condition-related features, with the latter serving as representations which are invariant to differences between individual machines. These refined features enable the downstream anomaly detectors to generalize effectively to unseen target machines. Experiments on an industrial dataset collected from three different machines performing nominally the same operation demonstrate that the proposed approach outperforms both the raw-signal-based and MOMENT-embedding feature baselines, confirming its effectiveness in enhancing cross-machine generalization.

关键词: anomaly detection, time-series, cross-machine, pre-trained model, domain-invariant features, industrial manufacturing, MOMENT, generalization

302. ❌ A Theoretical Framework for Statistical Evaluability of Generative Models

作者: Shashaank Aiyer, Yishay Mansour, Shay Moran, Han Shao 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05324v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于生成模型的统计评估理论框架，研究评估指标（如IPMs、Rényi散度）的有限样本可评估性。所有关键词均涉及大模型/深度学习的具体技术、应用或评估方法（如对齐、推理、压缩、科学AI等），而本文是纯理论统计学习研究，不涉及任何特定模型架构、训练技术、应用领域或评估方法（如幻觉缓解、可解释性）。因此，所有关键词得分为0。

!!! tip deepseek-chat TL;DR

本文提出了一个评估生成模型的统计理论框架，证明了积分概率度量（IPMs）可以从有限样本中评估，而Rényi和KL散度则不能。

摘要翻译

统计评估旨在利用从真实分布中抽取的独立同分布保留测试数据来估计模型的泛化性能。在分类等监督学习场景中，错误率等性能指标有明确定义，且只要数据集足够大，测试误差就能可靠地近似总体误差。相比之下，生成模型的评估更具挑战性，因其具有开放特性：难以确定何种指标适用，以及这些指标能否通过有限样本可靠评估。
本文提出了一个评估生成模型的理论框架，并针对常用指标建立了可评估性结论。我们研究了两类指标：基于测试的指标（包括积分概率度量）以及Rényi散度。我们证明，对于任何有界测试函数类，积分概率度量均可通过有限样本进行估计，其近似误差可控制在乘性与加性范围内。此外，当测试函数类具有有限脂肪粉碎维数时，积分概率度量可实现任意精度的评估。与之相反，Rényi散度与KL散度无法通过有限样本评估，因为其数值可能由罕见事件决定。本文同时分析了困惑度作为评估方法的潜力与局限性。

摘要 (Abstract)

Statistical evaluation aims to estimate the generalization performance of a model using held-out i.i.d.\ test data sampled from the ground-truth distribution. In supervised learning settings such as classification, performance metrics such as error rate are well-defined, and test error reliably approximates population error given sufficiently large datasets. In contrast, evaluation is more challenging for generative models due to their open-ended nature: it is unclear which metrics are appropriate and whether such metrics can be reliably evaluated from finite samples. In this work, we introduce a theoretical framework for evaluating generative models and establish evaluability results for commonly used metrics. We study two categories of metrics: test-based metrics, including integral probability metrics (IPMs), and Rényi divergences. We show that IPMs with respect to any bounded test class can be evaluated from finite samples up to multiplicative and additive approximation errors. Moreover, when the test class has finite fat-shattering dimension, IPMs can be evaluated with arbitrary precision. In contrast, Rényi and KL divergences are not evaluable from finite samples, as their values can be critically determined by rare events. We also analyze the potential and limitations of perplexity as an evaluation method.

关键词: generative models, statistical evaluation, integral probability metrics, Rényi divergences, finite sample evaluability, test-based metrics, perplexity, theoretical framework

303. ❌ Jeffreys Flow: Robust Boltzmann Generators for Rare Event Sampling via Parallel Tempering Distillation

作者: Guang Lin, Christian Moya, Di Qi, Xuda Ye 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05303v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于物理系统采样中的生成模型方法（Jeffreys Flow），用于解决多模态分布中的罕见事件采样和模式崩溃问题，属于科学计算和统计物理领域。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、智能体等）完全无关，因为这些关键词特指自然语言处理或通用AI领域的技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学计算（物理系统模拟）中的应用，但并非生物信息学或化学信息学，因此给予中等相关度5分。

!!! tip deepseek-chat TL;DR

该论文针对物理系统中多模态分布采样时的模式崩溃问题，提出了Jeffreys Flow生成框架，通过Jeffreys散度蒸馏并行回火轨迹数据，有效平衡了局部精度和全局模式覆盖，并在高维非凸基准和量子热态路径积分蒙特卡洛中展示了其准确性和加速效果。

摘要翻译

对具有粗糙能量景观的物理系统进行采样常受限于稀有事件和亚稳态捕获问题。虽然玻尔兹曼生成器已提供了一种解决方案，但其对反向Kullback-Leibler散度的依赖常引发灾难性的模态坍缩，导致在多峰分布中遗漏特定模态。本文提出Jeffreys流——一种鲁棒的生成框架，通过使用对称Jeffreys散度从并行回火轨迹中提炼经验采样数据，有效缓解了上述缺陷。该框架在局部目标追踪精度与全局模态覆盖之间实现了有效平衡。我们证明，最小化Jeffreys散度能够抑制模态坍缩，并通过提炼经验参考数据在结构上修正固有偏差。我们在高度非凸的多维基准测试中验证了该框架的可扩展性与准确性，包括系统修正副本交换随机梯度朗之万动力学中的随机梯度偏差，以及对量子热态路径积分蒙特卡洛中精确重要性采样的显著加速。

摘要 (Abstract)

Sampling physical systems with rough energy landscapes is hindered by rare events and metastable trapping. While Boltzmann generators already offer a solution, their reliance on the reverse Kullback–Leibler divergence frequently induces catastrophic mode collapse, missing specific modes in multi-modal distributions. Here, we introduce the Jeffreys Flow, a robust generative framework that mitigates this failure by distilling empirical sampling data from Parallel Tempering trajectories using the symmetric Jeffreys divergence. This formulation effectively balances local target-seeking precision with global modes coverage. We show that minimizing Jeffreys divergence suppresses mode collapse and structurally corrects inherent inaccuracies via distillation of the empirical reference data. We demonstrate the framework’s scalability and accuracy on highly non-convex multidimensional benchmarks, including the systematic correction of stochastic gradient biases in Replica Exchange Stochastic Gradient Langevin Dynamics and the massive acceleration of exact importance sampling in Path Integral Monte Carlo for quantum thermal states.

关键词: Jeffreys Flow, Boltzmann generators, rare event sampling, mode collapse, Parallel Tempering, Jeffreys divergence, generative framework, quantum thermal states

304. ❌ Robust Learning of Heterogeneous Dynamic Systems

作者: Shuoxun Xu, Zijian Guo, Brooke R. Staveland, Robert T. Knight, Lexin Li 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05285v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于使用分布鲁棒学习方法建模异质常微分方程系统，属于科学计算和统计学习领域。所有关键词均与大模型、深度学习技术原理或具体应用（如生物信息学）相关，但论文未涉及任何大模型、深度学习或AI技术，仅使用传统统计和优化方法处理动态系统数据。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文分析了颅内脑电图数据，属于科学领域的AI应用，但论文方法本身并非基于AI或深度学习，因此给予5分（有一定关联）。其他关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种分布鲁棒学习方法，用于从多个异质动态系统中学习共享模式，通过最大化最坏情况奖励构建鲁棒动态系统，并证明了该方法在模拟和颅内脑电图数据分析中显著提高了泛化性能。

摘要翻译

常微分方程为广泛科学领域中动态系统的建模提供了强大的框架。然而，现有的大多数常微分方程方法主要针对单一系统，未能充分解决从多个异质动态系统中学习共享模式的问题。本文提出了一种新颖的分布鲁棒学习方法，用于建模异质常微分方程系统。具体而言，我们通过在一类由轨迹导数凸组合构成的不确定性集合上最大化最差情况下的奖励，构建了一个鲁棒动态系统。我们证明了所得估计量具有显式的加权平均表示形式，其中权重通过一个二次优化获得，该优化平衡了多个数据源的信息。我们进一步开发了一种双层稳定化程序，以解决估计中潜在的不稳定性。我们为所提方法建立了严格的理论保证，包括稳定化权重的一致性、鲁棒轨迹估计的误差界，以及逐点置信区间的渐近有效性。通过大量模拟实验和一项颅内脑电图数据分析，我们证明相较于其他解决方案，所提方法显著提升了泛化性能。

摘要 (Abstract)

Ordinary differential equations (ODEs) provide a powerful framework for modeling dynamic systems arising in a wide range of scientific domains. However, most existing ODE methods focus on a single system, and do not adequately address the problem of learning shared patterns from multiple heterogeneous dynamic systems. In this article, we propose a novel distributionally robust learning approach for modeling heterogeneous ODE systems. Specifically, we construct a robust dynamic system by maximizing a worst-case reward over an uncertainty class formed by convex combinations of the derivatives of trajectories. We show the resulting estimator admits an explicit weighted average representation, where the weights are obtained from a quadratic optimization that balances information across multiple data sources. We further develop a bi-level stabilization procedure to address potential instability in estimation. We establish rigorous theoretical guarantees for the proposed method, including consistency of the stabilized weights, error bound for robust trajectory estimation, and asymptotical validity of pointwise confidence interval. We demonstrate that the proposed method considerably improves the generalization performance compared to the alternative solutions through both extensive simulations and the analysis of an intracranial electroencephalogram data.

关键词: heterogeneous dynamic systems, ordinary differential equations, distributionally robust learning, worst-case reward, weighted average representation, generalization performance, intracranial electroencephalogram, trajectory estimation

305. ❌ Extending Tabular Denoising Diffusion Probabilistic Models for Time-Series Data Generation

作者: Umang Dobhal, Christina Garcia, Sozo Inoue 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05257v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于扩散模型在时间序列数据生成中的应用，属于深度学习在科学/工程领域的应用，但与所有大模型（LLM）相关技术关键词完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及传感器数据分析（WISDM数据集）和合成数据生成，属于AI在科学/工程领域的应用，但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词均未涉及大模型、训练方法、推理优化、对齐、代理等主题。

!!! tip deepseek-chat TL;DR

该研究解决了TabDDPM模型在生成时间序列数据时忽略时间依赖性的问题，通过引入时间适配器和上下文感知嵌入模块，成功生成了具有更高时间真实性和连贯性的合成传感器数据，并在WISDM数据集上验证了其有效性。

摘要翻译

扩散模型正日益被用于生成合成表格与时间序列数据，以实现隐私保护的数据增强。表格去噪扩散概率模型（TabDDPM）能够从异构表格数据集中生成高质量的合成数据，但其假设样本间相互独立，这限制了其在时间依赖性至关重要的时序数据领域的应用。为解决这一问题，我们提出了TabDDPM的时序扩展版本，通过引入轻量级时序适配器和上下文感知嵌入模块，使模型具备序列感知能力。通过将传感器数据重构为窗口化序列，并借助时间步嵌入、条件活动标签以及观测/缺失掩码显式建模时序上下文，我们的方法能够生成时序连贯的合成序列。与基线方法及插值技术相比，基于二元转移矩阵和自相关分析的验证表明，所生成数据在时序真实性、多样性和连贯性方面均有提升。在WISDM加速度计数据集上，所提出的系统生成的合成时间序列高度接近真实传感器模式，并取得了可比的分类性能（宏观F1分数0.64，准确率0.71）。这对于少数类别的数据表征以及保持与真实分布的统计对齐尤为有利。这些进展表明，基于扩散的模型在配备时序推理能力后，能为序列数据合成提供有效且适应性强的解决方案。未来工作将探索向更长序列的扩展以及集成更强大的时序架构。

摘要 (Abstract)

Diffusion models are increasingly being utilised to create synthetic tabular and time series data for privacy-preserving augmentation. Tabular Denoising Diffusion Probabilistic Models (TabDDPM) generate high-quality synthetic data from heterogeneous tabular datasets but assume independence between samples, limiting their applicability to time-series domains where temporal dependencies are critical. To address this, we propose a temporal extension of TabDDPM, introducing sequence awareness through the use of lightweight temporal adapters and context-aware embedding modules. By reformulating sensor data into windowed sequences and explicitly modeling temporal context via timestep embeddings, conditional activity labels, and observed/missing masks, our approach enables the generation of temporally coherent synthetic sequences. Compared to baseline and interpolation techniques, validation using bigram transition matrices and autocorrelation analysis shows enhanced temporal realism, diversity, and coherence. On the WISDM accelerometer dataset, the suggested system produces synthetic time-series that closely resemble real world sensor patterns and achieves comparable classification performance (macro F1-score 0.64, accuracy 0.71). This is especially advantageous for minority class representation and preserving statistical alignment with real distributions. These developments demonstrate that diffusion based models provide effective and adaptable solutions for sequential data synthesis when they are equipped for temporal reasoning. Future work will explore scaling to longer sequences and integrating stronger temporal architectures.

关键词: diffusion models, tabular data, time-series generation, temporal dependencies, synthetic data, sensor data, TabDDPM, temporal coherence

306. ❌ EAGLE: Edge-Aware Graph Learning for Proactive Delivery Delay Prediction in Smart Logistics Networks

作者: Zhiming Xue, Menghao Huo, Yujue Wang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05254v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于物流网络中的交付延迟预测，提出了一种结合Transformer编码器和图注意力网络的混合深度学习框架。虽然属于深度学习应用，但论文内容完全围绕特定领域（智能物流）的预测任务，未涉及任何大语言模型（LLM）技术、大模型训练方法（如预训练、微调、对齐）、推理优化、智能体系统或AI for Science等关键词。所有关键词均与大模型或指定的科学领域应用无关，因此相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合Transformer和图注意力网络的混合深度学习框架，用于智能物流网络中的主动交付延迟预测，在真实数据集上取得了优于基线方法的预测准确性和训练稳定性。

摘要翻译

现代物流网络在每个仓库节点和运输通道上都会产生丰富的运营数据流——从订单时间戳、路径记录到货运清单——然而对交付延迟的预测仍主要停留在被动响应层面。现有的预测方法通常将这一问题视为表格分类任务（忽略网络拓扑结构），或作为时间序列异常检测任务（忽视供应链图的空间依赖性）。为弥补这一差距，我们提出了一种用于主动式供应链风险管理的混合深度学习框架。该方法通过轻量级Transformer补丁编码器联合建模时序订单流动态，并借助边感知图注意力网络（Edge-Aware Graph Attention Network, E-GAT）捕捉枢纽间关系依赖性，通过多任务学习目标进行优化。在真实世界DataCo智能供应链数据集上的评估表明，我们的框架相较于基线方法实现了持续改进，获得0.8762的F1分数和0.9773的AUC-ROC值。在四个独立随机种子实验中，该框架展现出仅0.0089的跨种子F1标准差——较最佳消融变体提升3.8倍——在所有评估模型中实现了预测准确性与训练稳定性的最优平衡。

摘要 (Abstract)

Modern logistics networks generate rich operational data streams at every warehouse node and transportation lane – from order timestamps and routing records to shipping manifests – yet predicting delivery delays remains predominantly reactive. Existing predictive approaches typically treat this problem either as a tabular classification task, ignoring network topology, or as a time-series anomaly detection task, overlooking the spatial dependencies of the supply chain graph. To bridge this gap, we propose a hybrid deep learning framework for proactive supply chain risk management. The proposed method jointly models temporal order-flow dynamics via a lightweight Transformer patch encoder and inter-hub relational dependencies through an Edge-Aware Graph Attention Network (E-GAT), optimized via a multi-task learning objective. Evaluated on the real-world DataCo Smart Supply Chain dataset, our framework achieves consistent improvements over baseline methods, yielding an F1-score of 0.8762 and an AUC-ROC of 0.9773. Across four independent random seeds, the framework exhibits a cross-seed F1 standard deviation of only 0.0089 – a 3.8 times improvement over the best ablated variant – achieving the strongest balance of predictive accuracy and training stability among all evaluated models.

关键词: delivery delay prediction, smart logistics, Transformer encoder, graph attention network, supply chain risk management, multi-task learning, temporal dynamics, spatial dependencies

307. ❌ Spike Hijacking in Late-Interaction Retrieval

作者: Karthik Suresh, Tushar Vatsa, Tracy King, Asim Kadav, Michael Friedrich 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05253v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是检索模型中的梯度路由和鲁棒性问题，聚焦于多向量检索系统中MaxSim池化方法的机制分析。所有关键词均与大语言模型、深度学习技术原理或科学应用相关，而本文研究的是传统检索模型（非大模型）的特定池化方法，与所有关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文研究了基于MaxSim池化的后期交互检索模型中梯度集中和鲁棒性问题，发现硬最大池化会导致梯度过度集中并降低对文档长度的鲁棒性，提出了稀疏性与鲁棒性之间的权衡问题。

摘要翻译

晚期交互检索模型依赖硬性最大相似度（MaxSim）来聚合词元级相似性。尽管有效，这种赢者通吃的池化规则可能在结构上对训练动态产生偏差。我们对基于MaxSim的检索中的梯度路由与鲁棒性进行了机制性研究。在使用批内对比训练的受控合成环境中，我们证明MaxSim相较于更平滑的替代方案（如Top-k池化和softmax聚合）会引发显著更高的片段级梯度集中度。虽然稀疏路由能提升早期区分能力，但也增加了对文档长度的敏感性：随着文档片段数量增加，MaxSim的性能下降比温和平滑的变体更为急剧。我们在真实世界的多向量检索基准上验证了这些发现，其中受控的文档长度扫描揭示了硬最大池化下类似的脆弱性。综合而言，我们的研究将池化诱导的梯度集中度分离为晚期交互检索的结构特性，并揭示了稀疏性与鲁棒性之间的权衡。这些发现为多向量检索系统中替代硬最大池化的原则性方法提供了理论依据。

摘要 (Abstract)

Late-interaction retrieval models rely on hard maximum similarity (MaxSim) to aggregate token-level similarities. Although effective, this winner-take-all pooling rule may structurally bias training dynamics. We provide a mechanistic study of gradient routing and robustness in MaxSim-based retrieval. In a controlled synthetic environment with in-batch contrastive training, we demonstrate that MaxSim induces significantly higher patch-level gradient concentration than smoother alternatives such as Top-k pooling and softmax aggregation. While sparse routing can improve early discrimination, it also increases sensitivity to document length: as the number of document patches grows, MaxSim degrades more sharply than mild smoothing variants. We corroborate these findings on a real-world multi-vector retrieval benchmark, where controlled document-length sweeps reveal similar brittleness under hard max pooling. Together, our results isolate pooling-induced gradient concentration as a structural property of late-interaction retrieval and highlight a sparsity-robustness tradeoff. These findings motivate principled alternatives to hard max pooling in multi-vector retrieval systems.

关键词: late-interaction retrieval, MaxSim pooling, gradient concentration, multi-vector retrieval, sparsity-robustness tradeoff, hard max pooling, document length sensitivity, patch-level gradient

308. ❌ Curvature-Aware Optimization for High-Accuracy Physics-Informed Neural Networks

作者: Anas Jnini, Elham Kiyani, Khemraj Shukla, Jorge F. Urban, Nazanin Ahmadi Daryakenari, Johannes Muller, Marius Zeinhofer, George Em Karniadakis 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05230v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于物理信息神经网络（PINNs）的优化方法开发，属于科学机器学习领域，与大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文研究PINNs在求解偏微分方程和常微分方程中的应用，属于AI for Science范畴，但并非核心的生物信息学或化学信息学应用，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了针对物理信息神经网络（PINNs）的先进优化策略（如自然梯度、自缩放BFGS和Broyden优化器），以加速求解偏微分方程和常微分方程的收敛，并在多个物理问题上验证了其高效性和可扩展性。

摘要翻译

高效且鲁棒的优化对于神经网络至关重要，它能使科学机器学习模型快速收敛至极高精度——从而精确捕捉由微分方程支配的复杂物理行为。本研究提出了先进的优化策略，以加速物理信息神经网络（Physics-Informed Neural Networks, PINNs）在求解具有挑战性的偏微分方程（PDEs）和常微分方程（ODEs）时的收敛速度。具体而言，我们实现了自然梯度（Natural Gradient, NG）优化器、自缩放BFGS优化器以及Broyden优化器的高效版本，并在包括亥姆霍兹方程、斯托克斯流、无粘性伯格斯方程、高速流动的欧拉方程，以及药代动力学与药效学中出现的刚性常微分方程等一系列问题上展示了其性能。除了优化器的开发，我们还提出了基于PINN的新方法用于求解无粘性伯格斯方程和欧拉方程，并将所得解与高阶数值方法的结果进行比较，以提供严谨且公平的评估。最后，我们解决了将这些拟牛顿优化器扩展至批量训练所面临的挑战，从而为大规模数据驱动问题提供了高效且可扩展的解决方案。

摘要 (Abstract)

Efficient and robust optimization is essential for neural networks, enabling scientific machine learning models to converge rapidly to very high accuracy – faithfully capturing complex physical behavior governed by differential equations. In this work, we present advanced optimization strategies to accelerate the convergence of physics-informed neural networks (PINNs) for challenging partial (PDEs) and ordinary differential equations (ODEs). Specifically, we provide efficient implementations of the Natural Gradient (NG) optimizer, Self-Scaling BFGS and Broyden optimizers, and demonstrate their performance on problems including the Helmholtz equation, Stokes flow, inviscid Burgers equation, Euler equations for high-speed flows, and stiff ODEs arising in pharmacokinetics and pharmacodynamics. Beyond optimizer development, we also propose new PINN-based methods for solving the inviscid Burgers and Euler equations, and compare the resulting solutions against high-order numerical methods to provide a rigorous and fair assessment. Finally, we address the challenge of scaling these quasi-Newton optimizers for batched training, enabling efficient and scalable solutions for large data-driven problems.

关键词: Physics-Informed Neural Networks, Optimization, Natural Gradient, Quasi-Newton Methods, Partial Differential Equations, Ordinary Differential Equations, Scientific Machine Learning, Convergence Acceleration

309. ❌ fastml: Guarded Resampling Workflows for Safer Automated Machine Learning in R

作者: Selcuk Korkmaz, Dincer Goksuluk, Eda Karaismailoglu 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05225v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《fastml: Guarded Resampling Workflows for Safer Automated Machine Learning in R》专注于传统机器学习（非深度学习）中的预处理泄漏问题，提出了一种通过保护性重采样来防止数据泄漏的R包。论文内容涉及自动化机器学习、数据预处理、模型评估和生存分析，但完全不涉及大语言模型（LLM）、深度学习技术原理、模型训练/微调方法（如RLHF、PEFT）、推理优化、AI代理或科学AI应用等关键词领域。所有评分关键词均与大模型或深度学习相关，而本文研究的是传统统计机器学习的工作流程工具，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了机器学习中预处理泄漏导致性能虚高的问题，开发了fastml R包通过保护性重采样来防止数据泄漏，实验证明该方法能更准确地评估模型性能并简化工作流程。

摘要翻译

预处理泄漏发生在重采样前进行缩放、插补或其他依赖数据的转换估计时，它会夸大表面性能且难以被察觉。我们推出了fastml——一个R语言包，它通过“防护式重采样”提供了单次调用接口以实现泄漏感知的机器学习。该方法在每次重采样内部重新估计预处理步骤，并将其应用于相应的评估数据。该软件包支持分组和时间序列重采样，阻止高风险配置，审核外部依赖流程，并包含沙箱执行与集成模型解释功能。我们通过蒙特卡洛模拟（对比全局与折叠局部归一化）、在相同规范下与tidymodels进行可用性比较，以及在不同规模数据集上的生存分析基准测试来评估fastml。模拟表明，相对于防护式重采样，全局预处理会显著夸大表面性能。fastml在减少工作流复杂度的同时，达到了与tidymodels相当的保留集性能，并通过统一接口支持了对多类生存模型的一致性基准测试。

摘要 (Abstract)

Preprocessing leakage arises when scaling, imputation, or other data-dependent transformations are estimated before resampling, inflating apparent performance while remaining hard to detect. We present fastml, an R package that provides a single-call interface for leakage-aware machine learning through guarded resampling, where preprocessing is re-estimated inside each resample and applied to the corresponding assessment data. The package supports grouped and time-ordered resampling, blocks high-risk configurations, audits recipes for external dependencies, and includes sandboxed execution and integrated model explanation. We evaluate fastml with a Monte Carlo simulation contrasting global and fold-local normalization, a usability comparison with tidymodels under matched specifications, and survival benchmarks across datasets of different sizes. The simulation demonstrates that global preprocessing substantially inflates apparent performance relative to guarded resampling. fastml matched held-out performance obtained with tidymodels while reducing workflow orchestration, and it supported consistent benchmarking of multiple survival model classes through a unified interface.

关键词: preprocessing leakage, guarded resampling, automated machine learning, R package, model evaluation, survival analysis, workflow orchestration

310. ❌ Vehicle-as-Prompt: A Unified Deep Reinforcement Learning Framework for Heterogeneous Fleet Vehicle Routing Problem

作者: Shihong Huang, Shengjie Wang, Lei Gao, Hong Ma, Zhanluo Zhang, Feng Zhang, Weihua Zhou 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05195v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于解决异构车队车辆路径问题（HFVRP），使用深度强化学习（DRL）框架，并提出了Vehicle-as-Prompt机制。虽然涉及深度学习技术，但论文内容与所有评分关键词（均围绕大语言模型、对齐、推理、压缩、科学AI等特定主题）完全无关，未提及任何大模型、LLM、MoE、对齐技术、推理方法、模型压缩或科学AI应用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对异构车队车辆路径问题（HFVRP）及其复杂变体，提出了一个统一的深度强化学习框架Vehicle-as-Prompt（VaP），通过VaP-CSMV模型显著优于现有DRL求解器，并在推理时间和零样本泛化方面表现出色。

摘要翻译

与传统同质车队路径规划问题不同，异质车队车辆路径问题（Heterogeneous Fleet Vehicle Routing Problem, HFVRP）涉及异质固定成本、可变行驶成本和容量约束，使得解的质量对车辆选择高度敏感。此外，现实物流应用通常包含额外的复杂约束，显著增加了计算复杂度。然而，现有大多数基于深度强化学习（Deep Reinforcement Learning, DRL）的方法仅限于同质场景，导致其在应用于HFVRP及其复杂变体时性能欠佳。为弥补这一差距，本研究探究复杂约束下的HFVRP，并开发了一个能够跨多种变体设置求解该问题的统一DRL框架。我们提出了车辆即提示（Vehicle-as-Prompt, VaP）机制，将问题构建为单阶段自回归决策过程。在此基础上，我们提出了VaP-CSMV框架，该框架配备跨语义编码器和多视图解码器，能有效处理各类问题变体，并捕捉车辆异质性与客户节点属性之间的复杂映射关系。大量实验结果表明，VaP-CSMV显著优于现有最先进的基于DRL的神经求解器，并与传统启发式求解器相比取得了具有竞争力的解质量，同时将推理时间缩短至数秒。此外，该框架在大规模及先前未见的问题变体上展现出强大的零样本泛化能力，而消融实验验证了各组件的重要贡献。

摘要 (Abstract)

Unlike traditional homogeneous routing problems, the Heterogeneous Fleet Vehicle Routing Problem (HFVRP) involves heterogeneous fixed costs, variable travel costs, and capacity constraints, rendering solution quality highly sensitive to vehicle selection. Furthermore, real-world logistics applications often impose additional complex constraints, markedly increasing computational complexity. However, most existing Deep Reinforcement Learning (DRL)-based methods are restricted to homogeneous scenarios, leading to suboptimal performance when applied to HFVRP and its complex variants. To bridge this gap, we investigate HFVRP under complex constraints and develop a unified DRL framework capable of solving the problem across various variant settings. We introduce the Vehicle-as-Prompt (VaP) mechanism, which formulates the problem as a single-stage autoregressive decision process. Building on this, we propose VaP-CSMV, a framework featuring a cross-semantic encoder and a multi-view decoder that effectively addresses various problem variants and captures the complex mapping relationships between vehicle heterogeneity and customer node attributes. Extensive experimental results demonstrate that VaP-CSMV significantly outperforms existing state-of-the-art DRL-based neural solvers and achieves competitive solution quality compared to traditional heuristic solvers, while reducing inference time to mere seconds. Furthermore, the framework exhibits strong zero-shot generalization capabilities on large-scale and previously unseen problem variants, while ablation studies validate the vital contribution of each component.

关键词: Heterogeneous Fleet Vehicle Routing Problem, Deep Reinforcement Learning, Vehicle-as-Prompt, Cross-semantic encoder, Multi-view decoder, Zero-shot generalization, Neural solvers, Autoregressive decision process

311. ❌ FNO$^{\angle θ}$: Extended Fourier neural operator for learning state and optimal control of distributed parameter systems

作者: Zhexian Li, Ketan Savla 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05187v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文提出了一种扩展的傅里叶神经算子（FNO）架构，用于学习偏微分方程（PDE）系统的状态和最优控制。论文的核心是改进FNO架构以更好地处理PDE控制问题，属于科学计算和AI for Science领域。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"与论文主题有一定关联，因为论文涉及AI在科学计算（PDE求解和控制）中的应用。其他关键词均与论文内容无关，因为论文不涉及大语言模型、模型训练、对齐、推理、代理、压缩等主题。

!!! tip deepseek-chat TL;DR

该论文提出了一种扩展的傅里叶神经算子（FNO）架构，用于学习偏微分方程系统的状态和最优控制，并在非线性Burgers方程上展示了比原始FNO更低的训练误差和更准确的边界值预测。

摘要翻译

我们提出一种扩展的傅里叶神经算子（FNO）架构，用于学习由偏微分方程描述的系统状态及线性二次加性最优控制。基于Ehrenpreis-Palamodov基本定理，我们证明了常系数线性偏微分方程的任意状态及最优控制均可表示为复域上的积分。该积分表示的被积函数包含与逆傅里叶变换中相同的指数项，而后者正用于构建FNO层中的卷积算子。受此启发，我们通过将逆傅里叶变换中的频率变量从实数域扩展至复数域，对FNO层进行改进，以捕捉基本定理所对应的积分表示。我们以非线性Burgers方程为例，展示了扩展FNO在学习状态与最优控制任务中的性能，其训练误差较原始FNO降低了一个数量级，并对非周期性边界值实现了更精确的预测。

摘要 (Abstract)

We propose an extended Fourier neural operator (FNO) architecture for learning state and linear quadratic additive optimal control of systems governed by partial differential equations. Using the Ehrenpreis-Palamodov fundamental principle, we show that any state and optimal control of linear PDEs with constant coefficients can be represented as an integral in the complex domain. The integrand of this representation involves the same exponential term as in the inverse Fourier transform, where the latter is used to represent the convolution operator in FNO layer. Motivated by this observation, we modify the FNO layer by extending the frequency variable in the inverse Fourier transform from the real to complex domain to capture the integral representation from the fundamental principle. We illustrate the performance of FNO in learning state and optimal control for the nonlinear Burgers’ equation, showing order of magnitude improvements in training errors and more accurate predictions of non-periodic boundary values over FNO.

关键词: Fourier neural operator, optimal control, partial differential equations, Burgers’ equation, Ehrenpreis-Palamodov principle, complex domain, non-periodic boundary, state learning

312. ❌ Cross-fitted Proximal Learning for Model-Based Reinforcement Learning

作者: Nishanth Venkatesh, Andreas A. Malikopoulos 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05185v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于模型强化学习中的统计估计方法，特别是针对存在隐藏混杂因素的离线设置下的桥函数估计问题。研究内容涉及强化学习、因果推断、统计估计和部分可观测马尔可夫决策过程，但完全不涉及大语言模型、深度学习技术原理或AI在科学领域的应用。所有关键词均与大模型、深度学习技术或AI科学应用相关，而本文研究的是传统强化学习中的统计方法，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文研究了在存在隐藏混杂因素的离线部分可观测马尔可夫决策过程中，通过提出一种K折交叉拟合的两阶段桥估计器来更高效地估计奖励-发射和观测-转移桥函数，从而改进模型强化学习的统计估计方法。

摘要翻译

基于模型的强化学习因其显式地估计奖励与转移模型，并通过模拟推演支持规划，在序列决策中具有吸引力。然而，在存在隐藏混杂因素的离线设置中，直接从观测数据学习得到的模型可能存在偏差。这一挑战在部分可观测系统中尤为突出，因为潜在因素可能共同影响动作、奖励及未来观测。近期研究表明，在此类混杂的部分可观测马尔可夫决策过程（POMDPs）中，策略评估可转化为估计满足条件矩限制（CMRs）的奖励-发射与观测-转移桥函数。本文研究这些桥函数的统计估计问题。我们将桥学习构建为一个条件矩限制问题，其干扰对象由条件均值嵌入与条件密度给出。随后，我们针对现有两阶段桥估计器提出了一种$K$折交叉拟合扩展方法。该方案保留了基于桥函数的原始识别策略，同时比单一样本分割更高效地利用可用数据。我们还推导了交叉拟合估计器的oracle比较界，并将最终误差分解为由干扰估计引起的第一阶段误差和由经验平均引起的第二阶段误差。

摘要 (Abstract)

Model-based reinforcement learning is attractive for sequential decision-making because it explicitly estimates reward and transition models and then supports planning through simulated rollouts. In offline settings with hidden confounding, however, models learned directly from observational data may be biased. This challenge is especially pronounced in partially observable systems, where latent factors may jointly affect actions, rewards, and future observations. Recent work has shown that policy evaluation in such confounded partially observable Markov decision processes (POMDPs) can be reduced to estimating reward-emission and observation-transition bridge functions satisfying conditional moment restrictions (CMRs). In this paper, we study the statistical estimation of these bridge functions. We formulate bridge learning as a CMR problem with nuisance objects given by a conditional mean embedding and a conditional density. We then develop a $K$-fold cross-fitted extension of the existing two-stage bridge estimator. The proposed procedure preserves the original bridge-based identification strategy while using the available data more efficiently than a single sample split. We also derive an oracle-comparator bound for the cross-fitted estimator and decompose the resulting error into a Stage I term induced by nuisance estimation and a Stage II term induced by empirical averaging.

关键词: model-based reinforcement learning, confounded POMDPs, bridge functions, conditional moment restrictions, cross-fitted estimator, statistical estimation, offline settings, hidden confounding

313. ❌ OrthoFuse: Training-free Riemannian Fusion of Orthogonal Style-Concept Adapters for Diffusion Models

作者: Ali Aliev, Kamil Garifullin, Nikolay Yudin, Vera Soboleva, Alexander Molozhavenko, Ivan Oseledets, Aibek Alanov, Maxim Rakhuba 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05183v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	10.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于扩散模型的参数高效微调（PEFT）和模型合并技术，与关键词’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分），因为全文讨论正交适配器的参数高效微调。与’Model Merging OR Model Soups OR Weight Averaging’高度相关（10分），因为核心贡献是训练自由的适配器合并方法。其他关键词均不相关（0分），因为论文不涉及大语言模型、推理、对齐、科学AI应用等主题，仅针对扩散模型的特定技术。

!!! tip deepseek-chat TL;DR

该论文解决了如何无需训练地合并扩散模型中针对不同任务（如主题和风格）调优的正交适配器的问题，提出了一种基于黎曼几何和谱恢复变换的融合方法，实现了概念和风格特征的联合。

摘要翻译

在模型训练这一快速发展的领域中，参数高效微调及利用少量训练数据使模型适应特定任务的各种技术始终受到广泛关注。然而，一个开放性问题依然存在：如何将针对不同任务微调的多个适配器合并为一个能在多项任务上均取得良好结果的统一适配器？具体而言，针对生成模型的主题适配器与风格适配器的融合问题尚未得到解决。本文旨在证明，在正交微调（OFT）的框架下，我们可以利用结构化正交参数化及其几何特性，推导出无需额外训练的适配器合并公式。特别地，我们分析了近期提出的分组洗牌（$\mathcal{GS}$）正交矩阵所形成的流形结构，并给出了近似测地线的高效计算公式。此外，我们提出了一种“谱恢复”变换，以恢复合并后适配器的谱特性，从而实现更高质量的融合。我们在主题驱动生成任务中进行了实验，结果表明，我们提出的合并两个$\mathcal{GS}$正交矩阵的技术能够有效整合不同适配器的概念特征与风格特征。据我们所知，这是首个无需训练即可合并乘法正交适配器的方法。代码可通过$\href{https://github.com/ControlGenAI/OrthoFuse}{链接}$获取。

摘要 (Abstract)

In a rapidly growing field of model training there is a constant practical interest in parameter-efficient fine-tuning and various techniques that use a small amount of training data to adapt the model to a narrow task. However, there is an open question: how to combine several adapters tuned for different tasks into one which is able to yield adequate results on both tasks? Specifically, merging subject and style adapters for generative models remains unresolved. In this paper we seek to show that in the case of orthogonal fine-tuning (OFT), we can use structured orthogonal parametrization and its geometric properties to get the formulas for training-free adapter merging. In particular, we derive the structure of the manifold formed by the recently proposed Group-and-Shuffle ($\mathcal{GS}$) orthogonal matrices, and obtain efficient formulas for the geodesics approximation between two points. Additionally, we propose a $\text{spectra restoration}$ transform that restores spectral properties of the merged adapter for higher-quality fusion. We conduct experiments in subject-driven generation tasks showing that our technique to merge two $\mathcal{GS}$ orthogonal matrices is capable of uniting concept and style features of different adapters. To the best of our knowledge, this is the first training-free method for merging multiplicative orthogonal adapters. Code is available via the $\href{https://github.com/ControlGenAI/OrthoFuse}{link}$.

关键词: Orthogonal Fine-Tuning, Adapter Merging, Diffusion Models, Parameter-efficient Fine-tuning, Training-free Fusion, Riemannian Geometry, Subject-driven Generation, Spectral Restoration

314. ❌ General Multimodal Protein Design Enables DNA-Encoding of Chemistry

作者: Jarrid Rector-Brooks, Théophile Lambert, Marta Skreta, Daniel Roth, Yueming Long, Zi-Qi Li, Xi Zhang, Miruna Cretu, Francesca-Zhoufan Li, Tanvi Ganapathy, Emily Jin, Avishek Joey Bose, Jason Yang, Kirill Neklyudov, Yoshua Bengio, Alexander Tong, Frances H. Arnold, Cheng-Hao Liu 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05181v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于蛋白质设计领域，提出了一种名为DISCO的多模态扩散模型，用于共同设计蛋白质序列和3D结构，并应用于酶催化新反应。论文的核心内容属于AI在生物科学（特别是生物信息学/化学信息学）领域的应用，与关键词列表中的’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（评10分）。然而，论文并未涉及大语言模型（LLMs）、模型架构（如MoE）、训练技术（如预训练、微调、对齐）、推理优化、智能体系统或其他通用大模型技术主题。因此，除上述一个关键词外，其余所有关键词均完全无关（评0分）。

!!! tip deepseek-chat TL;DR

该研究解决了如何设计能催化自然界不存在的新化学反应的酶的问题，通过开发多模态扩散模型DISCO，成功设计了多种新型血红素酶，实现了高效的碳烯转移反应，拓展了基因可编码转化的范围。

摘要翻译

进化是酶多样性产生的非凡引擎，然而其探索的化学反应范围仍远小于DNA所能编码的潜力。深度生成模型能够设计结合配体的新蛋白质，但尚未有模型能在不预先指定催化残基的情况下创造全新酶。我们提出了DISCO（序列-结构协同设计的扩散模型），这是一种多模态模型，能够围绕任意生物分子协同设计蛋白质序列与三维结构，并引入了在推理时优化双模态目标的缩放方法。仅以反应中间体为条件，DISCO设计出了具有新颖活性位点几何结构的多样化血红素酶。这些酶能催化自然界未曾发现的新型卡宾转移反应，包括烯烃环丙烷化、螺环丙烷化、B-H插入及C(sp$^3$)-H插入反应，其活性远超工程化酶。对选定设计的随机诱变进一步证实，通过定向进化可提升酶活性。DISCO为可进化酶的设计提供了可扩展的路径，从而拓宽了基因可编码转化的潜在范围。代码发布于https://github.com/DISCO-design/DISCO。

摘要 (Abstract)

Evolution is an extraordinary engine for enzymatic diversity, yet the chemistry it has explored remains a narrow slice of what DNA can encode. Deep generative models can design new proteins that bind ligands, but none have created enzymes without pre-specifying catalytic residues. We introduce DISCO (DIffusion for Sequence-structure CO-design), a multimodal model that co-designs protein sequence and 3D structure around arbitrary biomolecules, as well as inference-time scaling methods that optimize objectives across both modalities. Conditioned solely on reactive intermediates, DISCO designs diverse heme enzymes with novel active-site geometries. These enzymes catalyze new-to-nature carbene-transfer reactions, including alkene cyclopropanation, spirocyclopropanation, B-H, and C(sp$^3$)-H insertions, with high activities exceeding those of engineered enzymes. Random mutagenesis of a selected design further confirmed that enzyme activity can be improved through directed evolution. By providing a scalable route to evolvable enzymes, DISCO broadens the potential scope of genetically encodable transformations. Code is available at https://github.com/DISCO-design/DISCO.

关键词: protein design, enzyme design, diffusion model, multimodal model, carbene-transfer reactions, sequence-structure co-design, novel active-site geometries, directed evolution

315. ❌ Graph Signal Diffusion Models for Wireless Resource Allocation

作者: Yigit Berkay Uslu, Samar Hadou, Shirin Saeedi Bidokhti, Alejandro Ribeiro 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05175v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究无线网络中的资源分配问题，使用扩散模型和图神经网络技术，属于深度学习在通信领域的应用。但所有评分关键词均与大语言模型（LLM）相关，而论文完全不涉及LLM、自然语言处理或相关技术（如MoE、RLHF、RAG等），也未涉及AI for Science中的生物信息学或化学信息学。因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于扩散模型和图神经网络的无线网络资源分配方法，通过训练生成模型来近似最优资源分配策略，在功率控制案例中实现了接近最优的遍历和速率效用和可行性。

摘要翻译

我们研究具有图结构干扰的无线网络中的约束遍历资源优化问题。我们训练扩散模型策略以匹配专家在资源分配上的条件分布。通过利用原始-对偶（专家）算法，我们生成原始迭代序列，这些序列可作为每个训练网络实例对应专家条件分布的采样样本。我们将资源分配视为在已知信道状态图上支撑的随机图信号。扩散模型架构采用图神经网络（GNN）模块构成的U-Net层次结构实现，并以信道状态及附加节点状态为条件。在推理阶段，学习到的生成模型通过直接从接近最优的条件分布中采样分配向量，实现了对迭代专家策略的摊销计算。在功率控制的案例研究中，我们证明通过时间共享生成的功率分配方案，能够实现接近最优的遍历和速率效用与接近可行的遍历最小速率，并在不同网络状态间展现出强大的泛化与迁移能力。

摘要 (Abstract)

We consider constrained ergodic resource optimization in wireless networks with graph-structured interference. We train a diffusion model policy to match expert conditional distributions over resource allocations. By leveraging a primal-dual (expert) algorithm, we generate primal iterates that serve as draws from the corresponding expert conditionals for each training network instance. We view the allocations as stochastic graph signals supported on known channel state graphs. We implement the diffusion model architecture as a U-Net hierarchy of graph neural network (GNN) blocks, conditioned on the channel states and additional node states. At inference, the learned generative model amortizes the iterative expert policy by directly sampling allocation vectors from the near-optimal conditional distributions. In a power-control case study, we show that time-sharing the generated power allocations achieves near-optimal ergodic sum-rate utility and near-feasible ergodic minimum-rates, with strong generalization and transferability across network states.

关键词: wireless resource allocation, diffusion model, graph neural network, power control, ergodic optimization, graph-structured interference, generative model, primal-dual algorithm

316. ❌ Not All Turns Are Equally Hard: Adaptive Thinking Budgets For Efficient Multi-Turn Reasoning

作者: Neharika Jali, Anupam Nayak, Gauri Joshi 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05164v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在多轮推理场景中的计算效率优化，通过自适应预算分配减少推理时的计算开销。与"Large Language Models"高度相关（10分），因为论文明确以LLM为研究对象；与"Chain of Thought"和"System 2 Thinking"高度相关（10分），因为论文研究多步推理和深度推理问题；与"Speculative Decoding"相关（10分），因为论文关注推理加速和计算效率优化。其他关键词如MoE、SLMs、训练方法、对齐、RAG、量化等与论文内容无直接关联，均给0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM在多轮推理中计算效率低下的问题，提出了自适应预算分配方法TAB，在数学推理基准上节省35-40%的计算资源同时保持准确性。

摘要翻译

随着大语言模型推理性能进入平台期，提升推理时计算效率对于缓解模型在处理简单查询时出现的过度思考与冗长思维链问题至关重要。现有方法如长度正则化、自适应路由及基于难度的计算预算分配主要针对单轮推理场景，未能解决多轮推理中固有的序列依赖性问题。本研究将多轮推理建模为序列化计算资源分配问题，并将其形式化为多目标马尔可夫决策过程。我们提出TAB：轮次自适应预算分配策略，该策略通过群体相对策略优化训练，在遵循全局单问题令牌数约束的前提下学习最大化任务准确率。TAB以对话历史为输入，通过学习自适应地为较易轮次分配较少计算预算，同时为关键的高难度推理步骤预留适量令牌。在数学推理基准测试中，TAB实现了更优的准确率-令牌数权衡，相比静态基线及现成大语言模型预算分配方法，在保持准确率的同时最高可节省35%的令牌消耗。此外，对于可预先获取所有子问题规划的系统，我们进一步提出TAB All-SubQ策略，该策略基于对话历史及全部历史与未来子问题进行令牌预算分配，相比基线最高可节省40%的令牌消耗。

摘要 (Abstract)

As LLM reasoning performance plateau, improving inference-time compute efficiency is crucial to mitigate overthinking and long thinking traces even for simple queries. Prior approaches including length regularization, adaptive routing, and difficulty-based budget allocation primarily focus on single-turn settings and fail to address the sequential dependencies inherent in multi-turn reasoning.In this work, we formulate multi-turn reasoning as a sequential compute allocation problem and model it as a multi-objective Markov Decision Process. We propose TAB: Turn-Adaptive Budgets, a budget allocation policy trained via Group Relative Policy Optimization (GRPO) that learns to maximize task accuracy while respecting global per-problem token constraints. Consequently, TAB takes as input the conversation history and learns to adaptively allocate smaller budgets to easier turns and save appropriate number of tokens for the crucial harder reasoning steps. Our experiments on mathematical reasoning benchmarks demonstrate that TAB achieves a superior accuracy-tokens tradeoff saving up to 35% tokens while maintaining accuracy over static and off-the-shelf LLM budget baselines. Further, for systems where a plan of all sub-questions is available apriori, we propose TAB All-SubQ, a budget allocation policy that budgets tokens based on the conversation history and all past and future sub-questions saving up to 40% tokens over baselines.

关键词: LLM reasoning, multi-turn reasoning, compute efficiency, adaptive budget allocation, inference-time optimization, mathematical reasoning, token constraints, sequential dependencies

317. ❌ EffiPair: Improving the Efficiency of LLM-generated Code with Relative Contrastive Feedback

作者: Samira Hajizadeh, Suman Jana 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05137v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM生成的代码效率问题，提出无需微调的推理时反馈机制，因此与’Large Language Models’高度相关（10分），与’Self-Correction’相关（8分，涉及迭代改进），其他关键词如MoE、SLMs、训练方法、推理加速、科学AI等均未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLM生成的代码效率低下问题，提出了无需模型微调的相对对比反馈机制EffiPair，在推理时通过对比程序对生成更高效的代码，实验表明能显著提升运行速度并减少token使用。

摘要翻译

大语言模型（LLM）生成的代码往往功能正确，但在运行时间和内存使用上效率低下。以往提升代码效率的方法通常依赖于绝对执行反馈，例如分析单个程序的运行时间或内存使用情况，这种方式成本高昂且对代码优化的指导作用有限。我们提出相对对比反馈（Relative Contrastive Feedback, RCF），一种无需模型微调或参数更新的推理时反馈机制。RCF 针对同一任务比较两个结构相似的代码，并突出显示与更高效率相关的差异。基于这一思路，我们引入了 EffiPair——一个完全在测试时运行的推理时迭代优化框架。该框架通过生成多个候选解决方案，识别出效率差距显著且具有信息量的代码对，将其执行差异总结为轻量级反馈，并利用这一信号生成更高效的解决方案。通过用成对对比比较替代孤立的标量反馈，EffiPair 在降低性能分析和提示开销的同时，提供了更直接的优化指导。在代码效率基准测试上的实验表明，EffiPair 能在保持代码正确性的同时持续提升效率。例如，在 DeepSeek-Chat V3.2 模型上，EffiPair 相比无性能反馈的代码生成实现了最高 1.5 倍的加速，同时与先前工作相比减少了超过 90% 的令牌使用量。

摘要 (Abstract)

Large language models (LLMs) often generate code that is functionally correct but inefficient in runtime and memory. Prior approaches to improving code efficiency typically rely on absolute execution feedback, such as profiling a single program’s runtime or memory usage, which is costly and provides weak guidance for refinement. We propose Relative Contrastive Feedback (RCF), an inference-time feedback mechanism that requires no model fine-tuning or parameter updates. RCF compares two structurally similar programs for the same task and highlights the differences associated with better efficiency. Building on this idea, we introduce EffiPair, an inference-time iterative refinement framework that operates entirely at test time by generating multiple candidate solutions, identifying informative program pairs with large efficiency gaps, summarizing their execution differences into lightweight feedback, and using this signal to produce more efficient solutions. By replacing isolated scalar feedback with pairwise contrastive comparisons, EffiPair provides more direct guidance while reducing profiling and prompting overhead. Experiments on code-efficiency benchmarks show that EffiPair consistently improves efficiency while preserving correctness. For instance, with DeepSeek-Chat V3.2, EffiPair achieves up to 1.5x speedup over generation without performance feedback, while reducing token usage by more than 90% compared to prior work.

关键词: LLM-generated code, code efficiency, Relative Contrastive Feedback, inference-time refinement, iterative refinement, execution feedback, DeepSeek-Chat, token reduction

318. ❌ Reasoning Through Chess: How Reasoning Evolves from Data Through Fine-Tuning and Reinforcement Learning

作者: Lucas Dionisopoulos, Nicklas Majamaki, Prithviraj Ammanabrolu 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05134v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型在象棋任务中的推理能力演化，重点关注监督微调(SFT)和强化学习(RL)对推理质量的影响。高度相关的关键词包括：LLMs（研究对象）、SFT（核心方法）、CoT Reasoning（研究推理过程）、System 2 Thinking（涉及深度推理分析）、Hallucination Mitigation（研究幻觉减少）。Mechanistic Interpretability获得5分，因为论文分析了推理机制但非主要焦点。其他关键词与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文研究如何通过监督微调和强化学习使语言模型在象棋任务中发展推理能力，发现直接预测最佳移动的微调能产生最强性能但导致不忠实推理，而多移动轨迹训练能实现可比较性能并保持忠实推理。

摘要翻译

如何让语言模型在其原本难以应对的任务中进行推理？我们通过分析一系列理论启发型数据集如何影响语言模型在国际象棋中的表现，研究了推理能力在语言模型中的演化过程——从监督微调（SFT）到强化学习（RL）。研究发现，对模型进行直接预测最佳棋步的微调能带来有效的强化学习及最强的下游性能，但强化学习步骤会引发不忠实的推理（即推理与所选棋步不一致）。另一种方案是，基于多步棋轨迹进行训练可在保持忠实推理和更稳定强化学习的同时，获得可比的下游性能。我们证明强化学习能显著提升棋步质量的分布，并作为副作用降低幻觉率。最后，我们发现若干SFT检查点指标——涵盖评估性能、幻觉率及推理质量的度量——能够预测强化学习后模型的性能。我们发布了检查点、最终模型以及训练数据、评估工具和代码，这些资源使我们能够凭借一个70亿参数的模型在国际象棋领域超越领先的开源推理模型。

摘要 (Abstract)

How can you get a language model to reason in a task it natively struggles with? We study how reasoning evolves in a language model – from supervised fine-tuning (SFT) to reinforcement learning (RL) – by analyzing how a set of theoretically-inspired datasets impacts language model performance in chess. We find that fine-tuning a model to directly predict the best move leads to effective RL and the strongest downstream performance – however, the RL step elicits unfaithful reasoning (reasoning inconsistent with the chosen move). Alternatively, training on multi-move trajectories yields comparable downstream performance with faithful reasoning and more stable RL. We show that RL induces a substantial positive shift in the distribution of move quality and reduces hallucination rates as a side effect. Finally, we find several SFT-checkpoint metrics – metrics spanning evaluation performance, hallucination rates, and reasoning quality – to be predictive of post-RL model performance. We release checkpoints and final models as well as training data, evaluations, and code which allowed us to surpass leading open-source reasoning models in chess with a 7B-parameter model.

关键词: language model reasoning, supervised fine-tuning, reinforcement learning, chess reasoning, faithful reasoning, hallucination reduction, reasoning evolution, multi-move trajectories

319. ❌ On the Exploitability of FTRL Dynamics

作者: Yiheng Su, Emmanouil-Vasileios Vlatakis-Gkaragkounis 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05129v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是博弈论中Follow-the-Regularized-Leader (FTRL)算法的可剥削性分析，属于优化理论和博弈论领域，与所有提供的大模型、深度学习、AI应用等关键词完全无关。论文没有涉及任何大模型技术、训练方法、推理优化、AI应用或科学AI相关内容。

!!! tip deepseek-chat TL;DR

该论文研究了在两人零和博弈中，具有恒定步长的FTRL学习者在面对全知优化者时的可剥削性，证明了剥削性是FTRL算法的固有特性，并分析了不同正则化器对剥削行为的影响。

摘要翻译

本文研究了在$T$轮次中，具有恒定步长$η$的跟随正则化领导者（Follow-the-Regularized-Leader, FTRL）学习者在$n\times m$双人零和博弈中对抗一个具有完全信息的优化者时的可剥削性。与先前分析不同，我们证明了可剥削性是FTRL算法族的一个固有特征，而非特定实例的产物。首先，对于固定的优化者，我们建立了一个阶为$Ω(N/η)$的普遍规律，证明剥削程度与学习者次优行动的数量$N$成正比，且在没有次优行动时消失。其次，对于交替行动的优化者，在随机博弈中，无论均衡结构如何，都能以高概率保证获得$Ω(ηT/\mathrm{poly}(n,m))$的额外收益。我们的分析再次揭示了尖锐的几何二分现象：非陡峭的正则化器允许优化者通过有限时间内消除次优行动来获取最大收益，而陡峭的正则化器则引入一个趋于零的修正项，可能延迟剥削的发生。最后，我们讨论了在双边收益不确定的情况下这种杠杆效应是否持续存在，并提出了易感性度量以量化哪些正则化器最容易受到策略性操纵的影响。

摘要 (Abstract)

In this paper we investigate the exploitability of a Follow-the-Regularized-Leader (FTRL) learner with constant step size $η$ in $n\times m$ two-player zero-sum games played over $T$ rounds against a clairvoyant optimizer. In contrast with prior analysis, we show that exploitability is an inherent feature of the FTRL family, rather than an artifact of specific instantiations. First, for fixed optimizer, we establish a sweeping law of order $Ω(N/η)$, proving that exploitation scales to the number of the learner’s suboptimal actions $N$ and vanishes in their absence. Second, for alternating optimizer, a surplus of $Ω(ηT/\mathrm{poly}(n,m))$ can be guaranteed regardless of the equilibrium structure, with high probability, in random games. Our analysis uncovers once more the sharp geometric dichotomy: non-steep regularizers allow the optimizer to extract maximum surplus via finite-time elimination of suboptimal actions, whereas steep ones introduce a vanishing correction that may delay exploitation. Finally, we discuss whether this leverage persists under bilateral payoff uncertainty and we propose susceptibility measure to quantify which regularizers are most vulnerable to strategic manipulation.

关键词: FTRL, exploitability, zero-sum games, regularizers, optimization, game theory, strategic manipulation

320. ❌ Governance-Aware Agent Telemetry for Closed-Loop Enforcement in Multi-Agent AI Systems

作者: Anshul Pathak, Nishant Jain 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05119v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多智能体系统的治理和可观测性架构，与大多数关键词无关。仅与’LLM Agents OR Autonomous Agents OR Agentic Workflow’和’Multi-agent Systems OR Agent Coordination’高度相关（10分），因为论文核心是解决多智能体系统中的治理和执行问题。其他关键词涉及模型训练、推理优化、特定应用领域等，均未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文提出了Governance-Aware Agent Telemetry (GAAT)架构，解决了企业多智能体AI系统中观测工具只能检测而无法实时执行治理策略的问题，实现了从遥测收集到自动化策略执行的闭环。

摘要翻译

企业多智能体AI系统每小时产生数千次智能体间交互，但现有可观测性工具仅能捕获这些依赖关系而无法实施任何约束。OpenTelemetry与Langfuse虽能收集遥测数据，却将治理视为下游分析问题而非实时执行目标，导致形成“只观测不干预”的断层——策略违规往往仅在损害发生后才能被察觉。
本文提出治理感知型智能体遥测框架，该参考架构通过四项核心创新实现了多智能体系统遥测收集与自动化策略执行的闭环联动：采用扩展OpenTelemetry的治理遥测规范，为遥测数据注入治理属性；构建基于OPA兼容声明式规则、延迟低于200毫秒的实时违规检测引擎；设计具备分级干预能力的治理执行总线；建立搭载密码学溯源机制的可信遥测平面。

摘要 (Abstract)

Enterprise multi-agent AI systems produce thousands of inter-agent interactions per hour, yet existing observability tools capture these dependencies without enforcing anything. OpenTelemetry and Langfuse collect telemetry but treat governance as a downstream analytics concern, not a real-time enforcement target. The result is an “observe-but-do-not-act” gap where policy violations are detected only after damage is done. We present Governance-Aware Agent Telemetry (GAAT), a reference architecture that closes the loop between telemetry collection and automated policy enforcement for multi-agent systems. GAAT introduces (1) a Governance Telemetry Schema (GTS) extending OpenTelemetry with governance attributes; (2) a real-time policy violation detection engine using OPA-compatible declarative rules under sub-200 ms latency; (3) a Governance Enforcement Bus (GEB) with graduated interventions; and (4) a Trusted Telemetry Plane with cryptographic provenance.

关键词: multi-agent systems, governance, telemetry, policy enforcement, real-time detection, OpenTelemetry, agent coordination, trusted telemetry

321. ❌ Offline RL for Adaptive Policy Retrieval in Prior Authorization

作者: Ruslan Sharifullin, Maxim Gorshkov, Hannah Clay 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05125v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究医疗授权（Prior Authorization）中的自适应策略检索问题，将其建模为马尔可夫决策过程（MDP），并使用离线强化学习方法（包括CQL、IQL和DPO）进行训练。论文与大多数大模型技术关键词无关，但与’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’高度相关（10分），因为论文明确使用了DPO方法。与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’有一定关联（5分），因为论文涉及检索增强系统，但重点在自适应检索而非生成。其他关键词均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文将医疗授权中的政策检索建模为序列决策问题，使用离线强化学习方法（包括DPO）训练自适应检索策略，在保持高准确率的同时显著减少了检索步骤。

摘要翻译

预先授权（PA）要求对复杂且分散的覆盖政策进行解读，然而现有的检索增强系统依赖于静态的 top-$K$ 策略，其检索的章节数量固定。这种固定检索方式可能效率低下，并可能收集到不相关或不足的信息。我们将 PA 的政策检索建模为一个序贯决策问题，将自适应检索表述为马尔可夫决策过程（MDP）。在我们的系统中，智能体迭代地从 top-$K$ 候选集中选择政策片段，或选择停止并给出决策。奖励机制在决策正确性与检索成本之间进行权衡，以捕捉准确性与效率之间的平衡。我们在离线强化学习（RL）设置下，使用保守 Q 学习（CQL）、隐式 Q 学习（IQL）和直接偏好优化（DPO）来训练策略，训练数据是基于从公开可用的 CMS（美国医疗保险和医疗补助服务中心）覆盖数据衍生的合成 PA 请求、由基线检索策略生成的已记录轨迹。在一个包含 10 项 CMS 手术、共 186 个政策片段的语料库上，CQL 通过穷举检索实现了 92% 的决策准确率（比最佳固定-$K$ 基线高出 30 个百分点），而 IQL 在减少 44% 检索步骤的同时达到了与最佳基线相当的准确率，并且是所有策略中唯一获得正向回合回报的。在转移层面，DPO 在减少 47% 检索步骤（10.6 步 vs. 20.0 步）的情况下，与 CQL 一样达到了 92% 的准确率，占据了帕累托前沿上的一个“选择性-准确”区域，其表现优于 CQL 和行为克隆（BC）。行为克隆基线表现与 CQL 相当，这证实了需要基于优势加权或偏好的策略提取方法来学习选择性检索。对步骤成本 $λ\in {0.05, 0.1, 0.2}$ 进行的 Lambda 消融实验揭示了一个明显的准确率-效率拐点：仅在 $λ= 0.2$ 时，CQL 才从穷举检索转变为选择性检索。

摘要 (Abstract)

Prior authorization (PA) requires interpretation of complex and fragmented coverage policies, yet existing retrieval-augmented systems rely on static top-$K$ strategies with fixed numbers of retrieved sections. Such fixed retrieval can be inefficient and gather irrelevant or insufficient information. We model policy retrieval for PA as a sequential decision-making problem, formulating adaptive retrieval as a Markov Decision Process (MDP). In our system, an agent iteratively selects policy chunks from a top-$K$ candidate set or chooses to stop and issue a decision. The reward balances decision correctness against retrieval cost, capturing the trade-off between accuracy and efficiency. We train policies using Conservative Q-Learning (CQL), Implicit Q-Learning (IQL), and Direct Preference Optimization (DPO) in an offline RL setting on logged trajectories generated from baseline retrieval strategies over synthetic PA requests derived from publicly available CMS coverage data. On a corpus of 186 policy chunks spanning 10 CMS procedures, CQL achieves 92% decision accuracy (+30 percentage points over the best fixed-$K$ baseline) via exhaustive retrieval, while IQL matches the best baseline accuracy using 44% fewer retrieval steps and achieves the only positive episodic return among all policies. Transition-level DPO matches CQL’s 92% accuracy while using 47% fewer retrieval steps (10.6 vs. 20.0), occupying a “selective-accurate” region on the Pareto frontier that dominates both CQL and BC. A behavioral cloning baseline matches CQL, confirming that advantage-weighted or preference-based policy extraction is needed to learn selective retrieval. Lambda ablation over step costs $λ\in {0.05, 0.1, 0.2}$ reveals a clear accuracy-efficiency inflection: only at $λ= 0.2$ does CQL transition from exhaustive to selective retrieval.

关键词: Prior Authorization, Adaptive Retrieval, Offline Reinforcement Learning, Markov Decision Process, Direct Preference Optimization, Policy Retrieval, Conservative Q-Learning, Implicit Q-Learning

322. ❌ Probabilistic Tree Inference Enabled by FDSOI Ferroelectric FETs

作者: Pengyu Ren, Xingtian Wang, Boyang Cheng, Jiahui Duan, Giuk Kim, Xuezhong Niu, Halid Mulaosmanovic, Stefan Duenkel, Sven Beyer, X. Sharon Hu, Ningyuan Cao, Kai Ni 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05115v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于FDSOI-FeFET硬件的贝叶斯决策树（BDT）实现，属于硬件加速和机器学习交叉领域。论文核心是硬件架构创新（FeFET器件、ACAM、GRNG），用于加速贝叶斯决策树推理，而非大模型或深度学习技术。所有关键词均围绕大模型、深度学习技术原理、训练方法、推理优化、对齐、代理等主题，与论文的硬件加速贝叶斯决策树研究无直接关联。论文虽涉及AI应用（如医疗诊断），但未涉及生物信息学或化学信息学等具体科学AI应用。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于FDSOI-FeFET的硬件平台，用于高效实现贝叶斯决策树，在MNIST数据集上相比传统决策树提升40%以上分类准确率，并在速度和能效上相比CPU/GPU基线有多个数量级改进。

摘要翻译

人工智能在自动驾驶、医疗诊断和金融系统中的应用日益要求机器学习模型能够提供稳健的不确定性量化、可解释性及噪声鲁棒性。贝叶斯决策树因其兼具概率推理、可解释的决策过程以及对噪声的鲁棒性，在这些任务中展现出显著优势。然而，现有基于CPU和GPU的贝叶斯决策树硬件实现受限于内存瓶颈和不规则的处理模式，而利用模拟内容可寻址存储器（ACAM，analog content-addressable memory）与高斯随机数发生器（GRNG，Gaussian random number generator）的多平台解决方案则带来了集成复杂性和能耗开销。本文报道了一种单片FDSOI-FeFET硬件平台，该平台原生支持ACAM和GRNG功能。FeFET的铁电极化特性为ACAM实现了紧凑、高效的多比特存储，而栅漏重叠区的带带隧穿效应及随后在浮空体中存储的空穴为GRNG提供了高质量熵源。系统级评估表明，所提出的架构在保证高能效的同时，提供了稳健的不确定性估计、可解释性和噪声容忍度。在数据集噪声和器件变异的影响下，该架构在MNIST数据集上的分类准确率较传统决策树提升超过40%。此外，其处理速度较CPU和GPU基准方案提升超过两个数量级，能效提升超过四个数量级，从而为在资源受限和安全关键环境中部署贝叶斯决策树提供了一种可扩展的解决方案。

摘要 (Abstract)

Artificial intelligence applications in autonomous driving, medical diagnostics, and financial systems increasingly demand machine learning models that can provide robust uncertainty quantification, interpretability, and noise resilience. Bayesian decision trees (BDTs) are attractive for these tasks because they combine probabilistic reasoning, interpretable decision-making, and robustness to noise. However, existing hardware implementations of BDTs based on CPUs and GPUs are limited by memory bottlenecks and irregular processing patterns, while multi-platform solutions exploiting analog content-addressable memory (ACAM) and Gaussian random number generators (GRNGs) introduce integration complexity and energy overheads. Here we report a monolithic FDSOI-FeFET hardware platform that natively supports both ACAM and GRNG functionalities. The ferroelectric polarization of FeFETs enables compact, energy-efficient multi-bit storage for ACAM, and band-to-band tunneling in the gate-to-drain overlap region and subsequent hole storage in the floating body provides a high-quality entropy source for GRNG. System-level evaluations demonstrate that the proposed architecture provides robust uncertainty estimation, interpretability, and noise tolerance with high energy efficiency. Under both dataset noise and device variations, it achieves over 40% higher classification accuracy on MNIST compared to conventional decision trees. Moreover, it delivers more than two orders of magnitude speedup over CPU and GPU baselines and over four orders of magnitude improvement in energy efficiency, making it a scalable solution for deploying BDTs in resource-constrained and safety-critical environments.

关键词: FDSOI-FeFET, Bayesian decision trees, hardware acceleration, uncertainty quantification, analog content-addressable memory, Gaussian random number generator, energy efficiency, MNIST classification

323. ❌ Good Rankings, Wrong Probabilities: A Calibration Audit of Multimodal Cancer Survival Models

作者: Sajad Ghawami 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04239v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文研究多模态深度学习模型在癌症生存预测中的校准问题，属于AI在生物医学领域的应用，仅与’AI for Science OR Bioinformatics OR Cheminformatics’关键词相关（评8分，因涉及生物信息学应用但非核心大模型技术），其他关键词均涉及大模型技术原理（如LLM、MoE、训练方法等），论文未涉及这些内容，故评0分。

!!! tip deepseek-chat TL;DR

该论文首次系统审计了融合全切片组织病理图像和基因组数据的多模态深度学习模型在癌症生存预测中的概率校准问题，发现多数模型校准失败，并提出门控融合和Platt缩放可改善校准而不影响判别性能。

摘要翻译

融合全切片组织病理学图像与基因组数据的多模态深度学习模型，在癌症生存预测方面已展现出强大的判别性能（以一致性指数衡量）。然而，这些模型衍生的生存概率——无论是直接从原始输出获得，还是通过标准事后重建得到——是否经过校准，在很大程度上仍未得到检验。据我们所知，我们首次对多模态WSI-基因组学生存预测架构进行了系统性的折叠水平一阶校准审计，评估了原始的离散时间生存输出（实验A：在TCGA-BRCA数据集上评估3个模型）以及从标量风险分数通过Breslow方法重建的生存曲线（实验B：跨越5种TCGA癌症类型评估11种架构）。在实验A中，所有三个模型在大多数折叠上都未能通过一阶校准（经过Benjamini-Hochberg校正后，15项折叠水平测试中有12项拒绝原假设）。在全部290项折叠水平测试中，有166项在Benjamini-Hochberg校正（错误发现率FDR = 0.05）后，于中位事件时间点拒绝了正确校准的原假设。例如，MCAT模型在GBMLGG数据集上取得了一致性指数0.817，但在全部五个折叠上都未能通过一阶校准。基于门控机制的融合方式与更好的校准性能相关；而双线性和拼接融合方式则不然。事后普拉特缩放法能在不影响判别能力的情况下，减少在评估时间点上的校准错误（例如，MCAT模型：未能通过的折叠从5/5减少至2/5）。仅凭一致性指数不足以评估旨在临床应用的生存预测模型。

摘要 (Abstract)

Multimodal deep learning models that fuse whole-slide histopathology images with genomic data have achieved strong discriminative performance for cancer survival prediction, as measured by the concordance index. Yet whether the survival probabilities derived from these models - either directly from native outputs or via standard post-hoc reconstruction - are calibrated remains largely unexamined. We conduct, to our knowledge, the first systematic fold-level 1-calibration audit of multimodal WSI-genomics survival architectures, evaluating native discrete-time survival outputs (Experiment A: 3 models on TCGA-BRCA) and Breslow-reconstructed survival curves from scalar risk scores (Experiment B: 11 architectures across 5 TCGA cancer types). In Experiment A, all three models fail 1-calibration on a majority of folds (12 of 15 fold-level tests reject after Benjamini-Hochberg correction). Across the full 290 fold-level tests, 166 reject the null of correct calibration at the median event time after Benjamini-Hochberg correction (FDR = 0.05). MCAT achieves C-index 0.817 on GBMLGG yet fails 1-calibration on all five folds. Gating-based fusion is associated with better calibration; bilinear and concatenation fusion are not. Post-hoc Platt scaling reduces miscalibration at the evaluated horizon (e.g., MCAT: 5/5 folds failing to 2/5) without affecting discrimination. The concordance index alone is insufficient for evaluating survival models intended for clinical use.

关键词: multimodal deep learning, cancer survival prediction, calibration audit, whole-slide histopathology, genomic data, concordance index, gating-based fusion, Platt scaling

324. ❌ Multidimensional physical fitness is associated with reduced dementia risk through proteomic and neuroimaging pathways: a prospective cohort study of the UK Biobank

作者: Yiqing Sun, Runyu Lin, Jiayue Qin, Feiyue Pan, Bingjie Li, Zhigang Yao 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.03952v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文是一项关于多维体能（握力、心肺功能、肺功能）与痴呆风险关系的流行病学、蛋白质组学和神经影像学研究，使用UK Biobank数据进行前瞻性队列分析。论文主题属于生物医学/公共卫生领域，主要涉及痴呆预防、体能评估、蛋白质组学分析和脑成像。所有关键词均与大模型、深度学习技术原理或AI技术直接相关，而该论文未涉及任何AI模型、算法或技术应用。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文使用了蛋白质组学数据分析（可视为生物信息学相关），但论文本身并未应用或开发AI方法，因此给予5分（有一定关联）。其他关键词完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该研究通过UK Biobank队列发现，较高的握力、心肺功能和肺功能均独立与降低痴呆风险相关，并通过蛋白质组学和神经影像学分析揭示了这些关联涉及神经炎症、神经血管和脑结构（如海马体积）等多重机制。

摘要翻译

全球有超过5500万人受痴呆症影响，然而身体适能的不同维度是否通过共享或相异的生物学机制独立发挥神经保护作用，目前尚不明确。本研究利用英国生物样本库（UK Biobank，n = 51,517；12年随访期），整合流行病学、蛋白质组学和神经影像学分析，系统性地刻画了多维适能与痴呆症之间的关联。更高的握力、心肺适能（cardiorespiratory fitness）和肺功能均独立与降低的痴呆风险相关（最高与最低三分位数相比，风险比HR分别为0.50、0.62和0.73），且在女性和较年轻个体中关联更强。血浆蛋白质组学分析揭示了领域特异性的分子特征——神经丝轻链（neurofilament light chain）在肌肉适能与心肺适能中占主导，而肺功能则与包括生长分化因子15（GDF15）在内的炎症介质相关——每个领域有22-40种蛋白质能独立预测痴呆风险，并共同指向神经炎症和神经血管通路。脑部磁共振成像（MRI）分析发现海马体积是重要的结构中介因子（中介比例：3.7-10.1%），表明结构保存是多重机制通路之一。人群归因分数分析估计，欠佳的适能状态可能贡献约26%的痴呆病例。这些发现揭示，多维身体适能通过不同但汇聚的神经炎症、神经血管及脑结构机制影响痴呆风险，对全生命周期的预防具有启示意义。

摘要 (Abstract)

Dementia affects over 55 million people worldwide, yet whether distinct domains of physical fitness independently protect against neurodegeneration through shared or divergent biological mechanisms remains unknown. Using the UK Biobank (n = 51,517; 12-year follow-up), we integrated epidemiological, proteomic, and neuroimaging analyses to systematically characterize the multidimensional fitness-dementia relationship. Higher handgrip strength, cardiorespiratory fitness, and pulmonary function were each independently associated with reduced dementia risk (HRs 0.50, 0.62, and 0.73, respectively, for highest vs. lowest tertiles), with stronger associations in women and younger individuals. Plasma proteomic profiling revealed domain-specific molecular signatures–neurofilament light chain predominating for muscular and cardiorespiratory fitness, and inflammatory mediators including GDF15 for pulmonary function–with 22-40 proteins per domain independently predicting dementia, converging on neuroinflammatory and neurovascular pathways. Brain MRI analyses identified hippocampal volume as a significant structural mediator (proportion mediated: 3.7-10.1%), indicating structural preservation as one of multiple mechanistic pathways. Population attributable fraction analyses estimated that suboptimal fitness may account for approximately 26% of dementia cases. These findings reveal that multidimensional physical fitness shapes dementia risk through distinct yet converging neuroinflammatory, neurovascular, and structural brain mechanisms, with implications for life-course prevention.

关键词: dementia risk, physical fitness, proteomic profiling, neuroimaging, UK Biobank, hippocampal volume, neuroinflammation, population attributable fraction

325. ❌ Align Your Structures: Generating Trajectories with Structure Pretraining for Molecular Dynamics

作者: Aniketh Iyengar, Jiaqi Han, Pengwei Sun, Mingjian Jiang, Jianwen Xie, Stefano Ermon 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.03911v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于分子动力学轨迹生成的深度学习应用，属于AI for Science领域，与"AI for Science OR Bioinformatics OR Cheminformatics"高度相关（10分）。其他关键词主要涉及大语言模型技术、推理、对齐、优化等，论文未涉及这些内容，因此均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用结构预训练生成分子动力学轨迹的新框架，通过扩散模型生成结构并引入插值器确保时间一致性，在QM9和DRUGS数据集上验证了其生成化学真实轨迹的有效性。

摘要翻译

利用深度生成模型生成分子动力学轨迹已受到越来越多的关注，但由于分子动力学数据可用性有限以及高维分子动力学分布建模的复杂性，该任务本质上仍具挑战性。为克服这些挑战，我们提出了一种新颖框架，该框架利用结构预训练进行分子动力学轨迹生成。具体而言，我们首先在一个大规模构象数据集上训练一个基于扩散的结构生成模型，在此基础上，我们引入一个在分子动力学轨迹数据上训练的插值器模块，该模块旨在确保生成结构间的时间一致性。我们的方法有效利用了丰富的结构数据以缓解分子动力学轨迹数据的稀缺性，并将复杂的分子动力学建模任务有效分解为两个可处理的子问题：结构生成与时间对齐。我们在QM9和DRUGS小分子数据集上，针对无条件生成、正向模拟和插值任务全面评估了我们的方法，并将框架与分析进一步扩展到四肽和蛋白质单体系统。实验结果证实，我们的方法在生成化学上真实的分子动力学轨迹方面表现卓越，几何、动力学和能量测量精度的显著提升为此提供了有力证据。

摘要 (Abstract)

Generating molecular dynamics (MD) trajectories using deep generative models has attracted increasing attention, yet remains inherently challenging due to the limited availability of MD data and the complexities involved in modeling high-dimensional MD distributions. To overcome these challenges, we propose a novel framework that leverages structure pretraining for MD trajectory generation. Specifically, we first train a diffusion-based structure generation model on a large-scale conformer dataset, on top of which we introduce an interpolator module trained on MD trajectory data, designed to enforce temporal consistency among generated structures. Our approach effectively harnesses abundant structural data to mitigate the scarcity of MD trajectory data and effectively decomposes the intricate MD modeling task into two manageable subproblems: structural generation and temporal alignment. We comprehensively evaluate our method on the QM9 and DRUGS small-molecule datasets across unconditional generation, forward simulation, and interpolation tasks, and further extend our framework and analysis to tetrapeptide and protein monomer systems. Experimental results confirm that our approach excels in generating chemically realistic MD trajectories, as evidenced by remarkable improvements of accuracy in geometric, dynamical, and energetic measurements.

关键词: molecular dynamics, trajectory generation, structure pretraining, diffusion model, temporal consistency, deep generative models, small-molecule datasets, protein monomer systems

326. ❌ Probing of Core Excitons in Solid NaF with Polarization-Selective Attosecond Time-Resolved Four-Wave Mixing Spectroscopy

作者: Kevin Gulu Xiong, Rafael Quintero-Bermudez, Vincent Eggers, Hugo Laurell, Melody Wu, Stephen R. Leone 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06112v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究固体NaF中核心激子的超快动力学，使用阿秒四波混频光谱技术，属于实验物理化学领域，与所有大模型、深度学习、AI技术关键词完全无关；仅与’AI for Science’有微弱关联（5分），因为该研究属于科学实验，但未涉及AI方法或应用。

!!! tip deepseek-chat TL;DR

该研究使用阿秒四波混频光谱技术揭示了氟化钠中核心激子的超快退相干过程，发现退相干时间快于仪器响应极限，并利用偏振控制区分了亮暗激子的轨道角动量特征。

摘要翻译

非线性四波混频过程是揭示固态系统中超快动力学的强大技术。本文采用阿秒四波混频光谱技术，结合一束极紫外泵浦光和两束独立延时、非共线的近红外探测光，解析了氟化钠中钠离子L2,3边处偶极允许与偶极禁戒核心激子的超快退相干过程。观测发现核心激子的退相干时间远快于仪器响应时间的8飞秒极限，这归因于强激子-声子耦合。此外，通过对近红外探测光进行偏振控制（垂直与平行偏振），研究揭示亮核心激子呈现s型轨道角动量特征，而通过双光子激发抵达的暗核心激子则呈现p型角动量特征。

摘要 (Abstract)

Nonlinear Four-wave mixing processes are a powerful technique to unravel ultrafast dynamics in solid-state systems. Here, we employ attosecond four-wave mixing spectroscopy with one extreme ultraviolet (XUV) pump and two independently delayed, noncollinear near-infrared (NIR) probes to resolve the ultrafast decoherence of both dipole-allowed and dipole-forbidden core excitons at the Na+ L2,3 edge in sodium fluoride (NaF). The decoherence times of the core excitons are observed to be much faster than the 8 fs limit of the instrument response time, which is attributed to strong exciton-phonon coupling. Furthermore, polarization control of the NIR probes (Perpendicular and parallel polarizations) reveals that the bright core excitons exhibit s-like orbital angular momentum, while dark core excitons, reached by two-photon excitation, exhibit p-like angular momentum.

关键词: attosecond spectroscopy, four-wave mixing, core excitons, sodium fluoride, decoherence, exciton-phonon coupling, polarization control, orbital angular momentum

327. ❌ Dissociative Single and Double Ionization of Pyridine

作者: Sitanath Mondal, Brendan Wouterlood, Gustavo A. Garcia, Laurent Nahon, Frank Stienkemeier, Sebastian Hartweg 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05824v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究吡啶分子的解离单电离和双电离过程，属于实验物理化学领域，使用双成像光电子光离子符合光谱和量子化学计算。所有关键词均与大模型、深度学习技术原理或AI应用无关，仅最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’与科学应用有一定关联，但论文未使用AI方法，仅涉及传统计算化学，因此给予5分（有一定关联）。其余关键词完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该研究使用双成像光电子光离子符合光谱和量子化学计算，详细分析了吡啶分子的解离单电离和双电离过程，为理解复杂环境中此类分子的辐射损伤机制提供了基础。

摘要翻译

吡啶等简单杂环分子的解离电离过程对于理解复杂凝聚态环境中天然发生的生物材料辐射损伤过程具有重要意义。因此，吡啶可被视为核碱基的简化类似物，其相关环状结构存在于许多重要生物分子中。本文结合量子化学计算，利用双成像光电子光离子符合光谱，对解离性单光子单电离与双电离过程进行了详细研究。在单电离研究中，我们将先前描述的阳离子态与在23 eV光子能量下观测到的相应离子解离产物相关联，提供了超出已报道离子出现能之外的补充信息。对于36 eV光子引发的双电离过程，通过分析电子-离子-离子三重符合事件，我们获得了各种解离性双电离路径起始点的详细信息，这些路径往往仅因单个氢原子的位置不同而产生差异。深入理解吡啶的解离性单电离与双电离过程，是未来研究此类分子在复杂环境中辐射损伤过程的必要前提。

摘要 (Abstract)

Dissociative ionization processes of simple heterocyclic molecules like pyridine are relevant for an understanding of radiation damage processes in biological material that occur naturally in complex condensed environments. Pyridine can thereby be considered a simple analogue of nucleobases and related ring structures are included in many important biomolecules. We present here a detailed study of dissociative single-photon single and double ionization processes using double imaging photoelectron photoion coincidence spectroscopy, supported by quantum chemical calculations. In the case of single ionization we correlate previously described cationic states to their corresponding ionic dissociation products observed at a photon energy of 23 eV, providing additional information beyond previously reported ion appearance energies. For the case of double ionization by 36 eV photons the analysis of electron-ion-ion triple coincidences provides detailed information on the onsets of various dissociative double ionization pathways, often only different by the locations of single hydrogen atoms. The detailed understanding of dissociative single and double ionization of pyridine is a prerequisite for future studies addressing radiation damage processes of such molecules in complex environments.

关键词: dissociative ionization, pyridine, photoelectron photoion coincidence spectroscopy, quantum chemical calculations, radiation damage, biological material, double ionization, ion appearance energies

328. ❌ Reference Energies for Non-Relativistic Core Ionization Potentials

作者: Antoine Marie, Loris Burth, Pierre-François Loos 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05920v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算化学领域，研究核心电离势的理论计算和基准建立，使用全组态相互作用、耦合簇方法等量子化学方法。论文内容与绝大多数关键词（涉及大模型架构、训练、推理、对齐、应用等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于计算化学/理论化学研究，可视为广义的’AI for Science’（科学计算）范畴，但论文并未明确使用AI或深度学习技术，而是传统量子化学计算方法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究通过全组态相互作用方法计算了84个非相对论核心电离势，建立了高精度理论基准数据集，用于评估和验证各种近似量子化学方法的性能。

摘要翻译

深埋层芯电子携带高度局域化、位点特异性的信息，构成X射线光电子能谱学的基础。精确预测其相关的芯层电离能是一项高要求的理论任务，需要平衡处理强轨道弛豫、电子关联效应和相对论效应。多年来，已发展出多种方法，从态特定波函数方法到线性响应形式体系以及格林函数技术。然而，这些方法的评估通常依赖于与实验数据的比较，其中多种误差来源（基组不完备性、相对论修正和振动效应）相互纠缠，难以分离关联处理方法的性能。在本工作中，我们通过在全组态相互作用水平下，采用经紧致芯层和弥散函数增强的大型关联一致基组（aug-cc-pCVXZ），在芯-价分离近似框架内计算了84个非相对论电离能值（73个第二周期和11个第三周期元素的电离能），从而为芯层电离能建立了一套一致的理论基准。这些结果定义了固定有限基组下的理论最佳估计值，为方法发展和验证提供了化学精度的参考。重要的是，我们的数据集允许进行系统的理论间比较，从而将关联和弛豫效应与其他物理贡献分离开来。在此基础上，我们评估了广泛使用的近似方法的性能，包括包含至四重激发的运动方程耦合簇方法、一步$G_0W_0$方案以及态特定方法。

摘要 (Abstract)

Deep-lying core electrons carry highly localized, site-specific information that forms the basis of X-ray photoelectron spectroscopy. Accurately predicting their associated core ionization potentials (IPs) is a demanding theoretical task, requiring a balanced treatment of strong orbital relaxation, electron correlation, and relativistic effects. Over the years, a variety of approaches have been developed, ranging from state-specific wave function methods to linear-response formalisms and Green’s function techniques. However, their assessment has often relied on comparisons with experiment, where multiple sources of error (basis set incompleteness, relativistic corrections, and vibrational effects) are entangled, making it difficult to isolate the performance of correlation treatments. In the present work, we establish a consistent, theory-based benchmark for core IPs by computing 84 non-relativistic values (73 second-row and 11 third-row IPs) at the full configuration interaction level within the core-valence separation approximation, using large correlation-consistent basis sets augmented with tight-core and diffuse functions (aug-cc-pCVXZ). These results define theoretical best estimates within a fixed finite basis set, providing a chemically accurate reference for method development and validation. Importantly, our dataset allows for systematic, theory-versus-theory comparisons that disentangle correlation and relaxation effects from other physical contributions. On this basis, we assess the performance of widely used approximate methods, including equation-of-motion coupled-cluster approaches up to the inclusion of quadruple excitations, the one-shot $G_0W_0$ scheme, as well as state-specific methods.

关键词: core ionization potentials, full configuration interaction, benchmark dataset, equation-of-motion coupled-cluster, G0W0, core-valence separation, theoretical reference, method validation

329. ❌ The BOS-Lig Dataset: Accurate Ligand Charges from a Consensus Approach for 66,810 Experimentally Synthesized Ligands

作者: Roland G. St. Michel, Ryan J. Jang, Aaron G. Garrison, Ilia Kevlishvili, Heather J. Kulik 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06043v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于化学信息学领域，通过构建BOS-Lig数据集，为过渡金属配合物的配体提供准确的电荷分配和功能应用分类。论文内容与大多数关键词（涉及大模型技术原理、训练方法、推理优化、智能体等）完全无关，因为这些关键词均未在标题或摘要中出现，且研究焦点是化学数据集的构建和分析，而非大模型或深度学习技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于化学信息学（Cheminformatics）范畴，涉及计算筛选和数据驱动的配体设计，但论文本身未明确使用AI或深度学习方法，仅提及’computational screening’和’topic-modeling workflow’（后者可能涉及自然语言处理，但未详细说明），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究解决了过渡金属配合物中配体电荷信息缺失或不一致的问题，通过构建BOS-Lig数据集，为66,810个配体分配了准确的净电荷，并链接了功能应用领域，为计算筛选和数据驱动的配体设计提供了基础。

摘要翻译

理解配体性质对于过渡金属配合物的计算高通量筛选至关重要。然而，在晶体学数据集中，配体性质（如净电荷）及其应用领域等信息常常缺失或记录不一致。本研究从剑桥结构数据库中筛选的126,985个单核过渡金属配合物构建了配体数据集。通过采用结合配合物电荷、金属氧化态及晶体学观测共识的迭代电荷平衡工作流程，我们在94,581个已识别的独特配体结构中，为66,810个配体可靠地分配了净电荷，从而构建了波士顿开壳层配体（BOS-Lig）数据集。该工作流程首先确定均配配合物中的配体电荷，随后在异配环境中迭代传递这些电荷分配，使得即使在直接电荷信息缺失的情况下也能推断电荷。我们分析了八隅体规则等简单启发式方法会失效的案例，并引入纯度指标以识别电荷分配可能错误的情况。每个配体还按其金属配位原子及是否存在多种变体（即半不稳定性）进行分类。随后，我们将配合物与其相关期刊摘要关联，应用主题建模工作流程将25,146个配体与反应活性、氧化还原化学、生物化学及光物理化学等功能应用领域相关联。本研究最终提供了基于实验数据的配体化学空间数据集，该数据集将电荷与功能应用相连接，为计算筛选和数据驱动的配体设计奠定了基础。

摘要 (Abstract)

Understanding ligand properties is essential for computational high-throughput screening of transition metal complexes. However, ligand properties such as net charge and other information such as their application area are often absent or inconsistently recorded in crystallographic datasets. Here, we construct a ligand dataset from 126,985 mononuclear transition metal complexes curated from the Cambridge Structural Database. Using an iterative charge-balancing workflow that combines complex charges, metal oxidation states, and consensus across crystallographic observations, we confidently assign net charges to 66,810 ligands among 94,581 identified unique ligand structures to curate the Boston Open-Shell Ligand (BOS-Lig) dataset. The workflow assigns ligand charges in homoleptic complexes first and then iteratively propagates these assignments across heteroleptic environments, allowing charges to be inferred even when direct charge information is unavailable. We analyze cases where simple heuristics such as the octet rule would have failed and introduce a purity metric to identify when our charge assignments may be incorrect. Each ligand is also classified in terms of its metal coordinating atoms and whether there are multiple variants (i.e., hemilability). We then link complexes to their associated journal abstracts and apply a topic-modeling workflow to link 25,146 ligands with functional application areas spanning reactivity, redox chemistry, biological chemistry, and photophysical chemistry. Together, we provide an experimentally grounded dataset of ligand chemical space that connects charge and functional application as a foundation for computational screening and data-driven ligand design.

关键词: ligand dataset, transition metal complexes, charge assignment, computational screening, data-driven design, chemical informatics, topic modeling, functional application

330. ❌ Valence and Rydberg excited state bond dissociation curves of CO2 from orbital-optimized density functional calculations

作者: Darío Barreiro-Lage, Gianluca Levi, Hannes Jonssón, Thanja Lamberts 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05802v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用轨道优化密度泛函理论计算CO2分子的价态和里德堡激发态键解离曲线，属于计算化学领域。所有关键词均与大模型、深度学习、AI技术原理或应用直接相关，而本文完全不涉及这些主题。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于计算化学（与化学信息学相关），但论文并未使用AI或机器学习方法，而是基于传统的量子化学计算，因此仅给予5分（有一定关联）。其他所有关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该研究使用轨道优化密度泛函方法计算了CO2分子的价态和里德堡激发态及其键解离曲线，结果表明该方法在计算成本和精度上优于传统线性响应时间依赖密度泛函理论，为模拟凝聚相CO2的光弛豫过程提供了有前景的途径。

摘要翻译

本文采用轨道变分优化的密度泛函方法，对CO₂分子的最低价层π*激发态、3s及更高能量的3pσ里德伯激发态进行了计算，该方法所需计算量相对较小。研究使用了五种交换程度不同的泛函，并结合实数值或复数值轨道进行优化，通过寻找电子能面上对应激发态的鞍点来实现轨道优化。当采用PBE泛函与复数值轨道结合时，计算得到的激发能与多参考组态相互作用参考值的偏差在0.3 eV以内，且使用杂化泛函时结果进一步改善。相比之下，线性响应含时密度泛函理论计算对于最弥散的3pσ激发态误差高达1.9 eV，并且对激发态特性及所用泛函的依赖性更强。采用PBE泛函及轨道优化方法计算得到的C-O解离曲线与已报道的多参考组态相互作用、运动方程耦合簇单双激发计算方法的结果高度吻合。得益于较低的计算成本，这些结果表明轨道优化密度泛函计算可为凝聚相CO₂的光弛豫建模提供有前景的途径，例如在涉及高能里德伯态的星际宇宙射线辐射驱动过程研究中。

摘要 (Abstract)

Calculations of the lowest valence π* as well as the 3s and higher energy 3pσ Rydberg excited states of the CO2 molecule are carried out using density functionals with variational optimization of the orbitals, an approach involving relatively little computational effort. Five functionals with varying degree of exchange are used in combination with real or complex-valued orbitals that are optimized by finding saddle points on the electronic energy surface corresponding to the excited states. When the PBE functional is used in combination with complex orbitals, the calculated excitation energy is found to be within 0.3 eV of multireference configuration interaction reference values, and the results are further improved with hybrid functionals. In contrast, linear-response time-dependent density functional theory calculations give errors up to 1.9 eV for the most diffuse 3pσ excitation and exhibit stronger dependence on both the excitation character and the functional used. Calculated C-O dissociation curves using the PBE functional and the orbital-optimized approach compare remarkably well with the reported multireference configuration interaction and equation-of-motion coupled-cluster singles and doubles calculations. Thanks to the low computational cost, these results demonstrate that orbital-optimized density functional calculations can be a promising route for modelling photorelaxation in condensed-phase CO2, for example in the context of interstellar cosmic-ray radiation driven process involving high-energy Rydberg states.

关键词: CO2, excited states, density functional theory, orbital optimization, bond dissociation curves, Rydberg states, photorelaxation, computational chemistry

331. ❌ ORION: Unifying Top-Down and Bottom-Up Chemical Space Sampling for a Universal Organic Force Field

作者: Zherui Chen, Jiayu Zhang, Yuxuan Tian, Zhoulin Liu, Sining Dai, Yanghui Li, Cong Chen, Dingyuan Tang, Yajun Deng, Qingxia Liu 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05769v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于开发一种用于分子模拟的机器学习力场（ORION），属于AI for Science（科学人工智能）领域，具体应用于化学和材料科学。论文的核心是机器学习力场技术，而不是大语言模型（LLM）或深度学习技术原理的创新。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"高度相关（10分），因为论文直接属于科学AI在化学信息学领域的应用。其他关键词均与大语言模型、模型训练、推理优化、代理系统等LLM特定技术完全无关，因此评分为0分。

!!! tip deepseek-chat TL;DR

该研究开发了ORION，一种基于神经进化势能框架的通用机器学习力场，通过结合自上而下和自下而上的化学空间采样策略进行训练，实现了接近密度泛函理论的精度，同时比ReaxFF快215.5倍，为化学和材料科学中的预测性模拟提供了实用且通用的工具。

摘要翻译

经验性力场仍是大规模分子模拟的主要工具，但其有限的灵活性和可迁移性常阻碍化学复杂凝聚相体系中的预测性建模。本文提出ORION——一种在神经进化势能（Neuroevolution Potential，NEP）框架下开发的适用于C、H、O、N、S、P体系的通用机器学习力场。为提升其在不同化学环境中的可迁移性，ORION通过整合自上而下与自下而上的策略构建了化学多样性数据集进行训练，从而能精确描述复杂有机构型、反应中间体及弱分子间相互作用。ORION在保持大规模分子动力学模拟所需效率的同时，达到了接近密度泛函理论的精度。在测试集上，其原子力预测精度显著高于ReaxFF，且在相同硬件条件下运行速度提升215.5倍，使得数百纳秒尺度的模拟易于实现。该模型在键断裂与形成、芳香环生长、氢键、范德华相互作用及π-堆积等方面实现了平衡描述，在反应性与非反应性体系中均展现出强大的可迁移性。这些成果确立了ORION作为化学与材料科学领域预测性模拟的实用通用力场，并为构建兼具高精度与广适用性的通用机器学习力场提供了有效路径。

摘要 (Abstract)

Empirical force fields remain the primary tool for large-scale molecular simulation, yet their limited flexibility and transferability often hinder predictive modeling in chemically complex condensed-phase systems. Here we present ORION, a universal machine-learning force field for C, H, O, N, S, and P systems developed within the Neuroevolution Potential (NEP) framework. To enhance transferability across diverse chemical environments, ORION was trained on a chemically rich dataset constructed through an integrated top-down and bottom-up strategy, enabling accurate descriptions of complex organic configurations, reactive intermediates, and weak intermolecular interactions. ORION achieves near-density-functional-theory accuracy while retaining the efficiency required for large-scale molecular dynamics simulations. On the test set, it predicts atomic forces with substantially higher accuracy than ReaxFF while running 215.5 times faster under identical hardware conditions, making simulations on the hundreds-of-nanoseconds timescale readily accessible. The model provides a balanced description of bond breaking and formation, aromatic growth, hydrogen bonding, van der Waals interactions, and π-stacking, demonstrating strong transferability across both reactive and nonreactive systems. These results establish ORION as a practical and general force field for predictive simulations in chemistry and materials science, and provide an effective route toward universal machine-learning force fields with both high accuracy and broad applicability.

关键词: machine-learning force field, molecular simulation, Neuroevolution Potential (NEP), chemical space sampling, organic force field, molecular dynamics, density-functional-theory accuracy, transferability

332. ❌ Does the total energy difference method for modelling core level photoemission fail for bigger molecules?

作者: Marta Berholts, Tanel Käämbre, Arvo Tõnisoo, Rainer Pärna, Vambola Kisand, Juhan Matthias Kahk 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05735v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是计算化学中的ΔSCF方法在计算分子核心电子结合能方面的性能评估，属于传统计算化学和实验物理化学领域。论文内容完全不涉及大模型、深度学习、人工智能或任何机器学习技术，所有关键词均与大模型技术原理、应用或相关AI方法相关，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文通过重新计算和实验测量，验证了ΔSCF方法在计算中型分子（含10-40个原子）核心电子结合能时仍然保持良好精度（平均绝对误差0.19 eV），推翻了之前认为该方法不适用于较大分子的观点。

摘要翻译

Δ-自洽场（ΔSCF）方法能够以适中的计算成本计算材料与分子的核心电子结合能。然而，已有研究指出，该方法虽然适用于小分子体系，但在处理较大体系时其精度显著下降。特别是在由25个原子组成的蒽酮分子中，曾报道存在较大误差。本研究通过计算与实验手段重新审视了蒽酮的气相光电子能谱。测量得到的蒽酮C 1s结合能与先前发表的数据存在显著差异，而新的实验结果与基于SCAN泛函的ΔSCF计算结果高度吻合。此外，本研究在一个包含44个核心电子结合能的数据集上评估了ΔSCF方法的性能，该数据集涵盖原子数在10至40之间的中等尺寸分子。计算得到的平均绝对误差为0.19 eV，与既往计算基准测试的结果相当。总体而言，这些结果及一般理论分析表明，ΔSCF方法适用于模拟小分子与大分子中的局域激发，且在其他扩展体系中的应用前景广阔。

摘要 (Abstract)

The $Δ$-Self-Consistent-Field ($Δ$SCF) method permits calculations of core electron binding energies in materials and molecules at a modest computational cost. However, it has been reported that whilst this method works well for small molecules, its accuracy drops off dramatically when larger systems are considered. Particularly large errors have been reported for the anthrone molecule, which consists of 25 atoms. In this work, the gas-phase photoelectron spectrum of anthrone is revisited both computationally and experimentally. The measured C 1s binding energies in anthrone differ markedly from previously published values, and the new experimental results are in good agreement with $Δ$SCF calculations based on the SCAN functional. In addition, the performance of the $Δ$SCF method is evaluated for a dataset of 44 core electron binding energies from medium sized molecules containing between 10 and 40 atoms. The mean absolute error for this dataset - 0.19 eV - is comparable to the results of previous computational benchmarks. Overall, these results and general theoretical considerations indicate that the $Δ$SCF method is suitable for modelling localized excitations in both small and large molecules, and applications to other extended systems are also promising.

关键词: ΔSCF method, core electron binding energies, anthrone molecule, photoelectron spectroscopy, computational chemistry, SCAN functional, molecular systems, localized excitations

333. ❌ Accessing the performance of CC2 for excited state dynamics: a benchmark study with pyrazine

作者: Rui-Hao Bi, Chongxiao Zhao, Ruixin Sun, Wenjie Dou 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05734v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究RI-CC2方法在吡嗪分子激发态动力学中的性能评估，属于计算化学领域，与绝大多数大模型/深度学习技术关键词完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文使用了人工神经网络模型加速计算，并生成了机器学习开发的数据集，但这只是辅助工具而非核心研究内容，因此给予5分（有一定关联）。其他所有关键词均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究评估了RI-CC2方法在吡嗪分子超快内转换过程中的性能，通过振动耦合模型和全维动力学模拟揭示了暗态A1u的参与作用，并生成了高质量数据集供未来机器学习开发使用。

摘要翻译

本研究以吡嗪为基准体系，评估了RI-CC2方法在超快内转换过程中的性能。我们在Q-Chem软件包中实现了RI-CC2的解析梯度与非绝热耦合矢量，并将其应用于两种互补的研究方法：降维电子-振动耦合模型与全维从头算实时轨迹面跳跃模拟。为加速实时动力学计算，我们采用基于RI-CC2数据训练的绝热人工神经网络模型。电子-振动耦合模型与全维动力学模拟均表明，暗态$A_\text{1u}$在內转换过程中发挥着重要作用。RI-CC2方法识别出$Q_\text{9a}$和$Q_\text{8a}$振动模式是驱动$A_\text{1u}$与$B_\text{3u}$态间相干布居转移的关键因素。实时动力学模拟重现了实验观测的$B_\text{2u}$态布居衰减时间26 fs，与实测值$22\pm3$ fs相符。本研究生成的高质量能量、力及非绝热耦合数据集为未来机器学习发展提供了宝贵资源，而随机变体sRI-CC2方法有望将此类动力学模拟拓展至更大分子体系。

摘要 (Abstract)

In this work, we access the performance of RI-CC2 for ultrafast internal conversion using pyrazine as a benchmark system. We implement analytical gradients and nonadiabatic coupling vectors for RI-CC2 in the Q-Chem package and employ them in two complementary approaches: a reduced-dimensionality vibronic coupling (VC) model and full-dimensional ab initio on-the-fly trajectory surface hopping simulations. To accelerate the on-the-fly dynamics, we employ a diabatic artificial neural network model trained on RI-CC2 data. Both the VC model and the full-dimensional dynamics reveal that the dark $A_\text{1u}$ state actively participates in the internal conversion process. RI-CC2 identifies the $Q_\text{9a}$ and $Q_\text{8a}$ vibrational modes as key drivers of the coherent population transfer between the $A_\text{1u}$ and $B_\text{3u}$. The on-the-fly dynamics reproduce the experimental $B_\text{2u}$ population decay time of 26 fs, consistent with the measured value of $22\pm3$ fs. The high-quality dataset of energies, forces, and nonadiabatic couplings generated here provides a valuable resource for future machine-learning developments, while the stochastic variant sRI-CC2 promises to extend such dynamics to larger molecular systems.

关键词: RI-CC2, excited state dynamics, pyrazine, internal conversion, nonadiabatic coupling, trajectory surface hopping, artificial neural network, vibronic coupling model

334. ❌ Two-colour coherent control of nuclear and electron dynamics in photoionization of molecular hydrogen with FEL pulses

作者: Fabian Holzmeier, Alberto Gonzalez-Castrillo, Thomas M. Baumann, Roger Y. Bello, Carlo Callegari, Michele Di Fraia, Matteo Lucchini, Michael Meyer, Oksana Plekan, Kevin C. Prince, Eleonore Roussel, Rene Wagner, Fernando Martin, Alicia Palacios, Danielle Dowek 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05666v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究分子氢在自由电子激光脉冲下的双色相干控制电离动力学，属于实验物理化学领域，与深度学习、大模型技术完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，但论文未使用任何AI方法，仅涉及理论计算辅助分析，因此给予5分（有一定关联）。其余关键词均与论文内容无直接联系，评分为0。

!!! tip deepseek-chat TL;DR

该研究利用自由电子激光的双色相干控制方案，揭示了分子氢电离过程中电子与核动力学的耦合机制，并通过理论计算解释了光电子能量依赖的相位跃变现象。

摘要翻译

将近年来在自由电子激光器（FEL）中实现的相干$ω$-$2ω$控制方案拓展至分子系统，为在电子时间尺度上控制化学动力学提供了新的机遇，有望引导化学反应沿此前无法触及的路径进行。我们在种子型FERMI自由电子激光器上实施了此类方案，以获取氢分子中单光子（频率$2ω$）与双光子（频率$ω$）电离路径之间的相对相位，并将其表示为光电子能量和发射角的函数。XUV脉冲的窄带宽使得双光子电离路径中中性中间态H$_2$的振动能级得以被选择性激发。本文重点研究H$_2(X,^{1}Σ_g^{+},,v=0)$通过中间态H$_2(B,^{1}Σ_u^{+},,v’=6)$电离至基态H$_2^{+}(X,^{2}Σ_g^{+},,v_f)$的$ω$-$2ω$过程。干涉光电离振幅中$ω$与$2ω$的相对相位表现出对光电子能量（即H$_2^{+}$阳离子终态振动能级$v_f$）的强烈依赖性。借助精确的理论计算，观测到的相位跃变被归因于双光子过程中耦合的电子与核动力学效应，该过程显著受到H$_2(^{1}Σ_g^{+}$与$^{1}Π_g)$自电离态的影响，并涉及H$_2(B,^{1}Σ_u^{+},,v’=6)$中间态核波函数向H$_2^{+}(X,^{2}Σ_g^{+})$终态振动能级的映射。本研究奠定了利用自由电子激光装置现有$ω$-$2ω$相干控制方案探测分子中电子-核耦合动力学的基本理论基础。

摘要 (Abstract)

The extension of coherent $ω$-$2ω$ control schemes, recently implemented in free-electron lasers (FELs), to molecular systems offers new opportunities to control chemical dynamics on the electronic timescale, potentially allowing for the steering of reactions along previously inaccessible pathways. We have implemented such a scheme at the seeded FERMI FEL to retrieve the relative phases between one-photon (frequency $2ω$) and two-photon (frequency $ω$) ionization paths in the hydrogen molecule as a function of photoelectron energy and emission angle. The narrow bandwidth of the XUV pulses enables selective excitation of vibrational levels of neutral intermediate H$_2$ states in the two-photon ionization path. Here we focus on $ω$–$2ω$ ionization of H$_2(X,^{1}Σ_g^{+},,v=0)$ into the H$_2^{+}(X,^{2}Σ_g^{+},,v_f)$ ground state involving the H$_2(B,^{1}Σ_u^{+},,v’=6)$ intermediate state. The relative phases of the $ω$ and $2ω$ interfering photoionization amplitudes exhibit a strong dependence on photoelectron energy, i.e.\ on the final vibrational state $v_f$ in the H$_2^{+}$ cation. With the help of accurate theoretical calculations, the observed phase jumps are assigned to the coupled electronic and nuclear dynamics at play in the two-photon process, significantly influenced by H$_2(^{1}Σ_g^{+}$ and $^{1}Π_g)$ autoionizing states and the mapping of the H$_2(B,^{1}Σ_u^{+},,v’=6)$ intermediate-state nuclear wavefunction into the final vibrational states of H$_2^{+}(X,^{2}Σ_g^{+})$. The present work establishes the fundamental concepts required to access coupled electron–nuclear dynamics in molecules using $ω$–$2ω$ coherent control schemes currently available at free-electron laser facilities.

关键词: coherent control, photoionization, molecular hydrogen, free-electron laser, electron-nuclear dynamics, two-photon ionization, vibrational states, autoionizing states

335. ❌ Rationalizing defect formation energies in metals and semiconductors with semilocal density functionals

作者: Jorge Vega Bazantes, Timo Lebeda, Akilan Ramasamy, Kanun Pokharel, Ruiqi Zhang, John Perdew, Jianwei Sun 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05385v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究材料科学中的缺陷形成能计算，使用密度泛函理论（DFT）方法评估不同泛函近似（如LDA、PBE、SCAN、r2SCAN、LAK、HSE）在金属和半导体系统中的性能。论文主题属于计算材料科学和物理化学领域，与所有大模型、深度学习、AI技术原理关键词完全无关（评分为0）。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于计算科学在材料研究中的应用，但论文未使用AI/机器学习方法，而是基于第一性原理计算，因此给予较低相关性评分5分（有一定关联但非核心）。

!!! tip deepseek-chat TL;DR

该论文通过比较多种密度泛函近似方法，评估了它们在计算金属和半导体中缺陷形成能的性能，发现LDA在金属中表现更好，而LAK meta-GGA在硅中精度最高，接近计算成本更高的量子蒙特卡洛方法。

摘要翻译

材料缺陷研究对于技术应用及新材料设计至关重要。本研究分析了密度泛函近似在两类典型缺陷体系中的表现：八种面心立方金属中的单空位，以及金刚石结构半导体硅中的间隙原子。具体而言，我们采用局域密度近似、Perdew-Burke-Ernzerhof广义梯度近似、强约束适当定域化（SCAN）及其正则化版本（r2SCAN）的meta-广义梯度近似（meta-GGA）、Lebeda-Aschebrock-Kümmel（LAK）meta-GGA以及Heyd-Scuseria-Ernzerhof屏蔽杂化泛函计算了缺陷形成能。对于金属体系，局域密度近似相较于其他近似表现出更优的性能；而对于硅半导体，Lebeda-Aschebrock-Kümmel meta-广义梯度近似展现出卓越的精确度，其表现超越杂化泛函，并接近计算量更大的量子蒙特卡洛方法的结果。为理解不同泛函的性能差异，我们研究了完整结构与缺陷结构中的半局域成分rs、s和α。通过识别关键区域，我们揭示了缺陷形成能变化趋势的内在机理，为改进密度泛函近似提供了理论路径。

摘要 (Abstract)

The study of defects in materials is of utmost importance for technological applications and the design of new materials. In this work, we analyze the performance of density functional approximations on two prototypical sets of defective systems: monovacancies in eight fcc metals, and interstitials in the semiconductor Si-diamond. Specifically, we compute defect formation energies using the local density approximation, the Perdew-Burke-Ernzerhof generalized gradient approximation, the meta-generalized gradient approximations (meta-GGAs) strongly constrained and appropriately normed (SCAN), its regularized version (r2SCAN), the Lebeda-Aschebrock-Kummel (LAK) meta-GGA, and the Heyd-Scuseria-Ernzerhof screened hybrid functional. For metals, the local density approximation shows better performance compared to the other approximations, whereas for silicon, the meta-generalized gradient approximation Lebeda-Aschebrock-Kummel yields outstand- ing accuracy, surpassing the hybrid functional and approaching the results of more computationally demanding Quantum Monte Carlo methods. To rationalize the different performances, we study the semilocal ingredients rs, s and α in both the pristine and defective structures. We identify critical regions that indicate the observed trends of the defect formation energies and pave the way for improving density functional approximations.

关键词: defect formation energies, density functional approximations, metals, semiconductors, monovacancies, interstitials, SCAN, LAK meta-GGA

336. ❌ Molecular Excited States using Quantum Subspace Methods: Accuracy, Resource Reduction, and Error-Mitigated Hardware Implementation of q-sc-EOM

作者: Srivathsan Poyyapakkam Sundar, Prince Frederick Kwao, Alexey Galda, Ayush Asthana 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.05380v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于量子化学模拟，特别是激发态势能面的计算，使用量子子空间方法（q-sc-EOM）和ADAPT-VQE/LUCJ算法，并在量子硬件上实现误差缓解。论文内容与绝大多数关键词（涉及大模型、深度学习技术、训练方法、推理优化、智能体等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于量子计算在化学（科学计算）领域的应用，与’AI for Science’有一定关联，但论文核心是量子算法而非传统AI/深度学习，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究通过结合ADAPT-VQE/LUCJ和q-sc-EOM量子算法，在量子硬件上实现了准确且可扩展的分子激发态势能面计算，并通过Davidson算法、基旋转分组及误差缓解策略将测量复杂度从O(N^12)降至O(N^5)，同时识别出门噪声是主要误差源。

摘要翻译

量子化学模拟中的问题，尤其是实现精确的激发态势能面，是实现量子效用潜力的主要应用方向之一。在当前近期的量子硬件上，变分量子本征求解器算法的变体是化学模拟的主要选择。本研究结合了针对一般激发态的领先基态与激发态量子算法——即ADAPT-VQE/LUCJ与q-sc-EOM——用于计算具有挑战性的断键场景中的精确激发态势能面，并与经典的标度化方程耦合簇单双激发方法进行比较。本工作探讨了利用q-sc-EOM方法实现激发态量子化学中量子效用潜力的途径。我们评估其精度，同时通过戴维森算法和基旋转分组缓解主要标度瓶颈，将测量标度从O(N$^{12}$)降低至O(N$^{5}$)，并在量子硬件上实施该方法，结合多种误差缓解策略以减少激发态中的门误差和测量误差。q-sc-EOM算法的硬件实现，辅以M3读出误差缓解和对称性投影，能够产生合理精确的激发态能量，其中门噪声被确定为误差的主要来源。这为开发精确、可标度化且普遍适用的量子激发态方法铺平了道路，该方法具有实现量子效用的潜力，同时也指明了需要突破的关键问题。

摘要 (Abstract)

Problems in quantum chemical simulations, especially achieving accurate excited-state potential energy surfaces, are among the primary applications to achieve quantum utility. On near-term quantum hardware, variants of the variational quantum eigensolver (VQE) algorithms are the primary choice for chemistry simulation. In this study, a combination of leading ground and excited state quantum algorithms for general excited states, namely, ADAPT-VQE/LUCJ and q-sc-EOM, are utilized to calculate accurate excited state potential energy surfaces in challenging bond-breaking scenarios and compared with the classical scalable EOM-CCSD method. This work investigates avenues toward quantum utility in excited-state quantum chemistry using the q-sc-EOM approach. We assess its accuracy while mitigating major scaling bottlenecks through the Davidson algorithm and basis rotation grouping, reducing the measurement scaling from O(N$^{12}$) to O(N$^{5}$), and implementing the method on quantum hardware with various error mitigation strategies to reduce gate and measurement errors in excited states. The hardware implementation of the q-sc-EOM algorithm, augmented by mitigation of M3 readout error and symmetry projection, produces reasonably accurate excited-state energies with gate noise identified as the predominant source of error. This paves the way for accurate and scalable, generally applicable quantum excited-state methods with potential for quantum utility while identifying critical problems that require advancements.

关键词: quantum chemistry, excited states, q-sc-EOM, quantum algorithms, error mitigation, quantum hardware, potential energy surfaces, scalability

337. ❌ Nonlinear signal enhancement of strongly-coupled molecules in pump-probe experiments

作者: Alexander M. McKillop, Marissa L. Weichman 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05261v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究非线性光谱学中强耦合分子的信号增强问题，属于物理化学实验方法研究，与所有评分关键词（均涉及大模型、深度学习技术原理及应用）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究通过模拟实验量化了不同共振与非共振泵浦-探测配置下强耦合分子与未耦合分子的信号贡献，发现共振方案对强耦合分子信号选择性最高，而非共振方案在保持高灵敏度的同时更不易受光学伪影影响。

摘要翻译

非线性光谱学被广泛用于研究强光-物质耦合下分子的瞬态动力学，但迄今尚不清楚未耦合的腔内分子在多大程度上会干扰目标强耦合物种的信号。在强耦合光谱区域共振的泵浦或探测场会优先与腔耦合分子相互作用，但由于腔内的波干涉效应，可能产生严重的光学伪影。另一方面，当泵浦或探测场波长处于腔镜高透射波段时，这些非共振场将以行波形式沿腔轴传播，同时与耦合及未耦合的腔内分子发生相互作用。本文通过模拟实验量化了不同共振与非共振泵浦-探测构型下强耦合与未耦合分子群体对信号的贡献。研究发现，虽然共振方案能最大化对强耦合分子信号的选择性，但非共振方案在保持较低光学伪影敏感性的同时，仍对这些信号表现出惊人的高灵敏度。

摘要 (Abstract)

Nonlinear spectroscopy is widely used to study the transient dynamics of molecules under strong light-matter coupling, though it remains unclear to what extent uncoupled intracavity molecules obscure signals from the strongly-coupled species of interest. Pump or probe fields resonant in the strongly-coupled spectral region will preferentially interact with cavity-coupled molecules, but can exhibit severe optical artifacts due to wave interference in the cavity. On the other hand, non-resonant pump or probe fields having wavelengths at which the cavity mirrors are highly transmissive propagate as traveling waves along the cavity axis, interacting with both coupled and uncoupled intracavity molecules. Here, we quantify the contributions of signals from strongly-coupled and uncoupled populations in simulated experiments with different resonant and non-resonant pump-probe configurations. We find that while resonant schemes maximize selectivity for the signals of strongly-coupled molecules, non-resonant schemes retain surprisingly high sensitivity to these signals while remaining less susceptible to optical artifacts.

关键词: nonlinear spectroscopy, strong light-matter coupling, pump-probe experiments, cavity-coupled molecules, signal selectivity, optical artifacts, resonant schemes, non-resonant schemes

338. ❌ Information Entropy is a General-Purpose Collective Variable for Enhanced Sampling

作者: Xiangrui Li, Daniel Schwalbe-Koda 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.05239v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究分子和凝聚相系统中基于信息熵的增强采样方法，属于计算化学/物理领域，与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、智能体等）完全无关。仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为该方法可应用于生物信息学或化学信息学相关的分子模拟，但论文本身并非直接关于AI模型在这些领域的应用，而是提出一种物理模拟方法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出信息熵可作为通用集体变量，用于分子和凝聚相系统的增强采样，以无监督方式发现传统方法难以捕捉的亚稳态和反应路径。

摘要翻译

增强采样方法通常需要预定义的集体变量（CVs），这些变量以已知反应坐标作为前提，从而限制了未预期的过渡机制或中间体的发现。本文提出，原子系统中局部信息熵的度量可作为适用于分子与凝聚相系统的通用集体变量，用于稀有事件采样。该方法遵循良构元动力学（well-tempered metadynamics）思路，将模拟偏向于熵值变化的构型，从而在探索新颖性与热力学可达性之间取得平衡。通过对势能面的无预设探索，该方法能够无监督地发现亚稳态能谷与反应路径，包括传统序参量无法触及的竞争性过渡通道。我们在涵盖构象采样、均相成核、玻璃形成及固态相变等五个体系中验证了该方法的普适性。

摘要 (Abstract)

Enhanced sampling methods typically require predefined collective variables (CVs) that presuppose knowledge of reaction coordinates, restricting the discovery of unanticipated transition mechanisms or intermediates. Here, we show that a local measure of information entropy in atomistic systems is a general-purpose CV for rare event sampling across molecular and condensed-phase systems. The method biases simulations toward entropy-changing configurations following a well-tempered metadynamics approach, thus balancing novelty and thermodynamic accessibility. Blind exploration of potential energy surfaces enables unsupervised discovery of metastable basins and reaction pathways, including competing transition channels inaccessible to conventional order parameters. We demonstrate the generality of the method across five systems spanning conformational sampling, homogeneous nucleation, glass formation, and solid-state phase transformations.

关键词: information entropy, collective variable, enhanced sampling, metadynamics, rare event sampling, molecular systems, condensed-phase systems, unsupervised discovery

339. ❌ Neural-network quantum states for solving few-body problems: application to Efimov physics

作者: Sora Yokoi, Shimpei Endo, Hiroki Saito 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04435v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究神经网络量子态在连续空间中强相互作用少体问题（特别是Efimov物理）的应用，属于AI for Science（科学AI）领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评5分），因为它是深度学习在科学计算（量子物理）中的应用。然而，论文未涉及大语言模型（LLMs）、模型架构（如MoE）、训练技术（如预训练、微调、对齐）、推理优化、智能体系统或其他大模型相关技术，因此其他所有关键词评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于神经网络量子态的方法，用于求解连续空间中强相互作用的少体量子问题，并成功计算了三至六体玻色子系统和质量不平衡费米子系统的Efimov态及相关束缚态，验证了方法的有效性。

摘要翻译

神经网络量子态作为一种求解量子多体问题的新兴方法，近年来在晶格体系中取得了显著成功。本文将该方法拓展至连续空间中的强相互作用少体问题，并通过计算埃菲莫夫态及相关少体束缚态展示了其能力。我们采用以雅可比坐标为输入的全连接前馈神经网络，结合投影方法，计算了幺正条件下三至六全同玻色子体系的基态与第一激发态，以及一个由两个全同费米子与第三个粒子组成的质量不平衡费米体系。所得基态与第一激发态能量与已有研究结果高度吻合。此外，该方法还成功复现了埃菲莫夫态的关键特征，包括离散标度不变性、波函数的特征几何结构，以及质量不平衡费米体系中的临界质量行为。本方法可广泛应用于连续空间中各类强关联少体问题的研究。

摘要 (Abstract)

Neural-network quantum states have recently emerged as a powerful method for solving quantum many-body problems, with notable successes in lattice systems. Here, we extend this approach to strongly interacting few-body problems in continuous space, and demonstrate its capability by computing the Efimov states and associated few-body bound states. Using a fully connected feedforward neural network with Jacobi coordinates as inputs, combined with a projection method, we compute the ground and first excited states for three- to six-body systems of identical bosons at unitarity, as well as a mass-imbalanced fermionic system consisting of two identical fermions and a third particle. The obtained energies of the ground and first excited states agree well with previously reported results. Furthermore, the proposed approach also reproduces key features of Efimov states, including the discrete scale invariance, the characteristic geometric structure of the wave function, and the critical-mass behavior in mass-imbalanced fermionic systems. Our method can be readily applied to a broad class of strongly correlated few-body problems in continuous space.

关键词: neural-network quantum states, few-body problems, Efimov physics, continuous space, strongly interacting systems, bosons, fermions, wave function

340. ❌ Weak Solutions to the Bloch Equations with Distant Dipolar Field

作者: Louis-S. Bouchard 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04909v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是磁共振成像（MRI）物理中的Bloch方程与远距离偶极场（DDF）的数学建模和数值求解问题，属于计算物理和医学成像领域。所有评分关键词均涉及大模型、深度学习及相关技术，而本文完全不涉及这些内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在有界域上包含远距离偶极场的Bloch方程的弱解形式，提出了有限元弱公式化和数值求解方案，并验证了其在复杂几何边界下的有效性。

摘要翻译

远距离偶极场（Distant Dipolar Field, DDF）是一种对液态自旋动力学的长程非局域贡献，源于分子间偶极耦合，能够产生多量子相干及新颖的磁共振成像对比度。其符号变化的核函数使得布洛赫-DDF动力学强烈依赖于几何构型，而基于快速傅里叶变换的偶极卷积通常假定周期性或填充的笛卡尔域，而非具有反射性扩散边界的有限样本。本研究探讨了在齐次诺伊曼扩散边界条件下，有限域上包含DDF的布洛赫方程。我们推导了一种有限元弱形式，该形式支持空间变化的扩散与弛豫参数，并对久期DDF核函数进行了长度为a>0的短程正则化处理。针对固定的a，我们证明了DDF算子的有界性，建立了一个L2能量平衡关系，其中进动项为中性，而扩散与横向弛豫项为耗散项，并获得了依赖于数据的局部适定性，以及在能量中性输运条件下的全局存在性。对于伽辽金半离散化格式，我们展示了一个与连续估计相对应的离散能量恒等式。在计算方面，我们采用无矩阵的近场/远场方案在实空间计算DDF，并使用二阶IMEX分裂方法进行时间推进，其中扩散与弛豫项隐式处理，进动项显式处理。显式阶段在DDF积分点上应用罗德里格斯旋转后进行L2投影，从而实现了稳定的多周期实验室坐标系模拟。我们通过三个闭合形式基准测试进行了验证，并通过在球形诺伊曼本征模衰减问题上比较映射有限元方法与体素掩膜有限差分基线，量化了弯曲边界效应。这些结果为复杂几何有限域上的布洛赫-DDF动力学提供了一条可分析且可复现的研究路径。

摘要 (Abstract)

The distant dipolar field (DDF) is a long-range, nonlocal contribution to liquid-state spin dynamics that arises from intermolecular dipolar couplings and can generate multiple-quantum coherences and novel MRI contrast. Its sign-changing kernel makes Bloch-DDF dynamics strongly geometry dependent, and FFT-based dipolar convolutions naturally assume periodic or padded Cartesian domains rather than bounded samples with reflective diffusion boundaries. We study the Bloch equations with the DDF on bounded domains under homogeneous Neumann diffusion conditions. We derive a finite-element weak formulation that supports spatially varying diffusion and relaxation parameters and uses a short-distance regularization of the secular DDF kernel with length a>0. For fixed a we prove boundedness of the DDF operator, establish an L2 energy balance in which precession is neutral while diffusion and transverse relaxation are dissipative, and obtain local well-posedness with continuous dependence on the data, with global existence under energy-neutral transport. For the Galerkin semi-discretization we show a discrete energy identity mirroring the continuum estimate. For computation, we evaluate the DDF in real space with a matrix-free near/far scheme and advance in time using a second-order IMEX splitting method that treats diffusion and relaxation implicitly and precession explicitly. The explicit stage applies a Rodrigues rotation at DDF quadrature points followed by an L2 projection, enabling stable multi-cycle lab-frame simulations. We validate against three closed-form benchmarks and quantify curved-boundary effects by comparing mapped finite elements with a voxel-mask finite-difference baseline on spherical Neumann eigenmode decay. These results provide an analyzable and reproducible route for Bloch-DDF dynamics on bounded domains with complex geometry.

关键词: Bloch equations, distant dipolar field, finite-element method, MRI contrast, weak formulation, Neumann diffusion, numerical simulation, spin dynamics

341. ❌ Assessing the impact of nodal surface optimization in fixed-node diffusion Monte Carlo on non-covalent interactions

作者: Kousuke Nakano, Benjamin X. Shi, Dario Alfè, Andrea Zen 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04329v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究量子蒙特卡洛方法中节点表面优化对非共价相互作用预测的影响，属于计算化学领域。所有关键词均与大模型、深度学习技术原理或AI应用无关，仅最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’与科学计算有一定关联，但论文未使用AI方法，因此给5分（有一定关联）。其他关键词完全无关，均给0分。

!!! tip deepseek-chat TL;DR

该研究评估了固定节点扩散蒙特卡洛方法中节点表面优化对12种非共价相互作用化合物预测的影响，发现优化能改善氢键系统与CCSD(T)的一致性，但对色散主导系统影响可忽略。

摘要翻译

扩散量子蒙特卡洛（DMC）与耦合簇理论[CCSD(T)]是广泛用于非共价相互作用（NCIs）的基准方法。然而，近期研究报道了在若干氢键主导和色散主导的体系中存在显著差异，这引发了对两种方法所依赖近似之准确性的质疑。在DMC中，主要误差预计源于固定节点近似，其节点表面通常取自密度泛函理论或哈特里-福克计算产生的单个斯莱特行列式。本研究采用近期提出的基于自然轨道的反对称化双粒子波函数拟设，评估了节点表面优化对12种涵盖不同类型NCIs的化合物DMC预测结果的影响。我们发现，对于氢键主导体系，优化后与CCSD(T)的一致性得到改善，而对色散主导体系的影响可忽略不计。这些结果为解决氢键相互作用中的差异提供了一条实用且计算高效的路径，同时为理解色散主导体系中尚存的差异提供了见解。

摘要 (Abstract)

Diffusion quantum Monte Carlo (DMC) and coupled cluster theory [CCSD(T)] are widely-employed benchmark methods for noncovalent interactions (NCIs). However, recent studies have reported notable discrepancies across several hydrogen-bonded and dispersion-dominated systems, raising questions on the accuracy of the approximations underlying each approach. In DMC, the dominant error is expected to stem from the fixed-node approximation, where the nodal surface is typically taken from a single Slater determinant derived from a density functional theory or Hartree-Fock calculation. In this work, we assess the impact of nodal surface optimization on DMC predictions for 12 compounds spanning diverse NCIs, using a recently proposed antisymmetrized geminal power ansatz with natural orbitals. We find improved agreement with CCSD(T) for hydrogen-bonded systems, while having negligible effect for dispersion-dominated systems. These results provide a practical and computationally efficient route to resolving discrepancies in hydrogen-bonded interactions, while offering insight into the remaining differences in dispersion-dominated systems.

关键词: Diffusion Monte Carlo, fixed-node approximation, nodal surface optimization, non-covalent interactions, hydrogen-bonded systems, dispersion-dominated systems, CCSD(T), antisymmetrized geminal power

342. ❌ Elucidating Au-C Bonding via Laser Spectroscopy of Gold Monocarbide

作者: Rory M. Weldon, Danielle M. Darling, Nicole M. Albright, Kendall L. Rice, Phaedra L. Salerno, K. Cooper Stuntz, Benjamin L. Augenbraun 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04322v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是实验物理化学研究，使用激光光谱学首次观测并表征金单碳化物（AuC）分子，研究其电子结构、振动特性、自旋轨道耦合和键解离能，作为相对论理论的基准。所有评分关键词均涉及大模型、深度学习及相关技术，而该论文完全不涉及任何人工智能、机器学习或计算模型内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究首次通过激光光谱学观测并表征了金单碳化物（AuC）分子，确定了其电子激发态、振动结构、自旋轨道耦合和Au-C键解离能，为相对论理论提供了重要实验基准。

摘要翻译

单碳化金（AuC）已通过激光光谱技术成功制备并表征，这是首次报道的AuC观测结果。我们在400纳米至700纳米范围内记录了气相AuC的光谱，将$\mathrm{X},^2Π_{1/2}( (2σ)^2 (2π)^1 )$基态到$(2σ)^2 (3σ^\ast)^1$与$(2σ)^1 (2π)^2$电子组态所产生激发态的跃迁进行了归属。通过色散荧光光谱研究了基态的振动与自旋-轨道结构、激发态的辐射分支比与辐射寿命，以及Au–C键的解离能。分子轨道图被用于阐释AuC低能电子态的性质。这些数据为相对论理论提供了重要基准，并对冷分子量子科学与精密测量研究具有参考价值。

摘要 (Abstract)

Gold monocarbide (AuC) has been produced and characterized using laser spectroscopy, representing the first reported observation of AuC. We recorded the optical spectrum of gas-phase AuC between 400 nm and 700 nm, assigning excitations from the $\mathrm{X},^2Π_{1/2}( (2σ)^2 (2π)^1 )$ ground state to states arising from the $(2σ)^2 (3σ^\ast)^1 $ and $(2σ)^1 (2π)^2 $ configurations. Dispersed-fluorescence spectra are used to study the vibrational and spin-orbit structure of the ground state, branching ratios and radiative lifetimes of the excited states, and the Au–C bond dissociation energy. A molecular orbital diagram is used to rationalize the nature of AuC’s low-lying electronic states. The data serve as valuable benchmarks of relativistic theory and are relevant to quantum science and precision measurements with cold molecules.

关键词: gold monocarbide, laser spectroscopy, electronic states, vibrational structure, spin-orbit coupling, bond dissociation energy, relativistic theory, cold molecules

343. ❌ Amplification at Equilibrium: Structural and Thermodynamic Limitations, and Implementation

作者: Hamidreza Akef, Chia-Yu Sung, Aneesh Vanguri, David Soloveichik 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04285v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究生物化学系统中的平衡态信号放大，属于生物信息学/化学信息学领域，与"AI for Science OR Bioinformatics OR Cheminformatics"有一定关联（评5分），但论文未涉及大模型、深度学习或任何AI技术，而是专注于分子系统的理论分析和实验验证，因此其他所有关键词均完全无关（评0分）。

!!! tip deepseek-chat TL;DR

该论文研究了平衡态分子信号放大的基本限制，证明了二聚体网络无法实现平衡态放大，提出了三聚体放大器设计并验证了其可行性，同时推导了适用于任何平衡网络的普适热力学界限。

摘要翻译

在自然与人工生物化学系统中，放大微弱分子信号至关重要。尽管大多数放大方案依赖于非平衡态运行，借助动力学势垒和燃料驱动的级联反应，但通过添加分析物来改变能量景观，在热力学平衡状态下实现放大也是可能的。平衡放大具有吸引力，因为原则上它可以无限期保持在未触发状态。本研究建立了基于平衡放大的基本结构与热力学极限。我们首先证明二聚化网络——即仅限于最多由两个单体组成的复合物系统——本质上无法实现平衡放大。这一“不可行定理”解释了先前欠互补性“链置换”设计中缺乏放大能力的原因。随后，我们证明允许三聚体复合物可以突破这一限制。我们提出了一种基于等构三聚体的放大器，其输出能保持输入信号的尺寸，从而实现模块化组合，并通过实验验证了该设计，实现了接近理论值$2\times$的放大倍数。最后，我们推导出适用于任意平衡网络的普适热力学界限：最大放大倍数与分析物和放大器组分之间的相互作用自由能呈线性关系。对于核酸系统而言，这意味着分析物长度必须随目标放大倍数线性增长，且在分析物固定的情况下，模块化放大器的组合会产生收益递减效应。这些结果共同界定了平衡放大的结构与能量边界，并严格论证了实现高增益必须采用非平衡方法的必要性。

摘要 (Abstract)

Amplifying weak molecular signals is essential in both natural and engineered biochemical systems. While most amplification schemes operate out of equilibrium, relying on kinetic barriers and fuel-driven cascades, it is also possible to amplify at thermodynamic equilibrium by shifting the energy landscape upon addition of an analyte. Equilibrium amplification is appealing because, in principle, it can remain indefinitely in the untriggered state. In this work, we establish fundamental structural and thermodynamic limits on equilibrium-based amplification. We first prove that dimerization networks–systems restricted to complexes of at most two monomers–are inherently incapable of equilibrium amplification. This no-go theorem explains the absence of amplification in prior undercomplementary “strand commutation” designs. We then show that allowing trimeric complexes breaks this barrier. We propose an isometric trimer-based amplifier whose output preserves the size of the input, enabling modular composition, and validate it experimentally, achieving an amplification factor close to the expected $2\times$. Finally, we derive universal thermodynamic bounds applicable to any equilibrium network regardless of complex size: the maximum amplification factor scales linearly with the free energy of interaction between the analyte and the amplifier components. For nucleic acid systems, this implies that the analyte length must grow linearly with the desired amplification factor, and that composing modular amplifiers yields diminishing returns for a fixed analyte. Together, these results delineate the structural and energetic boundaries of equilibrium amplification and rigorously justify the necessity of out-of-equilibrium approaches for achieving high gain.

关键词: equilibrium amplification, thermodynamic limits, dimerization networks, trimeric complexes, nucleic acid systems, amplification factor, free energy, modular composition

344. ❌ Physics of the droplet-to-ion transition in electrosprays of highly conducting liquids

作者: Manel Caballero-Pérez, Manuel Gamero-Castaño 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04242v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究高导电液体电喷雾中从液滴主导到离子主导的物理机制，属于实验物理学和流体动力学领域。所有关键词均涉及大模型、深度学习、AI技术及其应用，而论文内容完全不涉及任何人工智能、机器学习或计算模型技术，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了高导电液体电喷雾中从液滴主导到离子主导连续转变的物理机制，通过实验和建模揭示了离子溶剂化能量、中性质量损失和离解极限等关键因素，并推导出电喷雾推进器最大比冲的解析表达式。

摘要翻译

本研究探讨了高电导率液体电喷雾中从液滴主导区向离子主导区连续转变的物理机制。我们采用飞行时间谱法和直接流量测量技术，对四种离子液体的电喷雾进行了表征。在液滴主导区，射流破碎过程表现出自相似的对数正态质荷分布，且变异系数恒定。在混合区与离子主导区，发射离子的平均溶剂化状态随流量降低而减弱，这与主要离子发射区向温度更低的锥射流颈部移动的趋势一致。通过对破碎后液滴群体中的离子蒸发过程进行建模，我们估算出离子溶剂化能约为 $ΔG_0 \gtrsim 1.9$~eV，该数值难以与泰勒锥尖端的无射流离子发射机制相协调。此外，我们确定了高电导率电喷雾在接近最小流量时性能的两个基本限制：由小液滴蒸发驱动的大量中性质量损失，以及体相液体中有限自由离子分数所施加的解离极限。该解离极限推导出了电喷雾推力器最大比冲的解析表达式，其与多种推进剂和电喷雾源的实验数据高度吻合。

摘要 (Abstract)

We investigate the physical mechanisms governing the continuous transition from the droplet-dominated to the ion-dominated regime in electrosprays of highly conducting liquids. We characterize electrosprays of four ionic liquids using time-of-flight spectrometry and direct flow rate measurements. In the droplet regime, the jet breakup process exhibits self-similar lognormal mass-to-charge distributions with a constant coefficient of variation. In the mixed and ionic regimes, the average solvation state of the emitted ions decreases with decreasing flow rate, consistent with a shift of the primary ion emission zone toward the cooler cone-jet neck. Modeling ion evaporation from the post-breakup droplet population yields an estimate for the ion solvation energy of $ΔG_0 \gtrsim 1.9$~eV, a value difficult to reconcile with jet-less ion emission from a Taylor cone tip. Furthermore, we identify two fundamental limits on the performance of highly conducting electrosprays near minimum flow rate: substantial neutral mass losses driven by the evaporation of small droplets, and a dissociation limit imposed by the finite fraction of free ions in the bulk liquid. The dissociation limit yields an analytical expression for the maximum specific impulse of electrospray thrusters, showing excellent agreement with experimental data across multiple propellants and electrospray sources.

关键词: electrospray, ionic liquids, droplet-to-ion transition, ion evaporation, solvation energy, specific impulse, thruster performance, time-of-flight spectrometry

345. ❌ PyGSC: A Python tool for correcting Kohn-Sham orbital energies by mitigating the delocalization error of density functional approximations

作者: Zipeng An, Xiaolong Yang, Xiao Zheng, Weitao Yang 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04076v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于密度泛函理论（DFT）的改进，特别是通过修正交换相关势的微扰表达式来减轻离域误差，并开发了PyGSC工具进行实现和验证。所有关键词均与大语言模型、深度学习技术原理或AI应用直接相关，但论文内容属于计算化学领域，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（AI在科学计算中的应用），其他关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文解决了密度泛函近似中的离域误差问题，通过改进准粒子能量密度泛函理论方法，开发了PyGSC工具，在原子和分子基准测试中显著提高了电子亲和能、电离势和准粒子能量的预测准确性。

摘要翻译

密度泛函近似（DFAs）存在离域误差，这限制了其在预测电子亲和能（EAs）、电离势（IPs）以及准粒子能量方面的准确性。本研究通过对交换关联势的微扰表达式进行改进，对密度泛函理论准粒子能量（QE-DFT）方法进行了理论优化，从而实现了对分子体系更一致性的描述。我们进一步开发了一个基于PySCF库的开源Python程序——PyGSC，该程序实现了改进的QE-DFT框架。在主族原子和G2/97分子数据集上的基准测试表明，改进的QE-DFT方法优于原始DFAs，其三阶修正对EA和IP预测的平均绝对偏差可降至0.3电子伏特以下。在DNA/RNA核碱基的偶极束缚态上的应用进一步验证了QE-DFT方法相对于原始DFAs的优越性，为预测大分子体系的电子性质提供了一种高效且精确的途径。

摘要 (Abstract)

Density functional approximations (DFAs) suffer from delocalization error, which limits their accuracy in predicting electron affinities (EAs), ionization potentials (IPs), and quasiparticle energies. In this work, we present a theoretical refinement of the quasiparticle energies from density functional theory (QE-DFT) method by improving the perturbative expression for the exchange-correlation potential, leading to a more consistent description of molecular systems. We further develop an open-source Python program, PyGSC, built upon the PySCF library, which implements the modified QE-DFT framework. Benchmark tests on main-group atoms and G2/97 molecules demonstrate that the modified QE-DFT method outperforms the original DFAs, with third-order corrections achieving mean absolute deviations below 0.3 eV for EA and IP predictions. Application to dipole-bound states of DNA/RNA nucleobases further validates the superiority of the QE-DFT approach over original DFAs, offering an efficient and accurate approach for predicting electronic properties in large molecular systems.

关键词: Density functional approximations, Delocalization error, Quasiparticle energies, PyGSC, Electronic properties, Molecular systems, Benchmark tests, Python tool

Token 消耗统计

总计: 1,080,005 tokens（输入 736,986 / 输出 343,019）

模型	输入	输出	合计
deepseek-chat	618,279	343,019	961,298
glm-4.7	118,707	0	118,707

📊 ArXiv 研究报告 (2026-04-09)

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

评分设置

📈 论文统计

⭐ 及格论文详细分析

1. UniCreative: Unifying Long-form Logic and Short-form Sparkle via Reference-Free Reinforcement Learni

2. Joint Knowledge Base Completion and Question Answering by Combining Large Language Models and Small

3. Improving Sparse Memory Finetuning

4. BiMind: A Dual-Head Reasoning Model with Attention-Geometry Adapter for Incorrect Information Detect

5. Measuring What Matters!! Assessing Therapeutic Principles in Mental-Health Conversation

6. MedLayBench-V: A Large-Scale Benchmark for Expert-Lay Semantic Alignment in Medical Vision Language

7. HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference

8. Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis

9. Mechanistic Circuit-Based Knowledge Editing in Large Language Models

10. The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models

11. From Governance Norms to Enforceable Controls: A Layered Translation Method for Runtime Guardrails i

📋 所有论文列表

1. ✅ UniCreative: Unifying Long-form Logic and Short-form Sparkle via Reference-Free Reinforcement Learning

2. ✅ Joint Knowledge Base Completion and Question Answering by Combining Large Language Models and Small Language Models

3. ✅ Improving Sparse Memory Finetuning

4. ✅ BiMind: A Dual-Head Reasoning Model with Attention-Geometry Adapter for Incorrect Information Detection

5. ✅ Measuring What Matters!! Assessing Therapeutic Principles in Mental-Health Conversation

6. ✅ MedLayBench-V: A Large-Scale Benchmark for Expert-Lay Semantic Alignment in Medical Vision Language Models

7. ✅ HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference

8. ✅ Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis

9. ✅ Mechanistic Circuit-Based Knowledge Editing in Large Language Models

10. ✅ The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models

11. ✅ From Governance Norms to Enforceable Controls: A Layered Translation Method for Runtime Guardrails in Agentic AI

12. ❌ A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models

13. ❌ How LLMs Follow Instructions: Skillful Coordination, Not a Universal Mechanism

14. ❌ Exclusive Unlearning

15. ❌ FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips

16. ❌ Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering

17. ❌ On the Role of Fault Localization Context for LLM-Based Program Repair

18. ❌ Disentangling MLP Neuron Weights in Vocabulary Space

19. ❌ Bridging Natural Language and Microgrid Dynamics: A Context-Aware Simulator and Dataset

20. ❌ Target Policy Optimization

21. ❌ FastDiSS: Few-step Match Many-step Diffusion Language Model on Sequence-to-Sequence Generation–Full Version

22. ❌ Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming

23. ❌ Selective Aggregation of Attention Maps Improves Diffusion-Based Visual Interpretation

24. ❌ SignalClaw: LLM-Guided Evolutionary Synthesis of Interpretable Traffic Signal Control Skills

25. ❌ Content Fuzzing for Escaping Information Cocoons on Digital Social Media

26. ❌ Turbulence-like 5/3 spectral scaling in contextual representations of language as a complex system

27. ❌ Channel-wise Retrieval for Multivariate Time Series Forecasting

28. ❌ Effective Dynamics and Transition Pathways from Koopman-Inspired Neural Learning of Collective Variables

29. ❌ MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

30. ❌ DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models

31. ❌ In-Place Test-Time Training

32. ❌ Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization

33. ❌ Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement

34. ❌ Who Governs the Machine? A Machine Identity Governance Taxonomy (MIGT) for AI Systems Operating Across Enterprise and Geopolitical Boundaries

35. ❌ Shot-Based Quantum Encoding: A Data-Loading Paradigm for Quantum Neural Networks

36. ❌ PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

37. ❌ Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

38. ❌ ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

39. ❌ Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery

40. ❌ Gym-Anything: Turn any Software into an Agent Environment

41. ❌ Artificial Intelligence and the Structure of Mathematics

42. ❌ LLM4CodeRE: Generative AI for Code Decompilation Analysis and Reverse Engineering

43. ❌ Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives

44. ❌ LAG-XAI: A Lie-Inspired Affine Geometric Framework for Interpretable Paraphrasing in Transformer Latent Spaces

45. ❌ Graph-PiT: Enhancing Structural Coherence in Part-Based Image Synthesis via Graph Priors

46. ❌ Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning

47. ❌ Stories of Your Life as Others: A Round-Trip Evaluation of LLM-Generated Life Stories Conditioned on Rich Psychometric Profiles

48. ❌ CritBench: A Framework for Evaluating Cybersecurity Capabilities of Large Language Models in IEC 61850 Digital Substation Environments

49. ❌ Governance and Regulation of Artificial Intelligence in Developing Countries: A Case Study of Nigeria

50. ❌ Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis

51. ❌ The Model Agreed, But Didn’t Learn: Diagnosing Surface Compliance in Large Language Models

52. ❌ Flowr – Scaling Up Retail Supply Chain Operations Through Agentic AI in Large Scale Supermarket Chains

53. ❌ A Formal Security Framework for MCP-Based AI Agents: Threat Taxonomy, Verification Models, and Defense Mechanisms

54. ❌ Beyond Compromise: Pareto-Lenient Consensus for Efficient Multi-Preference LLM Alignment

55. ❌ Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution

56. ❌ Polynomial-Time Algorithm for Thiele Voting Rules with Voter Interval Preferences

57. ❌ Towards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration

58. ❌ MARL-GPT: Foundation Model for Multi-Agent Reinforcement Learning

59. ❌ Context-Value-Action Architecture for Value-Driven Large Language Model Agents

60. ❌ Saliency-Guided Representation with Consistency Policy Learning for Visual Unsupervised Reinforcement Learning

61. ❌ “I See What You Did There”: Can Large Vision-Language Models Understand Multimodal Puns?

62. ❌ ReLU Networks for Exact Generation of Similar Graphs