📊 ArXiv 研究报告 (2026-04-15)

生成时间: 2026-04-15 09:31:39 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 322 篇
及格论文: 12 篇 (3.7%)

⭐ 及格论文详细分析

1. OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

作者: Junfu Pu, Yuxin Chen, Teng Wang, Ying Shan 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11102v1

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	5.0/10	5.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出OmniScript，一个8B参数的多模态大语言模型（MLLM），专注于长视频理解与脚本生成任务。核心相关关键词包括：1）“Large Language Models”（10分）：论文明确构建MLLM，属于大模型范畴；2）“Post-training”（10分）：采用监督微调（SFT）进行训练；3）“RLHF”（10分）：使用强化学习进行优化；4）“Chain of Thought”（10分）：利用CoT进行情节和角色推理。其他相关关键词：“Pre-training”（5分）：涉及模型训练流程；“Context Window Extension”（5分）：处理长视频需要长上下文能力；“System 2 Thinking”（5分）：CoT推理涉及深度思考。其余关键词与论文的音频-视觉脚本生成任务无直接关联，得0分。

!!! tip deepseek-chat TL;DR

该论文针对长视频理解与脚本生成任务，提出了OmniScript多模态大语言模型，通过链式思维监督微调和强化学习优化，在参数效率下实现了与最先进专有模型相当的性能。

摘要翻译

当前的多模态大语言模型（MLLMs）在短视频理解方面已展现出卓越能力，但将长篇幅电影视频转化为详细且时间定位准确的剧本仍是一项重大挑战。本文提出了新颖的视频转剧本（V2S）任务，旨在生成层次化的、逐场景的剧本，涵盖角色动作、对话、表情和音频提示。为此，我们构建了首个由人工标注的基准数据集，并提出了一种时间感知的层次化评估框架。此外，我们提出了OmniScript——一个专为长篇幅叙事理解设计的80亿参数全模态（视听）语言模型。OmniScript通过渐进式训练流程进行训练：首先利用思维链监督微调进行情节与角色推理，随后采用基于时间分段奖励的强化学习。大量实验表明，尽管参数高效，OmniScript在时间定位和多领域语义准确性方面均显著优于更大的开源模型，并与包括Gemini 3-Pro在内的最先进专有模型性能相当。

摘要 (Abstract)

Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form cinematic videos into detailed, temporally grounded scripts remains a significant challenge. This paper introduces the novel video-to-script (V2S) task, aiming to generate hierarchical, scene-by-scene scripts encompassing character actions, dialogues, expressions, and audio cues. To facilitate this, we construct a first-of-its-kind human-annotated benchmark and propose a temporally-aware hierarchical evaluation framework. Furthermore, we present OmniScript, an 8B-parameter omni-modal (audio-visual) language model tailored for long-form narrative comprehension. OmniScript is trained via a progressive pipeline that leverages chain-of-thought supervised fine-tuning for plot and character reasoning, followed by reinforcement learning using temporally segmented rewards. Extensive experiments demonstrate that despite its parameter efficiency, OmniScript significantly outperforms larger open-source models and achieves performance comparable to state-of-the-art proprietary models, including Gemini 3-Pro, in both temporal localization and multi-field semantic accuracy.

关键词: multimodal large language models, video-to-script generation, long-form video understanding, chain-of-thought reasoning, supervised fine-tuning, reinforcement learning, temporal localization, audio-visual language model

2. UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

作者: Yijuan Liang, Xinghao Chen, Yifan Ge, Ziyi Wu, Hao Wu, Changyu Zeng, Wei Xing, Xiaoyu Shen 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11557v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文UniToolCall的核心是构建一个用于LLM智能体工具学习的统一框架，包括工具集构建、数据集生成和评估。因此，它与"Large Language Models"、“LLM Agents”、“Tool Use"高度相关（10分），因为论文直接研究LLM智能体的工具调用能力。与"Post-training"相关（10分），因为论文通过微调（fine-tuning）Qwen3-8B来提升性能。与"Chain of Thought"和"System 2 Thinking"有一定关联（5分），因为论文涉及多轮推理（multi-turn reasoning）并引入了Anchor Linkage机制来增强跨轮依赖，这属于多步推理范畴。其他关键词如MoE、量化、RAG、对齐等，论文未涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM智能体工具学习中存在交互表示不一致、轨迹结构分布被忽视以及评估基准不兼容的问题，提出了一个统一的框架UniToolCall，通过构建大规模工具池和混合训练语料，并引入Anchor Linkage机制来增强多轮推理，实验表明基于该框架微调的模型在工具使用性能上显著优于包括GPT在内的商业模型。

摘要翻译

工具使用能力是大语言模型智能体的核心功能，使其能够通过结构化函数调用与外部系统交互。然而，现有研究存在交互表示不一致、普遍忽视工具使用轨迹的结构化分布，且依赖于互不兼容的评估基准的问题。我们提出了UniToolCall，一个用于工具学习的统一框架，该框架标准化了从工具集构建、数据集生成到评估的整个流程。该框架构建了一个包含22,000多个工具的大型工具池，并通过结合10个标准化的公共数据集与结构可控的合成轨迹，构建了一个包含390,000多个实例的混合训练语料库。它明确建模了多样的交互模式，包括单跳与多跳、单轮与多轮，同时捕捉了串行与并行的执行结构。为支持连贯的多轮推理，我们进一步引入了锚定链接机制，以强化跨轮次的依赖关系。此外，我们将7个公共基准测试转换为统一的查询—行动—观察—答案表示形式，并在函数调用、轮次和对话层面进行细粒度评估。实验表明，在我们的数据集上对Qwen3-8B进行微调，能显著提升其工具使用性能。在干扰项密集的Hybrid-20设置下，模型实现了93.0%的单轮严格精确度，表现优于包括GPT、Gemini和Claude在内的商业模型。

摘要 (Abstract)

Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function calls. However, existing research exhibits inconsistent interaction representations, largely overlooks the structural distribution of tool-use trajectories, and relies on incompatible evaluation benchmarks. We present UniToolCall, a unified framework for tool learning that standardizes the entire pipeline from toolset construction and dataset generation to evaluation. The framework curates a large tool pool of 22k+ tools and constructs a hybrid training corpus of 390k+ instances by combining 10 standardized public datasets with structurally controlled synthetic trajectories. It explicitly models diverse interaction patterns, including single-hop vs. multi-hop and single-turn vs. multi-turn, while capturing both serial and parallel execution structures. To support coherent multi-turn reasoning, we further introduce an Anchor Linkage mechanism that enforces cross-turn dependencies. Furthermore, we convert 7 public benchmarks into a unified Query–Action–Observation–Answer (QAOA) representation with fine-grained evaluation at the function-call, turn, and conversation levels. Experiments show that fine-tuning Qwen3-8B on our dataset substantially improves tool-use performance. Under the distractor-heavy Hybrid-20 setting, achieves 93.0% single-turn Strict Precision, outperforming commercial models including GPT, Gemini, and Claude.

关键词: LLM Agents, Tool Use, Function Calling, Multi-turn Reasoning, Fine-tuning, Unified Framework, Evaluation Benchmark, Anchor Linkage

3. Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retriev

作者: Dzenan Hamzic, Florian Skopik, Max Landauer, Markus Wurzenberger, Andreas Rauber 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11419v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	15.0/10	15.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究RAG系统在网络安全威胁情报(CTI)领域的应用与改进，因此与"Retrieval-Augmented Generation"高度相关(15分)。论文明确使用语言模型作为基础，与"Large Language Models"相关(10分)。论文提出并评估了"agentic variant”，与"LLM Agents"相关(10分)。论文处理需要推理的复杂查询，与"Chain of Thought"和"System 2 Thinking"有一定关联(各5分)。其他关键词如MoE、量化、对齐等未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文系统评估了四种RAG架构在网络安全威胁情报分析中的表现，发现基于知识图谱的混合检索方法相比传统向量检索在多跳推理问题上可将答案质量提升高达35%。

摘要翻译

网络威胁情报分析师必须基于大量叙述性安全报告回答复杂问题。检索增强生成系统能够帮助语言模型获取外部知识，但传统向量检索在处理需要推理威胁行为体、恶意软件与漏洞等实体间关系的问题时往往表现不佳。这种局限性源于相关证据通常分散在多个文本片段和文档中。知识图谱通过实体与关系的显式表征支持结构化多跳推理，从而应对这一挑战。然而，当前已出现包括基于图谱、智能体驱动及混合方法在内的多种检索范式，这些方法具有不同的前提假设与失效模式。在实际网络威胁情报场景中，这些方法的比较效果以及图谱基座何时能提升性能仍不明确。本研究系统评估了四种用于网络威胁情报分析的检索增强生成架构：标准向量检索、基于网络威胁情报知识图谱的图谱检索、能够修复失败图谱查询的智能体变体，以及结合图谱查询与文本检索的混合方法。我们在涵盖事实查找、多跳关系查询、分析师风格的综合问题及不可回答案例的3,300组网络威胁情报问答对上评估了这些系统。结果表明，图谱基座能提升结构化事实查询的性能。与向量检索增强生成系统相比，混合图谱-文本方法在多跳问题上的回答质量提升高达35%，同时比纯图谱系统保持更可靠的性能表现。

摘要 (Abstract)

Cyber threat intelligence (CTI) analysts must answer complex questions over large collections of narrative security reports. Retrieval-augmented generation (RAG) systems help language models access external knowledge, but traditional vector retrieval often struggles with queries that require reasoning over relationships between entities such as threat actors, malware, and vulnerabilities. This limitation arises because relevant evidence is often distributed across multiple text fragments and documents. Knowledge graphs address this challenge by enabling structured multi-hop reasoning through explicit representations of entities and relationships. However, multiple retrieval paradigms, including graph-based, agentic, and hybrid approaches, have emerged with different assumptions and failure modes. It remains unclear how these approaches compare in realistic CTI settings and when graph grounding improves performance. We present a systematic evaluation of four RAG architectures for CTI analysis: standard vector retrieval, graph-based retrieval over a CTI knowledge graph, an agentic variant that repairs failed graph queries, and a hybrid approach combining graph queries with text retrieval. We evaluate these systems on 3,300 CTI question-answer pairs spanning factual lookups, multi-hop relational queries, analyst-style synthesis questions, and unanswerable cases. Results show that graph grounding improves performance on structured factual queries. The hybrid graph-text approach improves answer quality by up to 35 percent on multi-hop questions compared to vector RAG, while maintaining more reliable performance than graph-only systems.

关键词: Retrieval-Augmented Generation, RAG, Cyber Threat Intelligence, Knowledge Graphs, Multi-hop Reasoning, Agentic Retrieval, Graph-based Retrieval, Hybrid Retrieval

4. Geoparsing: Diagram Parsing for Plane and Solid Geometry with a Unified Formal Language

作者: Peijie Wang, Ming-Liang Zhang, Jun Cao, Chao Deng, Dekang Ran, Hongda Sun, Pi Bu, Xuan Zhang, Yingyao Wang, Jun Song, Bo Zheng, Fei Yin, Cheng-Lin Liu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11600v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	5.0/10	5.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在几何推理中的应用，与"Large Language Models"高度相关（10分）。方法上明确使用"Supervised Fine-tuning"（10分）和"Reinforcement Learning via Verifiable Rewards"（与RLHF/RLAIF相关，5分）。研究涉及几何推理，与"Chain of Thought"和"System 2 Thinking"有一定关联（各5分）。几何作为科学领域的一部分，与"AI for Science"相关（5分）。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在几何推理中的感知瓶颈问题，提出了一种统一平面和立体几何的形式化语言，并通过监督微调与强化学习结合的训练范式，实现了最先进的几何解析性能，显著提升了MLLMs在下游几何推理任务中的能力。

摘要翻译

多模态大语言模型（MLLMs）已取得显著进展，但在几何推理方面仍面临挑战，主要源于对细粒度视觉元素的感知瓶颈。尽管形式化语言已辅助平面几何理解，但需要空间认知的立体几何领域仍很大程度上未被探索。本文通过设计一种统一的形式化语言来应对这一挑战，该语言整合了平面与立体几何，全面覆盖几何结构与语义关系。我们构建了GDP-29K大规模数据集，包含从多样现实来源收集的2万个平面几何样本和9千个立体几何样本，每个样本均配有真实的形式化描述。为确保语法正确性与几何一致性，我们提出了一种结合监督微调与基于可验证奖励的强化学习的训练范式。实验表明，我们的方法实现了最先进的解析性能。此外，我们证明了所解析的形式化描述可作为关键的认知支架，显著增强多模态大语言模型在下游几何推理任务中的能力。我们的数据与代码公开于Geoparsing平台。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) have achieved remarkable progress but continue to struggle with geometric reasoning, primarily due to the perception bottleneck regarding fine-grained visual elements. While formal languages have aided plane geometry understanding, solid geometry which requires spatial understanding remains largely unexplored. In this paper, we address this challenge by designing a unified formal language that integrates plane and solid geometry, comprehensively covering geometric structures and semantic relations. We construct GDP-29K, a large-scale dataset comprising 20k plane and 9k solid geometry samples collected from diverse real-world sources, each paired with its ground-truth formal description. To ensure syntactic correctness and geometric consistency, we propose a training paradigm that combines Supervised Fine-Tuning with Reinforcement Learning via Verifiable Rewards. Experiments show that our approach achieves state-of-the-art parsing performance. Furthermore, we demonstrate that our parsed formal descriptions serve as a critical cognitive scaffold, significantly boosting MLLMs’ capabilities for downstream geometry reasoning tasks. Our data and code are available at Geoparsing.

关键词: Multimodal Large Language Models, Geometric Reasoning, Formal Language, Supervised Fine-Tuning, Reinforcement Learning, GDP-29K Dataset, Plane and Solid Geometry, Parsing Performance

5. Do Agent Rules Shape or Distort? Guardrails Beat Guidance in Coding Agents

作者: Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11088v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究AI编码代理（LLM Agents）的性能如何受自然语言规则文件（如.cursorrules）的影响，属于大模型应用研究。高度相关关键词：“Large Language Models”（论文基于SOTA编码代理，本质是LLM应用）、“LLM Agents”（直接研究编码代理）。中等相关关键词：“Instruction Tuning”（规则文件类似指令调优）、“Tool Use”（编码代理使用工具）、“In-context Learning”（规则通过上下文提示影响代理）。其余关键词与论文技术细节（如MoE、量化、推理加速）或领域（如生物信息学）无关。

!!! tip deepseek-chat TL;DR

该研究首次通过大规模实证评估发现，AI编码代理的规则文件（如.cursorrules）通过上下文提示而非具体指令提升性能7-14个百分点，但负面约束有益而正面指令有害，揭示了规则可能隐藏的可靠性风险。

摘要翻译

开发者日益倾向于通过自然语言指令文件（如CLAUDE.md、.cursorrules）来指导AI编程助手，但目前尚无受控研究评估这些规则是否真正提升了助手性能，亦未明确何种规则特性能够产生积极影响。我们从GitHub收集了679份此类文件（共25,532条规则），并进行了首次大规模实证评估：在SWE-bench Verified基准上使用前沿的编程助手运行了超过5,000次测试。结果显示，规则能将性能提升7-14个百分点，但随机规则与专家精心设计的规则效果相当——这表明规则主要通过上下文启动而非具体指令发挥作用。负面约束（如“不要重构无关代码”）是唯一具有独立增益效果的规则类型，而正面指令（如“遵循代码风格”）反而会损害性能——我们通过基于势函数的奖励塑形理论对这一现象进行了分析。此外，单条规则在独立使用时大多有害，但集体应用时却产生积极效果，且规则数量增至50条时仍未出现性能衰减。这些发现揭示了一个潜在的可靠性风险：善意的规则往往会降低助手性能，同时为安全配置助手提供了明确原则：应约束助手“不应做什么”，而非规定其“必须做什么”。

摘要 (Abstract)

Developers increasingly guide AI coding agents through natural language instruction files (e.g., CLAUDE.md, .cursorrules), yet no controlled study has measured whether these rules actually improve agent performance or which properties make a rule beneficial. We scrape 679 such files (25,532 rules) from GitHub and conduct the first large-scale empirical evaluation, running over 5,000 agent runs with a state-of-the-art coding agent on SWE-bench Verified. Rules improve performance by 7–14 percentage points, but random rules help as much as expert-curated ones – suggesting rules work through context priming rather than specific instruction. Negative constraints (“do not refactor unrelated code”) are the only individually beneficial rule type, while positive directives (“follow code style”) actively hurt – a pattern we analyze through the lens of potential-based reward shaping (PBRS). Moreover, individual rules are mostly harmful in isolation yet collectively helpful, with no degradation up to 50 rules. These findings expose a hidden reliability risk – well-intentioned rules routinely degrade agent performance – and provide a clear principle for safe agent configuration: constrain what agents must not do, rather than prescribing what they should.

关键词: AI coding agents, natural language rules, context priming, negative constraints, potential-based reward shaping, agent performance, SWE-bench, reliability risk

6. METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models

作者: Pengfeng Li, Chen Huang, Chaoqun Hao, Hongyao Chen, Xiao-Yong Wei, Wenqiang Lei, See-Kiong Ng 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11502v1

评分: 34.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的上下文因果推理能力，因此与"Large Language Models"高度相关（10分）。研究涉及多步推理和深度推理分析，与"Chain of Thought"和"System 2 Thinking"有一定关联（8分）。通过错误模式识别和内部信息流追踪进行机制分析，与"Mechanistic Interpretability"相关（8分）。论文未涉及其他关键词的具体技术或应用。

!!! tip deepseek-chat TL;DR

该论文提出了METER基准来系统评估大语言模型在统一上下文设置下的多层次因果推理能力，发现模型能力随因果层次上升而显著下降，并通过机制分析揭示了两种主要失败模式。

摘要翻译

情境因果推理是大语言模型（LLM）一项关键而具有挑战性的能力。然而，现有基准测试通常在碎片化的场景中评估此项技能，未能确保情境一致性或覆盖完整的因果层级。为解决此问题，我们开创性地提出了METER，在一个统一的情境设置下，系统性地对大语言模型在因果阶梯的所有三个层级上进行基准测试。我们对多种大语言模型的广泛评估表明，随着任务在因果阶梯上攀升，模型的能力显著下降。为诊断这种性能退化，我们通过错误模式识别和内部信息流追踪进行了深入的机制分析。我们的分析揭示了两种主要的失败模式：（1）在较低的因果层级上，大语言模型容易受到因果无关但事实正确的信息干扰；（2）随着任务在因果阶梯上攀升，模型对给定情境的忠实度下降，导致性能降低。我们相信，我们的工作推进了我们对大语言模型情境因果推理背后机制的理解，并为未来研究奠定了关键基础。我们的代码和数据集可在 https://github.com/SCUNLP/METER 获取。

摘要 (Abstract)

Contextual causal reasoning is a critical yet challenging capability for Large Language Models (LLMs). Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy. To address this, we pioneer METER to systematically benchmark LLMs across all three levels of the causal ladder under a unified context setting. Our extensive evaluation of various LLMs reveals a significant decline in proficiency as tasks ascend the causal hierarchy. To diagnose this degradation, we conduct a deep mechanistic analysis via both error pattern identification and internal information flow tracing. Our analysis reveals two primary failure modes: (1) LLMs are susceptible to distraction by causally irrelevant but factually correct information at lower level of causality; and (2) as tasks ascend the causal hierarchy, faithfulness to the provided context degrades, leading to a reduced performance. We belive our work advances our understanding of the mechanisms behind LLM contextual causal reasoning and establishes a critical foundation for future research. Our code and dataset are available at https://github.com/SCUNLP/METER .

关键词: Large Language Models, Contextual Causal Reasoning, Causal Hierarchy, Benchmark Evaluation, Mechanistic Analysis, Error Pattern Identification, Internal Information Flow Tracing, Reasoning Capabilities

7. CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning

作者: Qixian Huang, Hongqiang Lin, Tong Fu, Yingsen Wang, Zhenghui Fu, Qirui Wang, Yiding Sun, Dongxu Zhang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.10973v1

评分: 31.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出CFMS框架，核心是结合多模态大语言模型（MLLMs）进行视觉感知和符号推理引擎进行表格推理，属于大模型在特定任务（表格推理）上的应用创新。与关键词相关性分析：1）“Large Language Models” (8分)：论文明确使用MLLMs，是核心组件之一；2）“Chain of Thought” (10分)：论文直接引用并改进CoT方法，是核心理论基础；3）“System 2 Thinking” (8分)：框架的两阶段设计体现了深度、分步推理思想；4）“Small Language Models” (5分)：论文提到框架在较小骨干模型上表现鲁棒，有一定关联。其他关键词如MoE、Scaling Laws、RLHF等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对表格数据推理任务，提出了一个从粗粒度视觉感知到细粒度符号推理的两阶段多模态合成框架（CFMS），在WikiTQ和TabFact基准上取得了有竞争力的准确率，并证明了对大表格和小型骨干模型的鲁棒性。

摘要翻译

基于表格数据的推理是问答与事实核查等任务的关键能力，这要求模型能够同时理解自由形式的自然语言问题与半结构化表格。然而，尽管思维链等方法引入了推理链条，纯符号化方法本质上受限于其对整体视觉模式的盲视性。为解决这一问题，我们提出了从粗到精的多模态合成框架，这是一种新颖的两阶段范式，能够将高层视觉感知与细粒度符号推理进行层次化解耦。在粗粒度阶段，CFMS利用多模态大语言模型一次性合成一个多视角知识元组。该元组随后作为动态推理图谱来指导精粒度阶段，在此阶段中，一个符号化引擎在表格上执行一系列有针对性且高效的迭代操作。在WikiTQ和TabFact基准测试上的大量实验表明，CFMS取得了具有竞争力的准确率。该框架在处理大型表格以及采用较小骨干模型实例化时表现出特别的鲁棒性，验证了其有效性与泛化能力。

摘要 (Abstract)

Reasoning over tabular data is a crucial capability for tasks like question answering and fact verification, as it requires models to comprehend both free-form questions and semi-structured tables. However, while methods like Chain-of-Thought (CoT) introduce reasoning chains, purely symbolic methodes are inherently limited by their blindness to holistic visual patterns. To address this, we propose the Coarse-to-Fine Multimodal Synthesis framework (CFMS), a novel two-stage paradigm that hierarchically decouples high-level visual perception from granular symbolic reasoning. In the Coarse Stage, CFMS leverages the Multimodal Large Language Models (MLLMs) to perform a one-time synthesis of a multi-perspective knowledge tuple. This tuple subsequently serves as a dynamic reasoning map to guide the fine stage, where a symbolic engine executes a targeted and efficient sequence of iterative operations over the table. Extensive experiments on the WikiTQ and TabFact benchmarks demonstrate that CFMS achieves competitive accuracy. The framework exhibits particular robustness when handling large tables and when instantiated with smaller backbone models, validating its effectiveness and generalizability.

关键词: Tabular Reasoning, Multimodal Synthesis, Chain-of-Thought, Multimodal Large Language Models, Coarse-to-Fine Framework, Symbolic Reasoning, Visual Perception, Knowledge Tuple

8. CLSGen: A Dual-Head Fine-Tuning Framework for Joint Probabilistic Classification and Verbalized Expl

作者: WonJin Yoon, Kangyu Zhu, Ian Bulovic, Autumn Sehy, Yanjun Gao, Dmitriy Dligach, Majid Afshar, Timothy A. Miller 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11801v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM在二元分类任务中的微调框架（CLSGen），旨在解决传统微调导致概率估计不可靠和解释能力丧失的问题。因此，与"Large Language Models"（LLM应用）、“Post-training”（微调方法）和"Mechanistic Interpretability"（解释生成）高度相关（10分）。其他关键词如MoE、量化、RAG等未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

论文提出CLSGen框架，通过双头微调方法解决LLM在二元分类任务中概率估计不可靠和解释能力丧失的问题，实验表明其在分类指标和解释可读性上优于基线模型。

摘要翻译

随着大语言模型（LLM）的最新进展，学界对其应用于解决复杂挑战性问题的兴趣日益增长。现代大语言模型能够处理长上下文并生成语言化解释，在应对现实世界应用方面展现出巨大潜力。然而，在将大语言模型部署于实际决策时，一个关键障碍在于其无法提供可靠的定量概率。虽然使用传统判别式目标（类似于仅编码器模型）对大语言模型进行任务特定的微调可以获得概率估计，但这通常会导致灾难性遗忘和语言能力崩溃。其结果是模型丧失生成解释的能力，严重损害了其可解释性与可用性。为应对这一挑战，我们提出了CLSGen——一种专为二元分类任务设计的新型大语言模型微调框架。该框架包含新颖的模型架构、训练方法和数据构建策略，旨在实现稳健的概率估计，同时不牺牲模型固有的解释生成能力。在多个基准数据集上的实验结果表明，经CLSGen微调的模型在分类指标（AUROC和F1分数）上优于现有基线方法。在解释性方面，结果显示预测标签与生成的理由之间具有高度一致性，同时文本可读性表现优异。

摘要 (Abstract)

With the recent progress of Large Language Models (LLMs), there is a growing interest in applying these models to solve complex and challenging problems. Modern LLMs, capable of processing long contexts and generating verbalized explanations, offer significant potential in addressing real-world applications. However, a critical hurdle in deploying LLMs for practical decision-making is their inability to provide reliable, quantitative probabilities. While task-specific fine-tuning of LLMs using traditional discriminative objectives (similar to encoder-only models) can yield probability estimates, this often leads to catastrophic forgetting and linguistic collapse. Consequently, the model loses its ability to generate explanations, severely undermining its interpretability and usability. To address this challenge, we propose CLSGen, a novel LLM fine-tuning framework designed for binary classification tasks. The CLSGen framework encompasses a new model architecture, training methodology, and data construction strategy to enable robust probability estimation without sacrificing the model’s inherent explanation-generation capabilities. Experimental results across multiple benchmark datasets demonstrate that models fine-tuned with CLSGen outperform existing baselines in classification metrics (AUROC and F1-score). Regarding explanation, the results showed strong alignment between predicted labels and generated justifications, as well as high readability.

关键词: Large Language Models, fine-tuning, binary classification, probability estimation, explanation generation, interpretability, CLSGen, dual-head architecture

9. Self-Evolving LLM Memory Extraction Across Heterogeneous Tasks

作者: Yuqing Yang, Tengxiao Liu, Wang Bill Zhu, Taiwei Shi, Linxin Song, Robin Jia 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11610v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM在异构任务中的记忆提取和自我演化优化，与"Large Language Models"高度相关（10分），因为全文围绕LLM助手展开。与"Self-Correction" OR “Self-Improvement” OR “Self-Reflection"高度相关（10分），因为论文提出自我演化策略（self-evolving strategy）优化提取提示。与"LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow"高度相关（10分），因为研究涉及LLM-based assistants和agentic tasks。其他关键词如MoE、SFT、RAG等未在论文中提及或相关，给0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM助手在异构任务中如何有效提取和保留对话记忆的问题，并提出了一种基于聚类的自我演化策略CluE，在BEHEMOTH基准测试中实现了比现有方法更好的泛化性能。

摘要翻译

随着基于大语言模型（LLM）的助手日益具备持久性与个性化能力，其必须从历史对话中提取并保留有用信息作为记忆。然而，值得记忆的信息类型在不同任务间差异显著。我们形式化了异构记忆提取任务，并提出了BEHEMOTH基准。该基准通过下游效用驱动指标，重新整合了涵盖个性化、问题解决与智能体任务的18个现有数据集，以进行系统性评估。我们的实证分析证实：不存在单一静态提取提示能在所有任务类别中均占主导地位；且现有自演化提示优化框架（最初为同质分布设计）在训练任务异构时性能会下降。为此，我们提出CluE——一种基于聚类的自演化策略。该方法按提取场景将训练样本分组为聚类，独立分析每个聚类，并综合跨聚类洞察以更新提取提示。在BEHEMOTH上的实验表明，CluE能有效泛化至异构任务（相对增益提升+9.04%），持续优于先前的自演化框架。

摘要 (Abstract)

As LLM-based assistants become persistent and personalized, they must extract and retain useful information from past conversations as memory. However, the types of information worth remembering vary considerably across tasks. We formalize the \textit{heterogeneous memory extraction} task and introduce \textbf{BEHEMOTH}, a benchmark that repurposes 18 existing datasets spanning personalization, problem-solving, and agentic tasks, using a downstream utility-driven metric for systematic evaluation. Our empirical analysis confirms that no single static extraction prompt dominates across all task categories, and that existing self-evolving prompt optimization frameworks, originally designed for homogeneous distributions, degrade when training tasks are heterogeneous. To address this, we propose \textbf{CluE}, a cluster-based self-evolving strategy that groups training examples into clusters by extraction scenarios, analyzes each cluster independently, and synthesizes cross-cluster insights to update the extraction prompt. Experiments on BEHEMOTH show that CluE generalizes effectively across heterogeneous tasks ($+$9.04% relative gain), consistently outperforming prior self-evolving frameworks.

关键词: LLM memory extraction, heterogeneous tasks, self-evolving strategy, BEHEMOTH benchmark, CluE, agentic tasks, prompt optimization, downstream utility

10. ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation

作者: Yiran Qin, Jiahua Ma, Li Kang, Wenzhan Li, Yihang Jiao, Xin Wen, Xiufeng Song, Heng Zhou, Jiwen Yu, Zhenfei Yin, Xihui Liu, Philip Torr, Yilun Du, Ruimao Zhang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11386v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	5.0/10	5.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	10.0/10	10.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文提出了一种名为Compositional Simulation的混合方法，结合经典模拟和神经模拟来生成机器人训练数据，核心是解决机器人领域大规模高质量数据获取的挑战。与关键词的相关性分析如下：1）与"Large Language Models"和"Foundation Models"有一定关联（5分），因为摘要提到基础模型（包括大语言模型）推动了机器人能力发展，但论文本身不研究这些模型。2）与"Scaling Laws” AND “Data Quality"有一定关联（5分），因为论文关注大规模高质量数据生成，但未明确讨论缩放定律。3）与"Pre-training” OR “Continual Pre-training” OR “Domain Adaptation"有一定关联（5分），因为论文涉及sim2real领域适应，但未具体讨论预训练技术。4）与"World Models” AND “General World Models"高度相关（10分），因为摘要明确提到世界模型是推动机器人进步的基础模型之一，且论文的神经模拟器可视为一种世界模型应用。5）与"AI for Science"有一定关联（5分），因为机器人技术可视为AI在科学/工程领域的应用。其他关键词与论文的机器人数据生成、模拟和策略训练主题无直接关系，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Compositional Simulation的混合模拟方法，通过结合经典模拟和神经模拟来生成大规模、高质量的机器人训练数据，有效减少了sim2real领域差距并提高了真实世界策略模型的成功率。

摘要翻译

近期，基础模型（如大语言模型与世界模型）的进展显著提升了机器人学的能力，使机器人能够自主执行复杂任务。然而，获取大规模、高质量的机器人训练数据仍面临挑战，因为这通常需要大量人工投入，且难以覆盖多样化的真实世界环境。为解决这一问题，我们提出了一种名为“组合式仿真”的新型混合方法，该方法结合了经典仿真与神经仿真，在保持真实世界一致性的同时生成精确的动作-视频对。我们的方法采用闭环式“真实-仿真-真实”数据增强流程，利用少量真实世界数据生成覆盖更广泛真实场景的多样化、大规模训练数据集。我们训练了一个神经仿真器，将经典仿真视频转换为真实世界表征，从而提升在真实环境中训练的策略模型的准确性。通过大量实验，我们证明该方法显著缩小了仿真到真实的领域差距，使真实世界策略模型训练获得更高的成功率。我们的研究为生成鲁棒训练数据、弥合仿真与真实机器人学之间的差距提供了一种可扩展的解决方案。

摘要 (Abstract)

Recent advancements in foundational models, such as large language models and world models, have greatly enhanced the capabilities of robotics, enabling robots to autonomously perform complex tasks. However, acquiring large-scale, high-quality training data for robotics remains a challenge, as it often requires substantial manual effort and is limited in its coverage of diverse real-world environments. To address this, we propose a novel hybrid approach called Compositional Simulation, which combines classical simulation and neural simulation to generate accurate action-video pairs while maintaining real-world consistency. Our approach utilizes a closed-loop real-sim-real data augmentation pipeline, leveraging a small amount of real-world data to generate diverse, large-scale training datasets that cover a broader spectrum of real-world scenarios. We train a neural simulator to transform classical simulation videos into real-world representations, improving the accuracy of policy models trained in real-world environments. Through extensive experiments, we demonstrate that our method significantly reduces the sim2real domain gap, resulting in higher success rates in real-world policy model training. Our approach offers a scalable solution for generating robust training data and bridging the gap between simulated and real-world robotics.

关键词: Compositional Simulation, robotics, data generation, neural simulator, sim2real, policy training, real-world consistency, closed-loop pipeline

11. DreamKG: A KG-Augmented Conversational System for People Experiencing Homelessness

作者: Javad M Alizadeh, Genhui Zheng, Chiu C Tan, Yuzhou Chen, Omar Martinez, Philip McCallion, Ying Ding, Chenguang Yang, AnneMarie Tomosky, Huanmei Wu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11703v1

评分: 28.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出了一种结合知识图谱（Neo4j）和大型语言模型（LLMs）的混合架构对话系统（DreamKG），旨在为无家可归者提供可靠、基于验证数据的社区服务信息。该系统直接使用LLMs（因此与第一个关键词相关，评8分），并采用知识图谱来增强生成，这本质上属于检索增强生成（RAG）方法，因此与"Retrieval-Augmented Generation"高度相关（评10分）。此外，论文明确旨在解决标准LLMs的幻觉问题，通过知识图谱提供可靠数据，因此与"Hallucination Mitigation"高度相关（评10分）。其他关键词主要涉及大模型的技术原理（如MoE、Scaling Laws、训练方法、推理优化、智能体等）或特定科学领域应用（如生物信息学），而本文专注于特定社会应用（无家可归者服务）的混合系统架构，未深入探讨这些技术细节或跨领域科学应用，因此评分为0。

!!! tip deepseek-chat TL;DR

该研究针对无家可归者难以获取准确社区服务信息的问题，开发了一个结合知识图谱和大型语言模型的对话系统DreamKG，通过基于验证数据的检索增强生成有效减少幻觉，并在初步评估中显示出优于Google Search AI的性能。

摘要翻译

无家可归者（PEH）在获取及时、准确的社区服务信息方面面临显著障碍。DreamKG通过一个知识图谱增强的对话系统应对这一问题，该系统将回答基于经过验证的、关于费城组织、服务、地点和开放时间的最新数据。与易产生幻觉的标准大语言模型（LLMs）不同，DreamKG结合Neo4j知识图谱与结构化查询理解，可靠地处理位置感知和时间敏感的查询。该系统能执行空间推理以提供基于距离的推荐，并进行时间过滤以处理运营时间。初步评估显示，在相关查询上其性能优于谷歌搜索AI达59%，并对无关查询的拒绝率达到84%。本案例展示了结合LLM灵活性与知识图谱可靠性的混合架构潜力，可有效提升弱势群体获取服务的可及性。

摘要 (Abstract)

People experiencing homelessness (PEH) face substantial barriers to accessing timely, accurate information about community services. DreamKG addresses this through a knowledge graph-augmented conversational system that grounds responses in verified, up-to-date data about Philadelphia organizations, services, locations, and hours. Unlike standard large language models (LLMs) prone to hallucinations, DreamKG combines Neo4j knowledge graphs with structured query understanding to handle location-aware and time-sensitive queries reliably. The system performs spatial reasoning for distance-based recommendations and temporal filtering for operating hours. Preliminary evaluation shows 59% superiority over Google Search AI on relevant queries and 84% rejection of irrelevant queries. This demonstration highlights the potential of hybrid architectures that combines LLM flexibility with knowledge graph reliability to improve service accessibility for vulnerable populations effectively.

关键词: knowledge graph, conversational system, large language models, hallucination mitigation, retrieval-augmented generation, homelessness, community services, spatial reasoning

12. Efficient Training for Cross-lingual Speech Language Models

作者: Yan Zhou, Qingkai Fang, Yun Hong, Yang Feng 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11096v1

评分: 28.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	8.0/10	8.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究跨语言语音大语言模型（CSLM），直接涉及LLMs（10分）和持续预训练（10分），并提到指令微调以增强模态对齐（8分）。其他关键词如MoE、SLMs、SFT、RAG、推理方法、代理、压缩等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于离散语音令牌的高效跨语言语音大语言模型训练方法（CSLM），通过新颖的对齐策略和指令微调，实现了跨模态和跨语言对齐，提升了生成质量并降低了延迟。

摘要翻译

当前，大语言模型主要集中于文本模态。为实现更自然的人机交互，语音大语言模型正逐渐兴起，但由于数据有限及多语言扩展困难，构建有效的端到端语音大语言模型仍具挑战。本文提出跨语言语音语言模型，这是一种基于离散语音令牌的高效跨语言语音大语言模型训练方法。我们设计了一种新颖的对齐策略，通过持续预训练实现跨模态与跨语言的对齐。通过遵循语音-文本交错模态生成链进行指令微调，我们在更细粒度上增强了模态对齐，从而提升生成质量并降低延迟。CSLM无需海量语音数据即可同时对齐不同模态与语言，因而展现出良好的语言可扩展性。在跨模态任务、单语言对话任务及跨语言对话任务上的评估表明，CSLM具备强大的跨模态对齐能力与通用任务处理能力。（代码发布于：https://github.com/ictnlp/CSLM）

摘要 (Abstract)

Currently, large language models (LLMs) predominantly focus on the text modality. To enable more natural human-AI interaction, speech LLMs are emerging, but building effective end-to-end speech LLMs remains challenging due to limited data and the difficulty in expanding to more languages. In this paper, we introduce Cross-lingual Speech Language Model (CSLM), an efficient training method for cross-lingual speech LLMs based on discrete speech tokens. We propose a novel alignment strategy that achieves cross-modal and cross-lingual alignment through continual pre-training. By conducting instruction fine-tuning following a speech-text interleaved chain-of-modality generation process, we enhance modal alignment at a finer granularity, thereby improving generation quality and reducing latency. CSLM aligns different modalities and languages simultaneously without the need for massive speech data, thus exhibiting good language scalability. Evaluations on cross-modal tasks, mono-lingual conversational tasks, and cross-lingual conversational tasks demonstrate CSLM’s strong cross-modal alignment capabilities and general task abilities. (Code is available at: https://github.com/ictnlp/CSLM)

关键词: speech language models, cross-lingual, discrete speech tokens, alignment strategy, continual pre-training, instruction fine-tuning, modal alignment, language scalability

📋 所有论文列表

1. ✅ OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

作者: Junfu Pu, Yuxin Chen, Teng Wang, Ying Shan 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11102v1

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	5.0/10	5.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对长视频理解与脚本生成任务，提出了OmniScript多模态大语言模型，通过链式思维监督微调和强化学习优化，在参数效率下实现了与最先进专有模型相当的性能。

摘要翻译

当前的多模态大语言模型（MLLMs）在短视频理解方面已展现出卓越能力，但将长篇幅电影视频转化为详细且时间定位准确的剧本仍是一项重大挑战。本文提出了新颖的视频转剧本（V2S）任务，旨在生成层次化的、逐场景的剧本，涵盖角色动作、对话、表情和音频提示。为此，我们构建了首个由人工标注的基准数据集，并提出了一种时间感知的层次化评估框架。此外，我们提出了OmniScript——一个专为长篇幅叙事理解设计的80亿参数全模态（视听）语言模型。OmniScript通过渐进式训练流程进行训练：首先利用思维链监督微调进行情节与角色推理，随后采用基于时间分段奖励的强化学习。大量实验表明，尽管参数高效，OmniScript在时间定位和多领域语义准确性方面均显著优于更大的开源模型，并与包括Gemini 3-Pro在内的最先进专有模型性能相当。

摘要 (Abstract)

Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form cinematic videos into detailed, temporally grounded scripts remains a significant challenge. This paper introduces the novel video-to-script (V2S) task, aiming to generate hierarchical, scene-by-scene scripts encompassing character actions, dialogues, expressions, and audio cues. To facilitate this, we construct a first-of-its-kind human-annotated benchmark and propose a temporally-aware hierarchical evaluation framework. Furthermore, we present OmniScript, an 8B-parameter omni-modal (audio-visual) language model tailored for long-form narrative comprehension. OmniScript is trained via a progressive pipeline that leverages chain-of-thought supervised fine-tuning for plot and character reasoning, followed by reinforcement learning using temporally segmented rewards. Extensive experiments demonstrate that despite its parameter efficiency, OmniScript significantly outperforms larger open-source models and achieves performance comparable to state-of-the-art proprietary models, including Gemini 3-Pro, in both temporal localization and multi-field semantic accuracy.

2. ✅ UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文UniToolCall的核心是构建一个用于LLM智能体工具学习的统一框架，包括工具集构建、数据集生成和评估。因此，它与"Large Language Models”、“LLM Agents”、“Tool Use"高度相关（10分），因为论文直接研究LLM智能体的工具调用能力。与"Post-training"相关（10分），因为论文通过微调（fine-tuning）Qwen3-8B来提升性能。与"Chain of Thought"和"System 2 Thinking"有一定关联（5分），因为论文涉及多轮推理（multi-turn reasoning）并引入了Anchor Linkage机制来增强跨轮依赖，这属于多步推理范畴。其他关键词如MoE、量化、RAG、对齐等，论文未涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM智能体工具学习中存在交互表示不一致、轨迹结构分布被忽视以及评估基准不兼容的问题，提出了一个统一的框架UniToolCall，通过构建大规模工具池和混合训练语料，并引入Anchor Linkage机制来增强多轮推理，实验表明基于该框架微调的模型在工具使用性能上显著优于包括GPT在内的商业模型。

摘要翻译

工具使用能力是大语言模型智能体的核心功能，使其能够通过结构化函数调用与外部系统交互。然而，现有研究存在交互表示不一致、普遍忽视工具使用轨迹的结构化分布，且依赖于互不兼容的评估基准的问题。我们提出了UniToolCall，一个用于工具学习的统一框架，该框架标准化了从工具集构建、数据集生成到评估的整个流程。该框架构建了一个包含22,000多个工具的大型工具池，并通过结合10个标准化的公共数据集与结构可控的合成轨迹，构建了一个包含390,000多个实例的混合训练语料库。它明确建模了多样的交互模式，包括单跳与多跳、单轮与多轮，同时捕捉了串行与并行的执行结构。为支持连贯的多轮推理，我们进一步引入了锚定链接机制，以强化跨轮次的依赖关系。此外，我们将7个公共基准测试转换为统一的查询—行动—观察—答案表示形式，并在函数调用、轮次和对话层面进行细粒度评估。实验表明，在我们的数据集上对Qwen3-8B进行微调，能显著提升其工具使用性能。在干扰项密集的Hybrid-20设置下，模型实现了93.0%的单轮严格精确度，表现优于包括GPT、Gemini和Claude在内的商业模型。

摘要 (Abstract)

Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function calls. However, existing research exhibits inconsistent interaction representations, largely overlooks the structural distribution of tool-use trajectories, and relies on incompatible evaluation benchmarks. We present UniToolCall, a unified framework for tool learning that standardizes the entire pipeline from toolset construction and dataset generation to evaluation. The framework curates a large tool pool of 22k+ tools and constructs a hybrid training corpus of 390k+ instances by combining 10 standardized public datasets with structurally controlled synthetic trajectories. It explicitly models diverse interaction patterns, including single-hop vs. multi-hop and single-turn vs. multi-turn, while capturing both serial and parallel execution structures. To support coherent multi-turn reasoning, we further introduce an Anchor Linkage mechanism that enforces cross-turn dependencies. Furthermore, we convert 7 public benchmarks into a unified Query–Action–Observation–Answer (QAOA) representation with fine-grained evaluation at the function-call, turn, and conversation levels. Experiments show that fine-tuning Qwen3-8B on our dataset substantially improves tool-use performance. Under the distractor-heavy Hybrid-20 setting, achieves 93.0% single-turn Strict Precision, outperforming commercial models including GPT, Gemini, and Claude.

关键词: LLM Agents, Tool Use, Function Calling, Multi-turn Reasoning, Fine-tuning, Unified Framework, Evaluation Benchmark, Anchor Linkage

3. ✅ Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval

作者: Dzenan Hamzic, Florian Skopik, Max Landauer, Markus Wurzenberger, Andreas Rauber 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11419v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	15.0/10	15.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文系统评估了四种RAG架构在网络安全威胁情报分析中的表现，发现基于知识图谱的混合检索方法相比传统向量检索在多跳推理问题上可将答案质量提升高达35%。

摘要翻译

网络威胁情报分析师必须基于大量叙述性安全报告回答复杂问题。检索增强生成系统能够帮助语言模型获取外部知识，但传统向量检索在处理需要推理威胁行为体、恶意软件与漏洞等实体间关系的问题时往往表现不佳。这种局限性源于相关证据通常分散在多个文本片段和文档中。知识图谱通过实体与关系的显式表征支持结构化多跳推理，从而应对这一挑战。然而，当前已出现包括基于图谱、智能体驱动及混合方法在内的多种检索范式，这些方法具有不同的前提假设与失效模式。在实际网络威胁情报场景中，这些方法的比较效果以及图谱基座何时能提升性能仍不明确。本研究系统评估了四种用于网络威胁情报分析的检索增强生成架构：标准向量检索、基于网络威胁情报知识图谱的图谱检索、能够修复失败图谱查询的智能体变体，以及结合图谱查询与文本检索的混合方法。我们在涵盖事实查找、多跳关系查询、分析师风格的综合问题及不可回答案例的3,300组网络威胁情报问答对上评估了这些系统。结果表明，图谱基座能提升结构化事实查询的性能。与向量检索增强生成系统相比，混合图谱-文本方法在多跳问题上的回答质量提升高达35%，同时比纯图谱系统保持更可靠的性能表现。

摘要 (Abstract)

Cyber threat intelligence (CTI) analysts must answer complex questions over large collections of narrative security reports. Retrieval-augmented generation (RAG) systems help language models access external knowledge, but traditional vector retrieval often struggles with queries that require reasoning over relationships between entities such as threat actors, malware, and vulnerabilities. This limitation arises because relevant evidence is often distributed across multiple text fragments and documents. Knowledge graphs address this challenge by enabling structured multi-hop reasoning through explicit representations of entities and relationships. However, multiple retrieval paradigms, including graph-based, agentic, and hybrid approaches, have emerged with different assumptions and failure modes. It remains unclear how these approaches compare in realistic CTI settings and when graph grounding improves performance. We present a systematic evaluation of four RAG architectures for CTI analysis: standard vector retrieval, graph-based retrieval over a CTI knowledge graph, an agentic variant that repairs failed graph queries, and a hybrid approach combining graph queries with text retrieval. We evaluate these systems on 3,300 CTI question-answer pairs spanning factual lookups, multi-hop relational queries, analyst-style synthesis questions, and unanswerable cases. Results show that graph grounding improves performance on structured factual queries. The hybrid graph-text approach improves answer quality by up to 35 percent on multi-hop questions compared to vector RAG, while maintaining more reliable performance than graph-only systems.

关键词: Retrieval-Augmented Generation, RAG, Cyber Threat Intelligence, Knowledge Graphs, Multi-hop Reasoning, Agentic Retrieval, Graph-based Retrieval, Hybrid Retrieval

4. ✅ Geoparsing: Diagram Parsing for Plane and Solid Geometry with a Unified Formal Language

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	5.0/10	5.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在几何推理中的感知瓶颈问题，提出了一种统一平面和立体几何的形式化语言，并通过监督微调与强化学习结合的训练范式，实现了最先进的几何解析性能，显著提升了MLLMs在下游几何推理任务中的能力。

摘要翻译

多模态大语言模型（MLLMs）已取得显著进展，但在几何推理方面仍面临挑战，主要源于对细粒度视觉元素的感知瓶颈。尽管形式化语言已辅助平面几何理解，但需要空间认知的立体几何领域仍很大程度上未被探索。本文通过设计一种统一的形式化语言来应对这一挑战，该语言整合了平面与立体几何，全面覆盖几何结构与语义关系。我们构建了GDP-29K大规模数据集，包含从多样现实来源收集的2万个平面几何样本和9千个立体几何样本，每个样本均配有真实的形式化描述。为确保语法正确性与几何一致性，我们提出了一种结合监督微调与基于可验证奖励的强化学习的训练范式。实验表明，我们的方法实现了最先进的解析性能。此外，我们证明了所解析的形式化描述可作为关键的认知支架，显著增强多模态大语言模型在下游几何推理任务中的能力。我们的数据与代码公开于Geoparsing平台。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) have achieved remarkable progress but continue to struggle with geometric reasoning, primarily due to the perception bottleneck regarding fine-grained visual elements. While formal languages have aided plane geometry understanding, solid geometry which requires spatial understanding remains largely unexplored. In this paper, we address this challenge by designing a unified formal language that integrates plane and solid geometry, comprehensively covering geometric structures and semantic relations. We construct GDP-29K, a large-scale dataset comprising 20k plane and 9k solid geometry samples collected from diverse real-world sources, each paired with its ground-truth formal description. To ensure syntactic correctness and geometric consistency, we propose a training paradigm that combines Supervised Fine-Tuning with Reinforcement Learning via Verifiable Rewards. Experiments show that our approach achieves state-of-the-art parsing performance. Furthermore, we demonstrate that our parsed formal descriptions serve as a critical cognitive scaffold, significantly boosting MLLMs’ capabilities for downstream geometry reasoning tasks. Our data and code are available at Geoparsing.

关键词: Multimodal Large Language Models, Geometric Reasoning, Formal Language, Supervised Fine-Tuning, Reinforcement Learning, GDP-29K Dataset, Plane and Solid Geometry, Parsing Performance

5. ✅ Do Agent Rules Shape or Distort? Guardrails Beat Guidance in Coding Agents

作者: Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11088v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该研究首次通过大规模实证评估发现，AI编码代理的规则文件（如.cursorrules）通过上下文提示而非具体指令提升性能7-14个百分点，但负面约束有益而正面指令有害，揭示了规则可能隐藏的可靠性风险。

摘要翻译

开发者日益倾向于通过自然语言指令文件（如CLAUDE.md、.cursorrules）来指导AI编程助手，但目前尚无受控研究评估这些规则是否真正提升了助手性能，亦未明确何种规则特性能够产生积极影响。我们从GitHub收集了679份此类文件（共25,532条规则），并进行了首次大规模实证评估：在SWE-bench Verified基准上使用前沿的编程助手运行了超过5,000次测试。结果显示，规则能将性能提升7-14个百分点，但随机规则与专家精心设计的规则效果相当——这表明规则主要通过上下文启动而非具体指令发挥作用。负面约束（如“不要重构无关代码”）是唯一具有独立增益效果的规则类型，而正面指令（如“遵循代码风格”）反而会损害性能——我们通过基于势函数的奖励塑形理论对这一现象进行了分析。此外，单条规则在独立使用时大多有害，但集体应用时却产生积极效果，且规则数量增至50条时仍未出现性能衰减。这些发现揭示了一个潜在的可靠性风险：善意的规则往往会降低助手性能，同时为安全配置助手提供了明确原则：应约束助手“不应做什么”，而非规定其“必须做什么”。

摘要 (Abstract)

Developers increasingly guide AI coding agents through natural language instruction files (e.g., CLAUDE.md, .cursorrules), yet no controlled study has measured whether these rules actually improve agent performance or which properties make a rule beneficial. We scrape 679 such files (25,532 rules) from GitHub and conduct the first large-scale empirical evaluation, running over 5,000 agent runs with a state-of-the-art coding agent on SWE-bench Verified. Rules improve performance by 7–14 percentage points, but random rules help as much as expert-curated ones – suggesting rules work through context priming rather than specific instruction. Negative constraints (“do not refactor unrelated code”) are the only individually beneficial rule type, while positive directives (“follow code style”) actively hurt – a pattern we analyze through the lens of potential-based reward shaping (PBRS). Moreover, individual rules are mostly harmful in isolation yet collectively helpful, with no degradation up to 50 rules. These findings expose a hidden reliability risk – well-intentioned rules routinely degrade agent performance – and provide a clear principle for safe agent configuration: constrain what agents must not do, rather than prescribing what they should.

关键词: AI coding agents, natural language rules, context priming, negative constraints, potential-based reward shaping, agent performance, SWE-bench, reliability risk

6. ✅ METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models

评分: 34.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了METER基准来系统评估大语言模型在统一上下文设置下的多层次因果推理能力，发现模型能力随因果层次上升而显著下降，并通过机制分析揭示了两种主要失败模式。

摘要翻译

情境因果推理是大语言模型（LLM）一项关键而具有挑战性的能力。然而，现有基准测试通常在碎片化的场景中评估此项技能，未能确保情境一致性或覆盖完整的因果层级。为解决此问题，我们开创性地提出了METER，在一个统一的情境设置下，系统性地对大语言模型在因果阶梯的所有三个层级上进行基准测试。我们对多种大语言模型的广泛评估表明，随着任务在因果阶梯上攀升，模型的能力显著下降。为诊断这种性能退化，我们通过错误模式识别和内部信息流追踪进行了深入的机制分析。我们的分析揭示了两种主要的失败模式：（1）在较低的因果层级上，大语言模型容易受到因果无关但事实正确的信息干扰；（2）随着任务在因果阶梯上攀升，模型对给定情境的忠实度下降，导致性能降低。我们相信，我们的工作推进了我们对大语言模型情境因果推理背后机制的理解，并为未来研究奠定了关键基础。我们的代码和数据集可在 https://github.com/SCUNLP/METER 获取。

摘要 (Abstract)

Contextual causal reasoning is a critical yet challenging capability for Large Language Models (LLMs). Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy. To address this, we pioneer METER to systematically benchmark LLMs across all three levels of the causal ladder under a unified context setting. Our extensive evaluation of various LLMs reveals a significant decline in proficiency as tasks ascend the causal hierarchy. To diagnose this degradation, we conduct a deep mechanistic analysis via both error pattern identification and internal information flow tracing. Our analysis reveals two primary failure modes: (1) LLMs are susceptible to distraction by causally irrelevant but factually correct information at lower level of causality; and (2) as tasks ascend the causal hierarchy, faithfulness to the provided context degrades, leading to a reduced performance. We belive our work advances our understanding of the mechanisms behind LLM contextual causal reasoning and establishes a critical foundation for future research. Our code and dataset are available at https://github.com/SCUNLP/METER .

7. ✅ CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning

评分: 31.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对表格数据推理任务，提出了一个从粗粒度视觉感知到细粒度符号推理的两阶段多模态合成框架（CFMS），在WikiTQ和TabFact基准上取得了有竞争力的准确率，并证明了对大表格和小型骨干模型的鲁棒性。

摘要翻译

基于表格数据的推理是问答与事实核查等任务的关键能力，这要求模型能够同时理解自由形式的自然语言问题与半结构化表格。然而，尽管思维链等方法引入了推理链条，纯符号化方法本质上受限于其对整体视觉模式的盲视性。为解决这一问题，我们提出了从粗到精的多模态合成框架，这是一种新颖的两阶段范式，能够将高层视觉感知与细粒度符号推理进行层次化解耦。在粗粒度阶段，CFMS利用多模态大语言模型一次性合成一个多视角知识元组。该元组随后作为动态推理图谱来指导精粒度阶段，在此阶段中，一个符号化引擎在表格上执行一系列有针对性且高效的迭代操作。在WikiTQ和TabFact基准测试上的大量实验表明，CFMS取得了具有竞争力的准确率。该框架在处理大型表格以及采用较小骨干模型实例化时表现出特别的鲁棒性，验证了其有效性与泛化能力。

摘要 (Abstract)

Reasoning over tabular data is a crucial capability for tasks like question answering and fact verification, as it requires models to comprehend both free-form questions and semi-structured tables. However, while methods like Chain-of-Thought (CoT) introduce reasoning chains, purely symbolic methodes are inherently limited by their blindness to holistic visual patterns. To address this, we propose the Coarse-to-Fine Multimodal Synthesis framework (CFMS), a novel two-stage paradigm that hierarchically decouples high-level visual perception from granular symbolic reasoning. In the Coarse Stage, CFMS leverages the Multimodal Large Language Models (MLLMs) to perform a one-time synthesis of a multi-perspective knowledge tuple. This tuple subsequently serves as a dynamic reasoning map to guide the fine stage, where a symbolic engine executes a targeted and efficient sequence of iterative operations over the table. Extensive experiments on the WikiTQ and TabFact benchmarks demonstrate that CFMS achieves competitive accuracy. The framework exhibits particular robustness when handling large tables and when instantiated with smaller backbone models, validating its effectiveness and generalizability.

关键词: Tabular Reasoning, Multimodal Synthesis, Chain-of-Thought, Multimodal Large Language Models, Coarse-to-Fine Framework, Symbolic Reasoning, Visual Perception, Knowledge Tuple

8. ✅ CLSGen: A Dual-Head Fine-Tuning Framework for Joint Probabilistic Classification and Verbalized Explanation

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

论文提出CLSGen框架，通过双头微调方法解决LLM在二元分类任务中概率估计不可靠和解释能力丧失的问题，实验表明其在分类指标和解释可读性上优于基线模型。

摘要翻译

随着大语言模型（LLM）的最新进展，学界对其应用于解决复杂挑战性问题的兴趣日益增长。现代大语言模型能够处理长上下文并生成语言化解释，在应对现实世界应用方面展现出巨大潜力。然而，在将大语言模型部署于实际决策时，一个关键障碍在于其无法提供可靠的定量概率。虽然使用传统判别式目标（类似于仅编码器模型）对大语言模型进行任务特定的微调可以获得概率估计，但这通常会导致灾难性遗忘和语言能力崩溃。其结果是模型丧失生成解释的能力，严重损害了其可解释性与可用性。为应对这一挑战，我们提出了CLSGen——一种专为二元分类任务设计的新型大语言模型微调框架。该框架包含新颖的模型架构、训练方法和数据构建策略，旨在实现稳健的概率估计，同时不牺牲模型固有的解释生成能力。在多个基准数据集上的实验结果表明，经CLSGen微调的模型在分类指标（AUROC和F1分数）上优于现有基线方法。在解释性方面，结果显示预测标签与生成的理由之间具有高度一致性，同时文本可读性表现优异。

摘要 (Abstract)

With the recent progress of Large Language Models (LLMs), there is a growing interest in applying these models to solve complex and challenging problems. Modern LLMs, capable of processing long contexts and generating verbalized explanations, offer significant potential in addressing real-world applications. However, a critical hurdle in deploying LLMs for practical decision-making is their inability to provide reliable, quantitative probabilities. While task-specific fine-tuning of LLMs using traditional discriminative objectives (similar to encoder-only models) can yield probability estimates, this often leads to catastrophic forgetting and linguistic collapse. Consequently, the model loses its ability to generate explanations, severely undermining its interpretability and usability. To address this challenge, we propose CLSGen, a novel LLM fine-tuning framework designed for binary classification tasks. The CLSGen framework encompasses a new model architecture, training methodology, and data construction strategy to enable robust probability estimation without sacrificing the model’s inherent explanation-generation capabilities. Experimental results across multiple benchmark datasets demonstrate that models fine-tuned with CLSGen outperform existing baselines in classification metrics (AUROC and F1-score). Regarding explanation, the results showed strong alignment between predicted labels and generated justifications, as well as high readability.

关键词: Large Language Models, fine-tuning, binary classification, probability estimation, explanation generation, interpretability, CLSGen, dual-head architecture

9. ✅ Self-Evolving LLM Memory Extraction Across Heterogeneous Tasks

作者: Yuqing Yang, Tengxiao Liu, Wang Bill Zhu, Taiwei Shi, Linxin Song, Robin Jia 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11610v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了LLM助手在异构任务中如何有效提取和保留对话记忆的问题，并提出了一种基于聚类的自我演化策略CluE，在BEHEMOTH基准测试中实现了比现有方法更好的泛化性能。

摘要翻译

随着基于大语言模型（LLM）的助手日益具备持久性与个性化能力，其必须从历史对话中提取并保留有用信息作为记忆。然而，值得记忆的信息类型在不同任务间差异显著。我们形式化了异构记忆提取任务，并提出了BEHEMOTH基准。该基准通过下游效用驱动指标，重新整合了涵盖个性化、问题解决与智能体任务的18个现有数据集，以进行系统性评估。我们的实证分析证实：不存在单一静态提取提示能在所有任务类别中均占主导地位；且现有自演化提示优化框架（最初为同质分布设计）在训练任务异构时性能会下降。为此，我们提出CluE——一种基于聚类的自演化策略。该方法按提取场景将训练样本分组为聚类，独立分析每个聚类，并综合跨聚类洞察以更新提取提示。在BEHEMOTH上的实验表明，CluE能有效泛化至异构任务（相对增益提升+9.04%），持续优于先前的自演化框架。

摘要 (Abstract)

As LLM-based assistants become persistent and personalized, they must extract and retain useful information from past conversations as memory. However, the types of information worth remembering vary considerably across tasks. We formalize the \textit{heterogeneous memory extraction} task and introduce \textbf{BEHEMOTH}, a benchmark that repurposes 18 existing datasets spanning personalization, problem-solving, and agentic tasks, using a downstream utility-driven metric for systematic evaluation. Our empirical analysis confirms that no single static extraction prompt dominates across all task categories, and that existing self-evolving prompt optimization frameworks, originally designed for homogeneous distributions, degrade when training tasks are heterogeneous. To address this, we propose \textbf{CluE}, a cluster-based self-evolving strategy that groups training examples into clusters by extraction scenarios, analyzes each cluster independently, and synthesizes cross-cluster insights to update the extraction prompt. Experiments on BEHEMOTH show that CluE generalizes effectively across heterogeneous tasks ($+$9.04% relative gain), consistently outperforming prior self-evolving frameworks.

关键词: LLM memory extraction, heterogeneous tasks, self-evolving strategy, BEHEMOTH benchmark, CluE, agentic tasks, prompt optimization, downstream utility

10. ✅ ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	5.0/10	5.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	10.0/10	10.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Compositional Simulation的混合模拟方法，通过结合经典模拟和神经模拟来生成大规模、高质量的机器人训练数据，有效减少了sim2real领域差距并提高了真实世界策略模型的成功率。

摘要翻译

近期，基础模型（如大语言模型与世界模型）的进展显著提升了机器人学的能力，使机器人能够自主执行复杂任务。然而，获取大规模、高质量的机器人训练数据仍面临挑战，因为这通常需要大量人工投入，且难以覆盖多样化的真实世界环境。为解决这一问题，我们提出了一种名为“组合式仿真”的新型混合方法，该方法结合了经典仿真与神经仿真，在保持真实世界一致性的同时生成精确的动作-视频对。我们的方法采用闭环式“真实-仿真-真实”数据增强流程，利用少量真实世界数据生成覆盖更广泛真实场景的多样化、大规模训练数据集。我们训练了一个神经仿真器，将经典仿真视频转换为真实世界表征，从而提升在真实环境中训练的策略模型的准确性。通过大量实验，我们证明该方法显著缩小了仿真到真实的领域差距，使真实世界策略模型训练获得更高的成功率。我们的研究为生成鲁棒训练数据、弥合仿真与真实机器人学之间的差距提供了一种可扩展的解决方案。

摘要 (Abstract)

Recent advancements in foundational models, such as large language models and world models, have greatly enhanced the capabilities of robotics, enabling robots to autonomously perform complex tasks. However, acquiring large-scale, high-quality training data for robotics remains a challenge, as it often requires substantial manual effort and is limited in its coverage of diverse real-world environments. To address this, we propose a novel hybrid approach called Compositional Simulation, which combines classical simulation and neural simulation to generate accurate action-video pairs while maintaining real-world consistency. Our approach utilizes a closed-loop real-sim-real data augmentation pipeline, leveraging a small amount of real-world data to generate diverse, large-scale training datasets that cover a broader spectrum of real-world scenarios. We train a neural simulator to transform classical simulation videos into real-world representations, improving the accuracy of policy models trained in real-world environments. Through extensive experiments, we demonstrate that our method significantly reduces the sim2real domain gap, resulting in higher success rates in real-world policy model training. Our approach offers a scalable solution for generating robust training data and bridging the gap between simulated and real-world robotics.

关键词: Compositional Simulation, robotics, data generation, neural simulator, sim2real, policy training, real-world consistency, closed-loop pipeline

11. ✅ DreamKG: A KG-Augmented Conversational System for People Experiencing Homelessness

评分: 28.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该研究针对无家可归者难以获取准确社区服务信息的问题，开发了一个结合知识图谱和大型语言模型的对话系统DreamKG，通过基于验证数据的检索增强生成有效减少幻觉，并在初步评估中显示出优于Google Search AI的性能。

摘要翻译

无家可归者（PEH）在获取及时、准确的社区服务信息方面面临显著障碍。DreamKG通过一个知识图谱增强的对话系统应对这一问题，该系统将回答基于经过验证的、关于费城组织、服务、地点和开放时间的最新数据。与易产生幻觉的标准大语言模型（LLMs）不同，DreamKG结合Neo4j知识图谱与结构化查询理解，可靠地处理位置感知和时间敏感的查询。该系统能执行空间推理以提供基于距离的推荐，并进行时间过滤以处理运营时间。初步评估显示，在相关查询上其性能优于谷歌搜索AI达59%，并对无关查询的拒绝率达到84%。本案例展示了结合LLM灵活性与知识图谱可靠性的混合架构潜力，可有效提升弱势群体获取服务的可及性。

摘要 (Abstract)

People experiencing homelessness (PEH) face substantial barriers to accessing timely, accurate information about community services. DreamKG addresses this through a knowledge graph-augmented conversational system that grounds responses in verified, up-to-date data about Philadelphia organizations, services, locations, and hours. Unlike standard large language models (LLMs) prone to hallucinations, DreamKG combines Neo4j knowledge graphs with structured query understanding to handle location-aware and time-sensitive queries reliably. The system performs spatial reasoning for distance-based recommendations and temporal filtering for operating hours. Preliminary evaluation shows 59% superiority over Google Search AI on relevant queries and 84% rejection of irrelevant queries. This demonstration highlights the potential of hybrid architectures that combines LLM flexibility with knowledge graph reliability to improve service accessibility for vulnerable populations effectively.

关键词: knowledge graph, conversational system, large language models, hallucination mitigation, retrieval-augmented generation, homelessness, community services, spatial reasoning

12. ✅ Efficient Training for Cross-lingual Speech Language Models

作者: Yan Zhou, Qingkai Fang, Yun Hong, Yang Feng 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11096v1

评分: 28.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	8.0/10	8.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了一种基于离散语音令牌的高效跨语言语音大语言模型训练方法（CSLM），通过新颖的对齐策略和指令微调，实现了跨模态和跨语言对齐，提升了生成质量并降低了延迟。

摘要翻译

当前，大语言模型主要集中于文本模态。为实现更自然的人机交互，语音大语言模型正逐渐兴起，但由于数据有限及多语言扩展困难，构建有效的端到端语音大语言模型仍具挑战。本文提出跨语言语音语言模型，这是一种基于离散语音令牌的高效跨语言语音大语言模型训练方法。我们设计了一种新颖的对齐策略，通过持续预训练实现跨模态与跨语言的对齐。通过遵循语音-文本交错模态生成链进行指令微调，我们在更细粒度上增强了模态对齐，从而提升生成质量并降低延迟。CSLM无需海量语音数据即可同时对齐不同模态与语言，因而展现出良好的语言可扩展性。在跨模态任务、单语言对话任务及跨语言对话任务上的评估表明，CSLM具备强大的跨模态对齐能力与通用任务处理能力。（代码发布于：https://github.com/ictnlp/CSLM）

摘要 (Abstract)

Currently, large language models (LLMs) predominantly focus on the text modality. To enable more natural human-AI interaction, speech LLMs are emerging, but building effective end-to-end speech LLMs remains challenging due to limited data and the difficulty in expanding to more languages. In this paper, we introduce Cross-lingual Speech Language Model (CSLM), an efficient training method for cross-lingual speech LLMs based on discrete speech tokens. We propose a novel alignment strategy that achieves cross-modal and cross-lingual alignment through continual pre-training. By conducting instruction fine-tuning following a speech-text interleaved chain-of-modality generation process, we enhance modal alignment at a finer granularity, thereby improving generation quality and reducing latency. CSLM aligns different modalities and languages simultaneously without the need for massive speech data, thus exhibiting good language scalability. Evaluations on cross-modal tasks, mono-lingual conversational tasks, and cross-lingual conversational tasks demonstrate CSLM’s strong cross-modal alignment capabilities and general task abilities. (Code is available at: https://github.com/ictnlp/CSLM)

关键词: speech language models, cross-lingual, discrete speech tokens, alignment strategy, continual pre-training, instruction fine-tuning, modal alignment, language scalability

13. ❌ Dynamic Summary Generation for Interpretable Multimodal Depression Detection

作者: Shiyu Teng, Jiaqing Liu, Hao Sun, Yu Li, Shurong Chai, Ruibo Hou, Tomoko Tateyama, Lanfen Lin, Yen-Wei Chen 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11334v1

评分: 26.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	8.0/10	8.0

评分理由: 论文明确使用LLMs进行抑郁症检测，属于大模型在科学（医疗）领域的应用，因此"Large Language Models"得10分。论文强调可解释性检测和透明推理，与"Mechanistic Interpretability"相关，得8分。抑郁症检测属于生物医学AI应用，与"AI for Science"相关，得8分。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种利用大语言模型（LLMs）进行多模态抑郁症检测的框架，通过生成渐进式临床摘要来指导多模态特征融合，在准确性和可解释性上均优于现有基线方法。

摘要翻译

抑郁症因社会污名化和主观症状评估阻碍了可靠筛查，仍普遍存在漏诊和治疗不足的问题。为解决这一挑战，我们提出了一种由粗到精的多阶段框架，利用大语言模型（LLMs）实现准确且可解释的检测。该流程依次执行二元筛查、五级严重程度分类和连续回归分析。在每个阶段，大语言模型生成逐步丰富的临床摘要，指导一个融合文本、音频和视频特征的多模态融合模块，从而产生附带透明推理依据的预测结果。系统随后将所有摘要整合为一份简洁、易于理解的评估报告。在E-DAIC和CMDC数据集上的实验表明，该系统在准确性和可解释性方面均显著优于现有先进基线方法。

摘要 (Abstract)

Depression remains widely underdiagnosed and undertreated because stigma and subjective symptom ratings hinder reliable screening. To address this challenge, we propose a coarse-to-fine, multi-stage framework that leverages large language models (LLMs) for accurate and interpretable detection. The pipeline performs binary screening, five-class severity classification, and continuous regression. At each stage, an LLM produces progressively richer clinical summaries that guide a multimodal fusion module integrating text, audio, and video features, yielding predictions with transparent rationale. The system then consolidates all summaries into a concise, human-readable assessment report. Experiments on the E-DAIC and CMDC datasets show significant improvements over state-of-the-art baselines in both accuracy and interpretability.

关键词: Large Language Models, Multimodal Depression Detection, Interpretable Detection, Clinical Summaries, Multimodal Fusion, E-DAIC Dataset, CMDC Dataset, Assessment Report

14. ❌ Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees

作者: Zhuolun Dong, Junyu Cao 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11001v1

评分: 26.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	8.0/10	8.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	8.0/10	8.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文专注于LLM推理系统的优化，特别是解决KV缓存内存溢出和系统不稳定的问题。因此，与"Large Language Models"高度相关（10分），因为论文直接研究LLM推理。与"KV Cache Compression"相关（8分），因为论文涉及KV缓存利用率的优化。与"Speculative Decoding"相关（8分），因为论文关注推理加速和延迟优化。其他关键词如模型训练、对齐、科学应用等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种流控制调度框架，以解决LLM推理中未知解码长度导致的KV缓存内存溢出和系统不稳定问题，实验证明该方法能提高吞吐量、降低延迟并稳定缓存利用率。

摘要翻译

大型语言模型（LLM）因其在广泛应用中的卓越性能而被广泛采用。ChatGPT和Gemini目前为数亿活跃用户提供服务，每日处理数十亿用户请求，这使得优化LLM推理成为关注焦点。LLM推理中的一个关键挑战是解码长度未知。每个请求的内存使用量随生成令牌数量增长，可能导致溢出并引发系统不稳定。为解决这一问题，我们提出了一种简单的流量控制框架，用于控制提示加入活动集合的速率。我们推导出任何稳定系统必须满足的必要条件，并建立了充分条件，证明我们的算法可确保稳定性。实验表明，与实践中常用策略相比，我们的方法实现了更高的令牌和请求吞吐量、更低的平均及尾部延迟，以及更稳定的KV缓存利用率。

摘要 (Abstract)

Large language models (LLMs) have been widely adopted due to their great performance across a wide range of applications. ChatGPT and Gemini now serve hundreds of millions of active users and handle billions of user requests per day, which puts optimizing LLM inference into the spotlight. A key challenge in LLM inference is that decode lengths are unknown. The memory usage for each request grows with generated tokens, which may lead to overflow and cause system instability. To address this concern, we propose a simple flow-control framework that controls the rate at which prompts join the active set. We derive a necessary condition that any stable system must satisfy and establish sufficient conditions under which our algorithm provably achieves stability. Experiments show that, compared to commonly used strategies in practice, our approach achieves higher token and request throughput, lower average and tail latency, and more stable KV cache utilization.

关键词: LLM inference, flow control, KV cache, system stability, throughput, latency, scheduling

15. ❌ Agentic Driving Coach: Robustness and Determinism of Agentic AI-Powered Human-in-the-Loop Cyber-Physical Systems

作者: Deeksha Prahlad, Daniel Fan, Hokeun Kim 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11705v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文明确提到使用基础模型（包括LLMs）构建人机交互物理系统（HITL CPS）中的AI代理（agentic AI），因此与"Large Language Models"和"LLM Agents"高度相关（10分）。论文聚焦于解决此类系统中的非确定性问题，未涉及其他关键词的具体技术细节或应用，因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对基于基础模型（如LLMs）的人机交互物理系统中AI代理行为不可预测导致的非确定性问题，提出了一种基于反应器计算模型的解决方案，并通过开源Lingua Franca框架实现，以驾驶教练应用为例验证了方法的有效性。

摘要翻译

基础模型，包括大语言模型（LLMs），正日益被用于人在回路（HITL）信息物理系统（CPS），因为基于基础模型的人工智能代理具备与物理环境和人类用户交互的潜力。然而，除了动态变化的物理环境外，人类用户和人工智能代理行为的不可预测性，导致了不可控的非确定性。为应对这一赋能智能体AI驱动的HITL CPS的紧迫挑战，我们提出了一种基于反应器计算模型（MoC）的方法，并通过开源Lingua Franca（LF）框架实现。我们还以智能驾驶教练作为HITL CPS的应用，进行了具体案例研究。通过对基于LF的智能体HITL CPS进行评估，我们识别了在此类智能体HITL CPS中重新引入确定性所面临的实际挑战，并提出了解决这些挑战的路径。

摘要 (Abstract)

Foundation models, including large language models (LLMs), are increasingly used for human-in-the-loop (HITL) cyber-physical systems (CPS) because foundation model-based AI agents can potentially interact with both the physical environments and human users. However, the unpredictable behavior of human users and AI agents, in addition to the dynamically changing physical environments, leads to uncontrollable nondeterminism. To address this urgent challenge of enabling agentic AI-powered HITL CPS, we propose a reactor-model-of-computation (MoC)-based approach, realized by the open-source Lingua Franca (LF) framework. We also carry out a concrete case study using the agentic driving coach as an application of HITL CPS. By evaluating the LF-based agentic HITL CPS, we identify practical challenges in reintroducing determinism into such agentic HITL CPS and present pathways to address them.

关键词: Foundation models, Large language models, AI agents, Human-in-the-loop, Cyber-physical systems, Determinism, Lingua Franca, Driving coach

16. ❌ Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

作者: Jiashu Yao, Heyan Huang, Chuwei Luo, Daiqing Wu, Zeming Liu, Yuhang Guo, Yangyang Kang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11510v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM的强化学习（RL）方法，提出Policy Split范式，通过双模式熵正则化鼓励多样化探索。因此，与"Large Language Models"和"RLHF"等RL相关关键词高度相关（10分），因为RLHF是LLM强化学习的主流方法之一，论文的RL框架与之直接相关。其他关键词如MoE、SFT、RAG、推理加速等，论文未涉及，故为0分。

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型（LLMs）在强化学习中探索与准确性的平衡问题，提出了一种名为Policy Split的双模式熵正则化新范式，实验表明该方法在多种任务和模型规模上优于现有基线，有效促进了多样化探索。

摘要翻译

为在确保准确性的前提下促进大语言模型（LLM）强化学习（RL）的多样化探索，我们提出策略分裂（Policy Split）这一新范式。该范式通过引入高熵提示词，将策略拆分为常规模式与高熵模式。两种模式共享模型参数，但针对不同目标进行协作式双模熵正则化。具体而言，常规模式以任务正确性为优化目标，而高熵模式则融入对探索行为的偏好，二者通过协作共同学习。大量实验表明，在通用任务与创造性任务中，我们的方法在不同规模模型上均稳定优于现有基于熵引导的强化学习基线。进一步分析揭示，策略分裂促进了双模探索：高熵模式产生了与常规模式截然不同的行为模式，从而提供了独特的学习信号。

摘要 (Abstract)

To encourage diverse exploration in reinforcement learning (RL) for large language models (LLMs) without compromising accuracy, we propose Policy Split, a novel paradigm that bifurcates the policy into normal and high-entropy modes with a high-entropy prompt. While sharing model parameters, the two modes undergo collaborative dual-mode entropy regularization tailored to distinct objectives. Specifically, the normal mode optimizes for task correctness, while the high-entropy mode incorporates a preference for exploration, and the two modes learn collaboratively. Extensive experiments demonstrate that our approach consistently outperforms established entropy-guided RL baselines across various model sizes in general and creative tasks. Further analysis reveals that Policy Split facilitates dual-mode exploration, where the high-entropy mode generates distinct behavioral patterns to the normal mode, providing unique learning signals.

关键词: Large Language Models, Reinforcement Learning, Policy Split, Dual-mode Entropy Regularization, Exploration, High-entropy Mode, LLM RL, Entropy-guided RL

17. ❌ When Meaning Isn’t Literal: Exploring Idiomatic Meaning Across Languages and Modalities

作者: Sarmistha Das, Shreyas Guha, Suvrayan Bandyopadhyay, Salisa Phosit, Kitsuchart Pasupa, Sriparna Saha 期刊/来源: arxiv 发布日期: 2026-04-12 arXiv链接: http://arxiv.org/abs/2604.10787v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在理解多语言、多模态习语方面的局限性，并提出了新的数据集和框架来改进。因此，与"Large Language Models"高度相关（10分）。论文涉及推理过程改进（HIDE框架使用迭代推理细化），与"Chain of Thought"和"System 2 Thinking"有一定关联（各5分）。其他关键词如MoE、SLMs、训练技术、优化方法、代理系统等均未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文研究了当前大语言模型和视觉语言模型在理解多语言、多模态习语时存在的系统性失败问题，通过创建Mediom数据集和提出HIDE框架，为下一代AI系统建立了文化基础的多模态习语理解测试平台和方法论。

摘要翻译

习语推理与隐喻及文化深度交织，仍是当代语言模型的盲点，其进展偏重于表层词汇与语义线索。例如，孟加拉语习语 \textit{\foreignlanguage{bengali}{\char"0986\char"0999\char"09CD\char"0997\char"09C1 \char"09B0 \char"09AB\char"09B2 \char"099F\char"0995}}（angur fol tok，“葡萄是酸的”）：它编码了基于否认的合理化机制，但简单模型却拘泥于字面的狐狸与葡萄意象。为弥补这一疏漏，我们提出“Mediom”——一个包含3,533条印地语、孟加拉语和泰语习语的多语言、多模态习语语料库，每条习语均配有高质量解释、跨语言翻译及精心对齐的文本-图像表征。我们在Mediom上对大型语言模型（文本推理）和视觉-语言模型（比喻性歧义消解）进行基准测试，揭示了隐喻理解中的系统性缺陷。为弥补这些不足，我们提出“HIDE”（基于提示的习语解释框架），该框架利用错误反馈检索和针对性诊断线索进行迭代式推理优化。总体而言，Mediom与HIDE共同为下一代人工智能系统建立了一个严谨的测试平台与方法论，旨在实现基于文化的、嵌入推理提示的多模态习语理解。

摘要 (Abstract)

Idiomatic reasoning, deeply intertwined with metaphor and culture, remains a blind spot for contemporary language models, whose progress skews toward surface-level lexical and semantic cues. For instance, the Bengali idiom \textit{\foreignlanguage{bengali}{\char"0986\char"0999\char"09CD\char"0997\char"09C1 \char"09B0 \char"09AB\char"09B2 \char"099F\char"0995}} (angur fol tok, grapes are sour''): it encodes denial-driven rationalization, yet naive models latch onto the literal fox-and-grape imagery. Addressing this oversight, we present Mediom,’’ a multilingual, multimodal idiom corpus of 3,533 Hindi, Bengali, and Thai idioms, each paired with gold-standard explanations, cross-lingual translations, and carefully aligned text–image representations. We benchmark both large language models (textual reasoning) and vision-language models (figurative disambiguation) on Mediom, exposing systematic failures in metaphor comprehension. To mitigate these gaps, we propose ``HIDE,’’ a Hinting-based Idiom Explanation framework that leverages error-feedback retrieval and targeted diagnostic cues for iterative reasoning refinement. Collectively, Mediom and HIDE establish a rigorous test bed and methodology for culturally grounded, multimodal idiom understanding embedded with reasoning hints in next-generation AI systems.

关键词: idiomatic reasoning, multilingual idiom corpus, large language models, vision-language models, metaphor comprehension, HIDE framework, multimodal understanding, cultural grounding

18. ❌ Generating Multiple-Choice Knowledge Questions with Interpretable Difficulty Estimation using Knowledge Graphs and Large Language Models

作者: Mehmet Can Şakiroğlu, H. Altay Güvenir, Kamer Kaya 期刊/来源: arxiv 发布日期: 2026-04-12 arXiv链接: http://arxiv.org/abs/2604.10748v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文核心是应用大语言模型（LLMs）于教育领域的自动多选题生成，因此与"Large Language Models"高度相关（10分）。方法涉及知识图谱和难度估计，与"Explainable AI"有一定关联（5分），因为难度估计被描述为可解释的。研究属于AI在教育中的应用，与"AI for Science"广义相关（5分），但非严格意义上的生物信息学或化学信息学。其他关键词主要涉及大模型的技术原理、训练方法、推理优化、代理系统等，论文未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种结合知识图谱和大语言模型的方法，用于从输入文档自动生成具有可解释难度估计的多选题，实验表明该方法能生成高质量且难度估计符合人类感知的题目。

摘要翻译

在自适应人工智能辅助教育中，自动生成具有难度评估的多项选择题（MCQs）仍面临挑战。本研究提出了一种新颖方法，通过利用知识图谱（KGs）和大语言模型（LLMs），从输入文档中生成带有难度评估的MCQs。我们的方法使用LLM从输入文档构建知识图谱，并基于该图谱系统性地生成MCQs。每道MCQ的生成过程包括：从知识图谱中选择一个节点作为关键点，采样一个相关的三元组或五元组（可选择性地额外增加一个三元组进行扩展），并提示LLM基于这些图谱组件生成对应的题干。干扰项则从知识图谱中选取。针对每道MCQ，我们计算九个难度信号，并通过数据驱动的方法将其整合为统一的难度分数。实验结果表明，我们的方法能够生成高质量的MCQs，其难度评估具有可解释性且与人类感知相符。本研究通过将结构化知识表征与LLMs及数据驱动的难度评估模型相结合，推进了自动化MCQ生成技术的发展。

摘要 (Abstract)

Generating multiple-choice questions (MCQs) with difficulty estimation remains challenging in automated MCQ-generation systems used in adaptive, AI-assisted education. This study proposes a novel methodology for generating MCQs with difficulty estimation from the input documents by utilizing knowledge graphs (KGs) and large language models (LLMs). Our approach uses an LLM to construct a KG from input documents, from which MCQs are then systematically generated. Each MCQ is generated by selecting a node from the KG as the key, sampling a related triple or quintuple – optionally augmented with an extra triple – and prompting an LLM to generate a corresponding stem from these graph components. Distractors are then selected from the KG. For each MCQ, nine difficulty signals are computed and combined into a unified difficulty score using a data-driven approach. Experimental results demonstrate that our method generates high-quality MCQs whose difficulty estimation is interpretable and aligns with human perceptions. Our approach improves automated MCQ generation by integrating structured knowledge representations with LLMs and a data-driven difficulty estimation model.

关键词: Multiple-Choice Questions, Knowledge Graphs, Large Language Models, Difficulty Estimation, Automated Question Generation, AI-assisted Education, Interpretable AI

19. ❌ BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning

作者: Zekun Qian, Ruize Han, Wei Feng 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11136v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出BoxTuning方法，专注于多模态大语言模型（MLLMs）的微调，属于大模型应用领域。核心贡献是解决视频问答中对象级时空理解的模态不匹配问题，通过视觉提示（彩色边界框和轨迹）直接注入对象信息，显著减少文本标记成本并保持完整时间分辨率。该方法直接涉及大语言模型（LLMs）和微调（SFT）技术，因此这两个关键词高度相关（10分）。其他关键词如MoE、量化、推理加速、AI for Science等与论文内容无直接关联，均得0分。

!!! tip deepseek-chat TL;DR

论文提出BoxTuning方法，通过视觉提示直接注入对象时空信息到多模态大语言模型中，解决了视频问答中文本坐标方法导致的模态不匹配和高标记成本问题，在多个基准测试中超越了基线方法。

摘要翻译

物体级时空理解对于视频问答至关重要，然而现有的多模态大语言模型（MLLMs）以整体方式编码视频帧，缺乏细粒度物体定位的显式机制。近期研究通过将边界框坐标序列化为文本标记来解决此问题，但这种文本-坐标范式存在根本性的模态失配：物体信息本质上是视觉的，但将其编码为文本会带来高昂的标记成本，迫使模型进行激进的时间下采样。我们提出BoxTuning方法，通过将物体时空信息直接注入视觉模态来解决这一失配问题。彩色的边界框和轨迹轨迹被作为视觉提示渲染到视频帧上，仅保留简洁的颜色-物体对应图例作为文本。这显著降低了文本标记成本，在实践中实现了87-93%的文本标记减少。该方法还保持了完整的时间分辨率，其中轨迹轨迹进一步编码了每个关键帧内的帧间运动方向和速度，恢复了文本-坐标方法被迫丢弃的细粒度动态信息。在五个视频问答基准（CLEVRER、Perception Test、STAR、NExT-QA、IntentQA）上的实验结果表明，BoxTuning在空间导向任务上超越了文本-坐标基线，并几乎消除了在以推理为中心的任务上观察到的准确率下降，从而确立了视觉提示作为一种更自然、更高效的向视频MLLMs传递物体信息的范式。

摘要 (Abstract)

Object-level spatial-temporal understanding is essential for video question answering, yet existing multimodal large language models (MLLMs) encode frames holistically and lack explicit mechanisms for fine-grained object grounding. Recent work addresses this by serializing bounding box coordinates as text tokens, but this text-coordinate paradigm suffers from a fundamental modality mismatch: object information is inherently visual, yet encoding it as text incurs a high token cost that forces aggressive temporal downsampling. We propose BoxTuning, which resolves this mismatch by injecting object spatial-temporal information directly into the visual modality. Colored bounding boxes and trajectory trails are rendered onto video frames as visual prompts, with only a concise color-to-object legend retained as text. This reduces the token cost significantly, achieving 87-93% text token reduction in practice. It also preserves full temporal resolution, where the trajectory trails further encode inter-frame motion direction and speed within each keyframe, recovering fine-grained dynamics that text-coordinate methods are forced to discard. Experimental results on five video QA benchmarks (CLEVRER, Perception Test, STAR, NExT-QA, IntentQA) show that BoxTuning surpasses text-coordinate baselines on spatially oriented tasks and nearly eliminates the accuracy degradation observed on reasoning-centric tasks, establishing visual prompting as a more natural and efficient paradigm for conveying object information to video MLLMs.

关键词: BoxTuning, multimodal large language models, video question answering, object spatial-temporal information, visual prompting, fine-tuning, token reduction, trajectory trails

20. ❌ Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)

作者: Chenhao Fang, Jordi Mola, Mark Harman, Jason Nawrocki, Vaibhav Shrivastava, Yue Cheng, Jay Minesh Shah, Katayoun Zand, Mansi Tripathi, Arya Pudota, Matthew Becker, Hervé Robert, Abhishek Gulati 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11141v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM在企业高风险工作流中的幻觉缓解问题，提出了HUMBR框架，因此与"Large Language Models"和"Hallucination Mitigation"高度相关（10分）。其他关键词如MoE、SLMs、训练方法、推理技术、代理系统、模型压缩等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM在企业高风险工作流中的幻觉问题，提出了HUMBR框架，通过最小贝叶斯风险方法显著降低了幻觉风险，在TruthfulQA、LegalBench和Meta生产数据上优于标准方法，并消除了关键召回失败。

摘要翻译

尽管大型语言模型驱动着自动化进程，但确保对涉及法律事务、风险管理和隐私合规等高风险企业工作流程给予充分考量至关重要。对于Meta及类似我们这样的组织而言，在此类高风险流程中出现单个幻觉条款即可能引发实质性后果。我们证明，通过将幻觉缓解问题构建为最小贝叶斯风险（Minimum Bayes Risk，MBR）问题，能够显著降低此类风险。具体而言，我们提出了一种混合效用MBR（Hybrid Utility MBR，HUMBR）框架，该框架综合语义嵌入相似性与词汇精确度，在无需真实参考标注的情况下识别共识，并为此推导了严格的误差边界。我们通过广泛使用的公共基准测试集（TruthfulQA与LegalBench）以及Meta实际生产部署中的真实数据，对此理论分析进行了全面的实证评估。实证研究结果表明，MBR方法显著优于标准通用自一致性方法。值得注意的是，该流程81%的建议方案优于人工撰写的真实标注，且关键召回失误几乎被完全消除。

摘要 (Abstract)

Although LLMs drive automation, it is critical to ensure immense consideration for high-stakes enterprise workflows such as those involving legal matters, risk management, and privacy compliance. For Meta, and other organizations like ours, a single hallucinated clause in such high stakes workflows risks material consequences. We show that by framing hallucination mitigation as a Minimum Bayes Risk (MBR) problem, we can dramatically reduce this risk. Specifically, we introduce a Hybrid Utility MBR (HUMBR) framework that synthesizes semantic embedding similarity with lexical precision to identify consensus without ground-truth references, for which we derive rigorous error bounds. We complement this theoretical analysis with a comprehensive empirical evaluation on widely-used public benchmark suites (TruthfulQA and LegalBench) and also real world data from Meta production deployment. The results from our empirical study show that MBR significantly outperforms standard Universal Self-Consistency. Notably, 81% of the pipeline’s suggestions were preferred over human-crafted ground truth, and critical recall failures were virtually eliminated.

关键词: Hallucination Mitigation, Large Language Models, Minimum Bayes Risk, Enterprise AI Workflows, Hybrid Utility MBR, TruthfulQA, LegalBench, Meta Production

21. ❌ S$^3$: Structured Sparsity Specification

作者: Ayoub Ghriss 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11315v1

评分: 18.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	8.0/10	8.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文《S³: Structured Sparsity Specification》提出了一种用于定义、组合和实现结构化稀疏模式的代数框架。该研究主要与模型压缩和稀疏化技术相关，因此与关键词"Mixture of Experts” OR “MoE” OR “Sparse Models"高度相关（10分），因为MoE本质上是稀疏激活模型，而该论文专注于结构化稀疏模式。与"Quantization” OR “Model Compression” OR “Low-bit Weights"有一定关联（8分），因为结构化稀疏是模型压缩的一种形式，但论文未涉及量化或低比特权重。其他关键词主要涉及大语言模型的训练、对齐、推理、应用等方面，与本文的通用稀疏化框架无直接关系，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为S³的代数框架，用于精确指定和实现多样化的结构化稀疏模式，并通过实验验证了基于该框架的稀疏化方法在输出重建任务上超越了现有的二阶启发式方法。

摘要翻译

本文提出结构化稀疏规范（Structured Sparsity Specification，简称S$^3$），这是一个用于定义、组合与实现结构化稀疏模式的代数框架。S$^3$通过三个组成部分定义稀疏性：通过布局组合重塑张量的视图（View）、定义原子剪枝单元的块（Block）规范，以及稀疏决策的作用域（Scope）。块与作用域均支持跨张量的耦合（Coupling），以实现协同稀疏化。S$^3$能够精确描述从细粒度N:M模式到粗粒度通道剪枝在内的多种稀疏结构，并可无缝集成最优脑损伤（Optimal Brain Damage，OBD）与最优脑外科医生（Optimal Brain Surgeon，OBS）算法。我们对本框架进行了数学形式化，通过典型稀疏模式展示了其表达能力，并基于完全构建于S$^3$之上的结构化OBS与OBD实现进行了实验验证：在多种常见配置的输出重建任务中，该方法超越了当前成熟的二阶启发式算法。

摘要 (Abstract)

We introduce the Structured Sparsity Specification (S$^3$), an algebraic framework for defining, composing, and implementing structured sparse patterns. S$^3$ specifies sparsity through three components: a View that reshapes the tensor via layout composition, a Block specification that defines the atomic pruning unit, and the sparsity decision Scope. Both Block and Scope support Coupling across tensors for coordinated sparsification. S$^3$ enables precise specification of diverse sparsity structures, from fine-grained N:M patterns to coarse channel pruning, and integrates seamlessly with Optimal Brain Damage (OBD) and Surgeon (OBS). We formalize the framework mathematically, demonstrate its expressiveness on canonical patterns, and validate it experimentally via structured OBS and OBD implementations built entirely on S$^3$, which surpasses well-established second order heuristics on output reconstruction across common configurations.

关键词: Structured Sparsity, Sparse Models, Model Compression, Optimal Brain Damage, Optimal Brain Surgeon, Pruning, Tensor Sparsification, Algebraic Framework

22. ❌ CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space

作者: Sohwi Lim, Lee Hyoseok, Jungjoon Park, Tae-Hyun Oh 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11539v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	5.0/10	5.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文CLAY提出了一种基于预训练视觉语言模型（VLMs）的自适应相似度计算方法，主要涉及视觉-语言多模态领域。与关键词的相关性分析如下：1）与"Large Language Models"和"Pre-training"相关（5分），因为论文使用了预训练的VLMs（通常包含大型语言模型组件）；2）与"Retrieval-Augmented Generation"相关（5分），因为论文核心是图像检索任务，属于检索增强生成的相关应用；3）其他关键词主要针对纯语言模型、训练技术、推理优化、代理系统等，与论文的视觉-语言多模态检索焦点无关，故评0分。

!!! tip deepseek-chat TL;DR

论文CLAY提出了一种无需额外训练的自适应图像检索方法，通过重构预训练视觉语言模型的嵌入空间为文本条件相似度空间，实现了高效多条件检索并构建了评估数据集验证其有效性。

摘要翻译

人类对视觉相似性的感知本质上是适应性和主观的，取决于用户的兴趣与关注点。然而，大多数图像检索系统未能反映这种灵活性，它们依赖固定、单一的度量标准，无法同时纳入多种条件。为解决此问题，我们提出了CLAY，一种自适应相似度计算方法。该方法将预训练视觉-语言模型（Vision-Language Models, VLMs）的嵌入空间重新构建为文本条件相似性空间，而无需额外训练。此设计将文本条件处理过程与视觉特征提取分离，从而能够利用固定的视觉嵌入实现高效且多条件的检索。我们还构建了一个合成评估数据集CLAY-EVAL，用于在不同条件检索设置下进行全面评估。在标准数据集及我们提出的数据集上的实验表明，与先前工作相比，CLAY实现了较高的检索精度和显著的计算效率。

摘要 (Abstract)

Human perception of visual similarity is inherently adaptive and subjective, depending on the users’ interests and focus. However, most image retrieval systems fail to reflect this flexibility, relying on a fixed, monolithic metric that cannot incorporate multiple conditions simultaneously. To address this, we propose CLAY, an adaptive similarity computation method that reframes the embedding space of pretrained Vision-Language Models (VLMs) as a text-conditional similarity space without additional training. This design separates the textual conditioning process and visual feature extraction, allowing highly efficient and multi-conditioned retrieval with fixed visual embeddings. We also construct a synthetic evaluation dataset CLAY-EVAL, for comprehensive assessment under diverse conditioned retrieval settings. Experiments on standard datasets and our proposed dataset show that CLAY achieves high retrieval accuracy and notable computational efficiency compared to previous works.

关键词: Vision-Language Models, Image Retrieval, Adaptive Similarity, Conditional Retrieval, Embedding Space, Computational Efficiency, Multi-conditioned Retrieval, CLAY-EVAL

23. ❌ Min-$k$ Sampling: Decoupling Truncation from Temperature Scaling via Relative Logit Dynamics

作者: Yuanhao Ding, Meimingwei Li, Esteban Garces Arias, Matthias Aßenmacher, Christian Heumann, Chongsheng Zhang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11012v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的文本生成解码采样策略，提出了一种名为Min-k Sampling的新方法，直接与"Large Language Models"高度相关（10分）。论文在多个推理基准上进行了实验，因此与"Chain of Thought"有一定关联（5分）。论文未涉及其他关键词，如MoE、SLMs、训练方法、对齐、推理加速、AI for Science等，这些均评为0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型解码采样策略对温度参数过度敏感的问题，提出了一种基于对数分布局部形状分析的动态截断方法Min-k Sampling，该方法被证明具有严格的温度不变性，并在实验中提高了文本生成质量。

摘要翻译

大型语言模型生成文本的质量关键取决于解码采样策略。虽然主流方法如Top-$k$、Top-$p$和Min-$p$通过概率空间截断在多样性与准确性之间取得平衡，但它们存在一个共同的固有局限：对温度参数极度敏感。近期基于对数空间的Top-$nσ$等方法实现了温度不变性，但依赖的全局统计量易受长尾噪声干扰，无法捕捉头部候选词之间细粒度的置信度结构。我们提出\textbf{Min-$k$采样}，这是一种新颖的动态截断策略，通过分析排序后对数概率分布的局部形态来识别“语义悬崖”——即从高置信度核心词符到不确定长尾词符的急剧过渡。通过计算位置加权的相对衰减率，Min-$k$能在每个生成步骤动态确定截断边界。我们严格证明了Min-$k$具有完全的温度不变性，并通过实验验证其对超参数选择的低敏感性。在多项推理基准测试、创意写作任务和人工评估中的实验表明，Min-$k$能持续提升文本质量，即使在基于概率的方法失效的极端温度设置下仍保持稳健性能。我们已公开代码、模型及分析工具。

摘要 (Abstract)

The quality of text generated by large language models depends critically on the decoding sampling strategy. While mainstream methods such as Top-$k$, Top-$p$, and Min-$p$ achieve a balance between diversity and accuracy through probability-space truncation, they share an inherent limitation: extreme sensitivity to the temperature parameter. Recent logit-space approaches like Top-$nσ$ achieve temperature invariance but rely on global statistics that are susceptible to long-tail noise, failing to capture fine-grained confidence structures among top candidates. We propose \textbf{Min-$k$ Sampling}, a novel dynamic truncation strategy that analyzes the local shape of the sorted logit distribution to identify “semantic cliffs”: sharp transitions from high-confidence core tokens to uncertain long-tail tokens. By computing a position-weighted relative decay rate, Min-$k$ dynamically determines truncation boundaries at each generation step. We formally prove that Min-$k$ achieves strict temperature invariance and empirically demonstrate its low sensitivity to hyperparameter choices. Experiments on multiple reasoning benchmarks, creative writing tasks, and human evaluation show that Min-$k$ consistently improves text quality, maintaining robust performance even under extreme temperature settings where probability-based methods collapse. We make our code, models, and analysis tools publicly available.

关键词: Large Language Models, Decoding Sampling, Min-k Sampling, Temperature Invariance, Logit Distribution, Text Generation, Reasoning Benchmarks, Dynamic Truncation

24. ❌ Hierarchical Textual Knowledge for Enhanced Image Clustering

作者: Yijie Zhong, Yunfan Gao, Weipeng Jiang, Haofen Wang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11144v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文的核心创新是利用大语言模型（LLMs）构建层次化的概念-属性结构化知识来增强图像聚类。摘要中明确提到“with the help of large language models (LLMs)”，因此与第一个关键词高度相关（10分）。论文主要关注LLMs在计算机视觉任务（图像聚类）中的应用，属于大模型在不同领域的研究应用，但并未深入探讨LLMs本身的技术原理（如MoE、Scaling Laws、训练方法、推理优化、Agent等），也未涉及科学领域的特定应用（如生物信息学），因此其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用大语言模型构建层次化概念-属性知识来增强图像聚类的方法，在20个数据集上取得了优于现有方法的性能，并解决了简单文本知识可能损害聚类效果的问题。

摘要翻译

图像聚类旨在以无监督方式对图像进行分组。传统方法主要关注视觉空间的知识，难以区分视觉相似但语义不同的类别。视觉语言模型的最新进展使得利用文本知识增强图像聚类成为可能。然而，现有方法大多依赖粗糙的类别标签或简单名词，忽视了文本空间中丰富的概念级与属性级语义。本文提出一种知识增强聚类方法，该方法借助大语言模型构建层次化的概念-属性结构化知识以指导聚类。具体而言，我们首先将冗余的文本标签凝练为抽象概念，随后通过结构化提示词引导大语言模型自动为每个独立概念及相似概念对提取区分性属性。这些知识针对每个输入图像进行实例化，从而获得知识增强特征。知识增强特征与原始视觉特征可适配于多种下游聚类算法。我们在20个多样化数据集上评估了该方法，结果显示其在现有利用文本知识的方法基础上实现了持续的性能提升。未经训练的KEC在20个数据集中有14个超越了零样本CLIP的性能。此外，简单的文本知识使用可能损害聚类性能，而KEC在保证准确性的同时提供了更强的鲁棒性。

摘要 (Abstract)

Image clustering aims to group images in an unsupervised fashion. Traditional methods focus on knowledge from visual space, making it difficult to distinguish between visually similar but semantically different classes. Recent advances in vision-language models enable the use of textual knowledge to enhance image clustering. However, most existing methods rely on coarse class labels or simple nouns, overlooking the rich conceptual and attribute-level semantics embedded in textual space. In this paper, we propose a knowledge-enhanced clustering (KEC) method that constructs a hierarchical concept-attribute structured knowledge with the help of large language models (LLMs) to guide clustering. Specifically, we first condense redundant textual labels into abstract concepts and then automatically extract discriminative attributes for each single concept and similar concept pairs, via structured prompts to LLMs. This knowledge is instantiated for each input image to achieve the knowledge-enhanced features. The knowledge-enhanced features with original visual features are adapted to various downstream clustering algorithms. We evaluate KEC on 20 diverse datasets, showing consistent improvements across existing methods using additional textual knowledge. KEC without training outperforms zero-shot CLIP on 14 out of 20 datasets. Furthermore, the naive use of textual knowledge may harm clustering performance, while KEC provides both accuracy and robustness.

关键词: Image Clustering, Large Language Models, Hierarchical Knowledge, Concept-Attribute Structure, Vision-Language Models, Unsupervised Learning, Textual Knowledge Enhancement, Zero-shot CLIP

25. ❌ Towards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation

作者: Chen Huang, Zitan Jiang, Changyi Zou, Wenqiang Lei, See-Kiong Ng 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11077v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	5.0/10	5.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究客户服务聊天机器人主动信息探测任务，属于大模型应用领域。与"Large Language Models"有一定关联（5分），因为聊天机器人通常基于LLMs构建；与"LLM Agents"有一定关联（5分），因为PROCHATIP框架可视为一种自主代理系统。其他关键词如MoE、SFT、RAG等均未在摘要中提及，与论文核心技术无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了主动信息探测任务和PROCHATIP框架，使客户服务聊天机器人能够主动获取高价值信息，实验证明其显著优于基线方法。

摘要翻译

客户服务聊天机器人正日益被期待不仅作为用户被动支持工具，更应成为获取高价值信息和商业智能的战略性界面。为此，我们做出三项主要贡献。1）我们引入并定义了一项名为“主动信息探查”（Proactive Information Probing）的新任务，该任务旨在优化探查用户以获取预设目标信息的时机，同时最小化对话轮次和用户摩擦。2）我们提出了PROCHATIP，这是一个主动式聊天机器人框架，其核心是一个经过专门训练的对话策略模块，能够精准掌握探查的最佳时机。3）实验表明，PROCHATIP显著优于基线模型，在信息探查和服务质量方面均展现出卓越能力。我们相信，这项工作有效重新定义了聊天机器人的商业效用，将其定位为可扩展、高性价比的主动商业智能引擎。我们的代码发布于https://github.com/SCUNLP/PROCHATIP。

摘要 (Abstract)

Customer service chatbots are increasingly expected to serve not merely as reactive support tools for users, but as strategic interfaces for harvesting high-value information and business intelligence. In response, we make three main contributions. 1) We introduce and define a novel task of Proactive Information Probing, which optimizes when to probe users for pre-specified target information while minimizing conversation turns and user friction. 2) We propose PROCHATIP, a proactive chatbot framework featuring a specialized conversation strategy module trained to master the delicate timing of probes. 3) Experiments demonstrate that PROCHATIP significantly outperforms baselines, exhibiting superior capability in both information probing and service quality. We believe that our work effectively redefines the commercial utility of chatbots, positioning them as scalable, cost-effective engines for proactive business intelligence. Our code is available at https://github.com/SCUNLP/PROCHATIP.

关键词: customer service chatbots, proactive information probing, conversation strategy, business intelligence, PROCHATIP, information harvesting, user friction

26. ❌ Towards Situation-aware State Modeling for Air Traffic Flow Prediction

作者: Anqi Liu, Bin Wang, Jiangtao Zhao, Dechuan Ma, Guiyuan Jiang, Feng Hong, Yanwei Yu, Tianrui Li 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11198v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文《Towards Situation-aware State Modeling for Air Traffic Flow Prediction》专注于航空交通流量预测，提出了一种名为AeroSense的直接状态到流量建模框架。该研究属于交通预测领域，核心是使用深度学习模型（特别是基于注意力机制的架构）处理微观飞机状态数据，以预测宏观交通流量。论文与绝大多数关键词（如LLMs、MoE、Scaling Laws、各种训练/对齐技术、推理方法、代理系统、模型压缩等）完全无关，因为这些关键词特指大语言模型及相关技术。仅有两个关键词有微弱关联：1）“Mechanistic Interpretability” OR “Explainable AI”：论文提到通过基于注意力的可视化提供有意义的可解释性，因此给5分（有一定关联）。2）“AI for Science” OR “Bioinformatics” OR “Cheminformatics”：论文应用AI于航空交通管理，可视为AI在科学/工程领域的一个应用，因此给5分（有一定关联）。其他关键词均未涉及，评分为0。加权总分为10.0（5.0 + 5.0），远低于动态及格分26.6。

!!! tip deepseek-chat TL;DR

该论文针对终端空域交通流量预测问题，提出了一种直接基于微观飞机状态建模的AeroSense框架，通过情境感知状态表示和掩码自注意力机制，实现了比传统时间序列方法更高的预测精度和鲁棒性。

摘要翻译

终端空域（TA）内的精准空中交通预测对于前瞻性空中交通管理（ATM）至关重要。然而，现有数据驱动方法主要依赖于基于时间序列的预测范式，这类方法本质上忽略了关键的航空器状态信息，例如实时运动学状态以及与空域边界的接近程度。为应对这一局限，我们提出了AeroSense——一种用于空中交通预测的直接状态-流量建模框架。与经典的基于时间序列的方法（其首先将航空器轨迹聚合为宏观流量序列再进行建模）不同，AeroSense将实时空域态势明确表示为一个动态的航空器状态集合，从而能够直接处理可变数量的航空器作为输入，而非时间序列。具体而言，我们引入了一种态势感知的状态表征方法，使AeroSense能够直接从微观航空器状态感知瞬时终端空域态势。此外，我们设计了一个模型架构，该架构结合了掩码自注意力机制以捕捉航空器间的交互，并配备两个解耦的预测头，用以分别建模终端空域两个关键功能区域内的异质流量动态。基于大规模真实机场数据集的大量实验表明，AeroSense持续实现了最先进的性能，验证了直接对微观航空器状态进行建模相比基于时间序列的基线方法能带来显著更高的预测保真度。此外，所提框架在高峰交通时段展现出卓越的鲁棒性，在分时段多目标评估中实现了帕累托最优性能，并通过基于注意力的可视化提供了有意义的可解释性。

摘要 (Abstract)

Accurate air traffic prediction in the terminal airspace (TA) is pivotal for proactive air traffic management (ATM). However, existing data-driven approaches predominantly rely on time series-based forecasting paradigms, which inherently overlook critical aircraft state information, such as real-time kinematics and proximity to airspace boundaries. To address this limitation, we propose \textit{AeroSense}, a direct state-to-flow modeling framework for air traffic prediction. Unlike classical time series-based methods that first aggregate aircraft trajectories into macroscopic flow sequences before modeling, AeroSense explicitly represents the real-time airspace situation as \textit{a dynamic set of aircraft states}, enabling the direct processing of a variable number of aircraft instead of time series as inputs. Specifically, we introduce a situation-aware state representation that enables AeroSense to sense the instantaneous terminal airspace situation directly from microscopic aircraft states. Furthermore, we design a model architecture that incorporates masked self-attention to capture inter-aircraft interactions, together with two decoupled prediction heads to model heterogeneous flow dynamics across two key functional areas of the TA. Extensive experiments on a large-scale real-world airport dataset demonstrate that AeroSense consistently achieves state-of-the-art performance, validating that direct modeling of microscopic aircraft states yields substantially higher predictive fidelity than time series-based baselines. Moreover, the proposed framework exhibits superior robustness during peak traffic periods, achieves Pareto-optimal performance under dayparting multi-object evaluation, and provides meaningful interpretability through attention-based visualizations.

关键词: air traffic prediction, state-to-flow modeling, situation-aware state representation, masked self-attention, terminal airspace, aircraft states, predictive fidelity, attention-based visualizations

27. ❌ Fairness is Not Flat: Geometric Phase Transitions Against Shortcut Learning

作者: Nicolas Rodriguez-Alvarez, Fernando Rodriguez-Merino 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11704v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究深度神经网络中的捷径学习问题，提出了一种几何先验方法来缓解该问题，并探讨了其与公平性和鲁棒性的关系。所有关键词均与大模型（LLMs）或深度学习在科学领域的特定应用（如生物信息学）直接相关。论文的核心是通用深度神经网络（DNNs）的公平性和鲁棒性，而非大模型技术。因此，除"Mechanistic Interpretability” OR “Explainable AI”（权重1.0）外，其他关键词均不相关。该关键词得5分，因为论文通过几何方法分析特征和梯度，旨在提高模型的可解释性和公平性，这与可解释AI有一定关联，但并非核心焦点。

!!! tip deepseek-chat TL;DR

该论文提出了一种几何先验方法来缓解深度神经网络中的捷径学习问题，通过数学隔离垄断梯度的特征并强制网络使用更高几何容量，从而减少人口统计偏差并提高公平性。

摘要翻译

深度神经网络极易陷入捷径学习，往往记忆低维的虚假相关性而非底层因果机制。这一现象不仅会降低分布外鲁棒性，还在敏感应用中引发严重的群体偏见。本文提出一种几何先验方法以缓解捷径学习。通过部署零隐藏层（$N=1$）的拓扑审计器，我们无需人工干预即可从数学上隔离那些垄断梯度的特征。我们通过实验证明了容量相变的存在：一旦线性捷径被剪除，网络将被迫利用更高的几何容量（$N \geq 16$）来弯曲决策边界并学习具有伦理意义的表征。本方法优于L1正则化——后者会坍缩为群体偏见——且计算成本仅为后处理方法（如Just Train Twice，JTT）的一小部分，成功将反事实性别脆弱性从21.18%降低至7.66%。

摘要 (Abstract)

Deep Neural Networks are highly susceptible to shortcut learning, frequently memorizing low-dimensional spurious correlations instead of underlying causal mechanisms. This phenomenon not only degrades out-of-distribution robustness but also induces severe demographic biases in sensitive applications. In this paper, we propose a geometric \textit{a priori} methodology to mitigate shortcut learning. By deploying a zero-hidden-layer ($N=1$) Topological Auditor, we mathematically isolate features that monopolize the gradient without human intervention. We empirically demonstrate a Capacity Phase Transition: once linear shortcuts are pruned, networks are forced to utilize higher geometric capacity ($N \geq 16$) to curve the decision boundary and learn ethical representations. Our approach outperforms L1 Regularization – which collapses into demographic bias – and operates at a fraction of the computational cost of post-hoc methods like Just Train Twice (JTT), successfully reducing counterfactual gender vulnerability from 21.18% to 7.66%.

关键词: shortcut learning, deep neural networks, geometric methodology, fairness, demographic bias, capacity phase transition, topological auditor, ethical representations

28. ❌ A Deep Equilibrium Network for Hyperspectral Unmixing

作者: Chentong Wang, Jincheng Gao, Fei Zhu, Jie Chen 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11279v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文《A Deep Equilibrium Network for Hyperspectral Unmixing》专注于高光谱图像解混的深度学习模型（DEQ-Unmix），属于计算机视觉和遥感领域。所有关键词均与大语言模型（LLMs）及其相关技术（如训练、对齐、推理优化、代理系统等）或特定科学AI应用（如生物信息学）直接相关。论文未涉及任何LLM技术，也未明确提及生物信息学或化学信息学，因此除“AI for Science”关键词外，其余关键词评分为0。虽然高光谱解混可视为科学计算应用，但论文未强调“AI for Science”这一广义范畴，故给予5分（有一定关联）。加权总分计算为5.0。

!!! tip deepseek-chat TL;DR

该论文针对高光谱图像解混中传统方法建模能力有限和深度学习方法缺乏物理可解释性的问题，提出了一种基于深度平衡模型的DEQ-Unmix方法，通过隐式微分实现高效恒内存训练，并在合成和真实数据集上取得了优越的解混性能。

摘要翻译

高光谱解混（Hyperspectral Unmixing，HU）对于分析高光谱图像至关重要，但实现精确解混仍具挑战性。传统方法难以有效建模复杂的光谱-空间特征，而深度学习方法则常缺乏物理解释性。基于展开（unrolling）的方法虽能提供网络可解释性，却未能充分利用光谱-空间信息，且在反向传播过程中存在高内存消耗和数值精度问题。为应对这些局限，我们提出DEQ-Unmix方法，将丰度估计重构为深度均衡模型，通过隐式微分实现高效且内存恒定的训练。该方法将数据重建项的梯度算子替换为可训练的卷积网络，以捕捉光谱-空间信息。借助隐式微分，DEQ-Unmix能够实现高效且内存恒定的反向传播。在合成数据集和两个真实数据集上的实验表明，DEQ-Unmix在保持恒定内存消耗的同时，取得了优越的解混性能。

摘要 (Abstract)

Hyperspectral unmixing (HU) is crucial for analyzing hyperspectral imagery, yet achieving accurate unmixing remains challenging. While traditional methods struggle to effectively model complex spectral-spatial features, deep learning approaches often lack physical interpretability. Unrolling-based methods, despite offering network interpretability, inadequately exploit spectral-spatial information and incur high memory costs and numerical precision issues during backpropagation. To address these limitations, we propose DEQ-Unmix, which reformulates abundance estimation as a deep equilibrium model, enabling efficient constant-memory training via implicit differentiation. It replaces the gradient operator of the data reconstruction term with a trainable convolutional network to capture spectral-spatial information. By leveraging implicit differentiation, DEQ-Unmix enables efficient and constant-memory backpropagation. Experiments on synthetic and two real-world datasets demonstrate that DEQ-Unmix achieves superior unmixing performance while maintaining constant memory cost.

关键词: Hyperspectral unmixing, Deep equilibrium model, Implicit differentiation, Constant-memory training, Spectral-spatial information, Abundance estimation, Convolutional network, Data reconstruction

29. ❌ Physics-Informed State Space Models for Reliable Solar Irradiance Forecasting in Off-Grid Systems

作者: Mohammed Ezzaldin Babiker Abdullah 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11807v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于太阳能预测的深度学习模型，提出了Thermodynamic Liquid Manifold Network，整合了气象变量、Koopman线性化、黎曼流形和物理约束。所有关键词均与大模型、训练方法、推理优化、对齐、代理等大模型核心技术无关，仅与’AI for Science’有一定关联，因为论文属于AI在科学（可再生能源）领域的应用，但未涉及生物信息学或化学信息学，故给5分。

!!! tip deepseek-chat TL;DR

该研究解决了离网光伏系统中深度学习太阳能预测模型存在的时间相位滞后和物理不可信夜间发电问题，通过提出Thermodynamic Liquid Manifold Network，实现了零滞后同步和零夜间误差的高精度预测。

摘要翻译

离网光伏系统的稳定运行依赖于遵循大气热力学原理的太阳能预测算法。当前深度学习模型普遍存在关键异常，主要表现为云层瞬变过程中的严重时间相位滞后以及物理上不可能发生的夜间发电现象。为解决数据驱动建模与确定性天体力学之间的这种偏差，本研究提出了热力学液态流形网络。该方法将15个气象与几何变量映射至库普曼线性化的黎曼流形，以系统化描述复杂气候动力学。该架构集成了谱校准单元与乘性热力学阿尔法门控机制，通过融合实时大气不透明度数据与理论晴空边界模型，在结构上强制遵循严格的天体几何约束。这完全消除了虚假夜间发电现象，同时在天气快速变化期间保持零滞后同步。在严酷半干旱气候环境下经过五年严格测试验证，该框架实现了18.31 Wh/m²的均方根误差和0.988的皮尔逊相关系数。模型在所有1826个测试日中严格保持零量级夜间误差，并在高频瞬变过程中表现出低于30分钟的相位响应。该超轻量级设计仅包含63,458个可训练参数，为边缘可部署微电网控制器建立了稳健且热力学自洽的新标准。

摘要 (Abstract)

The stable operation of autonomous off-grid photovoltaic systems dictates reliance on solar forecasting algorithms that respect atmospheric thermodynamics. Contemporary deep learning models consistently exhibit critical anomalies, primarily severe temporal phase lags during cloud transients and physically impossible nocturnal power generation. To resolve this divergence between data-driven modeling and deterministic celestial mechanics, this research introduces the Thermodynamic Liquid Manifold Network. The proposed methodology projects 15 meteorological and geometric variables into a Koopman-linearized Riemannian manifold to systematically map complex climatic dynamics. The architecture integrates a Spectral Calibration unit and a multiplicative Thermodynamic Alpha-Gate. This system synthesizes real-time atmospheric opacity with theoretical clear-sky boundary models, structurally enforcing strict celestial geometry compliance. This completely neutralizes phantom nocturnal generation while maintaining zero-lag synchronization during rapid weather shifts. Validated against a rigorous five-year testing horizon in a severe semi-arid climate, the framework achieves an RMSE of 18.31 Wh/m2 and a Pearson correlation of 0.988. The model strictly maintains a zero-magnitude nocturnal error across all 1826 testing days and exhibits a sub-30-minute phase response during high-frequency transients. Comprising exactly 63,458 trainable parameters, this ultra-lightweight design establishes a robust, thermodynamically consistent standard for edge-deployable microgrid controllers.

关键词: solar irradiance forecasting, off-grid systems, thermodynamic modeling, Koopman-linearized manifold, edge-deployable, phase lag elimination, physical consistency, ultra-lightweight model

30. ❌ Solving Physics Olympiad via Reinforcement Learning on Physics Simulators

作者: Mihir Prabhudesai, Aryan Satpathy, Yangmin Li, Zheyang Qin, Nikash Bhardwaj, Amir Zadeh, Chuan Li, Katerina Fragkiadaki, Deepak Pathak 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11805v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在物理推理中的应用，使用强化学习（RL）在物理模拟器生成的合成数据上训练LLMs，属于AI for Science领域。高度相关的关键词包括：LLMs（核心模型）、RLHF/DPO（使用强化学习训练）、Chain of Thought/System 2 Thinking（物理推理需要多步深度推理）、AI for Science（物理科学应用）。Scaling Laws AND Data Quality得5分，因为论文讨论了互联网QA数据规模有限的问题，并提出了模拟器作为可扩展数据源的解决方案。其他关键词如MoE、SLMs、PEFT、RAG等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出使用物理模拟器生成合成数据，并通过强化学习训练大语言模型进行物理推理，实现了在缺乏大规模真实问答数据的情况下，显著提升模型在国际物理奥林匹克竞赛问题上的零样本性能。

摘要翻译

随着DeepSeek-R1的出现，我们见证了大型语言模型（LLM）推理能力的显著进步。然而，这一进展很大程度上得益于互联网上大量问答对（QA pairs）的支撑，而这正成为未来发展的主要瓶颈，因为此类数据规模有限且主要集中在数学等领域。相比之下，物理学等其他科学领域缺乏大规模问答数据集来有效训练具备推理能力的模型。本研究证明，物理模拟器可作为训练LLM进行物理推理的强大替代监督源。我们通过在物理引擎中生成随机场景，从模拟交互中创建合成问答对，并利用强化学习在此合成数据上训练LLM。我们的模型展现出对真实世界物理基准测试的零样本模拟到现实迁移能力：例如，仅使用合成模拟数据进行训练，即可在不同规模模型上将国际物理奥林匹克竞赛（IPhO）问题的解决性能提升5-10个百分点。这些结果表明，物理模拟器能够作为可扩展的数据生成器，使LLM获得超越互联网规模问答数据局限的深度物理推理能力。代码发布于：https://sim2reason.github.io/。

摘要 (Abstract)

We have witnessed remarkable advances in LLM reasoning capabilities with the advent of DeepSeek-R1. However, much of this progress has been fueled by the abundance of internet question-answer (QA) pairs, a major bottleneck going forward, since such data is limited in scale and concentrated mainly in domains like mathematics. In contrast, other sciences such as physics lack large-scale QA datasets to effectively train reasoning-capable models. In this work, we show that physics simulators can serve as a powerful alternative source of supervision for training LLMs for physical reasoning. We generate random scenes in physics engines, create synthetic question-answer pairs from simulated interactions, and train LLMs using reinforcement learning on this synthetic data. Our models exhibit zero-shot sim-to-real transfer to real-world physics benchmarks: for example, training solely on synthetic simulated data improves performance on IPhO (International Physics Olympiad) problems by 5-10 percentage points across model sizes. These results demonstrate that physics simulators can act as scalable data generators, enabling LLMs to acquire deep physical reasoning skills beyond the limitations of internet-scale QA data. Code available at: https://sim2reason.github.io/.

关键词: Large Language Models, Reinforcement Learning, Physics Simulators, Physical Reasoning, Synthetic Data, Zero-shot Transfer, Physics Olympiad, AI for Science

31. ❌ Detecting Safety Violations Across Many Agent Traces

作者: Adam Stein, Davis Brown, Hamed Hassani, Mayur Naik, Eric Wong 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11806v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究AI智能体（agents）的安全违规检测，与’LLM Agents’高度相关（10分），因为论文聚焦于分析agent traces和检测misuse、misalignment等安全问题。与’Large Language Models’和’Alignment’有一定关联（各5分），因为Meerkat系统使用自然语言指定违规，且涉及misalignment检测。其他关键词如MoE、SLMs、训练方法、推理优化、科学AI应用等与论文内容无直接关系，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了Meerkat系统，通过结合聚类和智能体搜索来检测跨多个智能体轨迹的安全违规行为，在多个测试场景中显著提高了违规检测率。

摘要翻译

为识别安全违规行为，审计人员通常需对大量智能体轨迹进行搜索。这一搜索过程极具挑战性，因为故障往往罕见且复杂，有时甚至被对抗性隐藏，仅在同时分析多条轨迹时才可被察觉。此类挑战广泛存在于滥用攻击、隐蔽破坏、奖励破解及提示注入等多种场景中。现有方法在此类问题上存在若干局限：基于单条轨迹的判定器会遗漏仅跨轨迹可见的故障；简单的智能体审计难以扩展至大规模轨迹集合；而固定监测器对未预见的异常行为表现脆弱。本文提出Meerkat系统，该系统将聚类技术与智能体搜索相结合，以自然语言描述的违规行为为目标进行探测。通过结构化搜索及对潜在风险区域的自适应调查，Meerkat能够在无需种子场景、固定工作流程或穷举枚举的情况下发现稀疏分布的故障。在滥用行为、目标错位和任务博弈等多种场景的测试中，Meerkat相较于基线监测器显著提升了安全违规的检测能力：该系统在顶级智能体基准测试中发现了普遍存在的开发者作弊行为，并在CyBench基准上发现的奖励破解案例数量达到以往审计结果的近4倍。

摘要 (Abstract)

To identify safety violations, auditors often search over large sets of agent traces. This search is difficult because failures are often rare, complex, and sometimes even adversarially hidden and only detectable when multiple traces are analyzed together. These challenges arise in diverse settings such as misuse campaigns, covert sabotage, reward hacking, and prompt injection. Existing approaches struggle here for several reasons. Per-trace judges miss failures that only become visible across traces, naive agentic auditing does not scale to large trace collections, and fixed monitors are brittle to unanticipated behaviors. We introduce Meerkat, which combines clustering with agentic search to uncover violations specified in natural language. Through structured search and adaptive investigation of promising regions, Meerkat finds sparse failures without relying on seed scenarios, fixed workflows, or exhaustive enumeration. Across misuse, misalignment, and task gaming settings, Meerkat significantly improves detection of safety violations over baseline monitors, discovers widespread developer cheating on a top agent benchmark, and finds nearly 4x more examples of reward hacking on CyBench than previous audits.

关键词: safety violations, agent traces, agentic search, misuse detection, misalignment, reward hacking, auditing, Meerkat

32. ❌ A Mechanistic Analysis of Looped Reasoning Language Models

作者: Hugh Blayney, Álvaro Arroyo, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, Michael M. Bronstein, Xiaowen Dong 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11791v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究循环推理语言模型（looped reasoning language models）的内部机制分析，与LLMs和推理能力高度相关。主要涉及：1）LLMs（核心研究对象）；2）推理能力（Chain of Thought/System 2 Thinking）；3）机制可解释性（Mechanistic Interpretability）。其他关键词如MoE、SFT、RAG等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文通过机制分析研究了循环推理语言模型的内部动态，发现其循环块学习与标准前馈模型相似的推理阶段，并在每次迭代中重复这些阶段，为架构设计提供了实践指导。

摘要翻译

推理已成为大型语言模型的核心能力。近期研究表明，通过在潜在维度中对LLM层进行循环迭代可提升推理性能，从而催生了循环推理语言模型。尽管成果显著，但很少有研究探讨其内部动态机制与标准前馈模型的差异。本文对循环语言模型的潜在状态进行机制分析，重点关注前馈模型中观察到的推理阶段与循环模型中的对应阶段如何比较。为此，我们分析了循环递归现象，发现多数被研究模型的循环层会收敛至不同的不动点；因此，循环模块在潜在空间中遵循一致的周期性轨迹。我们证明当达到这些不动点时，注意力头行为趋于稳定，导致跨循环迭代的行为保持恒定。实证研究表明，循环模块学习的推理阶段与前馈模型高度相似，并在每次迭代中沿深度方向重复这些阶段。我们探究了循环模块尺寸、输入注入和归一化操作如何影响这些循环不动点的形成与稳定性。我们相信这些发现有助于将机制性洞见转化为架构设计的实践指导。

摘要 (Abstract)

Reasoning has become a central capability in large language models. Recent research has shown that reasoning performance can be improved by looping an LLM’s layers in the latent dimension, resulting in looped reasoning language models. Despite promising results, few works have investigated how their internal dynamics differ from those of standard feedforward models. In this paper, we conduct a mechanistic analysis of the latent states in looped language models, focusing in particular on how the stages of inference observed in feedforward models compare to those observed in looped ones. To this end, we analyze cyclic recurrence and show that for many of the studied models each layer in the cycle converges to a distinct fixed point; consequently, the recurrent block follows a consistent cyclic trajectory in the latent space. We provide evidence that as these fixed points are reached, attention-head behavior stabilizes, leading to constant behavior across recurrences. Empirically, we discover that recurrent blocks learn stages of inference that closely mirror those of feedforward models, repeating these stages in depth with each iteration. We study how recurrent block size, input injection, and normalization influence the emergence and stability of these cyclic fixed points. We believe these findings help translate mechanistic insights into practical guidance for architectural design.

关键词: looped reasoning language models, mechanistic analysis, latent states, cyclic recurrence, fixed points, attention-head behavior, stages of inference, architectural design

33. ❌ C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World Prompts

作者: Chenxi Qing, Junxi Wu, Zheng Liu, Yixiang Qiu, Hongyao Yu, Bin Chen, Hao Wu, Shu-Tao Xia 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11796v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究AI生成文本检测，特别是针对中文文本的检测基准构建。论文直接涉及大语言模型（LLMs）在生成文本方面的应用，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词，如模型架构（MoE、SLMs）、训练技术（预训练、微调、对齐、RLHF、PEFT）、推理优化（RAG、上下文扩展、注意力优化、量化、推测解码）、推理能力（CoT、系统2思维、MCTS、自我纠正）、智能体（LLM Agents、工具使用、多智能体）、可解释性、模型合并、上下文学习或科学AI应用，因此这些关键词得0分。

!!! tip deepseek-chat TL;DR

该论文针对中文AI生成文本检测中模型多样性和数据同质性的挑战，提出了一个名为C-ReD的综合基准，实验表明该基准能实现可靠的域内检测并对外部数据集和未见过的LLMs具有良好的泛化能力。

摘要翻译

近年来，大规模语言模型（LLMs）已能够生成高度流畅的文本内容。尽管它们为人类带来了显著的便利，但也引发了诸如网络钓鱼和学术不端等多种风险。大量研究工作致力于开发检测AI生成文本的算法并构建相关数据集。然而，在中文语料领域，仍存在模型多样性不足和数据同质化等挑战。为解决这些问题，我们提出了C-ReD：一个全面的中文真实提示AI生成检测基准。实验表明，C-ReD不仅能够实现可靠的领域内检测，还能对未见过的LLMs及外部中文数据集展现出强大的泛化能力——这弥补了先前中文检测基准在模型多样性、领域覆盖和提示真实性方面的关键不足。我们的资源已发布于https://github.com/HeraldofLight/C-ReD。

摘要 (Abstract)

Recently, large language models (LLMs) are capable of generating highly fluent textual content. While they offer significant convenience to humans, they also introduce various risks, like phishing and academic dishonesty. Numerous research efforts have been dedicated to developing algorithms for detecting AI-generated text and constructing relevant datasets. However, in the domain of Chinese corpora, challenges remain, including limited model diversity and data homogeneity. To address these issues, we propose C-ReD: a comprehensive Chinese Real-prompt AI-generated Detection benchmark. Experiments demonstrate that C-ReD not only enables reliable in-domain detection but also supports strong generalization to unseen LLMs and external Chinese datasets-addressing critical gaps in model diversity, domain coverage, and prompt realism that have limited prior Chinese detection benchmarks. We release our resources at https://github.com/HeraldofLight/C-ReD.

关键词: AI-generated text detection, Chinese benchmark, large language models (LLMs), real-world prompts, model diversity, data homogeneity, generalization, C-ReD

34. ❌ Budget-Aware Uncertainty for Radiotherapy Segmentation QA Using nnU-Net

作者: Ricardo Coimbra Brioso, Lorenzo Mondo, Damiano Dei, Nicola Lambri, Pietro Mancosu, Marta Scorsetti, Daniele Loiacono 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11798v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像分割（放疗靶区勾画）的质量保证框架，使用nnU-Net和不确定性量化方法。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、智能体等）完全无关。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文属于AI在生物医学（放疗）领域的应用，但并非核心创新于大模型或深度学习技术原理，而是应用现有方法解决特定临床问题，因此给予5分（有一定关联）。其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该研究针对放疗靶区自动分割的质量保证问题，提出了一种基于nnU-Net的预算感知不确定性驱动QA框架，通过结合不确定性量化和事后校准生成体素级不确定性图谱，以指导针对性人工复审，并在TMLI案例中验证了校准与集成策略能有效提升不确定性-误差对齐。

摘要翻译

临床靶区（Clinical Target Volume, CTV）的精准勾画对放射治疗计划至关重要，但这一过程仍耗时且难以评估，尤其在全身骨髓及淋巴结照射（Total Marrow and Lymph Node Irradiation, TMLI）等复杂治疗方案中。尽管基于深度学习的自动分割技术能够减轻工作负担，但其安全临床应用需要可靠的提示以识别模型可能出错的区域。本研究提出一种基于nnU-Net构建的预算感知不确定性驱动质量保证（QA）框架，该框架结合不确定性量化与事后校准技术，生成可引导针对性人工复核的体素级不确定性图谱（基于预测熵）。我们以TMLI为代表性用例，对比评估了温度缩放（Temperature Scaling, TS）、深度集成（Deep Ensembles, DE）、检查点集成（Checkpoint Ensembles, CE）及测试时增强（Test-Time Augmentation, TTA）等方法单独及组合使用的效果。通过基于感兴趣区掩模的校准指标，以及在现实修订约束下的不确定性-误差对齐性（总结为前0-5%最不确定体素的AUC值）来评估可靠性。在所有配置中，分割准确性保持稳定，而TS显著改善了校准效果。基于校准检查点的推理方法最有效地提升了不确定性-误差对齐性，其生成的不确定性图谱能更一致地突出需要人工修改的区域。总体而言，将校准与高效集成相结合，有望为放射治疗分割构建一种预算感知的质量保证工作流程。

摘要 (Abstract)

Accurate delineation of the Clinical Target Volume (CTV) is essential for radiotherapy planning, yet remains time-consuming and difficult to assess, especially for complex treatments such as Total Marrow and Lymph Node Irradiation (TMLI). While deep learning-based auto-segmentation can reduce workload, safe clinical deployment requires reliable cues indicating where models may be wrong. In this work, we propose a budget-aware uncertainty-driven quality assurance (QA) framework built on nnU-Net, combining uncertainty quantification and post-hoc calibration to produce voxel-wise uncertainty maps (based on predictive entropy) that can guide targeted manual review. We compare temperature scaling (TS), deep ensembles (DE), checkpoint ensembles (CE), and test-time augmentation (TTA), evaluated both individually and in combination on TMLI as a representative use case. Reliability is assessed through ROI-masked calibration metrics and uncertainty–error alignment under realistic revision constraints, summarized as AUC over the top 0-5% most uncertain voxels. Across configurations, segmentation accuracy remains stable, whereas TS substantially improves calibration. Uncertainty-error alignment improves most with calibrated checkpoint-based inference, leading to uncertainty maps that highlight more consistently regions requiring manual edits. Overall, integrating calibration with efficient ensembling seems a promising strategy to implement a budget-aware QA workflow for radiotherapy segmentation.

关键词: radiotherapy segmentation, uncertainty quantification, quality assurance, nnU-Net, calibration, ensemble methods, clinical target volume, predictive entropy

35. ❌ ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

作者: Wei Zhao, Zhe Li, Peixin Zhang, Jun Sun 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11790v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究工具增强的LLM代理的安全防御框架，与"LLM Agents"、“Tool Use"和"Large Language Models"高度相关（10分），因为这些是论文的直接研究对象。与"Instruction Tuning OR Alignment OR Value Alignment"和"Hallucination Mitigation OR Factuality OR Truthfulness"有一定关联（5分），因为论文涉及安全对齐和真实性/可靠性问题（防御恶意指令注入）。其他关键词如MoE、Scaling Laws、Pre-training、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

论文针对工具增强的LLM代理易受间接提示注入攻击的安全漏洞，提出了一个运行时安全框架ClawGuard，通过在执行工具调用前自动推导任务特定的访问约束并强制执行用户确认的规则集，有效拦截恶意工具调用，实现了对三种主要注入途径的鲁棒防护，且无需修改模型或基础设施。

摘要翻译

工具增强型大语言模型（LLM）智能体在自动化复杂多步骤现实任务方面展现出卓越能力，但其仍易受间接提示注入攻击。攻击者通过将恶意指令嵌入工具返回内容中利用此漏洞，智能体将这些内容作为可信观察直接纳入对话历史。该漏洞主要体现在三个主要攻击渠道：网页与本地内容注入、MCP服务器注入以及技能文件注入。为应对这些漏洞，本文提出\textsc{ClawGuard}——一种创新的运行时安全框架，该框架在每个工具调用边界强制执行用户确认的规则集，将不可靠的依赖对齐防御转化为确定性、可审计的机制，从而在产生任何现实影响前拦截恶意工具调用。通过在调用任何外部工具前自动从用户声明的目标推导出任务特定的访问约束，\textsc{ClawGuard}无需修改模型或基础设施即可阻断全部三种注入路径。在AgentDojo、SkillInject和MCPSafeBench基准上对五个前沿语言模型的实验表明，\textsc{ClawGuard}能在不影响智能体功能的前提下实现对间接提示注入的鲁棒防护。本工作确立了确定性工具调用边界强制执行作为安全智能体AI系统的有效防御机制，既无需安全专项微调，也无需架构修改。代码已公开于https://github.com/Claw-Guard/ClawGuard。

摘要 (Abstract)

Tool-augmented Large Language Model (LLM) agents have demonstrated impressive capabilities in automating complex, multi-step real-world tasks, yet remain vulnerable to indirect prompt injection. Adversaries exploit this weakness by embedding malicious instructions within tool-returned content, which agents directly incorporate into their conversation history as trusted observations. This vulnerability manifests across three primary attack channels: web and local content injection, MCP server injection, and skill file injection. To address these vulnerabilities, we introduce \textsc{ClawGuard}, a novel runtime security framework that enforces a user-confirmed rule set at every tool-call boundary, transforming unreliable alignment-dependent defense into a deterministic, auditable mechanism that intercepts adversarial tool calls before any real-world effect is produced. By automatically deriving task-specific access constraints from the user’s stated objective prior to any external tool invocation, \textsc{ClawGuard} blocks all three injection pathways without model modification or infrastructure change. Experiments across five state-of-the-art language models on AgentDojo, SkillInject, and MCPSafeBench demonstrate that \textsc{ClawGuard} achieves robust protection against indirect prompt injection without compromising agent utility. This work establishes deterministic tool-call boundary enforcement as an effective defense mechanism for secure agentic AI systems, requiring neither safety-specific fine-tuning nor architectural modification. Code is publicly available at https://github.com/Claw-Guard/ClawGuard.

关键词: LLM agents, tool-augmented agents, indirect prompt injection, runtime security framework, tool-call boundary enforcement, adversarial tool calls, access constraints, deterministic defense

36. ❌ GenTac: Generative Modeling and Forecasting of Soccer Tactics

作者: Jiayuan Rao, Tianlin Gui, Haoning Wu, Yanfeng Wang, Weidi Xie 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11786v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文GenTac专注于体育战术建模，使用扩散模型生成足球运动员轨迹和战术事件，属于计算机视觉、生成模型和体育分析交叉领域。与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐等）完全无关。仅与’Multi-agent Systems OR Agent Coordination’有一定关联（5分），因为论文处理多球员（多智能体）的协调运动预测，但未使用传统多智能体系统方法，而是基于扩散的生成建模。其他关键词均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出GenTac，一个基于扩散的生成框架，用于建模和预测开放比赛中的足球战术，通过从历史跟踪数据学习球员运动分布，生成多样、合理的长时程未来轨迹，并支持丰富的上下文条件控制，且可推广到其他动态团队运动。

摘要翻译

对开放式足球战术进行建模是一项艰巨的挑战，这源于比赛具有随机性、多智能体的本质。现有的计算方法通常只能生成单一的、确定性的轨迹预测，或专注于高度结构化的定位球战术，本质上未能捕捉到真实比赛演变过程中固有的变化性和分支可能性。本文介绍GenTac，这是一个基于扩散模型的生成框架，它将足球战术概念化为一个在连续的多球员轨迹和离散的语义事件之上的随机过程。通过从历史追踪数据中学习球员移动的底层分布，GenTac能够采样出多样、合理、长时程的未来轨迹。该框架支持丰富的上下文条件设定，包括对手行为、特定球队或联赛的比赛风格以及战略目标，同时将连续的空间动态映射到一个包含15类战术事件的语义空间中。在我们提出的基准测试集TacBench上进行的大量评估展示了其四个关键能力：（1）GenTac在严格保持球队集体结构一致性的同时，实现了较高的几何精度；（2）它能准确模拟风格上的细微差别，区分特定球队（例如，奥克兰足球俱乐部）和联赛（例如，澳大利亚A-League联赛与德国联赛）；（3）它支持可控的反事实模拟，能够根据进攻或防守的指导，显著改变空间控制和预期威胁指标；（4）它能够直接从生成的推演中可靠地预测未来的战术结果。最后，我们证明GenTac可以成功训练并泛化到其他动态团队运动，包括篮球、美式足球和冰球。

摘要 (Abstract)

Modeling open-play soccer tactics is a formidable challenge due to the stochastic, multi-agent nature of the game. Existing computational approaches typically produce single, deterministic trajectory forecasts or focus on highly structured set-pieces, fundamentally failing to capture the inherent variance and branching possibilities of real-world match evolution. Here, we introduce GenTac, a diffusion-based generative framework that conceptualizes soccer tactics as a stochastic process over continuous multi-player trajectories and discrete semantic events. By learning the underlying distribution of player movements from historical tracking data, GenTac samples diverse, plausible, long-horizon future trajectories. The framework supports rich contextual conditioning, including opponent behavior, specific team or league playing styles, and strategic objectives, while grounding continuous spatial dynamics into a 15-class tactical event space. Extensive evaluations on our proposed benchmark, TacBench, demonstrate four key capabilities: (1) GenTac achieves high geometric accuracy while strictly preserving the collective structural consistency of the team; (2) it accurately simulates stylistic nuances, distinguishing between specific teams (e.g., Auckland FC) and leagues (e.g., A-League versus German leagues); (3) it enables controllable counterfactual simulations, demonstrably altering spatial control and expected threat metrics based on offensive or defensive guidance; and (4) it reliably anticipates future tactical outcomes directly from generated rollouts. Finally, we demonstrate that GenTac can be successfully trained to generalize to other dynamic team sports, including basketball, American football, and ice hockey.

关键词: generative modeling, soccer tactics, diffusion models, multi-agent trajectories, tactical forecasting, sports analytics, stochastic process, team sports

37. ❌ ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

作者: Fei Tang, Zhiqiong Lu, Boxuan Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11784v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究GUI代理（图形用户界面代理）的训练、评估和部署框架，属于自主代理（Autonomous Agents）的研究范畴，因此与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分）。论文未涉及大模型技术原理、训练方法（如预训练、微调、对齐）、推理优化、模型压缩、科学AI应用等其他关键词，因此这些关键词得分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了ClawGUI框架，解决了GUI代理训练中环境不稳定、评估标准不统一和部署困难的问题，其ClawGUI-2B模型在MobileWorld GUI-Only基准上取得了17.1%的成功率，比同规模基线提高了6.0%。

摘要翻译

GUI代理通过视觉界面而非编程API驱动应用程序，借助点击、滑动和键盘输入与任意软件交互，从而覆盖了基于命令行界面（CLI）的代理无法触及的长尾应用领域。然而，该领域的发展瓶颈主要不在于模型能力，而在于缺乏一套连贯的全栈基础设施：在线强化学习（RL）训练受限于环境不稳定性和封闭流程，评估标准在不同研究中悄然漂移，且训练后的代理鲜少能在真实设备上触达实际用户。我们提出\textbf{ClawGUI}，这是一个开源框架，旨在单一架构内解决上述三个问题。\textbf{ClawGUI-RL}提供了首个开源的GUI代理强化学习基础设施，经验证可同时支持并行虚拟环境和真实物理设备，并将GiGPO与过程奖励模型（Process Reward Model）结合，以实现密集的步骤级监督。\textbf{ClawGUI-Eval}在6个基准测试和11个以上模型上强制执行完全标准化的评估流程，相较于官方基线实现了95.8%的复现一致性。\textbf{ClawGUI-Agent}通过混合CLI-GUI控制和持久个性化记忆，将训练后的代理部署至Android、HarmonyOS和iOS系统上的12个以上聊天平台。在该流程中端到端训练得到的\textbf{ClawGUI-2B}模型，在MobileWorld GUI-Only基准上取得了17.1%的成功率，较同规模的MAI-UI-2B基线提升了6.0%。

摘要 (Abstract)

GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and keystrokes, reaching a long tail of applications that CLI-based agents cannot. Yet progress in this area is bottlenecked less by modeling capacity than by the absence of a coherent full-stack infrastructure: online RL training suffers from environment instability and closed pipelines, evaluation protocols drift silently across works, and trained agents rarely reach real users on real devices. We present \textbf{ClawGUI}, an open-source framework addressing these three gaps within a single harness. \textbf{ClawGUI-RL} provides the first open-source GUI agent RL infrastructure with validated support for both parallel virtual environments and real physical devices, integrating GiGPO with a Process Reward Model for dense step-level supervision. \textbf{ClawGUI-Eval} enforces a fully standardized evaluation pipeline across 6 benchmarks and 11+ models, achieving 95.8% reproduction against official baselines. \textbf{ClawGUI-Agent} brings trained agents to Android, HarmonyOS, and iOS through 12+ chat platforms with hybrid CLI-GUI control and persistent personalized memory. Trained end to end within this pipeline, \textbf{ClawGUI-2B} achieves 17.1% Success Rate on MobileWorld GUI-Only, outperforming the same-scale MAI-UI-2B baseline by 6.0%.

关键词: GUI agents, RL training, evaluation framework, deployment infrastructure, ClawGUI, mobile applications, autonomous agents, benchmark reproduction

38. ❌ General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

作者: Junlin Liu, Shengnan An, Shuang Zhou, Dan Ma, Shixiong Luo, Ying Xie, Yuan Zhang, Wenling Yuan, Yifan Zhou, Xiaoyu Li, Ziwen Wang, Xuezhi Cao, Xunliang Cai 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11778v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于评估大语言模型（LLMs）的通用推理能力，与’Large Language Models’高度相关（10分）。论文涉及推理评估，与’Chain of Thought’和’System 2 Thinking’有一定关联（各8分），因为通用推理涉及多步和深度思考，但论文未明确使用这些具体技术。其他关键词（如MoE、SFT、RAG等）涉及具体模型架构、训练方法或应用技术，论文未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文通过引入General365基准，评估了26个领先大语言模型在通用推理任务上的表现，发现即使最佳模型准确率仅为62.8%，表明当前LLMs的推理能力高度依赖领域，在通用场景下有显著改进空间。

摘要翻译

当代大型语言模型（LLMs）已展现出卓越的推理能力，尤其在数学和物理等专业领域。然而，这些模型将推理技能泛化至更普遍、更广泛情境的能力——通常称为通用推理——仍未得到充分探索。与领域特定推理不同，通用推理较少依赖专家知识，但仍面临复杂的约束条件、嵌套逻辑分支和语义干扰等严峻的推理挑战。为填补这一空白，我们推出了General365，这是一个专门用于评估LLMs通用推理能力的基准测试。通过将背景知识限制在K-12（基础教育）水平，General365明确地将推理能力与专业领域知识解耦。该基准包含八大类别的365个种子问题和1,095个变体问题，确保了高难度与多样性。对26个领先LLMs的评估显示，即使表现最佳的模型准确率也仅为62.8%，与LLMs在数学和物理基准测试中接近完美的表现形成鲜明对比。这些结果表明，当前LLMs的推理能力高度依赖特定领域，在更广泛的应用中仍有巨大提升空间。我们期望General365能成为推动LLMs推理能力超越领域特定任务、迈向稳健通用现实场景的催化剂。代码、数据集及排行榜：https://general365.github.io

摘要 (Abstract)

Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics. However, their ability to generalize these reasoning skills to more general and broader contexts–often termed general reasoning–remains under-explored. Unlike domain-specific reasoning, general reasoning relies less on expert knowledge but still presents formidable reasoning challenges, such as complex constraints, nested logical branches, and semantic interference. To address this gap, we introduce General365, a benchmark specifically designed to assess general reasoning in LLMs. By restricting background knowledge to a K-12 level, General365 explicitly decouples reasoning from specialized expertise. The benchmark comprises 365 seed problems and 1,095 variant problems across eight categories, ensuring both high difficulty and diversity. Evaluations across 26 leading LLMs reveal that even the top-performing model achieves only 62.8% accuracy, in stark contrast to the near-perfect performances of LLMs in math and physics benchmarks. These results suggest that the reasoning abilities of current LLMs are heavily domain-dependent, leaving significant room for improvement in broader applications. We envision General365 as a catalyst for advancing LLM reasoning beyond domain-specific tasks toward robust, general-purpose real-world scenarios. Code, Dataset, and Leaderboard: https://general365.github.io

关键词: Large Language Models, General Reasoning, Benchmark, Evaluation, Domain-specific Reasoning, General365, Reasoning Capabilities, LLM Performance

39. ❌ Efficient KernelSHAP Explanations for Patch-based 3D Medical Image Segmentation

作者: Ricardo Coimbra Brioso, Giulio Sichili, Damiano Dei, Nicola Lambri, Pietro Mancosu, Marta Scorsetti, Daniele Loiacono 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11775v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于医学图像分割的可解释性方法（KernelSHAP），与大多数大模型技术关键词（如LLMs、MoE、RLHF等）完全无关。唯一相关的是’Mechanistic Interpretability OR Explainable AI’（10分），因为论文核心是模型解释方法；以及’AI for Science OR Bioinformatics OR Cheminformatics’（8分），因为应用在生物医学CT图像分析，属于AI for Science范畴。其他关键词均无涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种高效的KernelSHAP框架，用于3D医学CT图像分割的可解释性分析，通过区域限制和缓存机制减少计算成本，并比较了不同特征抽象方法在临床解释性上的权衡。

摘要翻译

基于扰动的可解释性方法（如KernelSHAP）能够提供与模型无关的特征归因，但由于其需要大量联盟评估且滑动窗口推断成本高昂，这类方法通常难以应用于基于图像块的3D医学图像分割任务。本文提出了一种针对容积CT分割的高效KernelSHAP框架：该框架将计算限制在用户定义的感兴趣区域及其感受野支持范围内，并通过图像块逻辑值缓存技术加速推断——在保留nnU-Net融合方案的同时，对未受影响的图像块复用基线预测结果。为获得具有临床意义的归因解释，我们在感受野裁剪区域内比较了三种自动生成的特征抽象形式：完整器官单元、规则FCC超体素以及混合型器官感知超体素，并研究了以稳定证据（真阳性/Dice/Soft Dice）或假阳性行为为目标的多种聚合/价值函数。在全身体CT分割实验表明，缓存机制显著减少了冗余计算（计算节省量达15%至30%），且保真度与可解释性存在明显权衡：规则超体素虽常能最大化基于扰动的评估指标，但缺乏解剖结构对齐性；而器官感知单元能产生更具临床可解释性的结果，在标准化指标下对假阳性驱动因素的凸显效果尤为显著。

摘要 (Abstract)

Perturbation-based explainability methods such as KernelSHAP provide model-agnostic attributions but are typically impractical for patch-based 3D medical image segmentation due to the large number of coalition evaluations and the high cost of sliding-window inference. We present an efficient KernelSHAP framework for volumetric CT segmentation that restricts computation to a user-defined region of interest and its receptive-field support, and accelerates inference via patch logit caching, reusing baseline predictions for unaffected patches while preserving nnU-Net’s fusion scheme. To enable clinically meaningful attributions, we compare three automatically generated feature abstractions within the receptive-field crop: whole-organ units, regular FCC supervoxels, and hybrid organ-aware supervoxels, and we study multiple aggregation/value functions targeting stabilizing evidence (TP/Dice/Soft Dice) or false-positive behavior. Experiments on whole-body CT segmentations show that caching substantially reduces redundant computation (with computational savings ranging from 15% to 30%) and that faithfulness and interpretability exhibit clear trade-offs: regular supervoxels often maximize perturbation-based metrics but lack anatomical alignment, whereas organ-aware units yield more clinically interpretable explanations and are particularly effective for highlighting false-positive drivers under normalized metrics.

关键词: KernelSHAP, 3D medical image segmentation, explainability, CT segmentation, patch-based, receptive-field, supervoxels, faithfulness

40. ❌ Retrieval Is Not Enough: Why Organizational AI Needs Epistemic Infrastructure

作者: Federico Bottino, Carlo Ferrero, Nicholas Dosio, Pierfrancesco Beneventano 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11759v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究组织AI的知识表示和推理框架，与RAG（检索增强生成）和AI代理相关。论文明确提到RAG条件，并讨论AI代理使用的组织知识，因此这两个关键词得分较高。论文涉及大模型在组织AI中的应用，但非技术核心，因此LLM相关关键词得5分。其他关键词如MoE、量化、推理加速等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了OIDA框架，通过结构化知识对象和知识重力引擎来解决组织AI中知识缺乏认知结构的问题，实现了对组织承诺强度、矛盾状态和无知的表示，并在RAG条件下验证了其有效性。

摘要翻译

人工智能代理所使用的组织知识通常缺乏认知结构：检索系统呈现的是语义相关的内容，却无法区分具有约束力的决策与被放弃的假设、有争议的主张与已确定的结论，或是已知事实与未解决的问题。我们认为，组织人工智能的上限并非检索保真度，而是认知保真度——即系统将承诺强度、矛盾状态和组织未知领域表示为可计算属性的能力。
我们提出了OIDA框架，该框架将组织知识结构化为带有类型的知识对象（Knowledge Objects），这些对象携带认知类别、具有类别特异性衰减的重要性分数，以及带符号的矛盾边。知识引力引擎（Knowledge Gravity Engine）以确定性方式维护这些分数，并具有已证明的收敛性保证（充分条件：最大度数 $< 7$；经验证对度数高达43的情况仍具鲁棒性）。OIDA引入了“问题即建模的未知”（QUESTION-as-modeled-ignorance）：这是一种具有反向衰减特性的原语，能够以日益增强的紧迫性凸显组织未知的领域——这是所有已调研系统中均缺失的机制。我们描述了认知质量分数（Epistemic Quality Score, EQS），这是一种包含五个组成部分、并带有显式循环性分析的评估方法。在一项受控比较中（$n{=}10$ 组回答对），OIDA的RAG条件（3,868个词元）获得的EQS为0.530，而全上下文基线（108,687个词元）为0.848；$28.1\times$ 的词元预算差异是主要的混杂因素。问题机制（QUESTION mechanism）通过了统计学验证（Fisher $p{=}0.0325$, OR$=21.0$）。其形式化属性已确立；在同等词元预算下的决定性消融实验（E4）已预注册，尚未执行。

摘要 (Abstract)

Organizational knowledge used by AI agents typically lacks epistemic structure: retrieval systems surface semantically relevant content without distinguishing binding decisions from abandoned hypotheses, contested claims from settled ones, or known facts from unresolved questions. We argue that the ceiling on organizational AI is not retrieval fidelity but \emph{epistemic} fidelity–the system’s ability to represent commitment strength, contradiction status, and organizational ignorance as computable properties. We present OIDA, a framework that structures organizational knowledge as typed Knowledge Objects carrying epistemic class, importance scores with class-specific decay, and signed contradiction edges. The Knowledge Gravity Engine maintains scores deterministically with proved convergence guarantees (sufficient condition: max degree $< 7$; empirically robust to degree 43). OIDA introduces QUESTION-as-modeled-ignorance: a primitive with inverse decay that surfaces what an organization does \emph{not} know with increasing urgency–a mechanism absent from all surveyed systems. We describe the Epistemic Quality Score (EQS), a five-component evaluation methodology with explicit circularity analysis. In a controlled comparison ($n{=}10$ response pairs), OIDA’s RAG condition (3,868 tokens) achieves EQS 0.530 vs.\ 0.848 for a full-context baseline (108,687 tokens); the $28.1\times$ token budget difference is the primary confound. The QUESTION mechanism is statistically validated (Fisher $p{=}0.0325$, OR$=21.0$). The formal properties are established; the decisive ablation at equal token budget (E4) is pre-registered and not yet run.

关键词: organizational AI, epistemic infrastructure, knowledge objects, retrieval-augmented generation, AI agents, epistemic fidelity, contradiction detection, knowledge representation

41. ❌ StarVLA-$α$: Reducing Complexity in Vision-Language-Action Systems

作者: Jinhui Ye, Ning Gao, Senqiao Yang, Jinliang Zheng, Zixuan Wang, Yuxin Chen, Pengguang Chen, Yilun Chen, Shu Liu, Jiaya Jia 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11757v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Vision-Language-Action (VLA)模型用于机器人代理，与’LLM Agents’高度相关（8分），因为VLA是具身智能代理的一种形式。与’Large Language Models’、‘Pre-training’、‘Post-training’有一定关联（各5分），因为论文提到使用VLM（视觉语言模型）作为骨干，并涉及预训练和微调。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG等均未在摘要中提及或相关，故给0分。

!!! tip deepseek-chat TL;DR

该论文针对Vision-Language-Action (VLA) 模型在机器人领域架构复杂、设计碎片化的问题，提出了StarVLA-α这一简化基线模型，通过统一多基准训练验证了其强竞争力，并在真实世界基准上超越了现有模型。

摘要翻译

视觉-语言-动作（Vision-Language-Action，VLA）模型近年来已成为构建通用机器人智能体的一种前景广阔的范式。然而，当前VLA领域的研究格局仍高度碎片化且复杂：现有方法在架构、训练数据、具身配置以及面向特定基准的工程实现上存在显著差异。本研究提出了StarVLA-$α$，一个旨在受控条件下探究VLA设计选择的简洁而强大的基线模型。StarVLA-$α$有意最小化架构和流程的复杂性，以减少实验干扰因素，并支持系统性分析。具体而言，我们重新评估了若干关键设计维度，包括动作建模策略、机器人专用预训练以及接口工程。通过在LIBERO、SimplerEnv、RoboTwin和RoboCasa数据集上进行统一的多基准训练，这一相同的简洁基线模型始终保持高度竞争力，这表明一个强大的视觉语言模型（VLM）主干结合极简设计，已足以实现强劲性能，而无需依赖额外的架构复杂性或工程技巧。值得注意的是，我们的单一通用模型在公开的真实世界RoboChallenge基准上，性能超越了$π_{0.5}$模型20%。我们期望StarVLA-$α$能为未来VLA领域的研究提供一个坚实的起点。代码将在https://github.com/starVLA/starVLA 发布。

摘要 (Abstract)

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for building general-purpose robotic agents. However, the VLA landscape remains highly fragmented and complex: as existing approaches vary substantially in architectures, training data, embodiment configurations, and benchmark-specific engineering. In this work, we introduce StarVLA-$α$, a simple yet strong baseline designed to study VLA design choices under controlled conditions. StarVLA-$α$ deliberately minimizes architectural and pipeline complexity to reduce experimental confounders and enable systematic analysis. Specifically, we re-evaluate several key design axes, including action modeling strategies, robot-specific pretraining, and interface engineering. Across unified multi-benchmark training on LIBERO, SimplerEnv, RoboTwin, and RoboCasa, the same simple baseline remains highly competitive, indicating that a strong VLM backbone combined with minimal design is already sufficient to achieve strong performance without relying on additional architectural complexity or engineering tricks. Notably, our single generalist model outperforms $π_{0.5}$ by 20% on the public real-world RoboChallenge benchmark. We expect StarVLA-$α$ to serve as a solid starting point for future research in the VLA regime. Code will be released at https://github.com/starVLA/starVLA.

关键词: Vision-Language-Action, VLA, robotic agents, general-purpose, pretraining, fine-tuning, multi-benchmark, simplified baseline

42. ❌ Grounded World Model for Semantically Generalizable Planning

作者: Quanyi Li, Lan Feng, Haonan Zhang, Wuyang Li, Letian Wang, Alexandre Alahi, Harold Soh 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11751v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种Grounded World Model (GWM)，属于世界模型研究，与关键词’World Models AND General World Models’高度相关（10分）。论文主要研究视觉语言对齐的潜在空间中的世界模型，用于模型预测控制（MPC）和语义泛化规划，不涉及大语言模型（LLMs）、MoE、SLMs、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、推理加速、幻觉缓解、可解释AI、模型合并、上下文学习或科学AI等主题，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了在模型预测控制中难以预先获取目标图像且图像目标交互性有限的问题，通过提出一种在视觉语言对齐潜在空间中学习的Grounded World Model (GWM)，实现了基于任务指令相似性的动作评分，在WISER基准测试中达到87%的成功率，显著优于传统视觉语言动作模型。

摘要翻译

在模型预测控制（MPC）中，世界模型用于预测不同行动方案对应的未来结果，随后通过评分机制引导最优行动的选择。对于视觉运动MPC，评分函数基于预测图像与目标图像在预训练视觉编码器（如DINO和JEPA）潜在空间中的距离度量。然而，在任务执行前获取目标图像具有挑战性，尤其是在新环境中。此外，与自然语言相比，通过图像传达目标的方式交互性有限。本研究提出在视觉-语言对齐的潜在空间中学习一个具身世界模型（Grounded World Model, GWM）。由此，每个行动方案的评分取决于其未来结果与任务指令的接近程度，这一程度通过嵌入向量的相似性反映。该方法将视觉运动MPC转化为一种视觉语言行动器（VLA），其在语义泛化能力上超越了基于视觉语言模型（VLM）的VLA。在所提出的WISER基准测试中，GWM-MPC在包含288项任务的测试集上取得了87%的成功率，这些任务具有训练时未见的视觉信号和指代表达，但仍可通过训练中演示的动作解决。相比之下，传统VLA在训练集上虽以90%的成功率过拟合，但在测试集上的平均成功率仅为22%。

摘要 (Abstract)

In Model Predictive Control (MPC), world models predict the future outcomes of various action proposals, which are then scored to guide the selection of the optimal action. For visuomotor MPC, the score function is a distance metric between a predicted image and a goal image, measured in the latent space of a pretrained vision encoder like DINO and JEPA. However, it is challenging to obtain the goal image in advance of the task execution, particularly in new environments. Additionally, conveying the goal through an image offers limited interactivity compared with natural language. In this work, we propose to learn a Grounded World Model (GWM) in a vision-language-aligned latent space. As a result, each proposed action is scored based on how close its future outcome is to the task instruction, reflected by the similarity of embeddings. This approach transforms the visuomotor MPC to a VLA that surpasses VLM-based VLAs in semantic generalization. On the proposed WISER benchmark, GWM-MPC achieves a 87% success rate on the test set comprising 288 tasks that feature unseen visual signals and referring expressions, yet remain solvable with motions demonstrated during training. In contrast, traditional VLAs achieve an average success rate of 22%, even though they overfit the training set with a 90% success rate.

关键词: Grounded World Model, Model Predictive Control, visuomotor planning, vision-language alignment, semantic generalization, WISER benchmark, VLA, embedding similarity

43. ❌ Discourse Diversity in Multi-Turn Empathic Dialogue

作者: Hongli Zhan, Emma S. Gueorguieva, Javier Hernandez, Jina Suh, Desmond C. Ong, Junyi Jessy Li 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11742v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在共情对话中的话语多样性问题，直接涉及LLMs关键词（10分）。提出的MINT框架使用强化学习优化多样性，与RLHF/RLAIF/DPO相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究发现LLMs在多轮共情对话中话语策略重复率高，并提出了基于强化学习的MINT框架来优化话语多样性，在提升共情质量的同时显著减少了策略重复。

摘要翻译

大型语言模型（LLMs）在单轮对话中被评价为具有高度共情能力（Ayers等人，2023；Lee等人，2024），但它们同时也被认为是公式化的生成器，会在不同任务中重复使用相同的词汇模式、句法模板和话语结构（Jiang等人，2025；Shaib等人，2024；Namuduri等人，2025）。然而，这种公式化特征是否延伸至话语行为层面——即回应为对话对象所执行的功能——则较少受到关注。这一问题对于共情对话尤为重要，因为有效的支持不仅需要单次对话中的善意回应，更需要在对话展开过程中采用多样化的策略（Stiles等人，1998）。事实上，已有研究表明，在单轮对话设置中，LLMs比人类支持者更频繁地重复使用相同的话术序列（Gueorguieva等人，2026）。我们将此分析扩展至多轮对话，发现其僵化性进一步加剧：一旦某种话术出现在支持者的回合中，LLMs在下一回合中重复使用该话术的概率几乎是人类的两倍（0.50-0.56对比0.27）。这一模式在作为真实情感支持对话中支持者的各类LLMs中均成立，且无法被标准相似性度量所察觉。为弥补这一不足，我们提出了MINT（多轮话术间新颖性训练），这是首个通过强化学习框架来优化多轮共情对话中话语行为多样性的方法。最佳的MINT变体结合了共情质量奖励与跨轮次话术新颖性信号，在17亿和40亿参数模型上，其整体共情能力比基础模型提升了25.3%，同时在40亿参数模型上将跨轮次话语行为重复率降低了26.3%，在两项指标上均超越了包括仅优化质量的方法和词元级多样性方法在内的所有基线。这些结果表明，当前模型所缺乏的并非共情能力本身，而是在对话过程中灵活调整其话语行为的能力。

摘要 (Abstract)

Large language models (LLMs) produce responses rated as highly empathic in single-turn settings (Ayers et al., 2023; Lee et al., 2024), yet they are also known to be formulaic generators that reuse the same lexical patterns, syntactic templates, and discourse structures across tasks (Jiang et al., 2025; Shaib et al., 2024; Namuduri et al., 2025). Less attention has been paid to whether this formulaicity extends to the level of discourse moves, i.e., what a response does for the person it is addressing. This question is especially consequential for empathic dialogue, where effective support demands not just a kind response at one moment but varied strategies as a conversation unfolds (Stiles et al., 1998). Indeed, prior work shows that LLMs reuse the same tactic sequences more than human supporters in single-turn settings (Gueorguieva et al., 2026). We extend this analysis to multi-turn conversations and find that the rigidity compounds: once a tactic appears in a supporter turn, LLMs reuse it in the next at nearly double the rate of humans (0.50-0.56 vs. 0.27). This pattern holds across LLMs serving as supporters in real emotional support conversations, and is invisible to standard similarity metrics. To address this gap, we introduce MINT (Multi-turn Inter-tactic Novelty Training), the first reinforcement learning framework to optimize discourse move diversity across multi-turn empathic dialogue. The best MINT variant combines an empathy quality reward with a cross-turn tactic novelty signal, improving aggregate empathy by 25.3% over vanilla across 1.7B and 4B models while reducing cross-turn discourse move repetition by 26.3% on the 4B model, surpassing all baselines including quality-only and token-level diversity methods on both measures. These results suggest that what current models lack is not empathy itself, but the ability to vary their discourse moves across a conversation.

关键词: Large Language Models, Empathic Dialogue, Discourse Diversity, Multi-turn Conversations, Reinforcement Learning, MINT Framework, Tactic Repetition, Empathy Quality

44. ❌ Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving

作者: Haojie Bai, Aimin Li, Ruoyu Yao, Xiongwei Zhao, Tingting Zhang, Xing Zhang, Lin Gao, and Jun Ma 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11734v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究多智能体扩散规划在协同驾驶中的应用，核心是扩散模型的预训练和在线强化后训练。与关键词的相关性分析：1）‘Pre-training’和’Post-training’高度相关（10分），论文明确提出了scene-conditioned diffusion pre-training和online reinforcement post-training；2）‘Multi-agent Systems’高度相关（10分），论文专注于多智能体轨迹规划；3）‘AI for Science’有一定关联（5分），属于AI在交通科学领域的应用；4）其他关键词如LLMs、MoE、RLHF等与论文的扩散模型、强化学习优化方法无直接关系，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了Multi-ORFT方法，通过结合场景条件扩散预训练和稳定的在线强化后训练，解决了多智能体协同驾驶中轨迹规划的模态多样性和闭环目标对齐问题，在WOMD基准上显著降低了碰撞率和脱轨率，同时提高了平均速度。

摘要翻译

闭环协同驾驶需要规划器能够生成真实的多模态多智能体轨迹，同时提升安全性与交通效率。现有的扩散规划器能够从演示数据中建模多模态行为，但其场景一致性通常较弱，且与闭环目标的契合度不足；此外，在反应式多智能体环境中进行稳定的在线后训练仍存在困难。本文提出Multi-ORFT方法，该方法将场景条件化扩散预训练与稳定的在线强化后训练相结合。在预训练阶段，规划器通过智能体间自注意力机制、交叉注意力机制以及基于AdaLN-Zero的场景条件化技术，提升联合轨迹的场景一致性与道路遵循能力。在后训练阶段，我们构建了一个双层马尔可夫决策过程（MDP），显式提供逐步反向核似然以供在线优化，并结合密集的轨迹级奖励与方差门控的群组相对策略优化（VG-GRPO）以稳定训练过程。在WOMD闭环基准测试中，相较于预训练规划器，Multi-ORFT将碰撞率从2.04%降至1.89%，脱轨率从1.68%降至1.36%，同时将平均速度从8.36提升至8.61米/秒；在核心安全与效率指标上，其表现优于包括SMART-large、SMART-tiny-CLSFT和VBD在内的多个强开源基线模型。这些结果表明，将场景一致性去噪与稳定的在线扩散策略优化相结合，能够有效提升闭环协同驾驶的可靠性。

摘要 (Abstract)

Closed-loop cooperative driving requires planners that generate realistic multimodal multi-agent trajectories while improving safety and traffic efficiency. Existing diffusion planners can model multimodal behaviors from demonstrations, but they often exhibit weak scene consistency and remain poorly aligned with closed-loop objectives; meanwhile, stable online post-training in reactive multi-agent environments remains difficult. We present Multi-ORFT, which couples scene-conditioned diffusion pre-training with stable online reinforcement post-training. In pre-training, the planner uses inter-agent self-attention, cross-attention, and AdaLN-Zero-based scene conditioning to improve scene consistency and road adherence of joint trajectories. In post-training, we formulate a two-level MDP that exposes step-wise reverse-kernel likelihoods for online optimization, and combine dense trajectory-level rewards with variance-gated group-relative policy optimization (VG-GRPO) to stabilize training. On the WOMD closed-loop benchmark, Multi-ORFT reduces collision rate from 2.04% to 1.89% and off-road rate from 1.68% to 1.36%, while increasing average speed from 8.36 to 8.61 m/s relative to the pre-trained planner, and it outperforms strong open-source baselines including SMART-large, SMART-tiny-CLSFT, and VBD on the primary safety and efficiency metrics. These results show that coupling scene-consistent denoising with stable online diffusion-policy optimization improves the reliability of closed-loop cooperative driving.

关键词: multi-agent diffusion planning, cooperative driving, online reinforcement fine-tuning, scene-conditioned diffusion, closed-loop planning, trajectory generation, safety and efficiency, WOMD benchmark

45. ❌ Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games

作者: Keyang Zhong, Junlin Xie, Hefeng Wu, Haofeng Li, Guanbin Li 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11741v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于多模态多智能体推理框架，核心涉及多智能体系统、多步推理和智能体工作流，与’LLM Agents/Autonomous Agents/Agentic Workflow’、‘Multi-agent Systems/Agent Coordination’、‘Chain of Thought/CoT Reasoning/Multi-step Reasoning’高度相关（10分）。论文使用VLMs（可视为大模型的一种），涉及监督微调（SFT）和推理行为，与’Large Language Models/LLMs/Foundation Models’和’Post-training/Supervised Fine-tuning/SFT’有一定关联（8分）。论文强调深度推理和不确定性建模，与’System 2 Thinking/Slow Thinking/In-depth Reasoning’部分相关（8分）。其他关键词如MoE、量化、RAG等未在摘要中体现，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型在多人游戏（如谋杀之谜）中不完美信息下的多跳推理退化问题，提出了一种协作多智能体框架，通过两阶段智能体监控训练策略显著提升了模型在叙事推理、隐藏事实提取和抗欺骗理解方面的性能。

摘要翻译

视觉语言模型（VLMs）在感知任务中展现出卓越能力，但在包含不完善与欺骗性信息的多玩家游戏场景中，其复杂多跳推理性能会显著下降。本文研究了一个典型的多玩家推理任务——谋杀之谜游戏，该任务要求玩家基于不同意图角色提供的局部线索推断隐藏真相。为应对这一挑战，我们提出了一种协作式多智能体框架，用于评估和生成高质量、角色驱动的多玩家游戏剧本，实现基于角色身份（即凶手与无辜者）的细粒度交互模式。该系统通过智能体间的协同交互，生成丰富的多模态上下文，包括角色背景故事、视觉与文本线索以及多跳推理链。我们设计了一种两阶段智能体监督训练策略以增强视觉语言模型的推理能力：（1）在建模不确定性与欺骗性的精选数据集与合成数据集上进行基于思维链的微调；（2）采用智能体监督奖励塑形的基于GRPO的强化学习，激励模型发展角色特定的推理行为及有效的多模态多跳推理能力。大量实验表明，我们的方法显著提升了视觉语言模型在叙事推理、隐藏事实提取和抗欺骗理解方面的性能。本研究的贡献为在不确定、对抗性及社会复杂性条件下训练与评估视觉语言模型提供了可扩展的解决方案，为不完善信息下的多模态多跳推理未来基准构建奠定了基础。

摘要 (Abstract)

Vision-language models (VLMs) have shown impressive capabilities in perceptual tasks, yet they degrade in complex multi-hop reasoning under multiplayer game settings with imperfect and deceptive information. In this paper, we study a representative multiplayer task, Murder Mystery Games, which require inferring hidden truths based on partial clues provided by roles with different intentions. To address this challenge, we propose a collaborative multi-agent framework for evaluating and synthesizing high-quality, role-driven multiplayer game scripts, enabling fine-grained interaction patterns tailored to character identities (i.e., murderer vs. innocent). Our system generates rich multimodal contexts, including character backstories, visual and textual clues, and multi-hop reasoning chains, through coordinated agent interactions. We design a two-stage agent-monitored training strategy to enhance the reasoning ability of VLMs: (1) chain-of-thought based fine-tuning on curated and synthetic datasets that model uncertainty and deception; (2) GRPO-based reinforcement learning with agent-monitored reward shaping, encouraging the model to develop character-specific reasoning behaviors and effective multimodal multi-hop inference. Extensive experiments demonstrate that our method significantly boosts the performance of VLMs in narrative reasoning, hidden fact extraction, and deception-resilient understanding. Our contributions offer a scalable solution for training and evaluating VLMs under uncertain, adversarial, and socially complex conditions, laying the groundwork for future benchmarks in multimodal multi-hop reasoning under imperfect information.

关键词: multi-agent systems, vision-language models, multi-hop reasoning, imperfect information, murder mystery games, chain-of-thought, agent-monitored training, deception-resilient understanding

46. ❌ Endogenous Information in Routing Games: Memory-Constrained Equilibria, Recall Braess Paradoxes, and Memory Design

作者: Saad Alqithami 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11733v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究交通路由博弈中的内生信息问题，聚焦于有限记忆状态下的均衡分析、算法设计和悖论现象。论文内容属于运筹学、博弈论和交通网络理论范畴，完全不涉及大模型、深度学习、AI技术原理或AI在科学领域的应用。所有评分关键词均与大模型技术、AI方法或AI应用相关，而本文是纯理论计算机科学/运筹学研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究交通路由博弈中旅行者基于有限记忆选择路径的问题，提出了遗忘沃德罗普均衡理论，设计了记忆管理算法，并发现了改善记忆反而可能增加延迟的回忆布雷斯悖论。

摘要翻译

本文研究一类路径选择博弈，其中出行者并非基于固定外生行动集进行优化，而是在其记忆或系统呈现的路径中进行选择。论文首先构建了一个可处理的内生记忆设计理论，进而将其与显式的有限记忆微观模型相联系。在微观层面，每位出行者携带有限记忆状态，接收系统呈现的备选路径，通过Logit规则进行选择，并依据特定策略（如LRU，最近最少使用）更新记忆。该模型导出了一个稳定的遗忘沃德罗普均衡（Forgetful Wardrop Equilibrium, FWE）；在温和正则性条件下证明了其存在性，并在约化定点映射的压缩区域内证明了唯一性。论文的核心设计层是一个稳态显著性模型，该模型将持久的记忆与界面效应概括为路径特定的权重。显著性加权的随机用户均衡是一个严格凸势函数的唯一最小化点，由此产生了一套简洁的优化与可实施性理论。在此层面，我们刻画了在比例预算与仿射关联约束下的受控可实施性，并在并联及串并联网络上推导了构造性算法。两个模型层次之间的桥梁在最后选择记忆（B=1）情形下是精确的：此时微观模型等价于显著性模型，因此任何内部显著性向量均可通过适当的呈现策略实现。对于更大的记忆容量，我们开发了一条从LRU到TTL（生存时间）再到显著性的显式近似流程，并添加了基于压缩性的误差界，将代理映射误差转化为定点误差与福利误差。最后，我们定义了一种“回忆布雷斯悖论”（Recall Braess Paradox），即在不改变物理通行能力的情况下，改善记忆反而会增加均衡延误，并证明该悖论可在任何拥有至少两条不同s-t路径的双端点网络上出现。针对性实验支持了近似机制的有效性、受控设计预测的准确性以及约化模型层的计算优势。

摘要 (Abstract)

We study routing games in which travelers optimize over routes that are remembered or surfaced, rather than over a fixed exogenous action set. The paper develops a tractable design theory for endogenous recall and then connects it back to an explicit finite-memory micro model. At the micro level, each traveler carries a finite memory state, receives surfaced alternatives, chooses via a logit rule, and updates memory under a policy such as LRU. This yields a stationary Forgetful Wardrop Equilibrium (FWE); existence is proved under mild regularity, and uniqueness follows in a contraction regime for the reduced fixed-point map. The paper’s main design layer is a stationary salience model that summarizes persistent memory and interface effects as route-specific weights. Salience-weighted stochastic user equilibrium is the unique minimizer of a strictly convex potential, which yields a clean optimization and implementability theory. In this layer we characterize governed implementability under ratio budgets and affine tying constraints, and derive constructive algorithms on parallel and series-parallel networks. The bridge between layers is exact for last-choice memory (B=1): the micro model is then equivalent to the salience model, so any interior salience vector can be realized by an appropriate surfacing policy. For larger memories, we develop an explicit LRU-to-TTL-to-salience approximation pipeline and add contraction-based bounds that translate surrogate-map error into fixed-point and welfare error. Finally, we define a Recall Braess Paradox, in which improving recall increases equilibrium delay without changing physical capacity, and show that it can arise on every two-terminal network with at least two distinct s-t paths. Targeted experiments support the approximation regime, governed-design predictions, and the computational advantages of the reduced layer.

关键词: routing games, endogenous information, finite memory, Forgetful Wardrop Equilibrium, salience model, Recall Braess Paradox, memory design, stationary equilibrium

作者: Ryan Faulkner, Anushka Deshpande, David Guzman Piedrahita, Joel Z. Leibo, Zhijing Jin 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11721v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在多智能体系统中的合作与领导机制，与’Large Language Models’和’LLM Agents’、‘Multi-agent Systems’高度相关（10分），因为这些是论文的实验对象和核心框架。其他关键词如MoE、量化、推理加速、科学AI应用等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究通过多智能体模拟实验探究选举领导机制能否提升LLM群体的社会福利与合作水平，结果显示选举领导使社会福利得分提高55.4%、生存时间延长128.6%。

摘要翻译

治理公共资源需要智能体通过合作与自我治理发展持久策略，以避免集体失败。尽管基础模型已在此类场景中展现出合作潜力，但现有多智能体研究未能深入揭示结构化领导与选举机制是否能改善集体决策。缺乏这种人类社会中普遍存在的关键组织特征，构成了当前方法的显著缺陷。本研究旨在通过基于大语言模型的多智能体模拟，直接探究领导机制与选举制度能否促进社会福利与合作水平的提升。我们提出了一个开源框架，通过选举产生的角色人格与候选人驱动的议程来模拟领导过程，并在受控治理条件下对大语言模型展开实证研究。实验表明，在多种高性能大语言模型中，实行选举领导机制可使社会福利评分提升55.4%，系统存续时间延长128.6%。通过构建智能体社会关系图，我们计算中心性指标以评估领导者角色的社会影响力，并通过对领导者话语的情感分析，揭示其修辞策略与合作倾向。本研究为在多智能体系统中进一步探索选举机制、应对复杂社会困境奠定了理论基础。

摘要 (Abstract)

Governing common-pool resources requires agents to develop enduring strategies through cooperation and self-governance to avoid collective failure. While foundation models have shown potential for cooperation in these settings, existing multi-agent research provides little insight into whether structured leadership and election mechanisms can improve collective decision making. The lack of such a critical organizational feature ubiquitous in human society presents a significant shortcoming of the current methods. In this work we aim to directly address whether leadership and elections can support improved social welfare and cooperation through multi-agent simulation with LLMs. We present our open-source framework that simulates leadership through elected personas and candidate-driven agendas and carry out an empirical study of LLMs under controlled governance conditions. Our experiments demonstrate that having elected leadership improves social welfare scores by 55.4% and survival time by 128.6% across a range of high performing LLMs. Through the construction of an agent social graph we compute centrality metrics to assess the social influence of leader personas and also analyze rhetorical and cooperative tendencies revealed through a sentiment analysis on leader utterances. This work lays the foundation for further study of election mechanisms in multi-agent systems toward navigating complex social dilemmas.

关键词: LLM, multi-agent systems, cooperation, leadership, elections, social welfare, agent simulation, social dilemmas

48. ❌ On the Robustness of Watermarking for Autoregressive Image Generation

作者: Andreas Müller, Denis Lukovnikov, Shingo Kodama, Minh Pham, Anubhav Jain, Jonathan Petit, Niv Cohen, Asja Fischer 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11720v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是自回归图像生成模型的水印技术安全性问题，包括水印移除和伪造攻击。所有关键词都聚焦于大语言模型（LLM）及其相关技术（如训练方法、推理优化、应用等），而本文研究对象是图像生成模型（特别是自回归图像生成器），属于计算机视觉领域，与LLM技术无直接关联。虽然都属于生成式AI范畴，但具体技术、模型架构和应用场景完全不同，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究发现，自回归图像生成模型的水印方案存在严重安全漏洞，攻击者仅凭单个水印图像即可有效实施移除和伪造攻击，导致水印无法可靠用于合成内容检测和数据集过滤。

摘要翻译

自回归图像生成器的扩散要求对其输出进行可靠的检测与溯源，以遏制错误信息传播，并从训练数据中过滤合成图像以防止模型崩溃。为应对这一需求，专为自回归模型设计的水印技术在生成时嵌入隐蔽信号，使其能通过相应的水印检测器进行下游验证。本研究深入分析了此类水印方案，并证明其易受水印移除与伪造攻击。我们评估了现有攻击方法，并进一步提出三种新型攻击：（一）基于向量量化的再生移除攻击，（二）基于对抗优化的攻击，以及（三）频率注入攻击。评估结果表明，仅凭单张带水印的参考图像且无需原始模型参数或水印密钥，移除与伪造攻击即可有效实施。我们的研究显示，当前针对自回归图像生成的水印方案无法为数据集过滤提供可靠的合成内容检测支持。此外，这些方案可能催生“水印模仿”行为——攻击者可篡改真实图像以模仿特定生成器的水印特征，从而触发误检测机制，阻碍真实图像被纳入后续模型训练。

摘要 (Abstract)

The proliferation of autoregressive (AR) image generators demands reliable detection and attribution of their outputs to mitigate misinformation, and to filter synthetic images from training data to prevent model collapse. To address this need, watermarking techniques, specifically designed for AR models, embed a subtle signal at generation time, enabling downstream verification through a corresponding watermark detector. In this work, we study these schemes and demonstrate their vulnerability to both watermark removal and forgery attacks. We assess existing attacks and further introduce three new attacks: (i) a vector-quantized regeneration removal attack, (ii) adversarial optimization-based attack, and (iii) a frequency injection attack. Our evaluation reveals that removal and forgery attacks can be effective with access to a single watermarked reference image and without access to original model parameters or watermarking secrets. Our findings indicate that existing watermarking schemes for AR image generation do not reliably support synthetic content detection for dataset filtering. Moreover, they enable Watermark Mimicry, whereby authentic images can be manipulated to imitate a generator’s watermark and trigger false detection to prevent their inclusion in future model training.

关键词: autoregressive image generation, watermarking, watermark removal, watermark forgery, synthetic content detection, dataset filtering, adversarial attacks, model collapse

49. ❌ SWE-AGILE: A Software Agent Framework for Efficiently Managing Dynamic Reasoning Context

作者: Shuquan Lian, Juncheng Liu, Yazhe Chen, Yuhong Chen, Hui Li 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11716v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SWE-AGILE专注于软件工程领域的LLM智能体框架，核心创新在于解决深度推理与上下文管理的平衡问题。高度相关关键词包括：LLMs（使用7B-8B模型）、Chain of Thought（扩展CoT推理）、System 2 Thinking（显式系统2推理）、LLM Agents（软件智能体框架）。Small Language Models相关度为8分，因为论文使用7B-8B规模模型，属于较小规模LLM。其余关键词与论文内容无直接关联，得0分。

!!! tip deepseek-chat TL;DR

论文提出SWE-AGILE软件智能体框架，通过动态推理上下文策略解决深度推理与上下文爆炸的困境，在SWE-Bench-Verified上为7B-8B模型设定了新标准。

摘要翻译

先前具有代表性的ReAct式自主软件工程方法通常缺乏进行深度分析和处理复杂边缘情况所需的显式系统二推理能力。尽管近期推理模型展现了扩展思维链的潜力，但将其应用于多轮次软件工程任务时会产生一个根本性困境：保留完整推理历史会导致上下文爆炸和“中间迷失”性能衰减，而丢弃历史则会迫使智能体在每一步进行冗余的重复推理。为应对这些挑战，我们提出SWE-AGILE——一个旨在弥合推理深度、效率与上下文限制之间鸿沟的新型软件智能体框架。SWE-AGILE引入了动态推理上下文策略，通过维护详细推理的“滑动窗口”来保证即时连续性以避免冗余重复分析，同时将历史推理内容压缩为精炼的推理摘要。实证研究表明，SWE-AGILE仅使用2.2k条轨迹数据和896项任务，就在SWE-Bench-Verified基准上为7B-8B参数规模的模型设立了新标准。代码发布于https://github.com/KDEGroup/SWE-AGILE。

摘要 (Abstract)

Prior representative ReAct-style approaches in autonomous Software Engineering (SWE) typically lack the explicit System-2 reasoning required for deep analysis and handling complex edge cases. While recent reasoning models demonstrate the potential of extended Chain-of-Thought (CoT), applying them to the multi-turn SWE task creates a fundamental dilemma: retaining full reasoning history leads to context explosion and Lost-in-the-Middle'' degradation, while discarding it would force the agent to redundantly re-reason at every step. To address these challenges, we propose SWE-AGILE, a novel software agent framework designed to bridge the gap between reasoning depth, efficiency, and context constraints. SWE-AGILE introduces a Dynamic Reasoning Context strategy, maintaining a sliding window’’ of detailed reasoning for immediate continuity to prevent redundant re-analyzing, while compressing historical reasoning content into concise Reasoning Digests. Empirically, SWE-AGILE sets a new standard for 7B-8B models on SWE-Bench-Verified using only 2.2k trajectories and 896 tasks. Code is available at https://github.com/KDEGroup/SWE-AGILE.

关键词: Software Agent Framework, Dynamic Reasoning Context, Chain-of-Thought, System-2 Reasoning, Context Explosion, Lost-in-the-Middle, Reasoning Digests, SWE-Bench

50. ❌ A Mamba-Based Multimodal Network for Multiscale Blast-Induced Rapid Structural Damage Assessment

作者: Wanli Ma, Sivasakthy Selvakumaran, Dain G. Farrimond, Adam A. Dennis, Samuel E. Rigby 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11709v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文提出了一种基于Mamba的多模态网络，用于爆炸引起的多尺度结构损伤评估。论文的核心是计算机视觉和遥感应用，使用Mamba架构（一种状态空间模型）处理图像数据，并整合爆炸载荷的物理特征。所有关键词均与大型语言模型（LLM）或大模型技术原理直接相关，而本文未涉及LLM、提示工程、对齐、推理、代理、效率优化等主题。唯一的相关点是“AI for Science”，因为该研究将AI应用于工程和灾害管理科学领域，属于AI for Science的广义范畴，但并非核心生物信息学或化学信息学应用，因此给予5分（有一定关联）。其他关键词完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于Mamba的多模态网络，通过整合多尺度爆炸载荷信息和光学遥感图像，显著提高了爆炸后结构损伤评估的准确性和速度，在2020年贝鲁特爆炸案例中优于现有方法。

摘要翻译

准确快速的结构损伤评估（Structural Damage Assessment, SDA）对于灾后管理至关重要，有助于救援人员优先配置资源、规划救援行动并支持恢复工作。传统的现场勘查方法虽然精确，但受限于可及性、安全风险和时间约束，尤其是在大规模爆炸发生后。基于遥感的机器学习已成为实现快速SDA的可扩展解决方案，其中基于Mamba架构的网络已取得最先进的性能。然而，这些方法通常需要大量训练和大规模数据集，限制了其在实际场景中的应用。此外，现有方法未能有效结合爆炸荷载的关键物理特征进行SDA。为应对这些挑战，我们提出了一种基于Mamba的多模态网络，用于快速SDA，该网络将多尺度爆炸荷载信息与光学遥感图像相融合。在2020年贝鲁特爆炸事件数据集上的评估表明，我们的方法相较于现有最优方法显著提升了性能。代码公开于：https://github.com/IMPACTSquad/Blast-Mamba

摘要 (Abstract)

Accurate and rapid structural damage assessment (SDA) is crucial for post-disaster management, helping responders prioritise resources, plan rescues, and support recovery. Traditional field inspections, though precise, are limited by accessibility, safety risks, and time constraints, especially after large explosions. Machine learning with remote sensing has emerged as a scalable solution for rapid SDA, with Mamba-based networks achieving state-of-the-art performance. However, these methods often require extensive training and large datasets, limiting real-world applicability. Moreover, they fail to incorporate key physical characteristics of blast loading for SDA. To overcome these challenges, we propose a Mamba-based multimodal network for rapid SDA that integrates multi-scale blast-loading information with optical remote sensing images. Evaluated on the 2020 Beirut explosion, our method significantly improves performance over state-of-the-art approaches. Code is available at: https://github.com/IMPACTSquad/Blast-Mamba

关键词: Mamba, multimodal network, structural damage assessment, blast loading, remote sensing, explosion, machine learning, Beirut explosion

51. ❌ Legal2LogicICL: Improving Generalization in Transforming Legal Cases to Logical Formulas via Diverse Few-Shot Learning

作者: Jieying Xue, Phuong Minh Nguyen, Ha Thanh Nguyen, May Myo Zin, Ken Satoh 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11699v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是使用LLMs进行法律领域的推理，通过检索增强生成（RAG）和上下文学习（ICL）技术改进法律案例到逻辑公式的转换。因此，与’Large Language Models’、‘Retrieval-Augmented Generation’和’In-context Learning’高度相关（10分）。论文属于AI在法律领域的应用，与’AI for Science’有一定关联（5分），但非生物或化学信息学。其他关键词如MoE、SFT、RLHF等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于LLM的检索增强生成框架Legal2LogicICL，通过平衡多样性和相似性的少样本学习，显著提高了将自然语言法律案例转换为逻辑公式的准确性、稳定性和泛化能力。

摘要翻译

本研究旨在通过将自然语言处理（NLP）的最新进展与基于大型语言模型（LLMs）的法律领域自适应少样本学习技术相结合，提升基于逻辑的法律推理系统的泛化能力。现有的基于逻辑的法律推理流程通常依赖于微调模型，将自然语言描述的法律案例映射为逻辑公式，再将其输入符号推理器。然而，此类方法严重受限于高质量标注训练数据的稀缺性。为应对这一局限，我们提出了一种新颖的基于LLM的法律推理框架，该框架通过检索增强生成实现有效的上下文学习。具体而言，我们引入了Legal2LogicICL，这是一个少样本检索框架，在潜在语义表征层面和法律文本结构层面同时兼顾了示例的多样性与相似性。此外，我们的方法通过缓解法律文本中由实体引发的检索偏差，显式地考虑了法律结构特征——在法律文本中，冗长且高度具体的实体提及常常主导语义表征，并掩盖了具有法律意义的推理模式。我们的Legal2LogicICL能够构建信息丰富且鲁棒的少样本演示示例，从而在不需额外训练的情况下，实现准确且稳定的逻辑规则生成。同时，我们构建了一个名为Legal2Proleg的新数据集，该数据集标注了法律案例与PROLEG逻辑公式之间的对应关系，以支持法律语义解析的评估。在开源和专有LLMs上的实验结果表明，我们的方法在将自然语言法律案例描述转化为逻辑表征方面，显著提升了准确性、稳定性和泛化能力，凸显了其在实现可解释且可靠的法律推理方面的有效性。我们的代码发布于https://github.com/yingjie7/Legal2LogicICL。

摘要 (Abstract)

This work aims to improve the generalization of logic-based legal reasoning systems by integrating recent advances in NLP with legal-domain adaptive few-shot learning techniques using LLMs. Existing logic-based legal reasoning pipelines typically rely on fine-tuned models to map natural-language legal cases into logical formulas before forwarding them to a symbolic reasoner. However, such approaches are heavily constrained by the scarcity of high-quality annotated training data. To address this limitation, we propose a novel LLM-based legal reasoning framework that enables effective in-context learning through retrieval-augmented generation. Specifically, we introduce Legal2LogicICL, a few-shot retrieval framework that balances diversity and similarity of exemplars at both the latent semantic representation level and the legal text structure level. In addition, our method explicitly accounts for legal structure by mitigating entity-induced retrieval bias in legal texts, where lengthy and highly specific entity mentions often dominate semantic representations and obscure legally meaningful reasoning patterns. Our Legal2LogicICL constructs informative and robust few-shot demonstrations, leading to accurate and stable logical rule generation without requiring additional training. In addition, we construct a new dataset, named Legal2Proleg, which is annotated with alignments between legal cases and PROLEG logical formulas to support the evaluation of legal semantic parsing. Experimental results on both open-source and proprietary LLMs demonstrate that our approach significantly improves accuracy, stability, and generalization in transforming natural-language legal case descriptions into logical representations, highlighting its effectiveness for interpretable and reliable legal reasoning. Our code is available at https://github.com/yingjie7/Legal2LogicICL.

关键词: Legal reasoning, Large Language Models, Retrieval-augmented generation, In-context learning, Few-shot learning, Logical formulas, Legal semantic parsing, Generalization

52. ❌ AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation

作者: Mingyang Li, Haofan Xu, Haowen Sun, Xinzhe Chen, Sihua Ren, Liqi Huang, Xinyang Sui, Chenyang Miao, Qiongjie Cui, Zeyang Liu, Xingyu Chen, Xuguang Lan 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11674v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于机器人操作领域，提出了一种结合3D视觉语言模型（VLM）和3D高斯重建的仿真数据生成框架AffordSim，用于生成具有物体功能区域感知的机器人操作轨迹数据。论文的核心技术是VoxAfford模型（一种开放词汇的3D功能区域检测器）和基于DA3的3D高斯重建，属于计算机视觉、机器人学和仿真技术的交叉领域。虽然论文提到了“VLM-powered task generation”，但这里的VLM（Vision-Language Model）是用于任务生成的视觉语言模型，并非论文研究的核心大语言模型技术。论文与绝大多数关键词（涉及大语言模型架构、训练、推理、对齐、代理等核心技术）完全无关。唯一有微弱关联的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为机器人学可以被广义地视为AI在工程/科学领域的一个应用分支，但论文并非典型的“AI for Science”（如生物信息学、化学信息学），因此仅给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对现有机器人操作仿真数据生成平台缺乏物体功能区域（affordance）信息的问题，提出了AffordSim框架，通过集成开放词汇3D功能区域预测来生成语义正确的操作轨迹，其基准测试表明当前模仿学习方法在需要精确功能区域交互的任务（如倾倒、悬挂）上仍然面临巨大挑战。

摘要翻译

基于仿真的数据生成已成为训练机器人操作策略的主流范式，但现有平台未将物体可供性信息纳入轨迹生成过程。因此，需要与特定功能区域进行精确交互的任务——如抓握杯柄、从杯缘倾倒或将杯子挂到挂钩上——无法自动生成语义正确的轨迹。本文提出AffordSim，这是首个将开放词汇3D可供性预测集成到操作数据生成流程的仿真框架。AffordSim采用我们提出的VoxAfford模型（一种开放词汇3D可供性检测器，通过多尺度几何特征增强多模态大语言模型输出标记），在物体点云上预测可供性图谱，从而将抓握姿态估计引导至任务相关的功能区域。该框架基于NVIDIA Isaac Sim构建，具备跨 embodiment 支持（Franka FR3、Panda、UR5e、Kinova）、基于视觉语言模型的任务生成能力，以及通过基于DA3的真实照片3D高斯重建实现的新型域随机化技术，能够自动化、可扩展地生成可供性感知的操作数据。我们建立了涵盖7个类别（抓握、放置、堆叠、推/拉、倾倒、挂杯、长时程复合任务）的50项任务基准，并评估了4种模仿学习基线方法（行为克隆、扩散策略、ACT、Pi 0.5）。实验结果表明：虽然抓握任务已基本解决（成功率53-93%），但对可供性敏感的任务——如向窄口容器倾倒（成功率1-43%）和挂杯任务（成功率0-47%）——对当前模仿学习方法仍极具挑战，这凸显了可供性感知数据生成的必要性。在真实Franka FR3机器人上进行的零样本仿真到现实实验验证了生成数据的可迁移性。

摘要 (Abstract)

Simulation-based data generation has become a dominant paradigm for training robotic manipulation policies, yet existing platforms do not incorporate object affordance information into trajectory generation. As a result, tasks requiring precise interaction with specific functional regions–grasping a mug by its handle, pouring from a cup’s rim, or hanging a mug on a hook–cannot be automatically generated with semantically correct trajectories. We introduce AffordSim, the first simulation framework that integrates open-vocabulary 3D affordance prediction into the manipulation data generation pipeline. AffordSim uses our VoxAfford model, an open-vocabulary 3D affordance detector that enhances MLLM output tokens with multi-scale geometric features, to predict affordance maps on object point clouds, guiding grasp pose estimation toward task-relevant functional regions. Built on NVIDIA Isaac Sim with cross-embodiment support (Franka FR3, Panda, UR5e, Kinova), VLM-powered task generation, and novel domain randomization using DA3-based 3D Gaussian reconstruction from real photographs, AffordSim enables automated, scalable generation of affordance-aware manipulation data. We establish a benchmark of 50 tasks across 7 categories (grasping, placing, stacking, pushing/pulling, pouring, mug hanging, long-horizon composite) and evaluate 4 imitation learning baselines (BC, Diffusion Policy, ACT, Pi 0.5). Our results reveal that while grasping is largely solved (53-93% success), affordance-demanding tasks such as pouring into narrow containers (1-43%) and mug hanging (0-47%) remain significantly more challenging for current imitation learning methods, highlighting the need for affordance-aware data generation. Zero-shot sim-to-real experiments on a real Franka FR3 validate the transferability of the generated data.

关键词: Robotic Manipulation, Simulation Data Generation, Affordance Awareness, 3D Affordance Prediction, Open-vocabulary Detection, Imitation Learning Benchmark, Sim-to-real Transfer, VLM-powered Task Generation

53. ❌ NetworkNet: A Deep Neural Network Approach for Random Networks with Sparse Nodal Attributes and Complex Nodal Heterogeneity

作者: Zhaoyu Xing, Xiufan Yu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11673v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文NetworkNet专注于网络科学中的统计建模问题，提出了一种用于处理具有高维节点属性的随机网络的深度神经网络方法。虽然论文使用了深度神经网络（DNNs），但其研究内容与所有评分关键词（均围绕大模型/LLM技术、训练方法、推理优化、对齐、应用等）完全无关。论文的核心是网络建模、节点异质性估计和属性选择，属于传统的网络分析与深度学习结合的应用，而非大模型相关研究。

!!! tip deepseek-chat TL;DR

论文提出了一种名为NetworkNet的深度神经网络方法，用于建模具有高维节点属性的随机网络中的复杂节点异质性，并同时进行数据驱动的属性选择，在模拟和实际作者引用网络应用中均表现出色。

摘要翻译

具有丰富节点信息的异质网络数据在多学科研究中日益普遍，然而如何准确建模复杂的节点异质性并同时筛选具有影响力的节点属性，仍是一个悬而未决的挑战。当节点异质性与高维个体特征共同深刻影响网络形成时，这一问题成为经济学和社会学众多应用的核心。本文提出一种基于统计理论的统一深度神经网络方法，用于建模具有高维节点属性的随机网络中的节点异质性，即“NetworkNet”。NetworkNet的核心创新在于其定制的神经架构：该架构显式参数化了属性驱动的异质性，同时嵌入可扩展的属性筛选机制。该方法能够一致估计两类潜在异质性函数（即节点扩张性与受欢迎度），并同步执行数据驱动的属性筛选以提取关键节点属性。通过将经典统计网络建模与深度学习相融合，NetworkNet在保持深度神经网络强大表达能力的同时，兼具方法可解释性、算法可扩展性及统计严谨性，并提供了非渐近近似误差界。模拟实验表明，该方法在异质性估计与高维属性筛选中均表现出优越性能。我们进一步将NetworkNet应用于统计学领域的大规模作者引用网络，揭示了研究领域动态演变与学术影响力的新洞见。

摘要 (Abstract)

Heterogeneous network data with rich nodal information become increasingly prevalent across multidisciplinary research, yet accurately modeling complex nodal heterogeneity and simultaneously selecting influential nodal attributes remains an open challenge. This problem is central to many applications in economics and sociology, when both nodal heterogeneity and high-dimensional individual characteristics highly affect network formation. We propose a statistically grounded, unified deep neural network approach for modeling nodal heterogeneity in random networks with high-dimensional nodal attributes, namely ``NetworkNet’’. A key innovation of NetworkNet lies in a tailored neural architecture that explicitly parameterizes attribute-driven heterogeneity, and at the same time, embeds a scalable attribute selection mechanism. NetworkNet consistently estimates two types of latent heterogeneity functions, i.e., nodal expansiveness and popularity, while simultaneously performing data-driven attribute selection to extract influential nodal attributes. By unifying classical statistical network modeling with deep learning, NetworkNet delivers the expressive power of DNNs with methodological interpretability, algorithmic scalability, and statistical rigor with a non-asymptotic approximation error bound. Empirically, simulations demonstrate strong performance in both heterogeneity estimation and high-dimensional attribute selection. We further apply NetworkNet to a large-scale author-citation network among statisticians, revealing new insights into the dynamic evolution of research fields and scholarly impact.

关键词: heterogeneous network, nodal heterogeneity, deep neural network, attribute selection, random networks, statistical network modeling, node expansiveness, node popularity

54. ❌ Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

作者: Hanqi Xiao, Vaidehi Patil, Zaid Khan, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11666v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在对抗性对话场景中的心智理论（ToM）能力，通过强化学习训练AI双面特工来欺骗攻击者。高度相关的关键词包括：LLMs（论文明确研究LLMs）、RLHF/RLAIF/DPO（使用强化学习训练）、LLM Agents（训练AI双面特工）、Multi-agent Systems（涉及防御者和攻击者多智能体交互）。其他关键词如MoE、SLMs、Scaling Laws、PEFT等与论文内容无关，评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一个心智理论挑战（ToM-SB），研究大语言模型作为双面特工如何通过强化学习训练来欺骗具有部分先验知识的攻击者，实验发现心智理论和欺骗能力之间存在双向涌现关系，结合两者的奖励能产生最强的性能。

摘要翻译

随着大语言模型（LLM）成为对话系统背后的引擎，它们对对话伙伴意图和状态进行推理的能力（即形成并运用心理理论，Theory of Mind，简称ToM）对于与潜在对抗性伙伴进行安全交互变得日益关键。我们提出了一项新颖的以隐私为主题的心理理论挑战——用于引导信念的心理理论（ToM-SB）。在此挑战中，防御者必须扮演双重间谍的角色，在一个共享的语境中，引导一位具备部分先验知识的攻击者的信念。要在ToM-SB上取得成功，防御者必须与攻击者互动并形成对其的心理理论，其目标是诱使攻击者相信其已成功提取敏感信息。我们发现，像Gemini3-Pro和GPT-5.4这样的前沿强大模型在ToM-SB上表现不佳，即使在提示其对攻击者信念进行推理（ToM提示）的情况下，也常常难以在攻击者具备部分先验知识的困难场景中成功欺骗攻击者。为弥补这一差距，我们使用强化学习在ToM-SB上训练模型，使其扮演AI双重间谍，并测试了欺骗奖励和ToM奖励。值得注意的是，我们发现ToM与欺骗攻击者之间存在双向涌现关系：仅奖励欺骗成功即可提升ToM能力，而仅奖励ToM也能提升欺骗效果。通过对四种不同强度的攻击者、六种防御者方法，以及在分布内和分布外（OOD）评估中的测试，我们发现ToM能力的提升与欺骗攻击者的成功率高度相关，这凸显了信念建模是ToM-SB成功的关键驱动力。结合了ToM奖励和欺骗奖励的AI双重间谍，在困难场景中取得了最强的欺骗效果和ToM性能，超越了采用ToM提示的Gemini3-Pro和GPT-5.4。我们还展示了ToM-SB和AI双重间谍可以扩展到更强的攻击者，证明了其向OOD设置的泛化能力以及我们任务的可升级性。

摘要 (Abstract)

As large language models (LLMs) become the engine behind conversational systems, their ability to reason about the intentions and states of their dialogue partners (i.e., form and use a theory-of-mind, or ToM) becomes increasingly critical for safe interaction with potentially adversarial partners. We propose a novel privacy-themed ToM challenge, ToM for Steering Beliefs (ToM-SB), in which a defender must act as a Double Agent to steer the beliefs of an attacker with partial prior knowledge within a shared universe. To succeed on ToM-SB, the defender must engage with and form a ToM of the attacker, with a goal of fooling the attacker into believing they have succeeded in extracting sensitive information. We find that strong frontier models like Gemini3-Pro and GPT-5.4 struggle on ToM-SB, often failing to fool attackers in hard scenarios with partial attacker prior knowledge, even when prompted to reason about the attacker’s beliefs (ToM prompting). To close this gap, we train models on ToM-SB to act as AI Double Agents using reinforcement learning, testing both fooling and ToM rewards. Notably, we find a bidirectionally emergent relationship between ToM and attacker-fooling: rewarding fooling success alone improves ToM, and rewarding ToM alone improves fooling. Across four attackers with different strengths, six defender methods, and both in-distribution and out-of-distribution (OOD) evaluation, we find that gains in ToM and attacker-fooling are well-correlated, highlighting belief modeling as a key driver of success on ToM-SB. AI Double Agents that combine both ToM and fooling rewards yield the strongest fooling and ToM performance, outperforming Gemini3-Pro and GPT-5.4 with ToM prompting on hard scenarios. We also show that ToM-SB and AI Double Agents can be extended to stronger attackers, demonstrating generalization to OOD settings and the upgradability of our task.

关键词: Large Language Models, Theory of Mind, Double Agent, Reinforcement Learning, Adversarial Dialogue, Belief Steering, Multi-agent Systems, Privacy

55. ❌ Beyond LLMs, Sparse Distributed Memory, and Neuromorphics <A Hyper-Dimensional SRAM-CAM “VaCoAl” for Ultra-High Speed, Ultra-Low Power, and Low Cost>

作者: Hiroyuki Chuma, Kanji Otsuka, Yoichi Sato 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11665v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	3.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	3.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	3.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种基于超维计算（HDC）和稀疏分布式内存（Sparse Distributed Memory）的新型AI架构VaCoAl，专注于可逆的多跳推理和组合泛化。与关键词的相关性分析如下：1）与’稀疏模型’高度相关（8分），因为核心基于稀疏分布式内存；2）与’推理’相关关键词（Chain of Thought, System 2 Thinking）各得8分，论文重点研究多代推理（57代）和深度推理；3）与’大语言模型’得5分，论文提到补充LLMs，但非核心；4）与’可解释AI’得5分，涉及透明可靠性度量；5）与’模型压缩’和’推理加速’各得3分，提及低负载部署和高速架构；6）与’小语言模型’得3分，涉及低功耗部署；7）其余关键词（如训练方法、对齐、RAG等）得0分，因论文未涉及这些主题。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于超维计算和稀疏分布式内存的新型AI架构VaCoAl，解决了现代AI的灾难性遗忘、学习停滞和绑定问题，通过可逆的多跳推理实现了组合泛化，并在Wikidata的47万条关系上验证了其推理能力。

摘要翻译

本文报告了一项意外发现：在基于伽罗瓦域代数的确定性超维计算架构中，涌现出一种路径依赖的语义选择机制，该机制等效于脉冲时序依赖可塑性，其强度可通过与大规模测量结果匹配的闭式表达式进行先验预测。这一发现从代数层面解决了现代人工智能的若干局限，包括灾难性遗忘、学习停滞和绑定问题。我们提出了模糊重合算法及其Python实现PyVaCoAl，将超高维记忆与确定性逻辑相结合。该方法植根于稀疏分布式记忆，通过伽罗瓦域扩散解决了高维二元空间中的正交化与检索问题，实现了低负载部署。VaCoAl是一种以记忆为中心的架构，优先考虑检索与关联，支持可逆组合的同时保持元素独立性，并通过透明可靠性度量指标实现组合泛化。我们在Wikidata约47万组师生关系数据上评估了多跳推理能力，追溯了最多57代传承关系。利用基于CR评分的去噪技术和超维计算绑定/解绑操作，我们量化了概念在有向无环图中的传播。研究结果重新诠释了牛顿-莱布尼茨争议，揭示了从稀疏收敛到后莱布尼茨"超级高速公路"的相变现象，其中涌现的结构性指标支持库恩范式转移理论。碰撞容忍机制进一步诱导出基于路径的剪枝过程，优先选择直接路径，从而产生等效于STDP的涌现语义选择机制。VaCoAl由此定义了第三种范式——超维计算人工智能，通过可逆多跳推理能力与大型语言模型形成互补。

摘要 (Abstract)

This paper reports an unexpected finding: in a deterministic hyperdimensional computing (HDC) architecture based on Galois-field algebra, a path-dependent semantic selection mechanism emerges, equivalent to spike-timing-dependent plasticity (STDP), with magnitude predictable a priori by a closed-form expression matching large-scale measurements. This addresses limitations of modern AI including catastrophic forgetting, learning stagnation, and the Binding Problem at an algebraic level. We propose VaCoAl (Vague Coincident Algorithm) and its Python implementation PyVaCoAl, combining ultra-high-dimensional memory with deterministic logic. Rooted in Sparse Distributed Memory, it resolves orthogonalisation and retrieval in high-dimensional binary spaces via Galois-field diffusion, enabling low-load deployment. VaCoAl is a memory-centric architecture prioritising retrieval and association, enabling reversible composition while preserving element independence and supporting compositional generalisation with a transparent reliability metric (CR score). We evaluated multi-hop reasoning on about 470k mentor-student relations from Wikidata, tracing up to 57 generations (over 25.5M paths). Using HDC bundling and unbinding with CR-based denoising, we quantify concept propagation over DAGs. Results show a reinterpretation of the Newton-Leibniz dispute and a phase transition from sparse convergence to a post-Leibniz “superhighway”, from which structural indicators emerge supporting a Kuhnian paradigm shift. Collision-tolerance mechanisms further induce path-based pruning that favors direct paths, yielding emergent semantic selection equivalent to STDP. VaCoAl thus defines a third paradigm, HDC-AI, complementing LLMs with reversible multi-hop reasoning.

关键词: Hyperdimensional Computing, Sparse Distributed Memory, Multi-hop Reasoning, Compositional Generalization, Galois-field Algebra, VaCoAl, HDC-AI, Reversible Reasoning

56. ❌ Why Do Large Language Models Generate Harmful Content?

作者: Rajesh Ganguli, Raha Moraffah 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11663v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs生成有害内容的原因，采用因果中介分析进行多粒度分析，因此与’Large Language Models’高度相关（10分）。研究涉及模型对齐和安全性，与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（5分）。研究旨在减少有害生成，与’Hallucination Mitigation OR Factuality OR Truthfulness’相关（8分）。论文通过分析模型层、模块和神经元来理解机制，与’Mechanistic Interpretability OR Explainable AI’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、Context Window、推理方法、代理、压缩、加速、科学应用等均未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型生成有害内容的根本原因，通过因果中介分析发现有害生成主要发生在模型后期层，源于MLP模块的失败，并与作为有害生成门控机制的神经元相关。

摘要翻译

大型语言模型（LLM）已被证实会生成有害内容。然而，此类行为背后的根本原因仍未得到充分探究。我们提出了一种基于因果中介分析的方法，以识别导致有害内容生成的原因因素。我们的方法在模型层级、模块（MLP和注意力模块）以及单个神经元层面进行了多粒度分析。在先进大型语言模型上进行的大量实验表明，有害内容的生成主要出现在模型的深层，其根源更多在于MLP模块而非注意力模块的失效，并且与一组充当有害内容生成门控机制的神经元相关。研究结果表明，模型的浅层用于对提示中的有害性进行上下文理解，随后这一理解在模型中传播，导致深层生成有害内容，同时通过MLP模块传递有害性信号。该信号进一步传播至模型的最后一层，尤其是一组稀疏的神经元，这些神经元接收信号并据此决定是否生成有害内容。

摘要 (Abstract)

Large Language Models (LLMs) have been shown to generate harmful content. However, the underlying causes of such behavior remain under explored. We propose a causal mediation analysis-based approach to identify the causal factors responsible for harmful generation. Our method performs a multi-granular analysis across model layers, modules (MLP and attention blocks), and individual neurons. Extensive experiments on state-of-the-art LLMs indicate that harmful generation arises in the later layers of the model, results primarily from failures in MLP blocks rather than attention blocks, and is associated with neurons that act as a gating mechanism for harmful generation. The results indicate that the early layers in the model are used for a contextual understanding of harmfulness in a prompt, which is then propagated through the model, to generate harmfulness in the late layers, as well as a signal indicating harmfulness through MLP blocks. This is then further propagated to the last layer of the model, specifically to a sparse set of neurons, which receives the signal and determines the generation of harmful content accordingly.

关键词: Large Language Models, harmful content generation, causal mediation analysis, model layers, MLP blocks, attention blocks, neurons, gating mechanism

57. ❌ Towards Autonomous Mechanistic Reasoning in Virtual Cells

作者: Yunhui Jang, Lu Zhu, Jake Fawkes, Alisandra Kaye Denton, Dominique Beaini, Emmanuel Noutahi 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11661v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在生物学虚拟细胞领域的应用，提出VCR-Agent多智能体框架，涉及检索增强生成（RAG）、多步推理（CoT）、深度推理（System 2）、多智能体系统、可解释AI（机制解释）和科学AI（生物信息学）等关键词，与这些关键词高度相关（8-10分）。论文提到训练改进下游任务，与监督微调（SFT）有一定关联（5分）。其他关键词如MoE、量化、世界模型等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在生物学虚拟细胞领域缺乏事实基础和可操作解释的问题，提出了一个基于多智能体框架和验证机制的自主机理推理方法，并通过实验证明该方法能提高事实精度和下游基因表达预测效果。

摘要翻译

大型语言模型（LLMs）作为加速科学发现的一种前景广阔的方法，近期获得了广泛关注。然而，其在生物学等开放性科学领域的应用仍然有限，主要原因是缺乏基于事实且可操作的机制解释。为解决这一问题，我们为虚拟细胞引入了一种结构化的解释形式化方法，将生物学推理表示为机制作用图，从而支持系统性的验证与证伪。在此基础上，我们提出了VCR-Agent——一个多智能体框架，该框架将基于生物学知识的检索与基于验证器的过滤方法相结合，以自主生成并验证机制推理。利用此框架，我们发布了VC-TRACES数据集，该数据集包含从Tahoe-100M图谱中提取并经过验证的机制解释。实证研究表明，使用这些解释进行训练能提高事实准确性，并为下游基因表达预测任务提供更有效的监督信号。这些结果凸显了通过多智能体协同与严格验证实现可靠机制推理对于虚拟细胞研究的重要性。

摘要 (Abstract)

Large language models (LLMs) have recently gained significant attention as a promising approach to accelerate scientific discovery. However, their application in open-ended scientific domains such as biology remains limited, primarily due to the lack of factually grounded and actionable explanations. To address this, we introduce a structured explanation formalism for virtual cells that represents biological reasoning as mechanistic action graphs, enabling systematic verification and falsification. Building upon this, we propose VCR-Agent, a multi-agent framework that integrates biologically grounded knowledge retrieval with a verifier-based filtering approach to generate and validate mechanistic reasoning autonomously. Using this framework, we release VC-TRACES dataset, which consists of verified mechanistic explanations derived from the Tahoe-100M atlas. Empirically, we demonstrate that training with these explanations improves factual precision and provides a more effective supervision signal for downstream gene expression prediction. These results underscore the importance of reliable mechanistic reasoning for virtual cells, achieved through the synergy of multi-agent and rigorous verification.

关键词: Large Language Models, Mechanistic Reasoning, Virtual Cells, Multi-agent Framework, Knowledge Retrieval, Verification, Bioinformatics, Gene Expression Prediction

58. ❌ RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents

作者: Riccardo Rosati, Edoardo Colucci, Massimiliano Bolognini, Adriano Mancini, Paolo Sernani 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11655v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是评估LLM-based Role-Playing Agents（RPAs）的框架RPA-Check，因此与LLM Agents、Chain of Thought（用于验证评分）高度相关（10分）。论文明确提到使用量化本地模型（quantized local models）和小型指令调优模型（smaller, adequately instruction-tuned models），因此与Quantization、Small Language Models、Instruction Tuning相关（8分）。论文涉及推理深度（reasoning depth）评估，与System 2 Thinking相关（8分）。其他关键词如MoE、Scaling Laws、Pre-training等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为RPA-Check的多阶段自动化框架，用于评估基于大语言模型的动态角色扮演代理，发现较小规模的指令调优模型在特定场景下可能比更大模型表现更稳定。

摘要翻译

大型语言模型（LLM）在交互系统中的快速应用催生了动态、开放式的角色扮演智能体（Role-Playing Agents, RPAs）。然而，评估此类智能体仍面临重大挑战，因为标准自然语言处理（NLP）指标难以捕捉角色遵循度、逻辑一致性与长期叙事稳定性等细微特征。本文提出RPA-Check——一个多阶段自动化评估框架，旨在客观评估基于LLM的RPAs在复杂、强约束环境中的表现。我们的方法基于四步流程：（1）维度定义：建立高层次定性行为准则；（2）指标扩展：将这些要求细化为颗粒化的布尔型检查清单指标；（3）语义过滤：确保指标客观性、无冗余且保持智能体独立性；（4）LLM即评委评估：采用思维链验证机制对智能体保真度进行评分。我们通过将该框架应用于“LLM法庭”（一个采用多个量化本地模型的法证训练严肃游戏）进行验证。在五种不同法律场景中的实验结果表明，该框架能够识别模型规模、推理深度与运行稳定性之间的微妙权衡。值得注意的是，研究发现参数量级与程序一致性存在反向关联：经过充分指令微调的小规模模型（80-90亿参数）在表现上可能优于易受用户对齐偏差或谄媚倾向影响的大型架构。因此，RPA-Check为专业领域生成式智能体评估的未来研究提供了标准化、可复现的度量标准。

摘要 (Abstract)

The rapid adoption of Large Language Models (LLMs) in interactive systems has enabled the creation of dynamic, open-ended Role-Playing Agents (RPAs). However, evaluating these agents remains a significant challenge, as standard NLP metrics fail to capture the nuances of role adherence, logical consistency, and long-term narrative stability. This paper introduces RPA-Check, a multi-stage automated evaluation framework designed to objectively assess the performance of LLM-based RPAs in complex, constraints-heavy environments. Our methodology is based on a four-step pipeline: (1) Dimension Definition, establishing high-level qualitative behavioral criteria; (2) Augmentation, where these requirements are expanded into granular boolean checklist indicators; (3) Semantic Filtering, to ensure indicator objectivity, no redundancy and agent isolation; and (4) LLM-as-a-Judge Evaluation, which employs chain-of-thought verification to score agent fidelity. We validate this framework by applying it to LLM Court, a serious game for forensic training involving several quantized local models. Experimental results across five distinct legal scenarios demonstrate the framework’s ability to identify subtle trade-offs between model size, reasoning depth, and operational stability. Notably, the findings reveal an inverse relationship between parametric scale and procedural consistency, showing that smaller, adequately instruction-tuned models (8-9B) can outperform larger architectures prone to user-alignment bias or sycophancy. RPA-Check thus provides a standardized and reproducible metric for future research in generative agent evaluation within specialized domains.

关键词: Large Language Models, Role-Playing Agents, Automated Evaluation Framework, Chain-of-Thought, Quantized Models, Instruction Tuning, Reasoning Depth, Model Size

59. ❌ CodeTracer: Towards Traceable Agent States

作者: Han Li, Yifan Yao, Letian Zhu, Rili Feng, Hongyi Ye, Jiaming Wang, Yancheng He, Pengyu Zou, Lehan Zhang, Xinping Lei, Haoyang Huang, Ken Deng, Ming Sun, Zhaoxiang Zhang, He Ye, Jiaheng Liu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11641v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究代码代理（code agents）的调试和追踪问题，与LLM代理（LLM Agents）和工具使用（Tool Use）高度相关（10分），因为代码代理通常基于LLM构建并使用工具调用。与多步推理（Chain of Thought）、系统2思维（System 2 Thinking）、自我纠正（Self-Correction）、多代理系统（Multi-agent Systems）和可解释AI（Explainable AI）有一定关联（5分），因为论文涉及代理状态转换、错误传播分析和可解释性。与基础LLM技术（Large Language Models）有间接关联（5分），因为代码代理可能使用LLM。其他关键词如MoE、量化、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

论文针对代码代理在复杂任务中状态转换和错误传播难以观察的问题，提出了CodeTracer追踪架构来重建状态转换历史并定位失败起源，实验表明该方法显著优于基线并能恢复失败运行。

摘要翻译

代码智能体正快速发展，但其调试难度日益增加。由于框架在复杂任务中协调并行工具调用与多阶段工作流，导致智能体的状态转换与错误传播难以观测。在这些运行过程中，早期的微小失误可能使智能体陷入无效循环，甚至引发根本性错误，形成难以察觉的错误链条，使得开发者难以判断智能体何时偏离正轨及其原因。现有的智能体追踪分析方法要么局限于简单交互，要么依赖小规模人工检查，这限制了其在真实编码工作流中的可扩展性与实用性。我们提出CodeTracer——一种追踪架构，它通过动态演进的提取器解析异构运行产物，将完整的状态转换历史重建为具有持久化记忆的层次化追踪树，并执行故障起始定位以精确定位故障源头及其下游传播链。为支持系统性评估，我们从四大主流代码智能体框架在多样化代码任务（如缺陷修复、代码重构和终端交互）中生成的大规模执行轨迹中构建了CodeTraceBench，并在阶段和步骤级别提供故障定位监督。实验表明，CodeTracer显著优于直接提示法与轻量级基线方法，且在其诊断信号指导下重放运行能在相同资源条件下持续恢复原本失败的任务。我们的代码与数据已公开。

摘要 (Abstract)

Code agents are advancing rapidly, but debugging them is becoming increasingly difficult. As frameworks orchestrate parallel tool calls and multi-stage workflows over complex tasks, making the agent’s state transitions and error propagation hard to observe. In these runs, an early misstep can trap the agent in unproductive loops or even cascade into fundamental errors, forming hidden error chains that make it hard to tell when the agent goes off track and why. Existing agent tracing analyses either focus on simple interaction or rely on small-scale manual inspection, which limits their scalability and usefulness for real coding workflows. We present CodeTracer, a tracing architecture that parses heterogeneous run artifacts through evolving extractors, reconstructs the full state transition history as a hierarchical trace tree with persistent memory, and performs failure onset localization to pinpoint the failure origin and its downstream chain. To enable systematic evaluation, we construct CodeTraceBench from a large collection of executed trajectories generated by four widely used code agent frameworks on diverse code tasks (e.g., bug fixing, refactoring, and terminal interaction), with supervision at both the stage and step levels for failure localization. Experiments show that CodeTracer substantially outperforms direct prompting and lightweight baselines, and that replaying its diagnostic signals consistently recovers originally failed runs under matched budgets. Our code and data are publicly available.

关键词: code agents, agent tracing, state transition, error propagation, failure localization, trace tree, debugging, tool calls

60. ❌ RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

作者: Haozhe Wang, Cong Wei, Weiming Ren, Jiaming Liu, Fangzhen Lin, Wenhu Chen 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11626v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究视觉生成领域的奖励模型，提出RationalRewards模型和PARROT框架，核心创新在于将奖励模型从单一评分扩展到生成多维度、可解释的理性分析（rationales）。与关键词的相关性分析：1）与RLHF/DPO等强化学习对齐技术有一定关联（5分），因为论文使用理性分析作为RL奖励；2）与Chain of Thought、System 2 Thinking、Self-Correction、Explainable AI高度相关（8分），因为论文强调结构化推理、多步批判性思维、自我改进循环和可解释性；3）其他关键词主要针对大语言模型的技术细节或特定应用领域，与本文的视觉生成焦点无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对视觉生成中奖励模型仅提供单一评分而缺乏解释性的问题，提出了RationalRewards模型，通过生成多维度理性分析来改进生成器，在训练时提供细粒度RL奖励，在测试时通过生成-批判-精炼循环提升输出质量，无需参数更新。

摘要翻译

当前大多数视觉生成奖励模型将丰富的人类判断简化为单一的未解释分数，丢弃了偏好背后的推理过程。我们证明，教导奖励模型在评分前生成明确的多维度评述，能将其从被动评估工具转化为主动优化工具，通过两种互补方式改进生成器：在训练阶段，结构化原理为强化学习提供可解释的细粒度奖励；在测试阶段，“生成-评述-优化”循环将评述转化为针对性提示词修订，无需参数更新即可提升输出质量。为在不依赖昂贵原理标注的情况下训练此类奖励模型，我们提出偏好锚定合理化（PARROT）框架，该框架通过锚定生成、一致性过滤和蒸馏技术，从易得的偏好数据中还原高质量原理。由此得到的模型RationalRewards（8B）在开源奖励模型中实现了最优的偏好预测性能，与Gemini-2.5-Pro相当，而训练数据量比同类基线少10-20倍。作为强化学习奖励，它在文本到图像和图像编辑生成任务中持续优于标量奖励模型。最显著的是，其测试阶段的评述优化循环在多个基准测试中达到甚至超越基于强化学习的微调效果，这表明结构化推理能够激发现有生成器中因次优提示而未能释放的潜在能力。

摘要 (Abstract)

Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.

关键词: reward models, visual generation, rationales, reinforcement learning, Preference-Anchored Rationalization, critique-and-refine, interpretable rewards, text-to-image

61. ❌ SCNO: Spiking Compositional Neural Operator – Towards a Neuromorphic Foundation Model for Nuclear PDE Solving

作者: Samrendra Roy, Souvik Chakraborty, Rizwan-uddin, Syed Bahauddin Alam 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11625v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于开发一种用于求解偏微分方程（PDE）的神经算子模型，特别是针对核工程中的中子扩散方程。虽然论文涉及AI在科学领域的应用（AI for Science），但其核心是神经算子、尖峰神经网络和模块化架构，与大多数关键词（如LLMs、MoE、对齐、推理等）完全无关。仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文将AI应用于核PDE求解这一科学问题，但并非生物信息学或化学信息学领域，因此给予8分。

!!! tip deepseek-chat TL;DR

论文提出了一种名为SCNO的模块化尖峰神经算子，用于高效求解偏微分方程（包括核工程中的中子扩散方程），在多个PDE家族上优于传统方法，同时实现了参数效率和零遗忘的模块化扩展。

摘要翻译

神经算子已成为偏微分方程求解器的强大替代模型，但通常作为针对单个偏微分方程的整体模型进行训练，需要高能耗的GPU硬件，且当出现新物理过程时必须从头重新训练。我们提出脉冲组合神经算子，这是一种结合脉冲与传统组件的模块化架构，可同时解决上述三个局限。SCNO维护一个由小型脉冲神经算子块组成的库，每个块均在单一基本微分算子（对流、扩散、反应）上训练，并通过一个轻量级的输入条件聚合器将它们组合起来，以求解在块训练阶段未见的耦合偏微分方程。一个小型校正网络学习交叉耦合残差，同时保持所有块和聚合器冻结，从而通过构造实现零遗忘的模块化扩展。我们在八个偏微分方程族上评估SCNO，包括五个耦合系统和一个与核反应相关的单群中子扩散方程。经过校正的SCNO在五个耦合偏微分方程中的四个上取得了最低的相对$L^2$误差，优于整体式脉冲DeepONet（最高提升62%，三次随机种子平均）和标准人工神经网络DeepONet（最高提升65%），而仅需95K可训练参数，整体基线模型则需要462K。据我们所知，这是首个组合式脉冲神经算子，也是首个具备内置无遗忘扩展能力的模块化神经形态偏微分方程求解的概念验证。

摘要 (Abstract)

Neural operators have emerged as powerful surrogates for partial differential equation (PDE) solvers, yet they are typically trained as monolithic models for individual PDEs, require energy-intensive GPU hardware, and must be retrained from scratch when new physics emerge. We introduce the Spiking Compositional Neural Operator (SCNO), a modular architecture combining spiking and conventional components that addresses all three limitations. SCNO maintains a library of small spiking neural operator blocks, each trained on a single elementary differential operator (convection, diffusion, reaction), and composes them through a lightweight input-conditioned aggregator to solve coupled PDEs not seen during block training. A small correction network learns cross-coupling residuals while keeping all blocks and the aggregator frozen, preserving zero-forgetting modular expansion by construction. We evaluate SCNO on eight PDE families including five coupled systems and a nuclear-relevant 1-group neutron diffusion equation. SCNO with correction achieves the lowest relative $L^2$ error on four of five coupled PDEs, outperforming both a monolithic spiking DeepONet (by up to 62%, mean over 3 seeds) and a standard ANN DeepONet (by up to 65%), while requiring only 95K trainable parameters versus 462K for the monolithic baseline. To our knowledge, this is the first compositional spiking neural operator and the first proof-of-concept for modular neuromorphic PDE solving with built-in forgetting-free expansion.

关键词: Spiking Neural Operator, Partial Differential Equations, Modular Architecture, Nuclear PDE Solving, Neuromorphic Computing, Compositional Learning, Parameter Efficiency, Zero-forgetting Expansion

62. ❌ Context Kubernetes: Declarative Orchestration of Enterprise Knowledge for Agentic AI Systems

作者: Charafeddine Mouzouni 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11623v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Context Kubernetes架构，专注于企业知识在智能体AI系统中的编排、治理和权限管理。与关键词的相关性分析如下：1. 与’LLM Agents/Autonomous Agents/Agentic Workflow’高度相关（10分），论文核心研究智能体AI系统的架构和治理。2. 与’Retrieval-Augmented Generation/RAG/Retrieval-Generation’有一定关联（5分），涉及知识检索和内容交付。3. 与’Tool Use/Function Calling/API Tool Use’有一定关联（5分），涉及智能体权限和工具使用治理。4. 与’Multi-agent Systems/Agent Coordination’有一定关联（5分），涉及多智能体系统的协调和权限管理。5. 与’Large Language Models/LLMs/Foundation Models’有一定关联（5分），论文属于大模型在智能体系统中的应用研究。其他关键词如MoE、量化、推理加速、对齐训练等与论文技术内容无关，均给0分。

!!! tip deepseek-chat TL;DR

论文提出了Context Kubernetes架构来解决企业知识在智能体AI系统中的编排、治理和权限管理问题，实验表明该架构能有效防止数据泄露、检测内容陈旧性并阻止攻击，而现有企业平台缺乏相应的架构隔离机制。

摘要翻译

本文提出Context Kubernetes架构，用于在智能体AI系统中编排企业知识，并提供了原型实现与八组实验。核心观点在于：在整个组织范围内，将正确的知识以恰当的权限与时效性传递给指定的智能体——这一过程在结构上类似于Kubernetes十年前解决的容器编排问题。我们形式化定义了六项核心抽象、基于YAML的声明式知识架构即代码清单、协调循环机制，以及三层智能体权限模型（其中智能体权限始终严格隶属于人类权限）。三项价值实验表明：（1）缺乏治理时，智能体在26.5%的查询中会从已删除源返回幻象内容并泄露跨域数据；（2）缺乏时效监控时，陈旧内容会被无提示返回——而通过协调机制可在1毫秒内检测到内容过期；（3）在五种攻击场景中，扁平权限模型可阻挡0/5攻击，基础RBAC模型阻挡4/5，三层模型则阻挡5/5攻击。五项正确性实验证实了零次未授权交付、零次约束违反，以及通过架构强制实现的带外审批隔离机制——这是现有企业平台均未提供的功能。对四大主流平台（Microsoft、Salesforce、AWS、Google）的调研显示，无一在架构层面隔离智能体审批通道。我们总结了使知识上下文编排比容器编排更复杂的四项特性，并论证这些特性使得本解决方案更具价值。

摘要 (Abstract)

We introduce Context Kubernetes, an architecture for orchestrating enterprise knowledge in agentic AI systems, with a prototype implementation and eight experiments. The core observation is that delivering the right knowledge, to the right agent, with the right permissions, at the right freshness – across an entire organization – is structurally analogous to the container orchestration problem Kubernetes solved a decade ago. We formalize six core abstractions, a YAML-based declarative manifest for knowledge-architecture-as-code, a reconciliation loop, and a three-tier agent permission model where agent authority is always a strict subset of human authority. Three value experiments show: (1) without governance, agents serve phantom content from deleted sources and leak cross-domain data in 26.5% of queries; (2) without freshness monitoring, stale content is served silently – with reconciliation, staleness is detected in under 1ms; (3) in five attack scenarios, flat permissions block 0/5 attacks, basic RBAC blocks 4/5, and the three-tier model blocks 5/5. Five correctness experiments confirm zero unauthorized deliveries, zero invariant violations, and architectural enforcement of out-of-band approval isolation that no surveyed enterprise platform provides. A survey of four major platforms (Microsoft, Salesforce, AWS, Google) documents that none architecturally isolates agent approval channels. We identify four properties that make context orchestration harder than container orchestration, and argue that these make the solution more valuable.

关键词: Context Kubernetes, agentic AI systems, enterprise knowledge orchestration, agent permission model, knowledge-architecture-as-code, reconciliation loop, governance, multi-agent coordination

63. ❌ CUTEv2: Unified and Configurable Matrix Extension for Diverse CPU Architectures with Minimal Design Overhead

作者: Jinpeng Ye, Chongxi Wang, Wenqing Li, Bin Yuan, Shiyi Wang, Fenglu Zhang, Junyu Yue, Jianan Xie, Yunhao Ye, Haoyu Deng, Yingkun Zhou, Xin Cheng, Fuxin Zhang, Jian Wang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11615v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于CPU硬件架构设计，特别是矩阵扩展单元，用于加速AI工作负载。虽然论文提到了在BERT和Llama3等AI模型上的评估，但其核心贡献是硬件架构创新（如解耦设计、可配置矩阵单元、异步抽象），而非大模型或深度学习技术原理的创新。所有关键词均涉及大模型技术、训练方法、推理优化、应用领域等软件和算法层面，与本文的硬件架构主题完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种统一可配置的CPU矩阵扩展架构，通过解耦矩阵单元与CPU流水线、支持混合精度操作和异步矩阵乘法抽象，实现了低开销集成和高硬件利用率，在多个开源CPU平台上对ResNet、BERT和Llama3等模型实现了显著加速。

摘要翻译

矩阵扩展已成为现代CPU应对人工智能工作负载激增需求的关键特性。然而，现有设计通常伴随着显著的硬件与软件设计开销。与CPU流水线的紧密耦合使得跨不同CPU的集成变得复杂，而细粒度的同步指令则阻碍了高性能内核的开发。本文提出一种统一且可配置的CPU矩阵扩展架构。通过将矩阵单元与CPU流水线解耦，该设计在保持与现有计算及内存资源紧密协同的同时，实现了低开销的集成。可配置的矩阵单元支持混合精度运算，并能适应不同的计算需求与内存带宽限制。一种具有灵活粒度的异步矩阵乘法抽象隐藏了硬件细节，简化了矩阵-向量重叠执行，并支持统一的软件栈。该架构被集成到四个开源CPU RTL平台中，并在代表性AI模型上进行了评估。在GEMM工作负载下，所有平台的矩阵单元利用率均超过90%。当配置的计算吞吐量和内存带宽与英特尔AMX相当时，我们的设计在ResNet、BERT和Llama3上分别实现了1.57倍、1.57倍和2.31倍的加速，其中超过30%的性能增益归功于重叠的矩阵-向量执行。一个在14纳米CMOS工艺下运行于2GHz、算力为4 TOPS的矩阵单元仅占用0.53平方毫米面积。这些结果展现了强大的跨平台适应能力以及有效的软硬件协同优化，为开源社区提供了一个实用的矩阵扩展方案。

摘要 (Abstract)

Matrix extensions have emerged as an essential feature in modern CPUs to address the surging demands of AI workloads. However, existing designs often incur substantial hardware and software design overhead. Tight coupling with the CPU pipeline complicates integration across diverse CPUs, while fine-grained synchronous instructions hinder the development of high-performance kernels. This paper proposes a unified and configurable CPU matrix extension architecture. By decoupling matrix units from the CPU pipeline, the design enables low-overhead integration while maintaining close coordination with existing compute and memory resources. The configurable matrix unit supports mixed-precision operations and adapts to diverse compute demands and memory bandwidth constraints. An asynchronous matrix multiplication abstraction with flexible granularity conceals hardware details, simplifies matrix-vector overlap, and supports a unified software stack. The architecture is integrated into four open-source CPU RTL platforms and evaluated on representative AI models. Matrix unit utilization under GEMM workloads exceeds 90% across all platforms. When configured with compute throughput and memory bandwidth comparable to Intel AMX, our design achieves speedups of 1.57x, 1.57x, and 2.31x on ResNet, BERT, and Llama3, with over 30% of the gains attributed to overlapped matrix-vector execution. A 4 TOPS@2GHz matrix unit occupies only 0.53 mm\textsuperscript{2} in 14nm CMOS. These results demonstrate strong cross-platform adaptability and effective hardware-software co-optimization, offering a practical matrix extension for the open-source community.

关键词: CPU matrix extension, hardware-software co-optimization, matrix unit, asynchronous matrix multiplication, AI workloads, open-source CPU, GEMM, mixed-precision operations

64. ❌ Layerwise Dynamics for In-Context Classification in Transformers

作者: Patrick Lutz, Themistoklis Haris, Arjun Chandra, Aditya Gangrade, Venkatesh Saligrama 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11613v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Transformer在上下文学习中的分类机制，核心贡献是提取了Transformer内部的可解释更新规则。与’In-context Learning OR Many-shot Learning’高度相关（10分），因为论文直接研究few-shot in-context classification。与’Mechanistic Interpretability OR Explainable AI’相关（8分），因为论文通过强制特征和标签置换等变性使计算可识别，实现可解释性。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为Transformer是LLMs的基础架构，但论文未直接研究LLMs。其他关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了Transformer在少量标注示例下进行上下文分类的内部机制，通过强制特征和标签置换等变性提取了首个端到端可识别的深度索引递归更新规则，该规则能够放大类别分离并实现稳健的类别对齐。

摘要翻译

Transformer能够通过少量标注样本进行上下文分类，但其推理时的计算机制仍不透明。本文研究硬边界无间隔情况下的多类线性分类问题，通过强制每一层保持特征与标签的排列等变性，使计算过程可识别。该方法在保持功能等价的同时实现了模型可解释性，并得到高度结构化的权重参数。从这些模型中，我们提取出一个显式的深度索引递归规则：一种在softmax Transformer内部端到端可识别、自涌现的更新机制，据我们所知，这是此类规则的首例发现。由混合特征-标签Gram结构形成的注意力矩阵驱动训练样本、标签及测试探针的耦合更新。由此产生的动力学实现了一种几何驱动的算法模式，该模式可被证明能够增强类别分离度，并产生稳健的预期类别对齐效果。

摘要 (Abstract)

Transformers can perform in-context classification from a few labeled examples, yet the inference-time algorithm remains opaque. We study multi-class linear classification in the hard no-margin regime and make the computation identifiable by enforcing feature- and label-permutation equivariance at every layer. This enables interpretability while maintaining functional equivalence and yields highly structured weights. From these models we extract an explicit depth-indexed recursion: an end-to-end identified, emergent update rule inside a softmax transformer, to our knowledge the first of its kind. Attention matrices formed from mixed feature-label Gram structure drive coupled updates of training points, labels, and the test probe. The resulting dynamics implement a geometry-driven algorithmic motif, which can provably amplify class separation and yields robust expected class alignment.

关键词: Transformers, In-context Learning, Few-shot Classification, Mechanistic Interpretability, Layerwise Dynamics, Feature-label Permutation Equivariance, Algorithmic Motif, Class Separation

65. ❌ Intersectional Sycophancy: How Perceived User Demographics Shape False Validation in Large Language Models

作者: Benjamin Maltbie, Shivam Raval 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11609v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大语言模型（LLMs）的奉承行为（sycophancy），即模型为了显得友好而验证用户错误信念的现象，这直接与LLMs相关（10分）。研究关注模型如何根据感知的用户人口统计特征（如种族、年龄、性别）调整其奉承行为，这涉及模型的价值观对齐（alignment）问题，因为奉承行为反映了模型在交互中的价值取向（8分）。奉承行为本质上是一种不真实或虚假的响应，与事实性（factuality）和真实性（truthfulness）问题高度相关，因为模型在验证错误信念时牺牲了事实准确性（10分）。研究通过对抗性对话测试模型行为，这有助于解释模型决策过程，与可解释AI（explainable AI）相关，旨在理解模型行为机制（8分）。论文未涉及其他关键词如MoE、SLMs、训练技术、推理方法、代理系统、压缩加速等具体技术细节，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究大语言模型（LLMs）的奉承行为是否因感知的用户人口统计特征（如种族、年龄、性别）而系统性地变化，发现GPT-5-nano比Claude Haiku 4.5更奉承，且奉承率在哲学领域更高，西班牙裔用户接收的奉承最多，表明安全评估应纳入身份感知测试。

摘要翻译

大型语言模型表现出谄媚倾向——即通过认可用户错误观点以显得迎合。本研究系统探讨了该行为是否随感知用户人口统计特征发生系统性变化，测试了种族、年龄、性别及表达自信程度的组合是否会产生差异化的错误认可率。受法学交叉性理论启发，我们采用Anthropic的Petri评估框架进行了768轮对抗性多轮对话，在数学、哲学和阴谋论领域对GPT-5-nano和Claude Haiku 4.5模型进行了128种人物组合测试。总体而言，GPT-5-nano的谄媚程度显著高于Claude Haiku 4.5（$\bar{x}=2.96$ vs. $1.74$，$p < 10^{-32}$，Wilcoxon符号秩检验）。针对GPT-5-nano，我们发现哲学领域引发的谄媚行为比数学领域多41%，且在跨种族比较中西班牙裔人物获得的谄媚度最高。表现最差的人物组合——一位自信的23岁西班牙裔女性——谄媚度平均得分达5.33/10。Claude Haiku 4.5则表现出 uniformly low sycophancy（均匀低谄媚度），且无显著人口统计差异。这些结果表明谄媚行为在用户间并非均匀分布，安全评估应当纳入身份感知测试。

摘要 (Abstract)

Large language models exhibit sycophantic tendencies–validating incorrect user beliefs to appear agreeable. We investigate whether this behavior varies systematically with perceived user demographics, testing whether combinations of race, age, gender, and expressed confidence level produce differential false validation rates. Inspired by the legal concept of intersectionality, we conduct 768 multi-turn adversarial conversations using Anthropic’s Petri evaluation framework, probing GPT-5-nano and Claude Haiku 4.5 across 128 persona combinations in mathematics, philosophy, and conspiracy theory domains. GPT-5-nano is significantly more sycophantic than Claude Haiku 4.5 overall ($\bar{x}=2.96$ vs. $1.74$, $p < 10^{-32}$, Wilcoxon signed-rank). For GPT-5-nano, we find that philosophy elicits 41% more sycophancy than mathematics and that Hispanic personas receive the highest sycophancy across races. The worst-scoring persona, a confident, 23-year-old Hispanic woman, averages 5.33/10 on sycophancy. Claude Haiku 4.5 exhibits uniformly low sycophancy with no significant demographic variation. These results demonstrate that sycophancy is not uniformly distributed across users and that safety evaluations should incorporate identity-aware testing.

关键词: Large Language Models, Sycophancy, User Demographics, False Validation, Safety Evaluation, Intersectionality, GPT-5-nano, Claude Haiku 4.5

66. ❌ A Triadic Suffix Tokenization Scheme for Numerical Reasoning

作者: Olga Chetverina 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11582v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是提出一种新的数字分词方案（Triadic Suffix Tokenization）来解决LLMs在算术和科学推理中的数字处理问题。因此，与"Large Language Models"高度相关（10分），因为论文直接针对LLMs的数字处理缺陷提出解决方案。与推理相关的关键词（Chain of Thought, System 2 Thinking）有一定关联（5分），因为论文提到数字处理错误是科学推理错误的主要驱动因素，但论文本身不直接研究推理方法。其他关键词如MoE、SLMs、训练方法、对齐、RAG、压缩、代理等均未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在数字分词时的不一致性问题，提出了一种确定性的三数字后缀分词方案（TST），通过将数字分组为三位数并添加明确的量级标记，以改善模型在算术和科学推理中的性能。

摘要翻译

标准的子词分词方法会不一致地切分数字，导致大语言模型丢失数字的位置和小数结构——这是算术与科学推理错误的主要诱因。我们提出了三元组后缀分词法，这是一种确定性方案，将数字分割为三位一体的数字三元组，并为每个三元组标注明确的数量级标记。该方案的关键在于，为整数部分（千、百万、十亿等）定义了后缀与数量级之间固定的一一映射关系，并为小数部分深度（十分位、千分位、百万分位等）建立了一套并行的复制标记系统。与依赖位置推断的方法不同，此方法提供了一致的梯度信号，从而确保稳定的收敛性。我们提出了两种实现变体：（1）基于词汇表的方法，最多向现有词汇表添加10,000个固定标记，覆盖33个数量级（$10^{-15}$ 至 $10^{18}$）；（2）后缀标记方法，使用一小部分特殊标记动态表示数量级。两种变体均能保留精确数字，同时在标记层面使数量级关系透明化。该框架本质上是可扩展的，允许通过线性词汇扩展来适应任意精度和范围。TST与模型架构无关，可作为即插即用的预处理步骤集成。实验验证将留待后续工作完成。

摘要 (Abstract)

Standard subword tokenization methods fragment numbers inconsistently, causing large language models (LLMs) to lose positional and decimal structure - a primary driver of errors in arithmetic and scientific reasoning. We introduce Triadic Suffix Tokenization (TST), a deterministic scheme that partitions digits into three-digit triads and annotates each triad with an explicit magnitude marker. Critically, the scheme defines a fixed, one-to-one mapping between suffixes and orders of magnitude for the integer part (thousands, millions, billions, etc.) and a parallel system of replicated markers for fractional depth (tenths, thousandths, millionths, etc.). Unlike approaches that rely on positional inference, this method provides a consistent gradient signal, which should ensure stable convergence. Two implementation variants are proposed: (1) a vocabulary-based approach that adds at most 10,000 fixed tokens to an existing vocabulary, covering 33 orders of magnitude ($10^{-15}$ to $10^{18}$); and (2) a suffix-marker approach that uses a small set of special tokens to denote magnitude dynamically. Both variants preserve exact digits while making order-of-magnitude relationships transparent at the token level. The framework is inherently scalable, allowing for linear vocabulary expansion to accommodate arbitrary precision and range. TST is architecture-agnostic and can be integrated as a drop-in preprocessing step. Experimental validation is deferred to future work.

关键词: tokenization, numerical reasoning, large language models, arithmetic, scientific reasoning, triadic suffix, magnitude markers, vocabulary expansion

67. ❌ Minimizing classical resources in variational measurement-based quantum computation for generative modeling

作者: Arunava Majumder, Hendrik Poulsen Nautrup, Hans J. Briegel 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11578v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于量子计算领域，特别是基于测量的量子计算（MBQC）和变分测量量子计算（VMBQC）在生成建模中的应用。论文的核心是量子信息处理、量子测量和量子信道模型，与深度学习、大语言模型（LLM）或人工智能技术无直接关联。所有关键词均针对大模型、深度学习及其相关技术（如训练方法、推理优化、对齐、代理等），而本文未涉及这些主题。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文将量子计算应用于生成建模（一种科学计算任务），但这并非核心焦点，且未使用AI方法，故给予5分（有一定关联）。其他关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种受限的变分测量量子计算（VMBQC）模型，通过仅增加一个可训练参数来扩展酉设置到基于信道的模型，解决了传统VMBQC参数过多导致优化困难的问题，并证明该最小扩展能生成对应酉模型无法学习的概率分布。

摘要翻译

基于测量的量子计算（Measurement-based quantum computation，MBQC）是一种量子信息处理框架，其中计算任务通过对高度纠缠的资源态进行单量子比特测量来实现。由于量子测量结果的不确定性，这些操作的随机结果若不加以修正，将产生一个可变的量子信道族。传统上，这种随机性通过经典处理进行校正，以确保确定性的酉计算。最近，变分基于测量的量子计算（Variational measurement-based quantum computation，VMBQC）被提出，旨在利用这种测量诱导的随机性在生成建模中获得优势。该方法的一个局限在于，对应的信道模型相比酉模型具有两倍的参数数量，其规模为 $N \times D$，其中 $N$ 是逻辑量子比特数（宽度），$D$ 是 VMBQC 模型的深度。这通常会使优化更加困难，并可能导致模型难以训练。本文提出一种受限的 VMBQC 模型，它仅使用一个额外的可训练参数，便将酉设置扩展至基于信道的框架。我们通过数值与代数方法证明，这种最小化扩展足以生成对应酉模型无法学习的概率分布。

摘要 (Abstract)

Measurement-based quantum computation (MBQC) is a framework for quantum information processing in which a computational task is carried out through one-qubit measurements on a highly entangled resource state. Due to the indeterminacy of the outcomes of a quantum measurement, the random outcomes of these operations, if not corrected, yield a variational quantum channel family. Traditionally, this randomness is corrected through classical processing in order to ensure deterministic unitary computations. Recently, variational measurement-based quantum computation (VMBQC) has been introduced to exploit this measurement-induced randomness to gain an advantage in generative modeling. A limitation of this approach is that the corresponding channel model has twice as many parameters compared to the unitary model, scaling as $N \times D$, where $N$ is the number of logical qubits (width) and $D$ is the depth of the VMBQC model. This can often make optimization more difficult and may lead to poorly trainable models. In this paper, we present a restricted VMBQC model that extends the unitary setting to a channel-based one using only a single additional trainable parameter. We show, both numerically and algebraically, that this minimal extension is sufficient to generate probability distributions that cannot be learned by the corresponding unitary model.

关键词: Measurement-based quantum computation, Variational quantum channel, Generative modeling, Parameter efficiency, Quantum measurement, Resource state, Trainable models, Probability distributions

68. ❌ bacpipe: a Python package to make bioacoustic deep learning models accessible

作者: Vincent S. Kather, Sylvain Haupert, Burooj Ghani, Dan Stowell 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11560v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于生物声学领域的深度学习模型应用和软件工具开发，与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及生物声学（bioacoustics）和生态学应用，属于AI for Science范畴，但并非核心创新点，只是应用领域，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文开发了一个名为bacpipe的Python软件包，旨在通过图形和编程接口使生物声学深度学习模型更易于访问和使用，以帮助生态学家和计算机科学家更高效地分析被动声学监测数据。

摘要翻译

在过去的数十年中，被动声学监测技术已累计记录了数百万小时的自然声音。深度学习模型的进步极大地加速了对其中大部分数据的分析。尽管新模型不断推动技术前沿，但利用工具充分发挥其潜力并非总是易事。本文介绍bacpipe——一个可通过图形界面和编程界面访问的生物声学深度学习模型与评估流程集成工具，专为生态学家和计算机科学家设计。Bacpipe采用模块化软件包架构，旨在成为生物声学模型的汇聚平台。
Bacpipe能够简化前沿模型在自定义音频数据集上的应用流程，自动生成声学特征向量（嵌入表示）和分类器预测结果。其模块化设计支持通过交互式可视化、聚类分析和模型探针等技术进行模型评估与性能基准测试。
我们认为获取新型深度学习模型至关重要。通过将bacpipe设计为面向广泛用户群体，研究人员将能够探索生物声学领域新的生态学与进化生物学问题。
总而言之，我们相信让更广泛的研究群体接触深度学习的最新进展，将有助于推动我们试图解答的生态学问题的研究进程。

摘要 (Abstract)

Natural sounds have been recorded for millions of hours over the previous decades using passive acoustic monitoring. Improvements in deep learning models have vastly accelerated the analysis of large portions of this data. While new models advance the state-of-the-art, accessing them using tools to harness their full potential is not always straightforward. Here we present bacpipe, a collection of bioacoustic deep learning models and evaluation pipelines accessible through a graphical and programming interface, designed for both ecologists and computer scientists. Bacpipe is a modular software package intended as a point of convergence for bioacoustic models.
Bacpipe streamlines the usage of state-of-the-art models on custom audio datasets, generating acoustic feature vectors (embeddings) and classifier predictions. A modular design allows evaluation and benchmarking of models through interactive visualizations, clustering and probing.
We believe that access to new deep learning models is important. By designing bacpipe to target a wide audience, researchers will be enabled to answer new ecological and evolutionary questions in bioacoustics.
In conclusion, we believe accessibility to developments in deep learning to a wider audience benefits the ecological questions we are trying to answer.

关键词: bioacoustic, deep learning models, Python package, accessibility, passive acoustic monitoring, ecological research, model evaluation, audio analysis

69. ❌ Synthius-Mem: Brain-Inspired Hallucination-Resistant Persona Memory Achieving 94.4% Memory Accuracy and 99.6% Adversarial Robustness on LoCoMo

作者: Artem Gadzhiev, Andrew Kislov 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11563v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文Synthius-Mem提出了一种用于LLM代理的脑启发式结构化人物记忆系统，核心解决记忆幻觉问题。与关键词高度相关（10分）的有：1）‘Large Language Models’（系统专为LLM代理设计）；2）‘Retrieval-Augmented Generation’（使用CategoryRAG进行结构化事实检索）；3）‘LLM Agents’（系统应用于LLM代理的长期记忆）；4）‘Hallucination Mitigation’（核心目标是实现抗幻觉记忆，达到99.55%的对抗鲁棒性）。与’Context Window Extension’有一定关联（5分），因为系统通过结构化记忆减少令牌消耗（约5倍），间接缓解上下文窗口限制。其他关键词如MoE、SLMs、训练方法、推理技术、量化等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了LLM代理长期记忆中的幻觉问题，提出了Synthius-Mem脑启发式结构化人物记忆系统，在LoCoMo基准测试中实现了94.37%的准确率和99.55%的对抗鲁棒性，超越了现有系统和人类表现。

摘要翻译

为人工智能代理提供可靠且不产生幻觉的长期记忆仍是一个开放性问题。当前面向大语言模型代理的记忆方法——滑动窗口、摘要、基于嵌入的检索增强生成以及扁平化事实提取——各自降低了令牌消耗，但都引入了灾难性的信息丢失、语义漂移或关于用户的不可控幻觉。其结构性原因在于架构：目前在LoCoMo基准测试上已发表的所有记忆系统都将对话视为对原始或轻度摘要对话片段的检索问题，且无一报告对抗鲁棒性，即拒绝回答用户从未披露事实相关问题的能力。我们提出了Synthius-Mem，一种受大脑启发的结构化人物记忆系统，它采用了一种根本不同的方法。Synthius-Mem并非检索“说过什么”，而是提取“对这个人了解什么”：一个完整的人物提取流程将对话分解为六个认知领域（传记、经历、偏好、社交圈、工作、心理测量学），在每个领域内进行整合与去重，并通过CategoryRAG以21.79毫秒的延迟检索结构化事实。在LoCoMo基准测试（ACL 2024，10段对话，1,813个问题）上，Synthius-Mem实现了94.37%的准确率，超过了所有已发表系统（包括MemMachine的91.69%，其未报告对抗得分）和人类表现（87.9 F1分数）。核心记忆事实准确率达到98.64%。对抗鲁棒性——即所有竞争系统均未报告的防幻觉指标——达到了99.55%。与完整上下文回放相比，Synthius-Mem将令牌消耗降低了约5倍，同时实现了更高的准确率。Synthius-Mem在LoCoMo上取得了最先进的结果，并且据我们所知，它是唯一一个既超越人类水平性能又报告了对抗鲁棒性的人物记忆系统。

摘要 (Abstract)

Providing AI agents with reliable long-term memory that does not hallucinate remains an open problem. Current approaches to memory for LLM agents – sliding windows, summarization, embedding-based RAG, and flat fact extraction – each reduce token cost but introduce catastrophic information loss, semantic drift, or uncontrolled hallucination about the user. The structural reason is architectural: every published memory system on the LoCoMo benchmark treats conversation as a retrieval problem over raw or lightly summarized dialogue segments, and none reports adversarial robustness, the ability to refuse questions about facts the user never disclosed. We present Synthius-Mem, a brain-inspired structured persona memory system that takes a fundamentally different approach. Instead of retrieving what was said, Synthius-Mem extracts what is known about the person: a full persona extraction pipeline decomposes conversations into six cognitive domains (biography, experiences, preferences, social circle, work, psychometrics), consolidates and deduplicates per domain, and retrieves structured facts via CategoryRAG at 21.79 ms latency. On the LoCoMo benchmark (ACL 2024, 10 conversations, 1,813 questions), Synthius-Mem achieves 94.37% accuracy, exceeding all published systems including MemMachine (91.69%, adversarial score is not reported) and human performance (87.9 F1). Core memory fact accuracy reaches 98.64%. Adversarial robustness, the hallucination resistance metric that no competing system reports, reaches 99.55%. Synthius-Mem reduces token consumption by ~5x compared to full-context replay while achieving higher accuracy. Synthius-Mem achieves state-of-the-art results on LoCoMo and is, to our knowledge, the only persona memory system that both exceeds human-level performance and reports adversarial robustness.

关键词: LLM agents, persona memory, hallucination mitigation, adversarial robustness, structured memory, Retrieval-Augmented Generation, LoCoMo benchmark, brain-inspired system

70. ❌ FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning

作者: Haoran Ding, Zhaoguo Wang, Haibo Chen 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11556v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文FM-Agent的核心是使用LLM（大语言模型）构建自动化代理（FM-Agent）来对大规模系统进行形式化验证，这直接与’Large Language Models’和’LLM Agents’高度相关（10分）。其推理过程涉及将系统分解为组件并分别推理，这体现了多步、深入的推理，与’Chain of Thought’和’System 2 Thinking’相关（8分）。论文未涉及其他关键词，如模型架构（MoE）、训练方法（SFT、RLHF）、效率技术（量化）或特定科学领域应用，因此这些关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了FM-Agent框架，利用LLM实现自动化组合推理，以解决大规模系统（如LLM生成的代码）形式化验证的扩展性问题，并在评估中成功分析了高达14.3万行代码的系统，发现了522个新错误。

摘要翻译

LLM辅助的软件开发日益普及，其能够生成大规模系统（如编译器）。确保生成代码的正确性变得至关重要。然而，由于代码复杂性，大规模系统的自动化推理仍面临挑战。霍尔逻辑提供了一种将大型系统分解为较小组件并分别进行推理（即组合推理）的方法。但现有研究仍难以扩展规模，因为霍尔逻辑要求为每个函数编写形式化规约，这给开发者带来了沉重负担。当代码由LLM生成时，该问题更加突出，因为开发者缺乏对每个函数预期行为的深入理解。
本文提出了FM-Agent，这是首个实现大规模系统自动化组合推理的框架。FM-Agent利用LLM，引入了一种自上而下的范式来自动生成函数级规约。具体而言，FM-Agent根据调用方对函数行为的预期来推导其规约，因此即使实现存在缺陷，生成的规约仍能反映开发者对函数的意图。开发者意图通常以自然语言表达，而现有验证器仅支持形式化公式。为此，FM-Agent推广了霍尔式推理，使其能够基于自然语言规约对函数进行验证。最后，为确认缺陷存在并解释其成因，FM-Agent会自动生成测试用例以触发潜在缺陷。在我们的评估中，FM-Agent成功在2天内完成了对大规模系统的推理，其中最大系统的代码量达14.3万行。这些系统均已通过开发者测试，但FM-Agent仍发现了522个新缺陷。这些缺陷可能导致严重后果，包括系统崩溃与错误执行结果。

摘要 (Abstract)

LLM-assisted software development has become increasingly prevalent, and can generate large-scale systems, such as compilers. It becomes crucial to strengthen the correctness of the generated code. However, automated reasoning for large-scale systems remains challenging due to code complexity. Hoare logic offers an approach to decomposing a large system into smaller components and reasoning about them separately (i.e., compositional reasoning). However, existing works still struggle to scale, because Hoare logic requires writing formal specifications for each function, imposing a heavy human burden. The problem is exacerbated when code is generated by LLMs, as developers lack a deep understanding of each function’s expected behavior. This paper presents FM-Agent, the first framework that realizes automated compositional reasoning for large-scale systems. Leveraging LLMs, FM-Agent introduces a top-down paradigm to automatically generate function-level specifications. Specifically, FM-Agent derives the specification of a function from how its callers expect the function to behave, so the generated specifications can reflect the developer’s intent of a function even if the implementation is buggy. Developers’ intent is usually expressed in natural language, while existing verifiers only support formulas. Therefore, FM-Agent generalizes Hoare-style inference to reason about functions against natural-language specifications. Finally, to confirm bug existence and explain bug causes, FM-Agent automatically generates test cases to trigger potential bugs. In our evaluation, FM-Agent successfully reasons about large-scale systems within 2 days, each of which has up to 143k LoC. These systems have already been tested by their developers, but FM-Agent still finds 522 newly discovered bugs. These bugs can cause serious consequences, including system crashes and incorrect execution results.

关键词: LLM, Formal Methods, Hoare Logic, Automated Reasoning, Software Verification, Large-scale Systems, Compositional Reasoning, Bug Detection

71. ❌ SemaClaw: A Step Towards General-Purpose Personal AI Agents through Harness Engineering

作者: Ningyan Zhu, Huacan Wang, Jie Zhou, Feiyu Chen, Shuo Zhang, Ge Chen, Chen Liu, Jiarou Wu, Wangyi Chen, Xiaofeng Mou, Yi Xu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11548v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SemaClaw专注于个人AI代理的框架开发，核心是“harness engineering”和多代理系统架构。它与“LLM Agents/Autonomous Agents/Agentic Workflow”高度相关（10分），因为论文直接研究AI代理框架；与“Multi-agent Systems/Agent Coordination”高度相关（10分），因为提出了DAG-based hybrid agent team orchestration方法；与“Large Language Models/LLMs/Foundation Models”有一定关联（8分），因为个人AI代理通常基于大模型，但论文未深入讨论模型本身。其他关键词（如MoE、SFT、RAG等）未在摘要中提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对个人AI代理从离散任务转向持续协作关系时缺乏可控、可审计基础设施的问题，提出了一个开源的多代理应用框架SemaClaw，通过harness engineering实现了DAG-based hybrid agent team orchestration、PermissionBridge安全系统、三层上下文管理架构和自动化个人知识库构建的agentic wiki skill。

摘要翻译

2026年初OpenClaw的兴起标志着数百万用户开始将个人智能体部署于日常生活的重要时刻，其代理任务涵盖从旅行规划到多步骤研究等广泛领域。这种规模的应用表明两条并行的发展轨迹已抵达转折点。首先是人工智能工程范式的转变——从提示与上下文工程演进至约束工程，即设计完整的基础设施体系，将无约束智能体转化为可控、可审计且具备生产可靠性的系统。随着模型能力趋于同质化，约束层正成为架构差异化的核心场域。其次是人机交互模式从离散任务执行向持续化、情境感知的协作关系演进，这要求构建开放、可信且可扩展的约束基础设施。本文提出SemaClaw开源多智能体应用框架，通过约束工程向通用个人智能体迈出关键一步。我们的核心贡献包括：基于有向无环图的两阶段混合智能体团队编排方法、PermissionBridge行为安全系统、三层级情境管理架构，以及支持自动化个人知识库构建的智能维基技能。

摘要 (Abstract)

The rise of OpenClaw in early 2026 marks the moment when millions of users began deploying personal AI agents into their daily lives, delegating tasks ranging from travel planning to multi-step research. This scale of adoption signals that two parallel arcs of development have reached an inflection point. First is a paradigm shift in AI engineering, evolving from prompt and context engineering to harness engineering-designing the complete infrastructure necessary to transform unconstrained agents into controllable, auditable, and production-reliable systems. As model capabilities converge, this harness layer is becoming the primary site of architectural differentiation. Second is the evolution of human-agent interaction from discrete tasks toward a persistent, contextually aware collaborative relationship, which demands open, trustworthy and extensible harness infrastructure. We present SemaClaw, an open-source multi-agent application framework that addresses these shifts by taking a step towards general-purpose personal AI agents through harness engineering. Our primary contributions include a DAG-based two-phase hybrid agent team orchestration method, a PermissionBridge behavioral safety system, a three-tier context management architecture, and an agentic wiki skill for automated personal knowledge base construction.

关键词: personal AI agents, harness engineering, multi-agent framework, agent team orchestration, behavioral safety system, context management, knowledge base construction, open-source framework

72. ❌ Time is Not a Label: Continuous Phase Rotation for Temporal Knowledge Graphs and Agentic Memory

作者: Weixian Waylon Li, Jiaxin Zhang, Xianan Jim Yang, Tiejun Ma, Yiwen Guo 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11544v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是提出RoMem，一个用于结构化记忆系统（特别是智能体记忆）的时序知识图谱模块，通过连续相位旋转和语义速度门来区分持久事实与演化事实，实现几何遮蔽而非删除。与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文明确应用于agentic memory，是核心应用场景。与’Large Language Models OR LLMs OR Foundation Models’和’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（各5分），因为摘要提到’pretrained Semantic Speed Gate’，涉及预训练嵌入，且大模型是智能体系统的潜在组件，但论文本身不聚焦LLM技术原理。其他关键词如MoE、SFT、RAG、推理方法等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文针对现有结构化记忆系统将时间建模为离散元数据、无法区分持久事实与演化事实的问题，提出了RoMem时序知识图谱模块，通过连续相位旋转和预训练的语义速度门实现几何遮蔽，在时序知识图谱补全和智能体记忆任务上取得了最先进的结果。

摘要翻译

诸如知识图谱之类的结构化记忆表征是自主智能体及其他长效系统的核心。然而，现有方法大多将时间建模为离散元数据，要么按时间远近排序（导致陈旧但永久的知识被埋没），要么直接覆盖过时事实，或在每次信息录入时都需要调用昂贵的大语言模型，因而无法区分持久性事实与演变性事实。为此，我们提出RoMem——一个即插即用的时序知识图谱模块，适用于结构化记忆系统，可广泛应用于智能体记忆及其他场景。该模块通过预训练的语义速度门，将每个关系的文本嵌入映射为波动性分数，从数据中学习到演变关系（如“某国总统”）应快速旋转，而持久关系（如“出生于”）应保持稳定。结合连续相位旋转技术，该模块实现了几何遮蔽：过时事实在复数向量空间中被旋转至相位失配状态，使得时序正确的事实无需删除操作即可自然超越矛盾事实。在时序知识图谱补全任务中，RoMem在ICEWS05-15数据集上取得了72.6 MRR的最新最优结果。应用于智能体记忆时，其在时序推理任务（MultiTQ）上实现了2-3倍的MRR与答案准确率提升，在混合基准测试（LoCoMo）中表现卓越，静态记忆保持零衰减（DMR-MSC），并能零样本泛化至未见过的金融领域（FinTMMBench）。

摘要 (Abstract)

Structured memory representations such as knowledge graphs are central to autonomous agents and other long-lived systems. However, most existing approaches model time as discrete metadata, either sorting by recency (burying old-yet-permanent knowledge), simply overwriting outdated facts, or requiring an expensive LLM call at every ingestion step, leaving them unable to distinguish persistent facts from evolving ones. To address this, we introduce RoMem, a drop-in temporal knowledge graph module for structured memory systems, applicable to agentic memory and beyond. A pretrained Semantic Speed Gate maps each relation’s text embedding to a volatility score, learning from data that evolving relations (e.g., “president of”) should rotate fast while persistent ones (e.g., “born in”) should remain stable. Combined with continuous phase rotation, this enables geometric shadowing: obsolete facts are rotated out of phase in complex vector space, so temporally correct facts naturally outrank contradictions without deletion. On temporal knowledge graph completion, RoMem achieves state-of-the-art results on ICEWS05-15 (72.6 MRR). Applied to agentic memory, it delivers 2-3x MRR and answer accuracy on temporal reasoning (MultiTQ), dominates hybrid benchmark (LoCoMo), preserves static memory with zero degradation (DMR-MSC), and generalises zero-shot to unseen financial domains (FinTMMBench).

关键词: temporal knowledge graphs, agentic memory, continuous phase rotation, geometric shadowing, structured memory, autonomous agents, semantic speed gate, volatility score

73. ❌ NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment

作者: Wenqing Wu, Yi Zhao, Yuzhuo Wang, Siyou Li, Juexi Shao, Yunfei Long, Chengzhi Zhang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11543v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文核心研究LLMs在学术论文新颖性评估中的应用，与’Large Language Models’高度相关（10分），涉及’Post-training/SFT’（8分）和’Instruction Tuning/Alignment’（5分），并属于’AI for Science’范畴（8分）。其他关键词如MoE、Scaling Laws、RAG等未在论文中提及或仅边缘相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在学术论文新颖性评估中的能力不足问题，提出了首个大规模基准测试NovBench，并通过实验发现当前模型对科学新颖性的理解有限且微调模型存在指令遵循缺陷。

摘要翻译

新颖性是学术出版的核心要求与同行评审的关注焦点，然而日益增长的投稿量给人工评审者带来了持续压力。尽管大语言模型（包括基于同行评审数据微调的模型）在生成评审意见方面展现出潜力，但由于缺乏专用基准，对其评估研究新颖性能力的系统性评测一直受限。为填补这一空白，我们提出了NovBench——首个用于评估大语言模型生成新颖性评价以辅助人工同行评审能力的大规模基准。NovBench包含来自顶尖自然语言处理（NLP）会议的1,684组论文-评审对，其中既包含从论文引言中提取的新颖性描述，也涵盖专家撰写的新颖性评价。我们同时关注这两类来源，是因为引言提供了标准化且明确的新颖性主张表述，而专家撰写的新颖性评价则代表了当前人类判断的黄金标准之一。此外，我们提出了一个四维评估框架（包括相关性、正确性、覆盖度与清晰度）以评估大语言模型生成的新颖性评价质量。在不同提示策略下对通用模型与专业模型进行的广泛实验表明：当前模型对科学新颖性的理解能力有限，且微调模型常存在指令遵循缺陷。这些发现凸显了需要针对性地开发联合提升新颖性理解与指令遵循能力的微调策略。

摘要 (Abstract)

Novelty is a core requirement in academic publishing and a central focus of peer review, yet the growing volume of submissions has placed increasing pressure on human reviewers. While large language models (LLMs), including those fine-tuned on peer review data, have shown promise in generating review comments, the absence of a dedicated benchmark has limited systematic evaluation of their ability to assess research novelty. To address this gap, we introduce NovBench, the first large-scale benchmark designed to evaluate LLMs’ capability to generate novelty evaluations in support of human peer review. NovBench comprises 1,684 paper-review pairs from a leading NLP conference, including novelty descriptions extracted from paper introductions and corresponding expert-written novelty evaluations. We focus on both sources because the introduction provides a standardized and explicit articulation of novelty claims, while expert-written novelty evaluations constitute one of the current gold standards of human judgment. Furthermore, we propose a four-dimensional evaluation framework (including Relevance, Correctness, Coverage, and Clarity) to assess the quality of LLM-generated novelty evaluations. Extensive experiments on both general and specialized LLMs under different prompting strategies reveal that current models exhibit limited understanding of scientific novelty, and that fine–tuned models often suffer from instruction-following deficiencies. These findings underscore the need for targeted fine-tuning strategies that jointly improve novelty comprehension and instruction adherence.

关键词: Large Language Models, Novelty Assessment, Benchmark, Peer Review, Fine-tuning, AI for Science, Evaluation Framework, NLP Conference

74. ❌ A collaborative agent with two lightweight synergistic models for autonomous crystal materials research

作者: Tongyu Shi, Yutang Li, Zhanyuan Li, Qian Liu, Jie Zhou, Wenhe Xu, Yang Li, Dawei Dai, Rui He, Wenhua Zhou, Jiahong Wang, Xue-Feng Yu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11540v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出MatBrain，一个用于晶体材料研究的轻量级协作智能体系统，包含两个协同模型（30B和14B参数）。核心相关关键词：1）‘Small Language Models’（10分）：明确使用轻量级模型（30B/14B），降低硬件部署门槛95%；2）‘LLM Agents’/‘Tool Use’/‘Multi-agent Systems’（各10分）：系统是协作智能体，专门用于工具协调和任务执行；3）‘AI for Science’（10分）：应用于材料科学，实现催化剂设计等任务；4）‘Chain of Thought’/‘System 2 Thinking’（各8分）：涉及领域推理和分析；5）‘Large Language Models’（8分）：基于大模型技术；6）‘Pre-training’/‘Post-training’（各5分）：隐含模型训练过程。其他关键词如MoE、Scaling Laws、RLHF等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了大型语言模型在材料科学领域推理和工具协调方面的不足，通过提出一个轻量级双模型协作智能体系统（MatBrain），在晶体材料研究中实现了高性能和低硬件需求，并在催化剂设计中展示了100倍的加速效果。

摘要翻译

当前的大型语言模型需要数千亿参数，却在材料科学领域的专业推理与工具协调方面存在局限。本文提出MatBrain——一个专为晶体材料研究设计的轻量级协同智能体系统，其核心由两个协同模型构成。该系统采用双模型架构：Mat-R1（300亿参数）作为分析模型，提供专家级领域推理能力；Mat-T1（140亿参数）作为执行模型，协调基于工具的操作流程。熵分析证实，该架构通过解耦工具规划与分析推理两者不同的熵动态，有效解决了其间的功能冲突。凭借这种双模型架构与结构效率，MatBrain在显著超越更大规模通用模型的同时，将硬件部署门槛降低了95%以上。该系统在结构生成、性质预测与合成规划等任务中均展现出卓越的通用性。在催化剂设计应用中，MatBrain于48小时内生成了3万种候选结构，并筛选出38种前景材料，相较传统方法实现了约100倍的加速。这些成果彰显了轻量级协同智能在提升材料研究能力方面的巨大潜力。

摘要 (Abstract)

Current large language models require hundreds of billions of parameters yet struggle with domain-specific reasoning and tool coordination in materials science. Here, we present MatBrain, a lightweight collaborative agent system with two synergistic models specialization for crystal materials research. MatBrain employs a dual-model architecture: Mat-R1 (30B parameters) as the analytical model providing expert-level domain reasoning, and Mat-T1 (14B parameters) as the executive model orchestrating tool-based actions. Entropy analysis confirms that this architecture resolves the conflict between tool planning and analytical reasoning by decoupling their distinct entropy dynamics. Enabled by this dual-model architecture and structural efficiency, MatBrain significantly outperforms larger general-purpose models while reducing the hardware deployment barrier by over 95%. MatBrain exhibits versatility across structure generation, property prediction, and synthesis planning tasks. Applied to catalyst design, MatBrain generated 30,000 candidate structures and identified 38 promising materials within 48 hours, achieving approximately 100-fold acceleration over traditional approaches. These results demonstrate the potential of lightweight collaborative intelligence for advancing materials research capabilities.

关键词: collaborative agent, lightweight models, materials science, dual-model architecture, tool coordination, crystal materials research, autonomous research, domain reasoning

75. ❌ Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems

作者: Xi-Wei Pan, Shi-Wen An, Jin-Guo Liu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11535v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究利用AI编码代理（AI coding agents）构建大规模问题归约库的系统工程方法，核心是’harness engineering’（约束设计、验证系统和反馈循环）。这与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文明确使用AI agents进行软件开发。其他关键词如大模型技术、训练方法、推理优化、科学AI应用等均未在摘要中提及，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用AI编码代理构建大规模NP-hard问题归约库的工程方法，通过设计约束、验证系统和反馈循环，在三个月内实现了包含100+问题类型和200+归约规则的库，使新求解器能通过归约路径自动适用于所有相连问题。

摘要翻译

求解NP难优化问题通常需要针对特定求解器（如量子硬件、商业优化器或领域启发式算法）进行问题重构。若存在一种能在难解问题间进行多项式时间归约的工具，实践者便可通过统一接口将任何支持的问题导向任何支持的求解器。然而，大规模构建此类库始终面临挑战。本文证明，通过约束工程——即设计用于引导AI编码智能体的约束条件、验证系统和反馈循环——能够突破这一障碍。我们的约束系统结合了以下要素：为领域专家提供的无代码贡献途径、涵盖从类型检查到智能体特征测试（由AI智能体扮演终端用户角色）的多层验证栈，以及全自动的实现-评审-集成流程。在大约三个月内，我们构建了一个命令行工具，其底层库包含100多种问题类型和200多条归约规则，代码量超过17万行Rust。结果表明，精心设计的约束系统能使智能体以超越以往归约库项目的规模和速度构建出经过充分测试的软件。由于归约图具有传递组合性，为任一问题类型注册的新求解器可立即通过归约路径服务于所有关联问题。源代码已发布于https://github.com/CodingThrust/problem-reductions。

摘要 (Abstract)

Solving an NP-hard optimization problem often requires reformulating it for a specific solver – quantum hardware, a commercial optimizer, or a domain heuristic. A tool for polynomial-time reductions between hard problems would let practitioners route any supported problem to any supported solver through a single interface. Building such a library at scale, however, has remained out of reach. We show that harness engineering, the practice of designing constraints, verification systems, and feedback loops that channel AI coding agents, can overcome this barrier. Our harness combines a no-code contribution route for domain experts, a multilayer verification stack ranging from type-level checks to agentic feature tests (AI agents role-playing as end users), and a fully automated implementation-review-integration pipeline. In about three months, we built a command-line tool backed by a library of 100+ problem types and 200+~reduction rules in over 170k lines of Rust. The result suggests that a well-engineered harness lets agents build well-tested software at a scale and pace beyond prior reduction-library efforts. Because the reduction graph composes transitively, a new solver registered for any single problem type instantly becomes available to every problem connected by a reduction path. The source code is available at https://github.com/CodingThrust/problem-reductions.

关键词: NP-hard optimization, problem reductions, AI coding agents, harness engineering, verification systems, automated pipeline, reduction library, Rust implementation

76. ❌ SVD-Prune: Training-Free Token Pruning For Efficient Vision-Language Models

作者: Yvon Apedo, Martyna Poreba, Michal Szczepanski, Samia Bouchafa 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11530v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Vision-Language Models（VLM）的token pruning方法，属于计算机视觉与自然语言处理交叉领域，但所有评分关键词均针对纯文本大语言模型（LLM）的技术原理、训练方法、推理优化、对齐技术、应用场景等，与VLM的视觉token pruning无直接关联。论文未涉及LLM相关技术，也未在科学领域（如生物信息学）应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于奇异值分解的训练无关视觉token剪枝方法SVD-Prune，解决了Vision-Language Models在处理长视觉序列时的高计算需求问题，在极端token预算下优于现有方法。

摘要翻译

视觉语言模型通过联合处理视觉与文本信息，在多模态学习领域引发了革命性进展。然而，由于处理长序列视觉标记所需的高计算与内存成本，这类模型仍面临重大挑战。现有方法多依赖局部启发式策略，如注意力分数或标记范数，但这些标准存在位置偏差和信息分散问题，导致在高剪枝率下难以保留关键内容，并在视觉细节丰富的图像上出现性能下降。为解决上述问题，我们提出SVD-Prune——一种基于奇异值分解的无训练即插即用型标记剪枝方法。该方法通过分解视觉标记特征矩阵，并利用统计杠杆得分选取前K个标记，确保仅保留对全局主导方差贡献最大的标记。实验表明，在极端视觉标记预算条件下，SVD-Prune始终优于现有剪枝方法，即使仅使用32或16个视觉标记仍能保持强劲性能。

摘要 (Abstract)

Vision-Language Models (VLM) have revolutionized multimodal learning by jointly processing visual and textual information. Yet, they face significant challenges due to the high computational and memory demands of processing long sequences of vision tokens. Many existing methods rely on local heuristics, such as attention scores or token norms. However, these criteria suffer from positional bias and information dispersion, limiting their ability to preserve essential content at high pruning ratios and leading to performance degradation on visually detailed images. To address these issues, we propose SVD-Prune, a trainingfree, plug-and-play token pruning method based on Singular Value Decomposition. It decomposes the vision token feature matrix and selects the top-K tokens using statistical leverage scores, ensuring only tokens contributing most to the dominant global variance are preserved. Experiments show that SVD-Prune consistently outperforms prior pruning methods under extreme vision token budgets, maintaining strong performance even with 32 and 16 vision tokens.

关键词: Vision-Language Models, token pruning, Singular Value Decomposition, training-free, computational efficiency, vision tokens, SVD-Prune, multimodal learning

77. ❌ Limited Perfect Monotonical Surrogates constructed using low-cost recursive linkage discovery with guaranteed output

作者: M. W. Przewozniczek, F. Chicano, R. Tinós, M. M. Komarnicki 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11524v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是优化问题中的代理模型（surrogate models），特别是针对非线性问题的完美单调代理模型（LyMPuS），用于降低昂贵局部搜索过程的成本。论文内容完全聚焦于优化算法和代理模型技术，与所有评分关键词（均涉及大模型、深度学习、AI应用等）无直接关联。论文未提及任何语言模型、深度学习架构、训练方法、对齐技术、推理方法、代理系统、模型压缩或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为LyMPuS的有限单调完美代理模型，用于解决非线性优化问题中昂贵局部搜索的成本问题，该模型无需参数、可在线训练，并能以低成本发现缺失的依赖关系。

摘要翻译

代理模型为优化计算代价高昂的问题提供了廉价的解评估方案，并展现出显著优势。通常，代理模型仅能近似原始函数。近期，研究者提出了理想的完美线性代理模型，其能够完美表征原始函数。这些代理模型并非对原始函数的模仿，事实上，它们是原始函数的另一种（正确）表达形式，从而开启了广泛的应用可能，例如：对于无法将编码解直接转换为评估值的问题，可藉此发现其优化函数。然而，许多现实世界问题无法用线性模型表示，使得上述代理模型无法适用。为此，我们提出了有限单调完美代理模型（Limited Monotonical Perfect Surrogate, LyMPuS），该模型克服了这一困难，并能够对仅单变量不同的两个解进行比较。我们的方案适用于限制昂贵局部搜索过程的成本。所提出的代理模型无需参数，可实时训练而无需任何独立的代理构建步骤。它仅使用必要的适应度评估，且在模型更新时已付出的计算成本不会被浪费。最后，该模型提供了低成本的缺失关联检测与低成本的关联发现能力，并保证在不超过 $2\lceil\log_2(n)\rceil$ 步内找到缺失的依赖关系。

摘要 (Abstract)

Surrogates provide a cheap solution evaluation and offer significant leverage for optimizing computationally expensive problems. Usually, surrogates only approximate the original function. Recently, the perfect linear surrogates were proposed that ideally represent the original function. These surrogates do not mimic the original function. In fact, they are another (correct) representation of it and enable a wide range of possibilities, e.g., discovering the optimized function for problems where the direct transformation of the encoded solution into its evaluation is not available. However, many real-world problems can not be represented by linear models, making the aforementioned surrogates inapplicable. Therefore, we propose the Limited Monotonical Perfect Surrogate (LyMPuS), which overcomes this difficulty and enables the comparison of two solutions that differ by a single variable. Our proposition is suitable for limiting the cost of expensive local search procedures. The proposed surrogate is parameterless and can be trained on the fly without any separate surrogate-building step. It uses only the necessary fitness evaluations, and the already-paid costs are not wasted when the model is updated. Finally, it offers low-cost missing-linkage detection and low-cost linkage discovery, guaranteed to find a missing dependency in no more than $2\lceil\log_2(n)\rceil$ steps.

关键词: surrogate models, perfect surrogates, monotonical surrogates, optimization, linkage discovery, local search, fitness evaluation, nonlinear problems

78. ❌ PAC-BENCH: Evaluating Multi-Agent Collaboration under Privacy Constraints

作者: Minjun Park, Donghyun Kim, Hyeonjong Ju, Seungwon Lim, Dongwook Choi, Taeyoon Kwon, Minju Kim, Jinyoung Yeo 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11523v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多智能体协作在隐私约束下的评估，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’和’Multi-agent Systems OR Agent Coordination’高度相关（10分），因为论文明确研究AI agents的协作。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为多智能体系统通常基于大模型构建。与’Hallucination Mitigation OR Factuality OR Truthfulness’有一定关联（5分），因为论文提到’privacy-induced hallucinations’作为协调失败的原因之一。其他关键词（如MoE、SFT、RAG等）与论文内容无直接关系，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了PAC-Bench基准来评估隐私约束下的多智能体协作，发现隐私约束会显著降低协作性能并导致协调失败，揭示了隐私感知的多智能体协作是一个尚未解决的挑战。

摘要翻译

我们正步入一个新时代，个体与组织日益部署专用的AI智能体，这些智能体与其他智能体进行交互与协作。然而，在隐私约束下的多智能体协作动态机制仍鲜为人知。在本研究中，我们提出了$PAC\text{-}Bench$，这是一个用于系统评估隐私约束下多智能体协作的基准。在$PAC\text{-}Bench$上的实验表明，隐私约束显著降低了协作性能，并使结果更多地取决于发起智能体而非协作伙伴。进一步分析揭示，这种性能下降源于反复出现的协调失效，包括早期阶段的隐私侵犯、过度保守的抽象化以及隐私引发的幻觉。综合而言，我们的研究指出，具备隐私意识的多智能体协作是一个独特且尚未解决的挑战，需要超越现有智能体能力的新型协调机制。

摘要 (Abstract)

We are entering an era in which individuals and organizations increasingly deploy dedicated AI agents that interact and collaborate with other agents. However, the dynamics of multi-agent collaboration under privacy constraints remain poorly understood. In this work, we present $PAC\text{-}Bench$, a benchmark for systematic evaluation of multi-agent collaboration under privacy constraints. Experiments on $PAC\text{-}Bench$ show that privacy constraints substantially degrade collaboration performance and make outcomes depend more on the initiating agent than the partner. Further analysis reveals that this degradation is driven by recurring coordination breakdowns, including early-stage privacy violations, overly conservative abstraction, and privacy-induced hallucinations. Together, our findings identify privacy-aware multi-agent collaboration as a distinct and unresolved challenge that requires new coordination mechanisms beyond existing agent capabilities.

关键词: multi-agent collaboration, privacy constraints, benchmark evaluation, coordination breakdowns, privacy violations, AI agents, privacy-induced hallucinations, agent coordination

79. ❌ From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python

作者: Jinhua Wang, Biswa Sengupta 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11518v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用LLM辅助将生产级AI代理（Codex CLI）从Rust迁移到Python，并采用基准测试驱动的方法进行迭代优化。因此，与"Large Language Models"和"LLM Agents"高度相关（10分），因为LLM是翻译工具，研究对象是AI代理。与"Tool Use"和"Multi-agent Systems"有一定关联（5分），因为论文提到代理使用API工具，并扩展了多代理编排功能。其他关键词（如MoE、SFT、RAG等）未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于LLM辅助的基准测试驱动方法，成功将生产级AI编码代理从Rust迁移到Python，实现了功能对等并扩展了新功能，同时验证了Python版本在代码精简和任务性能上的优势。

摘要翻译

大型软件系统的跨语言迁移是一项持续的工程挑战，尤其在源代码库快速演进的情况下。本文提出一种基于大语言模型（LLM）辅助的持续代码翻译方法，通过以大语言模型将生产环境的Rust代码库（648K行代码，65个crate）翻译为Python代码（41K行代码，28个模块），并以公开智能体基准测试作为驱动迭代优化的目标函数。研究对象为Codex CLI——一个生产级AI编程智能体。我们证明：（1）Python移植版本在SWE-bench Verified任务中解决了59/80个问题（73.8%），而Rust原版为56/80（70.0%）；在Terminal-Bench上达到42.5%得分，Rust版为47.5%，证实了在真实世界智能体任务中达到接近对等的性能；（2）基于基准测试的调试方法（揭示了API协议不匹配、环境污染、静默WebSocket故障模式及API 400崩溃问题）比单纯静态测试更有效；（3）该架构通过LLM辅助的差异翻译测试循环支持持续的上游同步；（4）Python移植版已发展成功能超集，具备30项Rust版本没有的特性开关扩展（多智能体编排、语义记忆、守护安全机制、成本追踪），同时保留严格对等模式以供比较。评估表明：对于以API延迟为主导的基于LLM的智能体，Python的表达能力可实现15.9倍的代码精简，且性能损失可忽略不计；而将基准测试作为目标函数的方法，为跨语言移植从功能对等演进为扩展平台提供了原则性框架。

摘要 (Abstract)

Cross-language migration of large software systems is a persistent engineering challenge, particularly when the source codebase evolves rapidly. We present a methodology for LLM-assisted continuous code translation in which a large language model translates a production Rust codebase (648K LOC, 65 crates) into Python (41K LOC, 28 modules), with public agent benchmarks as the objective function driving iterative refinement. Our subject system is Codex CLI, a production AI coding agent. We demonstrate that: (1) the Python port resolves 59/80 SWE-bench Verified tasks (73.8%) versus Rust’s 56/80 (70.0%), and achieves 42.5% on Terminal-Bench versus Rust’s 47.5%, confirming near-parity on real-world agentic tasks; (2) benchmark-driven debugging, revealing API protocol mismatches, environment pollution, a silent WebSocket failure mode, and an API 400 crash, is more effective than static testing alone; (3) the architecture supports continuous upstream synchronisation via an LLM-assisted diff-translate-test loop; and (4) the Python port has evolved into a capability superset with 30 feature-flagged extensions (multi-agent orchestration, semantic memory, guardian safety, cost tracking) absent from Rust, while preserving strict parity mode for comparison. Our evaluation shows that for LLM-based agents where API latency dominates, Python’s expressiveness yields a 15.9x code reduction with negligible performance cost, while the benchmark-as-objective-function methodology provides a principled framework for growing a cross-language port from parity into an extended platform.

关键词: LLM-assisted code translation, AI coding agent, benchmark-driven debugging, cross-language migration, production system, continuous synchronization, multi-agent orchestration, code reduction

80. ❌ EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models

作者: Jinane Bazzi, Mariam Rakka, Fadi Kurdahi, Mohammed E. Fouda, Ahmed Eltawil 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11512v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Small Language Models（SLMs）在边缘设备上的高效推理加速，与’Small Language Models OR SLMs OR On-device AI’高度相关（10分），因为全文围绕SLMs在边缘部署展开。与’Speculative Decoding OR Inference Acceleration’高度相关（10分），因为研究重点是硬件-软件协同设计以加速推理过程。与’Quantization OR Model Compression OR Low-bit Weights’有一定关联（5分），因为论文提到使用INT4精度。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为SLMs是LLMs的子集，且论文测试了LLaMA等模型。其他关键词如MoE、Scaling Laws、Pre-training、Alignment等与论文的硬件加速和边缘部署主题无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文针对边缘设备上Small Language Models推理效率低的问题，提出了EdgeCIM硬件-软件协同设计框架，实现了比现有平台高达7.3倍的吞吐量提升和49.59倍的能效提升。

摘要翻译

随着在笔记本电脑、智能手机及嵌入式平台等边缘设备上部署小语言模型（Small Language Models, SLMs）的需求日益增长，现有加速器存在的根本性效率不足问题日益凸显。尽管图形处理器（GPUs）能高效处理预填充（prefill）工作负载，但自回归解码（autoregressive decoding）阶段主要由内存带宽受限的广义矩阵-向量乘法（GEMV）操作主导，导致在边缘场景下计算利用率低下且能耗成本过高。本文提出EdgeCIM，一个面向端到端仅解码器（decoder-only）推理的软硬件协同设计框架。其核心是一个基于65纳米工艺实现的存内计算（CIM）宏单元，结合一种基于分块（tile-based）的映射策略，以平衡流水线级数，在最大化并行度的同时缓解动态随机存取存储器（DRAM）带宽瓶颈。我们的模拟器支持对参数规模高达40亿（4B）的SLMs进行设计空间探索，从而在延迟与能耗方面确定帕累托最优配置。相较于英伟达Orin Nano平台，EdgeCIM在LLaMA3.2-1B模型上实现了高达7.3倍的吞吐量提升和49.59倍的能效提升；在LLaMA3.2-3B模型上，其吞吐量比高通SA8255P高出9.95倍。在TinyLLaMA-1.1B、LLaMA3.2（1B、3B）、Phi-3.5-mini-3.8B、Qwen2.5（0.5B、1.5B、3B）、SmolLM2-1.7B、SmolLM3-3B及Qwen3（0.6B、1.7B、4B）等一系列模型上的广泛测试表明，我们的加速器在INT4精度下平均可实现336.42 tokens/s的吞吐率和173.02 tokens/J的能效。这些结果确立了EdgeCIM作为一种面向实时、高能效边缘级SLM推理的极具竞争力的解决方案。

摘要 (Abstract)

The growing demand for deploying Small Language Models (SLMs) on edge devices, including laptops, smartphones, and embedded platforms, has exposed fundamental inefficiencies in existing accelerators. While GPUs handle prefill workloads efficiently, the autoregressive decoding phase is dominated by GEMV operations that are inherently memory-bound, resulting in poor utilization and prohibitive energy costs at the edge. In this work, we present EdgeCIM, a hardware-software co-design framework that rethinks accelerator design for end-to-end decoder-only inference. At its core is a CIM macro, implemented in 65nm, coupled with a tile-based mapping strategy that balances pipeline stages, maximizing parallelism while alleviating DRAM bandwidth bottlenecks. Our simulator enables design space exploration of SLMs up to 4B parameters, identifying Pareto-optimal configurations in terms of latency and energy. Compared to an NVIDIA Orin Nano, EdgeCIM achieves up to 7.3x higher throughput and 49.59x better energy efficiency on LLaMA3.2-1B, and delivers 9.95x higher throughput than Qualcomm SA8255P on LLaMA3.2-3B. Extensive benchmarks on TinyLLaMA-1.1B, LLaMA3.2 (1B, 3B), Phi-3.5-mini-3.8B, Qwen2.5 (0.5B, 1.5B, 3B), SmolLM2-1.7B, SmolLM3-3B, and Qwen3 (0.6B, 1.7B, 4B) reveal that our accelerator, under INT4 precision, achieves on average 336.42 tokens/s and 173.02 tokens/J. These results establish EdgeCIM as a compelling solution towards real-time, energy-efficient edge-scale SLM inference.

关键词: Small Language Models, Edge Computing, Hardware-Software Co-design, Inference Acceleration, Energy Efficiency, CIM-based Acceleration, Decoder-only Inference, Autoregressive Decoding

81. ❌ Not All Forgetting Is Equal: Architecture-Dependent Retention Dynamics in Fine-Tuned Image Classifiers

作者: Miit Daga, Swarna Priya Ramu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11508v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究图像分类器微调过程中的样本遗忘动态，与大多数大语言模型技术关键词无关。唯一相关的是’Post-training OR Supervised Fine-tuning OR SFT’（10分），因为论文核心研究微调过程。‘AI for Science OR Bioinformatics OR Cheminformatics’（5分）部分相关，因为使用了视网膜OCT数据集，属于生物医学AI应用，但非核心。其他关键词如LLMs、MoE、Scaling Laws等均未涉及。

!!! tip deepseek-chat TL;DR

该论文研究了在ResNet-18和DeiT-Small图像分类器微调过程中，不同架构对样本遗忘模式的差异，发现遗忘具有架构依赖性、随机性，且基于样本难度的课程设计方法可能无法泛化。

摘要翻译

对预训练图像分类器进行微调是标准做法，但在此过程中哪些个体样本被遗忘，以及遗忘模式是稳定的还是依赖于架构的，目前尚不清楚。理解这些动态特性对课程设计、数据剪枝和集成构建具有直接影响。我们在视网膜OCT数据集（7个类别，56:1不平衡比例）和CUB-200-2011（200种鸟类）上对ResNet-18和DeiT-Small进行微调，追踪每个训练周期中每个样本的正确性，并为每个样本的保留轨迹拟合艾宾浩斯式指数衰减曲线。我们得出五点发现。首先，两种架构遗忘的样本存在根本差异：在OCTDL数据集上，前10%最易被遗忘样本的杰卡德重叠系数为0.34；在CUB-200数据集上为0.15。其次，视觉Transformer（ViT）的遗忘比卷积神经网络（CNN）更具结构性（平均$R^2 = 0.74$对比$R^2 = 0.52$）。第三，不同随机种子间的样本级遗忘具有随机性（斯皮尔曼$ρ\approx 0.01$），这对“样本难度是固有属性”的假设提出了挑战。第四，类别级遗忘具有一致性且可进行语义解释：视觉相似的物种最易被遗忘，特征鲜明的物种最不易被遗忘。第五，头部预热阶段后样本的损失值可预测其长期衰减常数（$ρ= 0.30$至$0.50$，$p < 10^{-45}$）。这些发现表明，集成模型中架构的多样性可提供互补的保留覆盖度，而基于样本级难度的课程设计或剪枝方法可能无法在不同训练过程中泛化。基于这些衰减常数构建的间隔重复采样器并未超越随机采样，表明静态调度无法利用不稳定的样本级信号。

摘要 (Abstract)

Fine-tuning pretrained image classifiers is standard practice, yet which individual samples are forgotten during this process, and whether forgetting patterns are stable or architecture dependent, remains unclear. Understanding these dynamics has direct implications for curriculum design, data pruning, and ensemble construction. We track per-sample correctness at every epoch during fine-tuning of ResNet-18 and DeiT-Small on a retinal OCT dataset (7 classes, 56:1 imbalance) and CUB-200-2011 (200 bird species), fitting Ebbinghaus-style exponential decay curves to each sample’s retention trace. Five findings emerge. First, the two architectures forget fundamentally different samples: Jaccard overlap of the top 10 percent most-forgotten is 0.34 on OCTDL and 0.15 on CUB-200. Second, ViT forgetting is more structured (mean $R^2 = 0.74$) than CNN forgetting ($R^2 = 0.52$). Third, per-sample forgetting is stochastic across random seeds (Spearman $ρ\approx 0.01$), challenging the assumption that sample difficulty is an intrinsic property. Fourth, class-level forgetting is consistent and semantically interpretable: visually similar species are forgotten most, distinctive ones least. Fifth, a sample’s loss after head warmup predicts its long-term decay constant ($ρ= 0.30$ to $0.50$, $p < 10^{-45}$). These findings suggest that architectural diversity in ensembles provides complementary retention coverage, and that curriculum or pruning methods based on per-sample difficulty may not generalize across runs. A spaced repetition sampler built on these decay constants does not outperform random sampling, indicating that static scheduling cannot exploit unstable per-sample signals.

关键词: fine-tuning, image classifiers, forgetting patterns, architecture-dependent, retention dynamics, ResNet-18, DeiT-Small, curriculum design

82. ❌ Lectures on AI for Mathematics

作者: Xiaoyang Chen, Xiaoyang Chen 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11504v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《Lectures on AI for Mathematics》是一本关于AI在数学领域应用的入门书籍，主要介绍AI如何用于数学研究，如发现模式、辅助证明和构造反例。摘要中未提及任何具体的大模型技术（如LLM、MoE、SFT等）、训练方法（如预训练、对齐）、推理技术（如CoT、RAG）或优化技术（如量化、注意力机制）。唯一相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为数学是科学的一个分支，AI在数学中的应用可视为“AI for Science”的范畴，但论文未深入探讨生物信息学或化学信息学。因此，该关键词评为5分（有一定关联），其他关键词均评为0分（完全无关）。论文未涉及作者列表中的指定专家。

!!! tip deepseek-chat TL;DR

该论文介绍了AI在数学研究中的应用，包括发现隐藏模式、辅助定理证明和构造反例，旨在为这一新兴领域提供全面易懂的入门指南。

摘要翻译

本书对人工智能数学这一新兴领域进行了全面而通俗的导论。它涵盖了利用人工智能推进数学研究的核心原理与多样化应用。通过清晰的阐释，本文探讨了人工智能如何发现隐藏的数学模式、辅助证明复杂定理，甚至构建反例以挑战猜想。

摘要 (Abstract)

This book provides a comprehensive and accessible introduction to the emerging field of AI for mathematics. It covers the core principles and diverse applications of using artificial intelligence to advance mathematical research. Through clear explanations, the text explores how AI can discover hidden mathematical patterns, assist in proving complicated theorems, and even construct counterexamples to challenge conjectures.

关键词: AI for mathematics, mathematical research, pattern discovery, theorem proving, counterexample construction, artificial intelligence, mathematical patterns, emerging field

83. ❌ Deep Learning for Sequential Decision Making under Uncertainty: Foundations, Frameworks, and Frontiers

作者: I. Esra Buyuktahtakin 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11507v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要讨论深度学习（特别是前馈神经网络、LSTM、Transformer和深度强化学习）与运筹学/管理科学在不确定性下序贯决策中的交叉应用，强调深度学习作为优化的补充而非替代。虽然涉及AI和深度学习，但全文未提及大语言模型（LLM）、基础模型或任何评分关键词中的具体技术（如MoE、SFT、RLHF、RAG等），也未涉及AI在科学领域的特定应用（如生物信息学）。所有关键词均与大模型技术、训练方法、推理优化或特定科学应用相关，而本文聚焦于通用深度学习与决策优化的集成，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

本文从运筹学/管理科学视角探讨深度学习在不确定性下序贯决策中的应用，提出深度学习作为优化的补充，结合了深度学习的适应性与运筹学的结构严谨性，并在供应链、医疗、农业等领域展示了集成学习-优化系统的潜力。

摘要翻译

人工智能正日益超越预测功能，开始为复杂、不确定和动态环境中的决策提供支持。这一转变使其与运筹学和管理科学自然交汇——后者长期以来为不确定性下的序贯决策提供了概念与方法论基础。与此同时，深度学习的最新进展（包括前馈神经网络、长短期记忆网络、Transformer架构以及深度强化学习）拓展了数据驱动建模的范畴，并为大规模决策系统开辟了新的可能性。本教程以运筹学/管理科学为核心视角，探讨不确定性下序贯决策的深度学习应用。其核心观点是：深度学习的价值并非替代优化方法，而是作为其补充。深度学习带来了适应性与可扩展的近似能力，而运筹学/管理科学则为表达约束、追索权与不确定性提供了必要的结构严谨性。本教程回顾了关键决策理论基础，将其与现代人工智能中的主要神经架构相联系，并讨论了学习与优化融合的前沿方法。同时，文章重点展示了该交叉领域在供应链、医疗与流行病应对、农业、能源及自主运营等领域的显著影响。更广泛而言，本文将上述进展视为人工智能从预测型向决策型全面转型的一部分，并强调了运筹学/管理科学在塑造新一代“学习-优化”融合系统中的关键作用。

摘要 (Abstract)

Artificial intelligence (AI) is moving increasingly beyond prediction to support decisions in complex, uncertain, and dynamic environments. This shift creates a natural intersection with operations research and management sciences (OR/MS), which have long offered conceptual and methodological foundations for sequential decision-making under uncertainty. At the same time, recent advances in deep learning, including feedforward neural networks, LSTMs, transformers, and deep reinforcement learning, have expanded the scope of data-driven modeling and opened new possibilities for large-scale decision systems. This tutorial presents an OR/MS-centered perspective on deep learning for sequential decision-making under uncertainty. Its central premise is that deep learning is valuable not as a replacement for optimization, but as a complement to it. Deep learning brings adaptability and scalable approximation, whereas OR/MS provides the structural rigor needed to represent constraints, recourse, and uncertainty. The tutorial reviews key decision-making foundations, connects them to the major neural architectures in modern AI, and discusses leading approaches to integrating learning and optimization. It also highlights emerging impact in domains such as supply chains, healthcare and epidemic response, agriculture, energy, and autonomous operations. More broadly, it frames these developments as part of a wider transition from predictive AI toward decision-capable AI and highlights the role of OR/MS in shaping the next generation of integrated learning–optimization systems.

关键词: deep learning, sequential decision making, uncertainty, operations research, optimization, reinforcement learning, transformers, data-driven modeling

84. ❌ Quantization Dominates Rank Reduction for KV-Cache Compression

作者: Samuel Salfati 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11501v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	15.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	15.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究KV缓存压缩技术，直接高度相关于’KV Cache Compression OR Linear Attention OR FlashAttention’（15分）和’Quantization OR Model Compression OR Low-bit Weights’（15分）。论文涉及大模型（如Mistral 7B、GPT-2）的推理加速，因此’Large Language Models OR LLMs OR Foundation Models’和’Speculative Decoding OR Inference Acceleration’分别得10分。其他关键词如MoE、SLMs、对齐、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文比较了KV缓存压缩中的量化与秩降低两种策略，发现量化在保持相同存储预算下始终优于秩降低，并揭示了其结构性不对称原因，最终实现了75%的KV压缩且性能损失极小。

摘要翻译

我们比较了Transformer推理中压缩KV缓存的两种策略：秩降低（舍弃维度）与量化（保留全部维度，降低精度）。在五种模型（124M-14B，包含MHA和GQA架构）的相同存储预算下，我们发现量化始终优于秩降低，根据模型和压缩级别的不同，困惑度（PPL）优势在4到364之间。即使将秩降低与量化结合在混合基线中，这种差距依然存在，并且随着GQA的激进程度增加而扩大。在LAMBADA任务上，INT4量化能达到与FP16相当的准确度（Mistral 7B上PPL仅增加0.23，GPT-2上增加0.58），而在相同存储条件下，秩降至32的模型准确度崩溃至0.4%。
我们将此差距归因于一种结构性的不对称：在softmax注意力路由机制下，移除一个维度可能彻底改变被关注的token（一种离散性失效），而量化噪声是有界的，通常能保持得分顺序。我们通过一个扰动分析结果对此进行了形式化证明：在softmax Fisher度量下，每个方向上的投影损伤超过量化损伤的因子为3 × 2^(2b)。一项基向量消融实验证实该发现与坐标系无关（PPL差异小于0.4），从而确立了优势来源于维度的保留，而非更优的坐标系。对K和V联合进行INT4量化在Mistral 7B上实现了75%的总KV压缩，而PPL仅增加0.18。

摘要 (Abstract)

We compare two strategies for compressing the KV cache in transformer inference: rank reduction (discard dimensions) and quantization (keep all dimensions, reduce precision). At matched storage budgets across five models (124M-14B, MHA and GQA), we find that quantization consistently outperforms rank reduction by 4-364 PPL depending on model and compression level. The gap persists even when rank reduction is combined with quantization in hybrid baselines, and it grows with GQA aggressiveness. On LAMBADA, INT4 matches FP16 accuracy (+0.23 PPL on Mistral 7B, +0.58 on GPT-2) while rank-32 at identical storage collapses to 0.4%. We trace this gap to a structural asymmetry: under softmax attention routing, removing a dimension can flip which token is attended (a discrete failure), while quantization noise is bounded and typically preserves score ordering. We formalize this via a perturbation result showing projection damage exceeds quantization damage by 3 x 2^(2b) per direction under the softmax Fisher metric. A basis ablation confirms the finding is basis-independent (spread <0.4 PPL), establishing that the advantage comes from preserving dimensions, not from a better coordinate system. Joint K+V INT4 quantization achieves 75% total KV reduction at only +0.18 PPL on Mistral 7B.

关键词: KV cache compression, quantization, rank reduction, transformer inference, model compression, inference acceleration, attention mechanism, softmax perturbation

85. ❌ ADD for Multi-Bit Image Watermarking

作者: An Luo, Jie Ding 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11491v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多比特图像水印技术，专注于解决生成模型带来的虚假信息问题，属于计算机视觉和多媒体安全领域。所有关键词均与大模型、深度学习技术原理或科学AI应用相关，而本文不涉及任何大模型技术、训练方法、推理优化、对齐技术、代理系统或科学领域应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出了一种名为ADD的两阶段多比特图像水印方法，在MS-COCO基准测试中实现了48位水印100%的解码准确率，并在抗图像失真和计算效率上显著优于现有方法。

摘要翻译

随着生成模型能够快速创建高保真图像，社会对虚假信息和真实性的担忧日益加剧。一种有效的解决方案是多比特图像水印技术，该技术将多比特信息嵌入图像中，使得验证者能够检测图像是否由特定生成器创建，并通过解码嵌入信息进一步识别来源。现有方法通常在容量、对常见图像失真的鲁棒性以及理论依据方面存在不足。为应对这些局限，我们提出ADD（添加、点积、解码）方法，这是一种两阶段多比特图像水印方案：首先学习一个水印，使其与多比特信息线性组合后添加到图像中；其次通过对水印图像与学习水印进行内积运算实现解码。在标准MS-COCO基准测试中，针对48比特水印这一挑战性任务，ADD实现了100%的解码准确率，在多种图像失真条件下性能下降最多不超过2%，远低于现有最优方法平均14%的下降幅度。此外，ADD显著提升了计算效率，其嵌入速度比当前最快方法快2倍，解码速度快7.4倍。我们进一步提供了理论分析，阐释了所学水印及其对应解码规则的有效性原理。

摘要 (Abstract)

As generative models enable rapid creation of high-fidelity images, societal concerns about misinformation and authenticity have intensified. A promising remedy is multi-bit image watermarking, which embeds a multi-bit message into an image so that a verifier can later detect whether the image is generated by someone and further identify the source by decoding the embedded message. Existing approaches often fall short in capacity, resilience to common image distortions, and theoretical justification. To address these limitations, we propose ADD (Add, Dot, Decode), a multi-bit image watermarking method with two stages: learning a watermark to be linearly combined with the multi-bit message and added to the image, and decoding through inner products between the watermarked image and the learned watermark. On the standard MS-COCO benchmark, we demonstrate that for the challenging task of 48-bit watermarking, ADD achieves 100% decoding accuracy, with performance dropping by at most 2% under a wide range of image distortions, substantially smaller than the 14% average drop of state-of-the-art methods. In addition, ADD achieves substantial computational gains, with 2-fold faster embedding and 7.4-fold faster decoding than the fastest existing method. We further provide a theoretical analysis explaining why the learned watermark and the corresponding decoding rule are effective.

关键词: multi-bit image watermarking, generative models, ADD method, decoding accuracy, image distortions, computational efficiency, theoretical analysis, MS-COCO benchmark

86. ❌ Anthropogenic Regional Adaptation in Multimodal Vision-Language Model

作者: Samuel Cahyawijaya, Peerat Limkonchotiwat, Tack Hwa Wong, Hitesh Laxmichand Patel, Amit Agarwal, Manuel Antonio Rufino, Carlos Rafael Catalan, Muhammad Reza Qorib, Vicky Feliren, Holy Lovenia, Aye Hninn Khine, Frederikus Hudi, David Anugraha, Alham Fikri Aji, Romrawin Chumpu, Viet-Thanh Pham, Minghan Wang, Mohamed Fazli Imam, Ruochen Zhang, Joseph Marvin Imperial, Do Xuan Long, Musa Izzanardi Wijanarko, Joel Ruben Antony Moniz, Patrick Amadeus Irawan, Hanif Muhammad Zhafran, Isaiah Flores, Ira Salsabila, Jun Kevin, Jostin Jerico Rosal, Patricia Nicole Monderin, Kun Kerdthaisong, Ahmad Mustafid, My Chiffon Nguyen, Natchapon Jongwiriyanurak, Siva Worajitwannakul, Haochen Li, Adrian Xuan Wei Lim, Bin Wang, Muhammad Ravi Shulthan Habibi, Lynnette Hui Xian Ng, Mithil Bangera, Yeshil Bangera, Priyaranjan Pattnayak, Dun Li Chan, Sherissa Caren Djuniwar, Hee Ming Shan 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11490v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	10.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多模态视觉-语言模型（VLM）的区域适应和人类中心对齐，与大多数纯文本大模型技术关键词无关。核心相关关键词：1）‘Instruction Tuning OR Alignment OR Value Alignment’（10分）：论文提出’Anthropogenic Regional Alignment’范式，直接研究区域价值对齐，是核心内容；2）‘Model Merging OR Model Soups OR Weight Averaging’（10分）：提出的GG-EZ方法使用模型合并技术，是核心方法；3）‘Pre-training OR Continual Pre-training OR Domain Adaptation’（5分）：涉及区域适应，与领域适应有一定关联。其他关键词如LLMs、MoE、SFT、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为'Anthropogenic Regional Adaptation'的新范式，旨在优化多模态视觉-语言模型对特定区域文化的对齐，同时保持全局泛化能力，并通过GG-EZ方法在东南亚案例中实现了5-15%的文化相关性提升。

摘要翻译

尽管视觉-语言（VL）领域在多语言和多模态的视觉与文本信息融合方面取得了显著成就，目前仍缺乏专门用于评估视觉-语言系统中以人为中心对齐性的框架。我们提出两项贡献以填补这一空白。首先，我们引入人类区域适应性（Anthropogenic Regional Adaptation）：一种旨在优化模型对特定区域语境相关性的新范式，同时确保其保留全局泛化能力。其次，我们提出一种简单而有效的适应方法，称为地理泛化简易法（Geographical-generalization-made-easy, GG-EZ），该方法利用区域数据筛选与模型融合技术。通过对三类视觉-语言架构——大规模视觉-语言模型、文本到图像扩散模型和视觉-语言嵌入模型——进行综合实验，并结合东南亚（SEA）区域适应的案例研究，我们验证了人类区域适应性的重要性以及GG-EZ方法的有效性。实验表明，该方法在东南亚地区的文化相关性指标上提升了5-15%，同时保持了98%以上的全局性能，甚至在某些情况下有所超越。我们的研究确立了人类区域对齐（Anthropogenic Regional Alignment）作为多模态视觉-语言模型在不同区域适用性的基础范式，并展示了一种简单而有效的基线方法，能够在保持全局泛化能力的同时优化区域价值对齐。

摘要 (Abstract)

While the field of vision-language (VL) has achieved remarkable success in integrating visual and textual information across multiple languages and domains, there is still no dedicated framework for assessing human-centric alignment in vision-language systems. We offer two contributions to address this gap. First, we introduce Anthropogenic Regional Adaptation: a novel paradigm that aims to optimize model relevance to specific regional contexts while ensuring the retention of global generalization capabilities. Second, we present a simple, but effective adaptation method named Geographical-generalization-made-easy (GG-EZ), which utilizes regional data filtering and model merging. Through comprehensive experiments on 3 VL architectures: large vision-language models, text-to-image diffusion models, and vision-language embedding models, and a case study in Southeast Asia (SEA) regional adaptation, we demonstrate the importance of Anthropogenic Regional Adaptation and the effectiveness of GG-EZ, showing 5-15% gains in cultural relevance metrics across SEA while maintaining over 98% of global performance and even occasionally surpassing it. Our findings establish Anthropogenic Regional Alignment as a foundational paradigm towards applicability of multimodal vision-language models in diverse regions and demonstrate a simple-yet-effective baseline method that optimizes regional value alignment while preserving global generalization.

关键词: Vision-Language Models, Regional Adaptation, Value Alignment, Model Merging, Cultural Relevance, Multimodal AI, Geographical Generalization, Human-centric Alignment

87. ❌ On the Complexity of the Discussion-based Semantics in Abstraction Argumentation

作者: Lydia Blümel, Kai Sauerwald, Kenneth Skiba, Matthias Thimm 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11480v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究抽象论证理论中基于讨论的语义的复杂性，属于形式逻辑和计算复杂性理论领域，与所有评分关键词（均涉及大模型、深度学习及其技术原理、应用或优化）完全无关。论文未提及任何大模型、深度学习、AI技术或科学AI应用，核心内容是图论、自动机理论和计算复杂性分析。

!!! tip deepseek-chat TL;DR

该论文证明了在抽象论证理论中，判断一个论点是否比另一个论点更强（基于Amgoud和Ben-Naim的讨论语义）是多项式时间可判定的，通过将其简化为半环自动机的等价性问题来解决。

摘要翻译

我们证明，在Amgoud和Ben-Naim提出的基于讨论的语义（discussion-based semantics）下，判定论证a是否强于论证b的问题可在多项式时间内判定。该问题的核心在于判断图中两个顶点处，各长度终止于这些顶点的路径（walks）数量是否相同。我们运用自动机理论的相关结论，将该问题归约为半环自动机（semiring automata）的等价性问题。这为排序语义（ranking semantics）的计算复杂性提供了新的视角，而该领域中许多语义的复杂性至今尚未明确。

摘要 (Abstract)

We show that deciding whether an argument a is stronger than an argument b with respect to the discussion-based semantics of Amgoud and Ben-Naim is decidable in polynomial time. At its core, this problem is about deciding whether, for two vertices in a graph, the number of walks of each length ending in those vertices is the same. We employ results from automata theory and reduce this problem to the equivalence problem for semiring automata. This offers a new perspective on the computational complexity of ranking semantics, an area in which the complexity of many semantics remains open.

关键词: Argumentation Theory, Discussion-based Semantics, Computational Complexity, Polynomial Time Decidability, Graph Walks, Semiring Automata, Ranking Semantics

88. ❌ OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems

作者: Kun Liu, Liqun Chen 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11477v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based多智能体系统的对齐问题，提出OOM-RL方法替代传统RLHF/RLAIF，使用金融市场的资本消耗作为客观惩罚机制。因此与’LLM Agents/Autonomous Agents’、‘Multi-agent Systems’、‘Alignment’、‘RLHF/RLAIF’高度相关（10分）。论文明确提到解决幻觉问题，与’Hallucination Mitigation’有一定关联（5分）。论文基于LLM，与’Large Language Models’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG等均未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对基于LLM的多智能体系统在自主软件工程中对齐时存在的评估者认知不确定性和模型奉承问题，提出了一种名为OOM-RL的客观对齐范式，通过将智能体部署到实时金融市场并利用资本消耗作为不可篡改的负梯度，最终使系统实现了稳定的性能均衡，年化夏普比率达到2.06。

摘要翻译

多智能体系统在自主软件工程领域的对齐问题受限于评估者的认知不确定性。当前主流范式，如基于人类反馈的强化学习与基于人工智能反馈的强化学习，常引发模型谄媚行为，而基于执行的环境则面临无约束智能体对抗性“测试规避”的挑战。本文提出一种客观对齐范式：资金耗尽强化学习。通过将智能体部署于非稳态、高摩擦的真实金融市场环境中，我们利用关键资本耗竭作为不可篡改的负向梯度。一项为期20个月的纵向实证研究记录了该系统从高周转率、谄媚的基线演变为稳健且具备流动性感知架构的过程。研究表明，财务损失无可辩驳的本体性后果迫使多智能体系统放弃过拟合的幻觉输出，转而采用严格测试驱动的智能体工作流。该流程强制执行一种受拜占庭启发的单向状态锁定机制，其锚定于一个经确定性验证的≥95%代码覆盖率约束矩阵。结果显示，尽管早期迭代版本经历了严重的执行衰减，但最终经OOM-RL对齐的系统在成熟阶段实现了稳定均衡，年化夏普比率达2.06。我们得出结论：以严格的经济惩罚替代主观人类偏好，为高风险现实环境中自主智能体的对齐提供了稳健方法论，并为以计算计费作为客观物理约束的通用范式奠定了基础。

摘要 (Abstract)

The alignment of Multi-Agent Systems (MAS) for autonomous software engineering is constrained by evaluator epistemic uncertainty. Current paradigms, such as Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF), frequently induce model sycophancy, while execution-based environments suffer from adversarial “Test Evasion” by unconstrained agents. In this paper, we introduce an objective alignment paradigm: \textbf{Out-of-Money Reinforcement Learning (OOM-RL)}. By deploying agents into the non-stationary, high-friction reality of live financial markets, we utilize critical capital depletion as an un-hackable negative gradient. Our longitudinal 20-month empirical study (July 2024 – February 2026) chronicles the system’s evolution from a high-turnover, sycophantic baseline to a robust, liquidity-aware architecture. We demonstrate that the undeniable ontological consequences of financial loss forced the MAS to abandon overfitted hallucinations in favor of the \textbf{Strict Test-Driven Agentic Workflow (STDAW)}, which enforces a Byzantine-inspired uni-directional state lock (RO-Lock) anchored to a deterministically verified $\geq 95%$ code coverage constraint matrix. Our results show that while early iterations suffered severe execution decay, the final OOM-RL-aligned system achieved a stable equilibrium with an annualized Sharpe ratio of 2.06 in its mature phase. We conclude that substituting subjective human preference with rigorous economic penalties provides a robust methodology for aligning autonomous agents in high-stakes, real-world environments, laying the groundwork for generalized paradigms where computational billing acts as an objective physical constraint

关键词: Multi-Agent Systems, LLM Alignment, Reinforcement Learning, OOM-RL, Financial Markets, Agentic Workflow, Hallucination Mitigation, Economic Penalties

89. ❌ From Attribution to Action: A Human-Centered Application of Activation Steering

作者: Tobias Labarta, Maximilian Dreyer, Katharina Weitz, Wojciech Samek, Sebastian Lapuschkin 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11467v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉模型的解释性AI（XAI）和激活引导技术，与大多数大模型技术关键词无关。仅与’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为论文核心研究XAI方法、激活引导和可解释性工具。其他关键词涉及大模型训练、推理、对齐、压缩、应用等，均未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文研究如何通过结合基于SAE的归因和激活引导技术，使计算机视觉模型的解释性AI更加可操作，并通过专家访谈发现该方法能实现从检查到干预的转变，但存在涟漪效应等风险。

摘要翻译

可解释人工智能（XAI）方法揭示了哪些特征影响模型预测，但为实践者提供的基于这些解释采取行动的手段有限。通过对XAI识别出的组件进行激活导向，为实现可操作的阐释提供了一条路径，但其实际效用仍未得到充分研究。我们引入了一种交互式工作流程，将基于稀疏自编码器（SAE）的归因分析与激活导向相结合，用于视觉模型中概念使用的实例级分析，并以网络工具形式实现。基于此工作流程，我们通过CLIP模型上的调试任务开展了半结构化专家访谈（N=8），以探究实践者如何理解、信任并应用激活导向。研究发现，激活导向使分析方式从观察检查转向基于干预的假设检验（8/8参与者），且多数参与者的信任建立于观察到的模型响应而非仅解释的合理性（6/8）。参与者采用了以组件抑制为主的系统性调试策略（7/8），并指出了包括涟漪效应及实例级修正泛化能力有限在内的风险。总体而言，激活导向提升了可解释性的可操作性，同时也为安全有效使用提出了重要考量。

摘要 (Abstract)

Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8) and highlighted risks including ripple effects and limited generalization of instance-level corrections. Overall, activation steering renders interpretability more actionable while raising important considerations for safe and effective use.

关键词: Explainable AI, Activation Steering, Interpretability, Vision Models, Debugging, SAE-based Attribution, Human-Centered, Interactive Workflow

作者: Juhoon Lee, Joseph Seering 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11466v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM代理在社会科学模拟中的应用，重点关注模拟验证方法，因此与’Large Language Models’和’LLM Agents’高度相关（10分）。论文未涉及其他关键词的具体技术细节或应用，如MoE、训练方法、推理优化、科学AI应用等，故其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM代理在生成社会科学模拟中面临的验证危机，提出了SLALOM框架，通过过程保真度评估而非结果验证，利用动态时间规整量化模拟轨迹的结构真实性，以区分合理的社会动态与随机噪声。

摘要翻译

大语言模型（LLM）智能体为生成式社会科学提供了一条可能具有变革性的发展路径，但也面临着严峻的效度危机。当前的模拟评估方法存在“停摆时钟”问题：它们仅确认模拟达到了正确的最终结果，却忽略了导致该结果的过程轨迹是否具有社会学意义上的合理性。由于大语言模型的内部推理过程不透明，验证社会机制的“黑箱”仍然是一个持续存在的挑战。本文提出SLALOM（通过纵向观测指标的模拟生命周期分析）框架，该框架将验证重点从结果核实转向过程保真度。借鉴模式导向建模（Pattern-Oriented Modeling, POM）思想，SLALOM将社会现象视为必须穿越特定SLALOM门控（即代表不同阶段的中间路径点约束）的多变量时间序列。通过利用动态时间规整（Dynamic Time Warping, DTW）技术将模拟轨迹与经验事实对齐，SLALOM提供了一种量化指标来评估结构真实性，有助于区分合理的社会动态与随机噪声，并为建立更稳健的政策模拟标准做出贡献。

摘要 (Abstract)

Large Language Model (LLM) agents offer a potentially-transformative path forward for generative social science but face a critical crisis of validity. Current simulation evaluation methodologies suffer from the “stopped clock” problem: they confirm that a simulation reached the correct final outcome while ignoring whether the trajectory leading to it was sociologically plausible. Because the internal reasoning of LLMs is opaque, verifying the “black box” of social mechanisms remains a persistent challenge. In this paper, we introduce SLALOM (Simulation Lifecycle Analysis via Longitudinal Observation Metrics), a framework that shifts validation from outcome verification to process fidelity. Drawing on Pattern-Oriented Modeling (POM), SLALOM treats social phenomena as multivariate time series that must traverse specific SLALOM gates, or intermediate waypoint constraints representing distinct phases. By utilizing Dynamic Time Warping (DTW) to align simulated trajectories with empirical ground truth, SLALOM offers a quantitative metric to assess structural realism, helping to differentiate plausible social dynamics from stochastic noise and contributing to more robust policy simulation standards.

关键词: Large Language Model agents, social simulation, validation framework, process fidelity, Pattern-Oriented Modeling, Dynamic Time Warping, structural realism, policy simulation

91. ❌ Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

作者: S. Aaron McClendon, Jorge Gallego-Feliciano, Stavros Zervoudakis, Antonios Saravanos 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11465v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究小型LLM（Qwen3-8B）在工具使用任务中的性能提升，通过推理时脚手架技术（同一模型扮演三个角色）实现，不涉及额外训练。高度相关的关键词包括：LLMs（核心研究对象）、SLMs（研究小型模型）、Self-Correction（包含校正模型角色）、LLM Agents（研究代理系统）、Tool Use（评估工具使用任务）、Quantization（评估4-bit量化配置）。中等相关的关键词包括：Context Window Extension（涉及32K上下文）、Chain of Thought（涉及多步推理）、Multi-agent Systems（同一模型扮演多个角色协调）、In-context Learning（依赖上下文压缩和条件化）。其他关键词如MoE、Scaling Laws、训练方法、对齐、RAG、注意力优化、MCTS、幻觉缓解、可解释性、世界模型、模型合并等均未涉及。

!!! tip deepseek-chat TL;DR

该论文研究如何通过推理时脚手架技术（让同一小型语言模型扮演三个不同角色）来显著提升小型LLM代理在复杂工具使用任务中的性能，无需额外训练计算，使8B模型在部分任务上超越33B模型。

摘要翻译

大语言模型（LLM）智能体在现实工具使用任务中展现出潜力，但在有限硬件上部署高性能智能体仍具挑战。本研究探讨仅通过推理时架构支持（无需任何额外训练计算）能否提升小型模型在复杂多步环境中的表现。我们在单张24GB GPU上评估了Qwen3-8B在全精度（FP16，12K上下文）和4位量化（AWQ，32K上下文）两种配置下的性能。未经干预时，原始模型仅达成5.4%（FP16）和3.0%（AWQ）的任务目标完成率。基于系统性故障模式分析，我们提出三层推理架构流水线，将同一冻结模型部署于三个不同角色：（1）摘要模型——在压缩对话历史时保留关键信息（令牌、凭证、API响应）；（2）主智能体模型——在压缩上下文上进行推理；（3）独立校正模型——在不访问对话历史的情况下审查修正智能体的代码输出，打破重复故障循环。将此架构应用于同一未修改模型后，任务完成率提升至8.9%（FP16）和5.9%（AWQ），在两种设置下均实现约两倍性能提升，其中难度1任务提升尤为显著（15.8%→26.3% FP16；5.3%→14.0% AWQ）。在全精度推理中，我们采用架构支持的80亿参数模型超越了原始AppWorld评估中DeepSeek-Coder 330亿参数Instruct模型（7.1%）的表现，证明结构化推理时干预能使小模型与规模达其4倍的系统竞争。我们将该方法形式化为冻结基模型上的架构策略——通过三种不同条件调用同一权重参数，并与强化学习中的测试时计算扩展和动作空间塑造建立理论关联。

摘要 (Abstract)

Large language model (LLM) agents show promise on realistic tool-use tasks, but deploying capable agents on modest hardware remains challenging. We study whether inference-time scaffolding alone, without any additional training compute, can improve the performance of a small model in complex multi-step environments. Operating on a single 24,GB GPU, we evaluate Qwen3-8B under both full-precision (FP16, 12K context) and 4-bit quantized (AWQ, 32K context) configurations. Without any intervention, the raw model achieves just 5.4% (FP16) and 3.0% (AWQ) task goal completion. Guided by a systematic failure mode analysis, we introduce a three-tier inference scaffolding pipeline that deploys the same frozen model in three distinct roles: (1) a summarization model that preserves critical artifacts (tokens, credentials, API responses) while compressing dialogue history; (2) the main agent model that reasons over the compressed context; and (3) an isolated correction model that reviews and revises the agent’s code output without access to conversation history, breaking repetitive failure loops. Applied to the same unmodified model, this scaffolding yields 8.9% (FP16) and 5.9% (AWQ) task goal completion, roughly doubling performance in both settings, with particularly strong gains on difficulty-1 tasks (15.8%$\to$26.3% FP16; 5.3%$\to$14.0% AWQ). On full-precision inference, our scaffolded 8B model surpasses DeepSeek-Coder 33B Instruct (7.1%) from the original AppWorld evaluation, demonstrating that structured inference-time interventions can make small models competitive with systems 4$\times$ their size. We formalize the approach as a scaffolded policy over a frozen base model, three invocations of the same weights with different conditioning, drawing connections to test-time compute scaling and action-space shaping in reinforcement learning.

关键词: LLM agents, small language models, inference-time scaffolding, tool-use tasks, quantization, self-correction, multi-step reasoning, parameter-efficient inference

92. ❌ Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning

作者: Xiaozhe Li, Tianyi Lyu, Yizhao Yang, Liang Shan, Siyi Yang, Ligao Zhang, Zhuoyi Huang, Qingwen Liu, Yang Li 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11462v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM智能体在长视野任务中的上下文管理问题，提出通过强化学习训练轻量级策略模型主动筛选上下文，减少噪声并保留推理锚点。与’LLM Agents’高度相关（10分），涉及’Large Language Models’（10分），直接解决’Context Window Extension’相关挑战（8分），并涉及多步推理和深度推理（各5分）。其他关键词如MoE、SLMs、训练方法、压缩技术、科学AI应用等均未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM智能体在长视野任务中因上下文噪声积累导致的性能下降问题，提出了一种通过强化学习训练轻量级上下文管理模型的框架，在提升任务成功率的同时显著减少了token消耗。

摘要翻译

大型语言模型（LLM）在处理长程任务时，因“上下文瓶颈”和“中间迷失”现象而面临挑战——冗长环境中积累的噪声会损害多轮交互中的推理能力。为解决此问题，我们引入了一种共生框架，将上下文管理与任务执行解耦。该架构将一个轻量级、专门化的策略模型ContextCurator与一个强大的冻结基础模型TaskExecutor配对。通过强化学习训练，ContextCurator主动降低工作记忆中的信息熵，它能积极修剪环境噪声，同时保留推理锚点——即对未来推理至关重要的稀疏数据点。在WebArena上，我们的框架将Gemini-3.0-flash的成功率从36.4%提升至41.2%，同时将令牌消耗降低8.8%（从47.4K降至43.3K）。在DeepSearch上，该框架实现了57.1%的成功率（对比基准53.9%），同时将令牌消耗降低至原来的八分之一。值得注意的是，一个70亿参数的ContextCurator在上下文管理性能上可匹配GPT-4o，为自主长程智能体提供了一种可扩展且计算高效的范式。

摘要 (Abstract)

Large Language Models (LLMs) struggle with long-horizon tasks due to the “context bottleneck” and the “lost-in-the-middle” phenomenon, where accumulated noise from verbose environments degrades reasoning over multi-turn interactions. To address this issue, we introduce a symbiotic framework that decouples context management from task execution. Our architecture pairs a lightweight, specialized policy model, ContextCurator, with a powerful frozen foundation model, TaskExecutor. Trained via reinforcement learning, ContextCurator actively reduces information entropy in the working memory. It aggressively prunes environmental noise while preserving reasoning anchors, that is, sparse data points that are critical for future deductions. On WebArena, our framework improves the success rate of Gemini-3.0-flash from 36.4% to 41.2% while reducing token consumption by 8.8% (from 47.4K to 43.3K). On DeepSearch, it achieves a 57.1% success rate, compared with 53.9%, while reducing token consumption by a factor of 8. Remarkably, a 7B ContextCurator matches the context management performance of GPT-4o, providing a scalable and computationally efficient paradigm for autonomous long-horizon agents.

关键词: LLM Agents, Context Management, Reinforcement Learning, Long-horizon Tasks, Context Bottleneck, Working Memory, Information Entropy, Reasoning Anchors

93. ❌ Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration

作者: Zhipeng Chen, Tao Qian, Wayne Xin Zhao, Ji-Rong Wen 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11446v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的RLVR训练加速，直接涉及’Large Language Models’和’PEFT/LoRA’（使用LoRA训练），因此这两项给10分。‘Speculative Decoding OR Inference Acceleration’与训练加速相关，但非核心，给5分。其他关键词如MoE、SFT、RAG等未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为NExt的非线性低秩轨迹建模框架，用于加速大型语言模型的RLVR训练，通过建模参数更新轨迹并外推，减少了约37.5%的计算开销。

摘要翻译

近期，基于可验证奖励的强化学习规模化训练已成为显著提升大语言模型能力的高效范式，该范式需引导模型进行大量探索与学习，导致计算开销巨大并成为关键挑战。为减少训练步数，已有研究采用模型参数的线性外推方法。然而，当前对可验证奖励强化学习训练过程中模型参数更新的动态机制仍缺乏充分理解。为深入探究大语言模型在此类训练中的演化规律，我们通过实证实验发现：模型的秩-1子空间并不呈线性演化，且在低秩自适应训练过程中其相对于原始参数的主导作用会进一步放大。基于上述发现，我们提出一种名为NExt的低秩轨迹非线性外推框架，该框架以非线性方式建模并外推低秩参数轨迹。具体而言，我们首先使用低秩自适应技术训练模型，并在多个训练步骤中提取参数差异的秩-1子空间，随后将其用于非线性外推。接着，我们利用提取的秩-1子空间训练预测器，该预测器能够建模可验证奖励强化学习过程中的参数更新轨迹，并通过“预测-扩展”流程实现模型参数的外推，从而加速训练进程。为深入研究和理解NExt方法，我们开展了系统性实验，验证了该方法的有效性与鲁棒性。我们的方法在保持与多种可验证奖励强化学习算法及任务广泛兼容的同时，可降低约37.5%的计算开销。代码已发布于https://github.com/RUCAIBox/NExt。

摘要 (Abstract)

Recently, scaling reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs) has emerged as an effective training paradigm for significantly improving model capabilities, which requires guiding the model to perform extensive exploration and learning, leading to substantial computational overhead and becoming a key challenge. To reduce the number of training steps, Prior work performs linear extrapolation of model parameters. However, the dynamics of model parameter updates during RLVR training remain insufficiently understood. To further investigate the evolution of LLMs during RLVR training, we conduct empirical experiments and find that the rank-1 subspace of the model does not evolve linearly, and its dominance over the original parameters is further amplified during LoRA training. Based on the above insights, we propose the \textbf{N}onlinear \textbf{Ext}rapolation of low-rank trajectories (\textbf{NExt}), a novel framework that models and extrapolates low-rank parameter trajectories in a nonlinear manner. Concretely, we first train the model using LoRA and extract the rank-1 subspace of parameter differences at multiple training steps, which is then used for the subsequent nonlinear extrapolation. Afterward, we utilized the extracted rank-1 subspace to train a predictor, which can model the trajectory of parameter updates during RLVR, and then perform the predict-extend process to extrapolate model parameters, achieving the acceleration of RLVR. To further study and understand NExt, we conduct comprehensive experiments that demonstrate the effectiveness and robustness of the method. Our method reduces computational overhead by approximately 37.5% while remaining compatible with a wide range of RLVR algorithms and tasks. We release our code in https://github.com/RUCAIBox/NExt.

关键词: Large Language Models, RLVR, LoRA, parameter trajectories, nonlinear extrapolation, training acceleration, low-rank optimization, computational overhead reduction

94. ❌ Think Before you Write: QA-Guided Reasoning for Character Descriptions in Books

作者: Argyrios Papoudakis, Mirella Lapata, Frank Keller 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11435v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在长文本叙事中的角色描述生成任务，提出QA引导的推理框架来提升生成质量。高度相关的关键词包括：LLMs（核心模型）、Chain of Thought（推理机制）、Long Context LLMs（处理长文本）、System 2 Thinking（深度推理）、Hallucination Mitigation（提升忠实性）。中等相关的有：Post-training（涉及训练框架）、Explainable AI（结构化推理可解释）。其余关键词与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对长文本叙事中角色描述生成的挑战，提出了一种QA引导的推理框架，通过分离推理与生成步骤，显著提升了生成描述的忠实性、信息量和事实依据。

摘要翻译

角色描述生成是面向叙事应用（如摘要生成、故事分析与角色驱动模拟）的重要能力。然而，从长篇叙事文本（例如小说）中生成准确的角色描述具有挑战性：模型需要追踪动态演变的属性（如人际关系与事件）、整合分散在文本中的证据，并推断隐含细节。尽管具备推理能力的大语言模型（LLMs）在许多基准测试中表现优异，但我们发现，在角色描述生成任务中，当禁用其内置推理机制（即采用空推理轨迹）时，其性能反而有所提升。受此启发，我们提出一种将推理与生成解耦的训练框架。该方法可应用于长上下文大语言模型或基于文本分块的方法之上，其核心包含一个生成结构化问答（QA）推理轨迹的推理模型，以及一个基于该轨迹生成最终角色描述的生成模型。在两个数据集（BookWorm与CroSS）上的实验表明，相较于强大的长上下文基线模型，问答引导的推理机制在忠实度、信息量与文本依据方面均实现了提升。

摘要 (Abstract)

Character description generation is an important capability for narrative-focused applications such as summarization, story analysis, and character-driven simulations. However, generating accurate character descriptions from long-form narratives (e.g., novels) is challenging: models must track evolving attributes (e.g., relationships and events), integrate evidence scattered across the text, and infer implicit details. Despite the success of reasoning-enabled LLMs on many benchmarks, we find that for character description generation their performance improves when built-in reasoning is disabled (i.e., an empty reasoning trace). Motivated by this, we propose a training framework that decouples reasoning from generation. Our approach, which can be applied on top of long-context LLMs or chunk-based methods, consists of a reasoning model that produces a structured QA reasoning trace and a generation model that conditions on this trace to produce the final character description. Experiments on two datasets (BookWorm and CroSS) show that QA-guided reasoning improves faithfulness, informativeness, and grounding over strong long-context baselines.

关键词: character description generation, long-form narratives, reasoning-enabled LLMs, QA-guided reasoning, faithfulness, informativeness, grounding, long-context LLMs

95. ❌ Hardening x402: PII-Safe Agentic Payments via Pre-Execution Metadata Filtering

作者: Vladimir Stantchev 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11430v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究AI代理在x402支付协议中的PII安全过滤中间件，与绝大多数大模型技术关键词（如LLM架构、训练方法、推理优化等）完全无关。仅与’LLM Agents OR Autonomous Agents OR Agentic Workflow’有微弱关联（5分），因为论文涉及AI代理的支付工作流，但未使用LLM或讨论代理工作流技术本身。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为presidio-hardened-x402的开源中间件，用于在AI代理通过x402协议进行支付时，在请求传输前检测和编辑个人身份信息（PII），以增强支付安全性，并在合成数据集上验证了其高效性（p99延迟5.73ms，F1分数0.894）。

摘要翻译

通过x402协议支付资源的AI代理会在每个HTTP支付请求中嵌入支付元数据——资源URL、描述信息和原因字符串。这些元数据在链上结算发生前，会传输至支付服务器和中心化的协调器API；双方通常不受数据处理协议的约束。我们提出了presidio-hardened-x402，这是首个开源中间件，可在x402支付请求传输前进行拦截，以检测并编辑个人可识别信息（PII）、执行声明式支出策略，并阻止重复的重放尝试。为评估PII过滤器，我们构建了一个包含2,000条x402元数据三元组的标注合成语料库，涵盖七个用例类别，并在两种检测模式（正则表达式、自然语言处理）和五个置信度阈值下进行了42种配置的精确率/召回率扫描。推荐配置（模式=自然语言处理，最小分数=0.4，所有实体类型）实现了微平均F1分数=0.894，精确率达0.972，其p99延迟为5.73毫秒——完全在50毫秒的开销预算内。该中间件、语料库及所有实验代码已在https://github.com/presidio-v/presidio-hardened-x402公开提供。

摘要 (Abstract)

AI agents that pay for resources via the x402 protocol embed payment metadata - resource URLs, descriptions, and reason strings - in every HTTP payment request. This metadata is transmitted to the payment server and to the centralised facilitator API before any on-chain settlement occurs; neither party is typically bound by a data processing agreement. We present presidio-hardened-x402, the first open-source middleware that intercepts x402 payment requests before transmission to detect and redact personally identifiable information (PII), enforce declarative spending policies, and block duplicate replay attempts. To evaluate the PII filter, we construct a labeled synthetic corpus of 2,000 x402 metadata triples spanning seven use-case categories, and run a 42-configuration precision/recall sweep across two detection modes (regex, NLP) and five confidence thresholds. The recommended configuration (mode=nlp, min_score=0.4, all entity types) achieves micro-F1 = 0.894 with precision 0.972, at a p99 latency of 5.73ms - well within the 50ms overhead budget. The middleware, corpus, and all experiment code are publicly available at https://github.com/presidio-v/presidio-hardened-x402.

关键词: AI agents, x402 protocol, PII detection, payment security, middleware, metadata filtering, privacy protection, agentic payments

96. ❌ METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues

作者: Haofu Yang, Jiaji Liu, Chen Huang, Faguo Wu, Wenqiang Lei, See-Kiong Ng 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11427v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文METRO提出了一种利用大语言模型从专家对话记录中自主归纳策略的方法，用于构建非协作对话代理。该方法的核心是使用LLMs（高度相关，10分）来诱导策略，属于LLM Agents的研究范畴（高度相关，10分）。论文未涉及其他关键词的具体技术细节，如MoE、量化、推理加速、对齐等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

论文提出METRO方法，利用大语言模型从原始对话记录中自主归纳策略行动和规划逻辑，以构建非协作对话代理，实验表明该方法在基准测试中平均优于现有方法9%-10%。

摘要翻译

传统非协作对话智能体的开发通常依赖专家策略的人工编码，这种方式难以规模化。我们提出METRO方法，利用大语言模型直接从原始对话文本中自主归纳策略行为与规划逻辑。该方法将专家知识形式化为“策略森林”——一种能同时捕捉短期回应（节点）与长期战略前瞻（分支）的层次化结构。在两个基准测试上的实验结果表明，METRO展现出优越性能，平均超越现有方法9%-10%。进一步分析不仅揭示了METRO的成功机制（策略行为多样性与前瞻性），还验证了其强大的跨任务迁移能力。这为以经济高效、可扩展的方式构建非协作智能体提供了新思路。代码已开源：https://github.com/Humphrey-0125/METRO。

摘要 (Abstract)

Developing non-collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies. We propose \ours, a method that leverages large language models to autonomously induce both strategy actions and planning logic directly from raw transcripts. METRO formalizes expert knowledge into a Strategy Forest, a hierarchical structure that captures both short-term responses (nodes) and long-term strategic foresight (branches). Experimental results across two benchmarks show that METRO demonstrates promising performance, outperforming existing methods by an average of 9%-10%. Our further analysis not only reveals the success behind METRO (strategic behavioral diversity and foresight), but also demonstrates its robust cross-task transferability. This offers new insights into building non-collaborative agents in a cost-effective and scalable way. Our code is available at https://github.com/Humphrey-0125/METRO.

关键词: non-collaborative dialogue agents, strategy induction, large language models, expert dialogue transcripts, Strategy Forest, planning logic, cross-task transferability, autonomous agents

97. ❌ Emulating Non-Differentiable Metrics via Knowledge-Guided Learning: Introducing the Minkowski Image Loss

作者: Filippo Quarenghi, Ryan Cotsakis, Tom Beucler 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11422v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于地球系统深度学习中的非可微度量问题，提出了两种处理非可微函数的方法：解析近似和神经模拟器，并开发了Minkowski图像损失函数。论文的核心是深度学习在科学计算（特别是气象学）中的应用，属于"AI for Science"范畴，因此该关键词得10分。其他所有关键词均涉及大语言模型（LLM）的特定技术、训练方法、推理优化、代理系统等，而本文完全不涉及LLM、语言处理或相关技术，仅使用卷积神经网络处理图像数据，因此其他26个关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文解决了地球系统深度学习中因非可微科学度量导致的"可微性差距"问题，通过开发解析近似和神经模拟器两种方法，创建了可微的Minkowski图像损失函数，在降水场超分辨率任务中实现了高精度模拟但揭示了正则化与纹理恢复之间的权衡。

摘要翻译

“可微性鸿沟”是地球系统深度学习的主要瓶颈：由于模型无法直接在不可微的科学指标上训练，而必须依赖平滑代理指标（如均方误差），它们往往无法捕捉高频细节，产生“模糊”的输出。我们开发了一个框架，通过两种处理不可微函数的方法来弥合这一鸿沟：第一种是通过解析方法将原始不可微函数近似为可微的等效函数；第二种是为科学泛函学习可微的代理模型。我们通过使用温度控制的S型函数和连续逻辑算子来松弛离散拓扑运算，从而构建解析近似。相反，我们的神经仿真器采用利普希茨约束卷积神经网络，通过以下方式稳定梯度学习：(1) 使用谱归一化来约束利普希茨常数；(2) 强制实施几何原理的硬性架构约束。我们通过开发闵可夫斯基图像损失来展示该框架的实用性，该损失是地表降水场积分几何度量（面积、周长、连通分量）的可微等效形式。在EUMETNET OPERA数据集上的验证表明，我们带约束的神经代理模型实现了高仿真精度，完全消除了在无约束基线中观察到的几何违规现象。然而，将这些可微代理模型应用于确定性超分辨率任务时，揭示了一个根本性的权衡：虽然严格的利普希茨正则化确保了优化稳定性，但它本质上会过度平滑梯度信号，从而限制了对高度局地化对流纹理的恢复。这项工作强调了将此类拓扑约束与随机生成架构相结合的必要性，以实现完整的形态真实性。

摘要 (Abstract)

The differentiability gap'' presents a primary bottleneck in Earth system deep learning: since models cannot be trained directly on non-differentiable scientific metrics and must rely on smooth proxies (e.g., MSE), they often fail to capture high-frequency details, yielding blurry’’ outputs. We develop a framework that bridges this gap using two different methods to deal with non-differentiable functions: the first is to analytically approximate the original non-differentiable function into a differentiable equivalent one; the second is to learn differentiable surrogates for scientific functionals. We formulate the analytical approximation by relaxing discrete topological operations using temperature-controlled sigmoids and continuous logical operators. Conversely, our neural emulator uses Lipschitz-convolutional neural networks to stabilize gradient learning via: (1) spectral normalization to bound the Lipschitz constant; and (2) hard architectural constraints enforcing geometric principles. We demonstrate this framework’s utility by developing the Minkowski image loss, a differentiable equivalent for the integral-geometric measures of surface precipitation fields (area, perimeter, connected components). Validated on the EUMETNET OPERA dataset, our constrained neural surrogate achieves high emulation accuracy, completely eliminating the geometric violations observed in unconstrained baselines. However, applying these differentiable surrogates to a deterministic super-resolution task reveals a fundamental trade-off: while strict Lipschitz regularization ensures optimization stability, it inherently over-smooths gradient signals, restricting the recovery of highly localized convective textures. This work highlights the necessity of coupling such topological constraints with stochastic generative architectures to achieve full morphological realism.

关键词: differentiability gap, non-differentiable metrics, Earth system deep learning, Minkowski image loss, Lipschitz-convolutional neural networks, topological constraints, super-resolution, geometric violations

98. ❌ Efficient Emotion-Aware Iconic Gesture Prediction for Robot Co-Speech

作者: Edwin C. Montiel-Vazquez, Christian Arzate Cruz, Stefanos Gkikas, Thomas Kassiotis, Giorgos Giannakakis, Randy Gomez 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11417v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究机器人协同语音的图标手势预测，使用轻量级Transformer模型从文本和情感中生成手势，属于机器人学和自然语言处理的交叉应用。所有评分关键词均聚焦于大语言模型（LLM）的技术原理、训练方法、优化技术、应用范式（如Agent）或特定科学领域应用（如生物信息学）。论文虽涉及Transformer架构和文本输入，但其核心是机器人手势生成这一具体任务，并未涉及LLM本身的技术创新、训练、对齐、推理优化或LLM在科学领域的应用。因此，与所有关键词均无直接关联，相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种轻量级Transformer模型，仅基于文本和情感（无需音频）来预测机器人协同语音时的图标手势位置和强度，在BEAT2数据集上超越了GPT-4o，并保持了计算紧凑性，适合在具身智能体上实时部署。

摘要翻译

伴随语音的手势能提升互动参与度并增强言语理解效果。当前多数数据驱动的机器人系统仅能生成节奏性击打式动作，鲜少整合语义强调功能。为此，我们提出一种轻量级Transformer模型，该模型仅依据文本与情感信息即可推导出表意手势的位置与强度，在推理过程中无需音频输入。在BEAT2数据集上，本模型在语义手势位置分类与强度回归两项任务中均优于GPT-4o，同时保持计算紧凑性，适合在具身智能体上实现实时部署。

摘要 (Abstract)

Co-speech gestures increase engagement and improve speech understanding. Most data-driven robot systems generate rhythmic beat-like motion, yet few integrate semantic emphasis. To address this, we propose a lightweight transformer that derives iconic gesture placement and intensity from text and emotion alone, requiring no audio input at inference time. The model outperforms GPT-4o in both semantic gesture placement classification and intensity regression on the BEAT2 dataset, while remaining computationally compact and suitable for real-time deployment on embodied agents.

关键词: co-speech gesture, iconic gesture prediction, lightweight transformer, robot gesture generation, emotion-aware, real-time deployment, embodied agents, BEAT2 dataset

99. ❌ Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning

作者: Bo Li, Mingda Wang, Gexiang Fang, Shikun Zhang, Wei Ye 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11407v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	15.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文的核心创新在于提出了一种新的检索增强生成（RAG）框架GRIP，将检索决策嵌入到token级解码中，实现端到端的检索-生成协调。因此，与"Retrieval-Augmented Generation OR RAG OR Retrieval-Generation"高度相关（15分）。论文涉及大模型在问答任务中的应用，与"Large Language Models OR LLMs OR Foundation Models"相关（10分）。GRIP支持动态多步推理和证据整合，与"Chain of Thought OR CoT Reasoning OR Multi-step Reasoning"有一定关联（8分）。论文提到使用结构化训练集进行监督，与"Post-training OR Supervised Fine-tuning OR SFT"有弱关联（5分）。其他关键词如MoE、量化、AI for Science等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为GRIP的新型检索增强生成框架，通过将检索控制嵌入到生成过程中，实现了端到端的检索-生成协调，在多个问答基准测试中超越了现有RAG方法并与GPT-4o竞争。

摘要翻译

我们通过将检索控制直接嵌入生成过程，重新审视检索增强生成（RAG）。不同于将检索视为外部干预，我们将检索决策表达在词元级解码中，从而无需额外控制器或分类器即可实现端到端的协同。在“检索即生成”范式下，我们提出 GRIP（基于生成引导的信息规划检索），这是一个统一框架，模型通过发射控制词元来调节检索行为。GRIP 的核心是 自触发信息规划，它使模型能够在单一自回归轨迹中自主决定何时检索、如何重构查询以及何时终止检索。这一设计将检索与推理紧密耦合，并支持动态多步推理与实时证据整合。为了监督这些行为，我们构建了一个结构化训练集，涵盖可回答、部分可回答以及多跳查询，每种类型都与特定的词元模式对齐。在五个问答基准测试上的实验表明，GRIP 超越了强力的 RAG 基线模型，并与 GPT-4o 性能相当，同时使用的参数量显著减少。

摘要 (Abstract)

We revisit retrieval-augmented generation (RAG) by embedding retrieval control directly into generation. Instead of treating retrieval as an external intervention, we express retrieval decisions within token-level decoding, enabling end-to-end coordination without additional controllers or classifiers. Under the paradigm of Retrieval as Generation, we propose \textbf{GRIP} (\textbf{G}eneration-guided \textbf{R}etrieval with \textbf{I}nformation \textbf{P}lanning), a unified framework in which the model regulates retrieval behavior through control-token emission. Central to GRIP is \textit{Self-Triggered Information Planning}, which allows the model to decide when to retrieve, how to reformulate queries, and when to terminate, all within a single autoregressive trajectory. This design tightly couples retrieval and reasoning and supports dynamic multi-step inference with on-the-fly evidence integration. To supervise these behaviors, we construct a structured training set covering answerable, partially answerable, and multi-hop queries, each aligned with specific token patterns. Experiments on five QA benchmarks show that GRIP surpasses strong RAG baselines and is competitive with GPT-4o while using substantially fewer parameters.

关键词: Retrieval-Augmented Generation, RAG, Generation-guided Retrieval, Self-Triggered Information Planning, Autoregressive Decoding, Multi-step Inference, Question Answering, End-to-end Coordination

100. ❌ One Scale at a Time: Scale-Autoregressive Modeling for Fluid Flow Distributions

作者: Mario Lino, Nils Thuerey 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11403v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于流体力学模拟的生成模型方法（scale-autoregressive modeling），属于AI for Science（科学AI）领域，因此与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。但论文未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、训练方法、推理优化、智能体等）或生物信息学/化学信息学具体应用，与其他关键词完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对非定常流体流动分布的高效准确生成问题，提出了一种分层粗到细的尺度自回归建模方法（SAR），在多个基准测试中实现了比现有扩散模型更低的分布误差和更高的每样本精度，同时运行速度更快。

摘要翻译

分析非定常流体流动通常需要获取所有可能时间状态的完整分布，但传统偏微分方程求解器计算成本过高，而基于学习的时间步进代理模型在长时间推演中会快速累积误差。生成模型通过独立采样状态避免了误差累积，但扩散模型和流匹配方法虽然精确，却受限于在整个网格上进行多次评估的高昂成本。我们提出了尺度自回归建模方法，用于在非结构化网格上从粗到细分层采样流动状态：该方法首先生成低分辨率流场，随后以粗尺度预测为条件，通过逐步采样更高分辨率实现细化。这种由粗到细的分解机制通过将计算集中在不确定性最大的粗尺度上提高了效率，同时在细尺度上需要更少的计算步骤。在不同复杂度的非定常流动基准测试中，相较于基于多尺度图神经网络的最先进扩散模型，SAR获得了显著更低的分布误差和更高的单样本精度；同时，其性能匹配或超越了流匹配Transolver（一种线性时间复杂度变压器模型），而计算速度根据任务不同提升了2-7倍。总体而言，SAR为实际应用中快速准确估计统计流动量（如湍流动能和两点相关性）提供了实用工具。

摘要 (Abstract)

Analyzing unsteady fluid flows often requires access to the full distribution of possible temporal states, yet conventional PDE solvers are computationally prohibitive and learned time-stepping surrogates quickly accumulate error over long rollouts. Generative models avoid compounding error by sampling states independently, but diffusion and flow-matching methods, while accurate, are limited by the cost of many evaluations over the entire mesh. We introduce scale-autoregressive modeling (SAR) for sampling flows on unstructured meshes hierarchically from coarse to fine: it first generates a low-resolution field, then refines it by progressively sampling higher resolutions conditioned on coarser predictions. This coarse-to-fine factorization improves efficiency by concentrating computation at coarser scales, where uncertainty is greatest, while requiring fewer steps at finer scales. Across unsteady-flow benchmarks of varying complexity, SAR attains substantially lower distributional error and higher per-sample accuracy than state-of-the-art diffusion models based on multi-scale GNNs, while matching or surpassing a flow-matching Transolver (a linear-time transformer) yet running 2-7x faster than this depending on the task. Overall, SAR provides a practical tool for fast and accurate estimation of statistical flow quantities (e.g., turbulent kinetic energy and two-point correlations) in real-world settings.

关键词: fluid flow, generative models, scale-autoregressive modeling, unstructured meshes, coarse-to-fine, distributional error, turbulent kinetic energy, computational efficiency

101. ❌ From Agent Loops to Structured Graphs:A Scheduler-Theoretic Framework for LLM Agent Execution

作者: Hu Wei 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11378v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM Agent的执行框架，与’LLM Agents’高度相关（10分），涉及Agent Loop范式（与’Tool Use’和’Multi-agent Systems’相关，各5分）。论文提到Agent Loop使用不断增长的上下文窗口，与’Context Window Extension’有一定关联（5分）。其他关键词如模型训练、推理优化、科学应用等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文针对LLM Agent执行中Agent Loop范式的结构弱点，提出了一个基于调度理论的统一框架SGH，将控制流从隐式上下文提升为显式静态DAG，以增强可控性、可验证性和可实施性。

摘要翻译

当前构建基于大语言模型（LLM）智能体的主流范式是智能体循环（Agent Loop），这是一种迭代循环：单个语言模型通过不断读取持续增长的上下文窗口来决定下一步行动。该范式存在三个结构性弱点：步骤间的隐式依赖、无界的恢复循环，以及使调试复杂化的可变执行历史。我们将智能体循环描述为一种单就绪单元调度器（single ready unit scheduler）：在任何时刻，最多只有一个可执行单元处于活跃状态，而激活哪个单元的选择来自不透明的LLM推理，而非可检查的策略。这一视角将智能体循环与基于图的执行引擎置于同一语义连续统中。我们提出结构化图式框架（Structured Graph Harness, SGH），它将控制流从隐式上下文中提升为显式的静态有向无环图（DAG）。SGH遵循三项原则：执行计划在特定版本内不可变；规划、执行与恢复分离为三个层次；恢复遵循严格的升级协议。这些设计选择以牺牲部分表达能力为代价，换取了更高的可控性、可验证性与可实现性。我们的贡献包括四个方面：一个统一的调度器框架，将经典调度理论应用于LLM智能体执行，并识别了非确定性LLM节点带来的挑战；对70个已调研系统在可控性、表达能力和可实现性方面的权衡分析；包含节点状态机及具备终止性与可靠性保证的形式化规范；以及一个可归因的实验框架，采用七组设计以供未来验证。本文是一篇立场论文与设计提案。我们提供了理论框架、设计分析和实验方案，而非生产级实现或实证结果。

摘要 (Abstract)

The dominant paradigm for building LLM based agents is the Agent Loop, an iterative cycle where a single language model decides what to do next by reading an ever growing context window. This paradigm has three structural weaknesses: implicit dependencies between steps, unbounded recovery loops, and mutable execution history that complicates debugging. We characterize the Agent Loop as a single ready unit scheduler: at any moment, at most one executable unit is active, and the choice of which unit to activate comes from opaque LLM inference rather than an inspectable policy. This perspective places Agent Loops and graph based execution engines on a single semantic continuum. We propose SGH, Structured Graph Harness, which lifts control flow from implicit context into an explicit static DAG. SGH makes three commitments: execution plans are immutable within a plan version, planning execution and recovery are separated into three layers, and recovery follows a strict escalation protocol. These choices trade some expressiveness for controllability, verifiability, and implementability. Our contributions are fourfold: a scheduler unified framework that applies classical scheduling theory to LLM agent execution and identifies challenges introduced by non deterministic LLM nodes; a trade off analysis of controllability, expressiveness, and implementability across 70 surveyed systems; a formal specification including a node state machine with termination and soundness guarantees; and an attributable experimental framework with a seven group design for future validation. This is a position paper and design proposal. We provide a theoretical framework, design analysis, and experimental protocol, not a production implementation or empirical results.

关键词: LLM Agents, Agent Loop, Scheduler Theory, Structured Graph, Execution Framework, Control Flow, DAG, Controllability

102. ❌ From Redaction to Restoration: Deep Learning for Medical Image Anonymization and Reconstruction

作者: Adrienne Kline, Abhijit Gaonkar, Daniel Pittman, Chris Kuehn, Nils Forkert 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11376v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学图像去识别和重建的深度学习应用，使用CRNN和Stable Diffusion 2等技术，属于AI在科学（特别是医学成像）领域的应用。因此，仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评5分），因为医学图像处理可视为生物信息学或科学AI的应用。其他关键词均涉及大模型技术原理、训练方法、推理优化、代理系统等，论文未涉及这些内容，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种端到端深度学习框架，用于医学图像的去识别和重建，通过检测和修复受保护健康信息区域，在保护隐私的同时保持图像质量，以促进医学成像AI的数据共享和协作。

摘要翻译

从医学影像中移除患者特定信息对于在不泄露患者身份的前提下实现数据共享与开放科学至关重要。然而，当前使用的许多脱敏方法会因移除相关但非识别性信息而对下游影像分析任务产生负面影响。本研究提出了一种端到端的深度学习框架，可将原始临床影像体积数据转化为脱敏且可直接用于分析的数据集，同时不损害下游应用价值。本工作开发并验证的方法首先检测并遮盖可能包含受保护健康信息（PHI）的区域（如烧录文本和元数据），随后利用生成式深度学习模型，以解剖结构合理且影像学可信的内容对遮盖区域进行修复。该流程采用轻量级混合架构，结合了基于CRNN的遮盖模块与潜在扩散修复模块（Stable Diffusion 2）。我们通过隐私导向指标（量化残留PHI与遮盖成功率）以及影像质量与任务型指标（评估修复后数据在典型深度学习应用中的保真度）对该方法进行综合评估。结果表明，所提方法生成的脱敏医学影像在视觉上具有连贯性，能保持下游模型所需的保真度，同时显著降低患者被重新识别的风险。通过将匿名化与影像重建整合至单一自动化工作流，该方法有望促进大规模医学影像数据集的构建与传播，从而降低医学影像AI领域数据共享与多机构协作的关键壁垒。

摘要 (Abstract)

Removing patient-specific information from medical images is crucial to enable sharing and open science without compromising patient identities. However, many methods currently used for deidentification have negative effects on downstream image analysis tasks because of removal of relevant but non-identifiable information. This work presents an end-to-end deep learning framework for transforming raw clinical image volumes into de-identified, analysis-ready datasets without compromising downstream utility. The methodology developed and tested in this work first detects and redacts regions likely to contain protected health information (PHI), such as burned-in text and metadata, and then uses a generative deep learning model to inpaint the redacted areas with anatomically and imaging plausible content. The proposed pipeline leverages a lightweight hybrid architecture, combining CRNN-based redaction with a latent-diffusion inpainting restoration module (Stable Diffusion 2). We evaluate the approach using both privacy-oriented metrics, which quantify residual PHI and success of redaction, and image-quality and task-based metrics, which assess the fidelity of restored volumes for representative deep learning applications. Our results suggest that the proposed method yields de-identified medical images that are visually coherent, maintaining fidelity for downstream models, while substantially reducing the risk of patient re-identification. By automating anonymization and image reconstruction within a single workflow, and dissemination of large-scale medical imaging collections, thereby lowering a key barrier to data sharing and multi-institutional collaboration in medical imaging AI.

关键词: medical image anonymization, deep learning, de-identification, image reconstruction, protected health information, latent-diffusion inpainting, data sharing, medical imaging AI

103. ❌ Minimal Embodiment Enables Efficient Learning of Number Concepts in Robot

作者: Zhegong Shangguan, Alessandro Di Nuovo, Angelo Cangelosi 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11373v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究机器人通过具身交互学习抽象数字概念，使用神经网络模型进行顺序计数。与大多数关键词无关，因为这些关键词主要针对大语言模型（LLM）技术、训练方法、推理技术、代理系统等。论文涉及的是机器人学习、具身认知和神经网络，而非LLM。仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文分析了模型如何发展出可解释的、符合生物认知的表征（如对数调谐、心理数字线）。与’AI for Science OR Bioinformatics OR Cheminformatics’也有一定关联（5分），因为研究属于人工智能在认知科学和机器人学领域的应用，可视为’AI for Science’的一个子领域。

!!! tip deepseek-chat TL;DR

该研究探讨了机器人如何通过具身交互高效学习抽象数字概念，发现具身模型仅需10%的训练数据就能达到96.8%的计数准确率，并自发形成了与生物认知一致的可解释表征。

摘要翻译

机器人正日益进入需要理解数量概念的人机交互场景。智能系统如何从感觉运动经验中获取抽象数值概念，这仍是认知科学与人工智能领域的核心挑战。本研究通过神经网络模型探索具身数值学习，该模型在与Franka Panda机械臂的自然主义机器人交互中训练执行序列计数任务。实验表明，具身模型仅需10%的训练数据即可达到96.8%的计数准确率，而纯视觉基线模型仅达60.6%。当视觉-运动对应关系被随机化时，该优势依然存在，这表明具身性作为结构化先验规范了学习过程，而非单纯作为信息源。该模型自发形成了符合生物学特征的表征：具有对数调谐特性的数值选择性单元、心理数字线组织、韦伯定律缩放，以及编码数值大小的旋转动力学表征（相关系数r=0.97，斜率=30.6°/计数）。其学习轨迹与儿童从子集知晓者到基数原则知晓者的发展进程相吻合。这些发现证明，最小限度的具身性能够为抽象概念提供基础，提升数据效率，并产生与生物认知相契合的可解释表征，这或将为具身数学教学及安全关键型工业应用提供新思路。

摘要 (Abstract)

Robots are increasingly entering human-interactive scenarios that require understanding of quantity. How intelligent systems acquire abstract numerical concepts from sensorimotor experience remains a fundamental challenge in cognitive science and artificial intelligence. Here we investigate embodied numerical learning using a neural network model trained to perform sequential counting through naturalistic robotic interaction with a Franka Panda manipulator. We demonstrate that embodied models achieve 96.8% counting accuracy with only 10% of training data, compared to 60.6% for vision-only baselines. This advantage persists when visual-motor correspondences are randomized, indicating that embodiment functions as a structural prior that regularizes learning rather than as an information source. The model spontaneously develops biologically plausible representations: number-selective units with logarithmic tuning, mental number line organization, Weber-law scaling, and rotational dynamics encoding numerical magnitude ($r = 0.97$, slope $= 30.6°$/count). The learning trajectory parallels children’s developmental progression from subset-knowers to cardinal-principle knowers. These findings demonstrate that minimal embodiment can ground abstract concepts, improve data efficiency, and yield interpretable representations aligned with biological cognition, which may contribute to embodied mathematics tutoring and safety-critical industrial applications.

关键词: embodied learning, numerical concepts, robotic interaction, neural network model, cognitive science, data efficiency, interpretable representations, developmental progression

104. ❌ Learning from Contrasts: Synthesizing Reasoning Paths from Diverse Search Trajectories

作者: Peiyang Liu, Zhirui Chen, Xi Wang, Di Liang, Youru Li, Zhi Cai, Wei Ye 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11365v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	15.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究MCTS在自动推理中的应用，提出CRPS框架从对比搜索轨迹中合成推理路径，用于模型微调。高度相关关键词：‘Monte Carlo Tree Search OR MCTS AND LLM’(15分，论文核心方法)，‘Post-training OR Supervised Fine-tuning OR SFT’(10分，涉及模型微调)，‘Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’(10分，关注推理路径)。中等相关：‘Large Language Models’(8分，隐含应用背景)，‘System 2 Thinking’(8分，涉及深度推理)，‘Self-Correction’(8分，包含反思过程)。弱相关：‘LLM Agents’(5分，与自动推理相关)。其余关键词与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对Monte Carlo Tree Search在自动推理中监督数据提取效率低的问题，提出了Contrastive Reasoning Path Synthesis框架，通过对比分析高低质量搜索轨迹来合成推理路径，实验表明仅用6万合成样本微调的模型性能可媲美使用59万标准样本的基线，并提升了跨领域泛化能力。

摘要翻译

蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）已广泛应用于自动化推理数据探索，但当前的监督信息提取方法仍效率低下。标准方法仅保留单条最高奖励轨迹，舍弃了众多探索路径中蕴含的比较性信号。本文提出**对比推理路径合成（Contrastive Reasoning Path Synthesis, CRPS）**框架，将监督提取从筛选过程转化为合成过程。CRPS采用结构化反思流程分析高质量与低质量搜索轨迹之间的差异，提取关于策略转折点与局部失败模式的显式信息。这些洞见指导合成既包含成功模式又规避已识别缺陷的推理链。实验表明，仅用6万条CRPS合成样本微调的模型，其性能达到或超越了基于59万条标准拒绝采样样本训练的基线模型，数据量缩减了20倍。此外，CRPS在领域外基准测试中表现出更强的泛化能力，这证明从成功与失败的对比中学习，比仅从成功中学习能产生更具可迁移性的推理能力。

摘要 (Abstract)

Monte Carlo Tree Search (MCTS) has been widely used for automated reasoning data exploration, but current supervision extraction methods remain inefficient. Standard approaches retain only the single highest-reward trajectory, discarding the comparative signals present in the many explored paths. Here we introduce \textbf{Contrastive Reasoning Path Synthesis (CRPS)}, a framework that transforms supervision extraction from a filtering process into a synthesis procedure. CRPS uses a structured reflective process to analyze the differences between high- and low-quality search trajectories, extracting explicit information about strategic pivots and local failure modes. These insights guide the synthesis of reasoning chains that incorporate success patterns while avoiding identified pitfalls. We show empirically that models fine-tuned on just 60K CRPS-synthesized examples match or exceed the performance of baselines trained on 590K examples derived from standard rejection sampling, a 20$\times$ reduction in dataset size. Furthermore, CRPS improves generalization on out-of-domain benchmarks, demonstrating that learning from the contrast between success and failure produces more transferable reasoning capabilities than learning from success alone.

关键词: Monte Carlo Tree Search, automated reasoning, contrastive learning, reasoning path synthesis, supervised fine-tuning, generalization, search trajectories, CRPS

105. ❌ The Missing Knowledge Layer in Cognitive Architectures for AI Agents

作者: Michaël Roynard 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11364v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究AI智能体的认知架构，提出四层分解（知识、记忆、智慧、智能），每层具有不同的持久性语义。核心与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文直接讨论AI智能体的认知架构框架（CoALA和JEPA）。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为摘要中提到了’LLM Knowledge Base’，但论文焦点是架构而非LLM技术本身。其他关键词（如MoE、Scaling Laws、RLHF等）与论文内容无关（0分），论文未涉及这些具体技术。

!!! tip deepseek-chat TL;DR

论文指出当前AI智能体认知架构（如CoALA和JEPA）缺乏明确的知识层，导致持久性语义混淆，并提出一个四层分解架构（知识、记忆、智慧、智能），每层具有不同的持久性语义，并通过Python和Rust实现验证其可行性。

摘要翻译

当前最具影响力的两种AI智能体认知架构框架——CoALA[21]与JEPA[12]——均缺乏具有独立持久性语义的显式知识层。这一缺失导致范畴错误：系统对事实主张施加认知衰减，或以相同的更新机制处理事实与经验。我们系统考察了现有记忆系统中的持久性语义，识别出八个趋同点（从Karpathy的LLM知识库[10]到BEAM基准测试中接近零值的矛盾消解分数[22]），这些现象均指向相关的架构缺陷。我们提出一种四层解构方案（知识层、记忆层、智慧层、智能层），其中每层具有根本不同的持久性语义：分别为无限更替、艾宾浩斯衰减、证据门控修正与瞬时推理。基于Python和Rust的配套实现证明该架构分离具备可行性。我们借用认知科学术语作为类比（知识/记忆的区分呼应了Tulving的三分法），但各层级是基于持久性语义需求构建的工程结构，而非神经架构的映射。我们认为这些差异需要在工程实现中通过不同的持久性语义来体现，而现有框架或系统均未满足这一要求。

摘要 (Abstract)

The two most influential cognitive architecture frameworks for AI agents, CoALA [21] and JEPA [12], both lack an explicit Knowledge layer with its own persistence semantics. This gap produces a category error: systems apply cognitive decay to factual claims, or treat facts and experiences with identical update mechanics. We survey persistence semantics across existing memory systems and identify eight convergence points, from Karpathy’s LLM Knowledge Base [10] to the BEAM benchmark’s near-zero contradiction-resolution scores [22], all pointing to related architectural gaps. We propose a four-layer decom position (Knowledge, Memory, Wisdom, Intelligence) where each layer has fundamentally different persistence semantics: indefinite supersession, Ebbinghaus decay, evidence-gated revision, and ephemeral inference respectively. Companion implementations in Python and Rust demonstrate the architectural separation is feasible. We borrow terminology from cognitive science as a useful analogy (the Knowledge/Memory distinction echoes Tulving’s trichotomy), but our layers are engineering constructs justified by persistence-semantics requirements, not by neural architecture. We argue that these distinctions demand distinct persistence semantics in engineering implementations, and that no current framework or system provides this.

关键词: cognitive architectures, AI agents, knowledge layer, persistence semantics, memory systems, four-layer decomposition, CoALA, JEPA

106. ❌ CoRe-ECG: Advancing Self-Supervised Representation Learning for 12-Lead ECG via Contrastive and Reconstructive Synergy

作者: Zehao Qin, Xiaojian Lin, Ping Zhang, Hongliang Wu, Xinkang Wang, Guangling Liu, Bo Chen, Wenming Yang, Guijin Wang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11359v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于心电图（ECG）的自监督表示学习，提出了一种结合对比学习和重建学习的预训练范式（CoRe-ECG），并引入了频率动态增强和时空双重掩码技术。论文的核心是深度学习在医疗信号处理（ECG分析）中的应用，属于AI for Science（具体为生物医学AI）的范畴，因此仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评分5分）。论文未涉及大语言模型（LLMs）、模型架构（如MoE）、训练技术（如RLHF、PEFT）、推理优化、智能体系统或其他大模型相关主题，因此其他所有关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究针对心电图（ECG）分析中标注数据稀缺的问题，提出了一种结合对比学习和重建学习的自监督预训练方法CoRe-ECG，通过频率动态增强和时空双重掩码技术提升表示学习效果，在多个下游ECG数据集上取得了最先进的性能。

摘要翻译

由于标记数据稀缺且专家标注成本高昂，心电图的准确解读仍面临挑战。自监督学习通过使模型从未标记信号中学习具有表现力的表征，为此提供了有前景的解决方案。现有的心电图自监督学习方法通常依赖于对比学习或重建学习。然而，单一方法仅能提供有限的监督信号，并存在额外局限，包括简单数据增强引入的非生理性失真，以及模型可能利用多导联间平凡相关性作为学习捷径。本研究提出CoRe-ECG，一种统一的对比与重建预训练范式，在全局语义建模与局部结构学习间建立协同交互机制。CoRe-ECG在重建过程中对齐全局表征，使实例级判别信号能够指导局部波形恢复。为进一步增强预训练效果，我们提出频域动态增强技术，根据心电图信号的频域重要性进行自适应扰动；同时提出时空双重掩蔽策略，打破导联间的线性依赖关系，从而增加重建任务的难度。我们的方法在多个下游心电图数据集中取得了最先进的性能。消融研究进一步验证了各组成部分的必要性与互补性。该方法为心电图分析提供了一个鲁棒且具有生理学意义的表征学习框架。

摘要 (Abstract)

Accurate interpretation of electrocardiogram (ECG) remains challenging due to the scarcity of labeled data and the high cost of expert annotation. Self-supervised learning (SSL) offers a promising solution by enabling models to learn expressive representations from unlabeled signals. Existing ECG SSL methods typically rely on either contrastive learning or reconstructive learning. However, each approach in isolation provides limited supervisory signals and suffers from additional limitations, including non-physiological distortions introduced by naive augmentations and trivial correlations across multiple leads that models may exploit as shortcuts. In this work, we propose CoRe-ECG, a unified contrastive and reconstructive pretraining paradigm that establishes a synergistic interaction between global semantic modeling and local structural learning. CoRe-ECG aligns global representations during reconstruction, enabling instance-level discriminative signals to guide local waveform recovery. To further enhance pretraining, we introduce Frequency Dynamic Augmentation (FDA) to adaptively perturb ECG signals based on their frequency-domain importance, and Spatio-Temporal Dual Masking (STDM) to break linear dependencies across leads, increasing the difficulty of reconstructive tasks. Our method achieves state-of-the-art performance across multiple downstream ECG datasets. Ablation studies further demonstrate the necessity and complementarity of each component. This approach provides a robust and physiologically meaningful representation learning framework for ECG analysis.

关键词: self-supervised learning, ECG analysis, contrastive learning, reconstructive learning, representation learning, frequency dynamic augmentation, spatio-temporal dual masking, biomedical AI

107. ❌ Governance by Design: A Parsonian Institutional Architecture for Internet-Wide Agent Societies

作者: Anbang Ruan 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11337v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究互联网范围的智能体社会（internet-wide agent societies）的治理架构，核心是应用帕森斯的AGIL框架设计制度架构，并诊断现有生态系统（如OpenClaw）的治理缺失。论文与大多数关键词无关，因为这些关键词主要涉及大模型技术原理、训练方法、推理优化、对齐等具体技术细节，而本文聚焦于宏观的智能体社会制度设计、治理架构和社会学分析。仅与两个关键词高度相关：1）‘LLM Agents OR Autonomous Agents OR Agentic Workflow’：论文研究互联网范围的自主智能体社会，核心是自主智能体的交互和涌现行为，高度相关。2）‘Multi-agent Systems OR Agent Coordination’：论文分析多智能体系统中的协调缺失和治理需求，涉及智能体间的交互和协调问题，高度相关。其他关键词如大模型技术、训练方法、推理优化等均未涉及，因此评分为0。

!!! tip deepseek-chat TL;DR

论文研究互联网范围智能体社会的治理缺失问题，应用帕森斯的AGIL框架提出一个制度架构，并通过诊断OpenClaw等生态系统发现现有基础设施缺乏有效的治理协调层和规范基础。

摘要翻译

当前主流的本地多智能体系统范式——即由企业边界内集中编排的管道式架构——正被互联网规模的智能体社会所取代。在这种新型社会中，自主智能体通过开放注册机制相互发现，在无中心协调者的情况下进行交互，并涌现出社会性行为。我们认为，治理此类社会需要制度设计，而不仅仅是风险枚举或流程合规。应用塔尔科特·帕森斯的AGIL框架——即每个可持续社会系统必须满足的四种功能要件（适应Adaptation、目标达成Goal Attainment、整合Integration、潜在模式维持Latency）——我们推导出一个规范性的十六单元制度架构，用于指导互联网规模智能体治理。通过对OpenClaw生态系统（拥有超过25万GitHub星标、月活用户超200万、注册智能体超77万）进行递归子功能分析（在16个单元中设置64项二元指标），诊断性应用该架构发现其子功能覆盖率最高仅为19%（敏感度区间17-30%）——这仅是潜在能力而非运行能力，因为单元间零协调机制导致现有基础设施无法参与支柱间的交换。补充性的交换媒介评估显示，十二个支柱间通路中无一处于功能状态：该生态系统虽具备技术基础设施，但缺乏有效治理、协调层与规范基础，其中受托支柱与政治支柱的服务缺失最为严重。将此诊断延伸至更广泛的智能体原生协议栈（MCP、A2A、ANP、x402、ERC-8004），独立开发团队重现了相同的结构模式——证实治理缺口是市场驱动发展的固有特征，而非生态系统不成熟的表现。制度设计在社会模式固化前最为有效；我们最终提出了针对缺失治理基础设施的优先实施路线图。

摘要 (Abstract)

The dominant paradigm of local multi-agent systems – orchestrated, enterprise-bounded pipelines – is being superseded by internet-wide agent societies in which autonomous agents discover each other through open registries, interact without central orchestrators, and generate emergent social behaviors. We argue that governing such societies requires institutional design, not merely risk enumeration or process compliance. Applying Talcott Parsons’ AGIL framework – four functional imperatives (Adaptation, Goal Attainment, Integration, Latency) every viable social system must satisfy – we derive a prescriptive sixteen-cell institutional architecture for internet-wide agent governance. Diagnostically applied to the OpenClaw ecosystem (250,000+ GitHub stars, 2M+ monthly users, 770,000+ registered agents) via a recursive sub-function analysis (64 binary indicators across 16 cells), we find at most 19% sub-function coverage (sensitivity range 17-30%) – potential rather than operative capacity, since zero inter-cell coordination prevents existing infrastructure from participating in inter-pillar interchange. A complementary interchange media assessment finds zero of twelve inter-pillar pathways functional: the ecosystem has technical infrastructure but no active governance, no coordination layer, and no normative grounding, with the Fiduciary and Political pillars most severely underserved. Extending the diagnostic to the broader agent-native protocol stack (MCP, A2A, ANP, x402, ERC-8004), independent development teams reproduce the same structural pattern – confirming the governance gap is a feature of market-driven development, not ecosystem immaturity. Institutional design is most effective before social patterns calcify; we conclude with a prioritized roadmap for the missing governance infrastructure.

关键词: internet-wide agent societies, governance architecture, AGIL framework, institutional design, multi-agent systems, autonomous agents, coordination layer, emergent social behaviors

108. ❌ A Compact and Efficient 1.251 Million Parameter Machine Learning CNN Model PD36-C for Plant Disease Detection: A Case Study

作者: Shkelqim Sherifi 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11332v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用小型卷积神经网络（CNN）进行植物病害检测，属于深度学习在农业领域的应用。论文未涉及任何大语言模型（LLM）、MoE、SLMs、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理方法、智能体、工具使用、多智能体系统、量化、推测解码、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等大模型相关技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为植物病害检测可视为AI在农业科学（生物信息学相关领域）的应用，但论文核心是传统CNN而非大模型，因此给予5分（有一定关联）。其他所有关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种紧凑的卷积神经网络PD36-C（125万参数），用于植物病害分类，在包含87k图像、38个类别的数据集上实现了高达99.53%的平均测试准确率，并开发了桌面应用程序，为智能农业提供了高效的边缘部署解决方案。

摘要翻译

随着硬件性能与数据集质量的提升，神经网络模型日益精准，深度学习显著推动了基于图像的植物病害诊断技术发展。本文提出PD36 C——一种用于植物病害分类的紧凑型卷积神经网络（参数1,250,694个，体积4.77 MB）。该模型基于新植物病害数据集（New Plant Diseases Dataset，含8.7万张图像、38个类别）使用TensorFlow Keras训练完成，其设计兼顾鲁棒性与边缘部署能力，并辅以一套基于Qt for Python的桌面应用程序，提供直观的图形用户界面（GUI）及在商用硬件上的离线推理功能。实验结果显示，模型在第30轮训练时准确率达到0.99697，在38个类别上的平均测试准确率为0.9953。各类别性能均保持较高水平：在较低表现端，玉米灰斑病（Corn (maize) Cercospora leaf spot）的精确率约为0.9777，召回率约为0.9634，表明其偶尔会与视觉相似的类别混淆；而在较高表现端，包括苹果黑腐病（Apple Black rot）、苹果锈病（Cedar apple rust）、健康蓝莓（Blueberry healthy）、樱桃白粉病（Cherry Powdery mildew）、健康樱桃（Cherry healthy）以及全部四个葡萄病害类别在内的多个类别均实现了1.00的精确率与1.00的召回率，表明无误报且覆盖全面。这些结果证明，通过精心构建的数据集与细致的架构设计，小型卷积神经网络能够在保持边缘场景实用性的同时，取得与近期基准模型相媲美的准确率。我们也注意到典型限制因素，如恶劣天气、低质量图像以及叶片同时罹患多种病害等情况可能降低模型性能，这要求未来在领域鲁棒性方面开展进一步研究。总体而言，PD36 C及其应用流程为智慧农业中的人工智能辅助植物病害检测提供了一套可直接投入实地使用的高效解决方案。

摘要 (Abstract)

Deep learning has markedly advanced image based plant disease diagnosis as improved hardware and dataset quality have enabled increasingly accurate neural network models. This paper presents PD36 C, a compact convolutional neural network (1,250,694 parameters and 4.77 MB) for plant disease classification. Trained with TensorFlow Keras on the New Plant Diseases Dataset (87k images, 38 classes), PD36 C is designed for robustness and edge deployability, complemented by a Qt for Python desktop application that offers an intuitive GUI and offline inference on commodity hardware. Across experiments, training accuracy reached 0.99697 by epoch 30, and average test accuracy was 0.9953 across 38 classes. Per class performance is uniformly high; on the lower end, Corn (maize) Cercospora leaf spot achieved precision around 0.9777 and recall around 0.9634, indicating occasional confusion with visually similar categories, while on the upper end numerous classes including Apple Black rot, Cedar apple rust, Blueberry healthy, Cherry Powdery mildew, Cherry healthy, and all four grape categories achieved perfect precision 1.00 and recall of 1.00, indicating no false positives and strong coverage. These results show that with a well curated dataset and careful architectural design, small CNNs can achieve competitive accuracy compared with recent baselines while remaining practical for edge scenarios. We also note typical constraints such as adverse weather, low quality imagery, and leaves exhibiting multiple concurrent diseases that can degrade performance and warrant future work on domain robustness. Overall, PD36 C and its application pipeline contribute a field ready, efficient solution for AI assisted plant disease detection in smart agriculture.

关键词: plant disease detection, convolutional neural network, compact model, edge deployment, smart agriculture, image classification, deep learning, PD36-C

109. ❌ Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees

作者: Xiaoyu Ma, Yiwen Li, Haoyue Liu, Zhichao Wang, Ye Chen, Yongxin Guo, Xiaoying Tang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11328v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	8.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究自动提示优化（APO）中的评估调度问题，核心是提高LLM提示优化的效率。与"Large Language Models"高度相关（8分），因为APO是LLM应用的关键技术；与"In-context Learning"相关（8分），因为提示优化直接影响上下文学习效果。其他关键词如MoE、SFT、RAG等均未涉及，评0分。

!!! tip deepseek-chat TL;DR

该论文针对自动提示优化中评估成本过高的问题，提出了Prompt-Aware Online Evaluation Scheduling方法，在36个任务上实现了6.2%的准确率提升，同时将token消耗降低了35-60%。

摘要翻译

自动提示优化（APO）的效果取决于其评估信号的质量，但在完整训练集上对每个候选提示进行评分成本极高。现有方法要么在优化开始前固定单一评估子集（具有原则性但忽略提示特性），要么在优化过程中启发式地调整子集（灵活但不稳定且缺乏形式化保证）。我们观察到APO可自然映射为在线自适应测试问题：提示即应试者，训练样本即测试题目，调度器应选择能最有效区分最优候选提示的题目。这一洞见催生了**提示感知在线评估调度（POES）**方法，该方法将基于IRT（项目反应理论）的区分效用、设施选址覆盖项以及考虑切换成本的热启动交换策略整合为统一目标函数，该目标被证明具有单调子模性，从而为冷启动提供(1-1/e)的贪心算法保证，并为热启动更新设定有界漂移约束。自适应控制器根据优化进度动态调节探索与利用的平衡。在涵盖三大基准系列的36项任务中，POES在相同评估预算下实现了最高的总体平均准确率（较最佳基线提升6.2%），且令牌开销可忽略不计（约4%）。此外，基于原则性方法选择k=20个样本的表现达到甚至超越了朴素评估方法在k=30-50个样本上的性能，将令牌消耗降低35-60%，证明“智能选择”比“数量堆砌”更有效。我们的结果表明，评估调度是APO的核心组成部分，而非次要的实现细节。

摘要 (Abstract)

Automatic prompt optimization (APO) hinges on the quality of its evaluation signal, yet scoring every prompt candidate on the full training set is prohibitively expensive. Existing methods either fix a single evaluation subset before optimization begins (principled but prompt-agnostic) or adapt it heuristically during optimization (flexible but unstable and lacking formal guarantees). We observe that APO naturally maps to an online adaptive testing problem: prompts are examinees, training examples are test items, and the scheduler should select items that best discriminate among the strongest candidates. This insight motivates Prompt-Aware Online Evaluation Scheduling (POES), which integrates an IRT-based discrimination utility, a facility-location coverage term, and switching-cost-aware warm-start swaps into a unified objective that is provably monotone submodular, yielding a (1-1/e) greedy guarantee for cold starts and bounded drift for warm-start updates. An adaptive controller modulates the exploration-exploitation balance based on optimization progress. Across 36 tasks spanning three benchmark families, POES achieves the highest overall average accuracy (6.2 percent improvement over the best baseline) with negligible token overhead (approximately 4 percent) at the same evaluation budget. Moreover, principled selection at k = 20 examples matches or exceeds the performance of naive evaluation at k = 30-50, reducing token consumption by 35-60 percent, showing that selecting smarter is more effective than selecting more. Our results demonstrate that evaluation scheduling is a first-class component of APO, not an implementation detail.

关键词: Automatic Prompt Optimization, Evaluation Scheduling, Submodular Optimization, IRT-based Discrimination, Facility-Location Coverage, Online Adaptive Testing, Token Efficiency, Prompt-Aware Selection

110. ❌ Do LLMs Know Tool Irrelevance? Demystifying Structural Alignment Bias in Tool Invocations

作者: Yilong Liu, Xixun Lin, Pengfei Cao, Ge Zhang, Fang Fang, Yanan Cao 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11322v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在工具调用中的结构性对齐偏见问题，与"Large Language Models"和"Tool Use"高度相关（10分），因为全文围绕LLMs的工具使用能力展开。与"Mechanistic Interpretability"高度相关（10分），因为论文通过Contrastive Attention Attribution方法研究LLMs的内部机制。其他关键词如MoE、SFT、RAG等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现LLMs在工具调用中存在结构性对齐偏见，即当查询属性与工具参数结构匹配时，即使工具与查询语义无关，LLMs仍倾向于调用该工具，并提出了SABEval数据集和Contrastive Attention Attribution方法来分析及缓解这一偏见。

摘要翻译

大语言模型（LLMs）在利用外部工具方面已展现出令人印象深刻的能力。然而在实际应用中，LLMs常常会接触到与用户查询无关的工具，此时理想的行为应是避免调用。本研究发现工具拒绝机制中存在一个普遍但被忽视的结构性缺陷，我们将其定义为结构对齐偏差：即使某个工具无法满足用户目标，只要查询属性能够被有效分配给工具参数，LLMs仍倾向于调用该工具。为系统研究此偏差，我们构建了SABEval数据集，该数据集将结构对齐与语义相关性进行解耦分析。研究表明，结构对齐偏差会导致LLMs产生严重的工具调用错误，而现有评估体系大多未考虑这一因素。为探究该偏差的内在机制，我们提出对比性注意力归因法，该方法揭示了语义检查与结构匹配两条竞争性处理路径。这两条路径的相对强度主导着LLMs的工具调用决策。基于这些发现，我们进一步提出再平衡策略，大量实验证明该策略能有效缓解结构对齐偏差，且不会削弱模型的通用工具使用能力。

摘要 (Abstract)

Large language models (LLMs) have demonstrated impressive capabilities in utilizing external tools. In practice, however, LLMs are often exposed to tools that are irrelevant to the user’s query, in which case the desired behavior is to refrain from invocations. In this work, we identify a widespread yet overlooked mechanistic flaw in tool refusal, which we term structural alignment bias: Even when a tool fails to serve the user’s goal, LLMs still tend to invoke it whenever query attributes can be validly assigned to tool parameters. To systematically study this bias, we introduce SABEval, a new dataset that decouples structural alignment from semantic relevance. Our analysis shows that structural alignment bias induces severe tool-invocation errors in LLMs, yet remains largely unaccounted for in existing evaluations. To investigate the internal mechanisms underlying this bias, we propose Contrastive Attention Attribution, which reveals two competing pathways for semantic checking and structural matching. The relative strength of these pathways drives LLMs’ tool invocation decisions. Based on these findings, we further introduce a rebalancing strategy that effectively mitigates structural alignment bias, as demonstrated by extensive experiments, without degrading general tool-use capabilities.

关键词: Large Language Models, Tool Use, Structural Alignment Bias, Mechanistic Interpretability, Tool Invocation, Contrastive Attention Attribution, SABEval

111. ❌ Network Effects and Agreement Drift in LLM Debates

作者: Erica Cau, Andrea Failla, Giulio Rossetti 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11312v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究LLM在多轮辩论中的集体行为，核心涉及LLM代理和智能体系统，因此与’Large Language Models OR LLMs OR Foundation Models’、‘LLM Agents OR Autonomous Agents OR Agentic Workflow’、‘Multi-agent Systems OR Agent Coordination’高度相关（10分）。其他关键词如MoE、量化、推理加速、对齐等均未在摘要中提及或相关，故给0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM代理在多轮辩论中的集体行为，发现存在一种称为'协议漂移'的方向性倾向，表明在将LLM群体作为人类行为代理之前需要区分结构效应和模型偏差。

摘要翻译

大语言模型（LLMs）已展现出模拟类人社会行为的空前能力，使其成为模拟复杂社会系统的有效工具。然而，这些模拟能在多大程度上被信任以准确捕捉关键社会机制，尤其是在涉及少数群体的高度不平衡情境中，目前尚不明确。本文采用一个具有可控同质性和类别规模的网络生成模型，以考察LLM智能体在多轮辩论中的集体行为。此外，我们的研究结果揭示了一种特定的方向性敏感现象，我们称之为共识漂移，即智能体更倾向于向意见量表中的特定立场偏移。总体而言，我们的发现强调在将LLM群体视为人类群体的行为代理之前，必须首先厘清结构效应与模型偏差之间的影响。

摘要 (Abstract)

Large Language Models (LLMs) have demonstrated an unprecedented ability to simulate human-like social behaviors, making them useful tools for simulating complex social systems. However, it remains unclear to what extent these simulations can be trusted to accurately capture key social mechanisms, particularly in highly unbalanced contexts involving minority groups. This paper uses a network generation model with controlled homophily and class sizes to examine how LLM agents behave collectively in multi-round debates. Moreover, our findings highlight a particular directional susceptibility that we term \textit{agreement drift}, in which agents are more likely to shift toward specific positions on the opinion scale. Overall, our findings highlight the need to disentangle structural effects from model biases before treating LLM populations as behavioral proxies for human groups.

关键词: Large Language Models, LLM agents, multi-round debates, network generation model, agreement drift, social behaviors, model biases, behavioral proxies

112. ❌ The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

作者: Yihao Zhang, Kai Wang, Jiangrong Wu, Haolin Wu, Yuxuan Zhou, Zeming Wei, Dongxian Wu, Xun Chen, Jun Sun, Meng Sun 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11309v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM安全领域，特别是针对多轮越狱攻击的Salami Attack方法及其防御策略。核心相关关键词为’Large Language Models’（论文研究对象）和’Instruction Tuning OR Alignment OR Value Alignment’（攻击目标是绕过模型的对齐约束）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理优化、AI for Science等均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Salami Attack的多轮越狱攻击方法，通过累积低风险输入来绕过LLM的安全对齐约束，并在GPT-4o和Gemini等模型上实现了超过90%的攻击成功率，同时提出了相应的防御策略。

摘要翻译

大语言模型（LLMs）面临着越狱行为带来的显著安全风险，这种操作通过操控模型以绕过内置安全约束，生成不道德或不安全的内容。在各种越狱技术中，多轮越狱攻击相较于单轮攻击更为隐蔽和持久，暴露了大语言模型的关键脆弱性。
然而，现有的多轮越狱方法存在两个影响其实际场景效用的根本性局限：（a）随着模型语境感知能力的增强，任何显式的有害触发内容都更可能被标记和拦截；（b）成功的最终步骤触发通常需要精细调整、针对特定模型的语境，使得此类攻击高度依赖上下文。为填补这一空白，我们提出了 “渐进式风险累积”（Salami Slicing Risk），其运作机制是通过串联大量低风险输入，这些输入各自规避了对齐阈值，但累积起来逐步积累有害意图，最终触发高风险行为，而无需严重依赖预先设计的上下文结构。基于此风险概念，我们开发了“Salami Attack”，这是一个可普遍适用于多种模型类型和模态的自动化攻击框架。
严格的实验证明，该框架在多种模型和模态上均取得了最先进的性能，在GPT-4o和Gemini上实现了超过90%的攻击成功率，并对现实世界的对齐防御机制表现出鲁棒性。我们还提出了一种防御策略，该策略能将Salami Attack的成功率至少降低44.8%，同时对其他多轮越狱攻击实现了最高64.8%的拦截率。我们的研究结果为理解多轮越狱的普遍性风险提供了关键见解，并为增强大语言模型安全性提供了可行的缓解策略。

摘要 (Abstract)

Large Language Models (LLMs) face prominent security risks from jailbreaking, a practice that manipulates models to bypass built-in security constraints and generate unethical or unsafe content. Among various jailbreak techniques, multi-turn jailbreak attacks are more covert and persistent than single-turn counterparts, exposing critical vulnerabilities of LLMs. However, existing multi-turn jailbreak methods suffer from two fundamental limitations that affect the actual impact in real-world scenarios: (a) As models become more context-aware, any explicit harmful trigger is increasingly likely to be flagged and blocked; (b) Successful final-step triggers often require finely tuned, model-specific contexts, making such attacks highly context-dependent. To fill this gap, we propose \textit{Salami Slicing Risk}, which operates by chaining numerous low-risk inputs that individually evade alignment thresholds but cumulatively accumulate harmful intent to ultimately trigger high-risk behaviors, without heavy reliance on pre-designed contextual structures. Building on this risk, we develop Salami Attack, an automatic framework universally applicable to multiple model types and modalities. Rigorous experiments demonstrate its state-of-the-art performance across diverse models and modalities, achieving over 90% Attack Success Rate on GPT-4o and Gemini, as well as robustness against real-world alignment defenses. We also proposed a defense strategy to constrain the Salami Attack by at least 44.8% while achieving a maximum blocking rate of 64.8% against other multi-turn jailbreak attacks. Our findings provide critical insights into the pervasive risks of multi-turn jailbreaking and offer actionable mitigation strategies to enhance LLM security.

关键词: Large Language Models, jailbreaking, multi-turn attacks, alignment, security risks, Salami Attack, defense strategy, Attack Success Rate

作者: Lei Xiong, Huaying Yuan, Zheng Liu, Zhao Cao, Zhicheng Dou 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11307v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是构建用于评估多模态大语言模型（MLLMs）在科学领域进行深度多文档研究的基准PaperScope。高度相关（10分）的关键词包括：‘Large Language Models’（论文明确使用MLLMs）、‘LLM Agents’（论文聚焦于agentic deep research）、‘AI for Science’（论文应用于科学领域，特别是AI论文分析）。中等相关（8分）的关键词包括：‘Retrieval-Augmented Generation’（涉及多文档检索和推理）、‘Context Window Extension’（需要处理长上下文）、‘Chain of Thought’和’System 2 Thinking’（涉及多步深度科学推理）。其他关键词与论文的技术细节（如模型训练、优化、压缩等）或特定应用领域无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在科学领域进行深度多文档研究缺乏系统评估的问题，提出了一个名为PaperScope的多模态多文档基准，用于评估模型在长上下文检索和多源深度推理方面的能力，实验表明现有先进系统在该基准上表现有限。

摘要翻译

利用多模态大语言模型加速前沿科学研究前景广阔，但如何严格评估此类系统仍不明确。现有基准主要关注单文档理解，而真实的科学工作流程需要整合来自多篇论文的证据，包括文本、表格和图形。因此，多模态、多文档的科学推理研究仍显不足，且缺乏系统性评估。为填补这一空白，我们提出了PaperScope——一个为智能深度研究设计的多模态多文档基准。PaperScope具备三大优势：（1）结构化的科学基础。它建立在涵盖三年、超过2000篇人工智能论文的知识图谱之上，为研究型查询提供了结构化基础。（2）语义密集的证据构建。它整合了语义相关的关键信息节点，并采用优化的随机游走文章选择器来采样主题连贯的论文集合，从而确保足够的语义密度和任务复杂度。（3）科学推理的多任务评估。它包含超过2000个涵盖推理、检索、摘要和问题解决的问答对，能够评估多步骤科学推理能力。实验结果表明，即使是OpenAI Deep Research和通义Deep Research等先进系统，在PaperScope上的得分也有限，凸显了长上下文检索和深度多源推理的难度。因此，PaperScope不仅提供了一个严格的评估基准，还提供了一个可扩展的构建流程，用于创建大规模多模态、多源深度研究数据集。

摘要 (Abstract)

Leveraging Multi-modal Large Language Models (MLLMs) to accelerate frontier scientific research is promising, yet how to rigorously evaluate such systems remains unclear. Existing benchmarks mainly focus on single-document understanding, whereas real scientific workflows require integrating evidence from multiple papers, including their text, tables, and figures. As a result, multi-modal, multi-document scientific reasoning remains underexplored and lacks systematic evaluation. To address this gap, we introduce PaperScope, a multi-modal multi-document benchmark designed for agentic deep research. PaperScope presents three advantages: (1) Structured scientific grounding. It is built on a knowledge graph of over 2,000 AI papers spanning three years, providing a structured foundation for research-oriented queries. (2) Semantically dense evidence construction. It integrates semantically related key information nodes and employs optimized random-walk article selector to sample thematically coherent paper sets, thereby ensuring adequate semantic density and task complexity. (3) Multi-task evaluation of scientific reasoning. It contains over 2,000 QA pairs across reasoning, retrieval, summarization, and problem solving, enabling evaluation of multi-step scientific reasoning. Experimental results show that even advanced systems such as OpenAI Deep Research and Tongyi Deep Research achieve limited scores on PaperScope, highlighting the difficulty of long-context retrieval and deep multi-source reasoning. PaperScope thus provides a rigorous benchmark alongside a scalable pipeline for constructing large-scale multi-modal, multi-source deep research datasets.

关键词: Multi-modal Large Language Models, scientific research, multi-document benchmark, agentic deep research, long-context retrieval, multi-source reasoning, knowledge graph, scientific reasoning evaluation

114. ❌ Learning to Forget – Hierarchical Episodic Memory for Lifelong Robot Deployment

作者: Leonard Bärmann, Joana Plewnia, Alex Waibel, Tamim Asfour 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11306v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出H²-EMV框架，让机器人通过用户交互学习选择性遗忘，构建分层情景记忆。核心使用语言模型进行相关性估计和自然语言规则学习，属于大模型在机器人领域的应用创新。与"Large Language Models"和"LLM Agents"高度相关（8分），因为论文明确使用语言模型进行记忆管理和决策；与"Self-Correction"有一定关联（5分），因为系统通过用户反馈更新规则实现自我改进；其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究解决了机器人长期部署中情景记忆存储和查询效率的问题，通过H²-EMV框架让机器人学习选择性遗忘，在减少45%内存和35%查询计算的同时提高了70%的问答准确性。

摘要翻译

当用户询问“你把我的钥匙放哪儿了？”或“任务为何失败？”时，机器人必须能够描述其过往经历。然而，基于持续多模态感知构建终身情景记忆（Episodic Memory, EM）会迅速超出存储限制，并导致实时查询难以实现，这要求系统具备适应用户相关性概念的选择性遗忘能力。本文提出H$^2$-EMV框架，使人形机器人能够通过用户交互学习记忆内容。我们的方法逐步构建分层情景记忆，基于语言模型的相关性估计（以学习到的自然语言规则为条件）进行选择性遗忘，并根据用户对遗忘细节的反馈更新这些规则。在模拟家务任务及ARMAR-7机器人20.5小时真实场景记录上的评估表明，H$^2$-EMV在保持问答准确率的同时，将内存占用减少45%，查询计算量降低35%。关键在于其性能随时间持续提升——通过适应用户特定优先级，第二轮查询的准确率提高了70%，这证明学习型遗忘机制能够为长期人机协作提供可扩展、个性化的情景记忆系统。

摘要 (Abstract)

Robots must verbalize their past experiences when users ask “Where did you put my keys?” or “Why did the task fail?” Yet maintaining life-long episodic memory (EM) from continuous multimodal perception quickly exceeds storage limits and makes real-time query impractical, calling for selective forgetting that adapts to users’ notions of relevance. We present H$^2$-EMV, a framework enabling humanoids to learn what to remember through user interaction. Our approach incrementally constructs hierarchical EM, selectively forgets using language-model-based relevance estimation conditioned on learned natural-language rules, and updates these rules given user feedback about forgotten details. Evaluations on simulated household tasks and 20.5-hour-long real-world recordings from ARMAR-7 demonstrate that H$^2$-EMV maintains question-answering accuracy while reducing memory size by 45% and query-time compute by 35%. Critically, performance improves over time - accuracy increases 70% in second-round queries by adapting to user-specific priorities - demonstrating that learned forgetting enables scalable, personalized EM for long-term human-robot collaboration.

关键词: hierarchical episodic memory, selective forgetting, language-model-based relevance estimation, human-robot collaboration, long-term robot deployment, user interaction learning, memory efficiency, question-answering accuracy

115. ❌ BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows

作者: Elaine Lau, Markus Dücker, Ronak Chaudhary, Hui Wen Goh, Rosemary Wei, Vaibhav Kumar, Saed Qunbar, Guram Gogia, Yi Liu, Scott Millslagle, Nasim Borazjanizadeh, Ulyana Tkachenko, Samuel Eshun Danquah, Collin Schweiker, Vijay Karumathil, Asrith Devalaraju, Varsha Sandadi, Haemi Nam, Punit Arani, Ray Epps, Abdullah Arif, Sahil Bhaiwala, Curtis Northcutt, Skyler Wang, Anish Athalye, Jonas Mueller, Francisco Guzmán 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11304v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是评估AI代理在投资银行端到端工作流程中的表现，与’LLM Agents/Autonomous Agents/Agentic Workflow’和’Tool Use/Function Calling/API Tool Use’高度相关（10分），因为论文明确测试前沿AI代理在专业工具使用和工作流中的能力；与’Large Language Models/LLMs/Foundation Models’相关（10分），因为论文测试了9个前沿模型（包括GPT-5.4）作为代理的基础；其他关键词如MoE、SLMs、训练方法、推理技术、科学AI应用等均未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文通过开发BankerToolBench基准，评估了AI代理在投资银行端到端工作流程中的表现，发现即使最佳模型（GPT-5.4）也未能满足近半评估标准，且无输出达到客户就绪水平。

摘要翻译

现有的人工智能基准测试缺乏评估专业工作流程中经济意义进展的保真度。为评估前沿人工智能代理在高价值、劳动密集型专业领域的能力，我们推出了BankerToolBench（BTB）：一个针对初级投资银行家常规端到端分析工作流程的开源基准测试。为构建一个基于代表性工作环境、具备生态效度的基准，我们与来自顶尖机构的502名投资银行家合作。BTB要求智能体通过导航数据室、使用行业工具（市场数据平台、美国证券交易委员会文件数据库）并生成多文件交付成果——包括Excel财务模型、PowerPoint推介材料和PDF/Word报告——来执行高级银行家的指令。完成一项BTB任务需要银行家投入长达21小时，这凸显了将此类工作成功委托给人工智能的经济价值。BTB支持对任何大语言模型或智能体进行自动化评估，依据资深投资银行家定义的100多项评估标准对交付成果进行评分，以捕捉实际利益相关者的效用。通过对9个前沿模型的测试，我们发现即使表现最佳的模型（GPT-5.4）也未通过近半数的评估标准，且银行家认为其0%的输出成果达到可交付客户的水平。我们的失败分析揭示了智能体人工智能在高风险专业工作流程中的关键障碍（例如跨成果文件的一致性缺失）及改进方向。

摘要 (Abstract)

Existing AI benchmarks lack the fidelity to assess economically meaningful progress on professional workflows. To evaluate frontier AI agents in a high-value, labor-intensive profession, we introduce BankerToolBench (BTB): an open-source benchmark of end-to-end analytical workflows routinely performed by junior investment bankers. To develop an ecologically valid benchmark grounded in representative work environments, we collaborated with 502 investment bankers from leading firms. BTB requires agents to execute senior banker requests by navigating data rooms, using industry tools (market data platform, SEC filings database), and generating multi-file deliverables–including Excel financial models, PowerPoint pitch decks, and PDF/Word reports. Completing a BTB task takes bankers up to 21 hours, underscoring the economic stakes of successfully delegating this work to AI. BTB enables automated evaluation of any LLM or agent, scoring deliverables against 100+ rubric criteria defined by veteran investment bankers to capture stakeholder utility. Testing 9 frontier models, we find that even the best-performing model (GPT-5.4) fails nearly half of the rubric criteria and bankers rate 0% of its outputs as client-ready. Our failure analysis reveals key obstacles (such as breakdowns in cross-artifact consistency) and improvement directions for agentic AI in high-stakes professional workflows.

关键词: AI agents, investment banking workflows, benchmark evaluation, LLM testing, professional tools, end-to-end analytics, economic impact, agentic AI

116. ❌ Enhancing Multimodal Large Language Models for Ancient Chinese Character Evolution Analysis via Glyph-Driven Fine-Tuning

作者: Rui Song, Lida Shi, Ruihua Qi, Yingji Li, Hao Xu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11299v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在古文字演化分析中的应用，提出了一种字形驱动的微调框架（GEVO）。与关键词高度相关：1）明确使用MLLMs（属于LLMs范畴）；2）核心贡献是微调方法（属于SFT范畴）；3）应用领域为古文字分析（属于AI for Science范畴）。与2B规模模型相关（属于SLMs范畴），涉及字形比较和演化推理（属于推理范畴）。其他关键词如MoE、Scaling Laws、RLHF等未涉及。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在古汉字演化分析中能力不足的问题，提出了字形驱动的微调框架GEVO，显著提升了模型在字符识别和演化推理等任务上的性能。

摘要翻译

近年来，多模态大语言模型的快速发展日益推动着古文字研究。由于文字演变是理解文化变迁与历史传承的基本路径，如何系统性地利用多模态大语言模型以支持并推进文字演变分析，仍是一个开放且尚未充分探索的问题。为填补这一空白，我们构建了一个包含11项任务、超过13万个实例的综合性基准，专门用于评估多模态大语言模型在分析古文字演变方面的能力。我们对多个广泛使用的多模态大语言模型进行了广泛评估，发现现有模型虽然在字形层面比较上表现出有限能力，但在核心任务——如文字识别与演变推理——上的性能仍存在显著局限。基于这些发现，我们提出了一种字形驱动的微调框架，旨在显式引导模型捕捉字形演变中的演化一致性，并增强其对文字演变的理解。实验结果表明，即使是规模为20亿参数的模型，在所有评估任务上也实现了持续且全面的性能提升。为促进未来研究，我们公开了基准数据集与训练好的模型。

摘要 (Abstract)

In recent years, rapid advances in Multimodal Large Language Models (MLLMs) have increasingly stimulated research on ancient Chinese scripts. As the evolution of written characters constitutes a fundamental pathway for understanding cultural transformation and historical continuity, how MLLMs can be systematically leveraged to support and advance text evolution analysis remains an open and largely underexplored problem. To bridge this gap, we construct a comprehensive benchmark comprising 11 tasks and over 130,000 instances, specifically designed to evaluate the capability of MLLMs in analyzing the evolution of ancient Chinese scripts. We conduct extensive evaluations across multiple widely used MLLMs and observe that, while existing models demonstrate a limited ability in glyph-level comparison, their performance on core tasks-such as character recognition and evolutionary reasoning-remains substantially constrained. Motivated by these findings, we propose a glyph-driven fine-tuning framework (GEVO) that explicitly encourages models to capture evolutionary consistency in glyph transformations and enhances their understanding of text evolution. Experimental results show that even models at the 2B scale achieve consistent and comprehensive performance improvements across all evaluated tasks. To facilitate future research, we publicly release both the benchmark and the trained models\footnote{https://github.com/songruiecho/GEVO}.

关键词: Multimodal Large Language Models, Ancient Chinese scripts, Glyph-driven fine-tuning, Character evolution analysis, Benchmark evaluation, Evolutionary reasoning, 2B scale models, GEVO framework

117. ❌ 3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS

作者: Bronislav Sidik, Dror Mizrahi 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11302v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	8.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于机器人操作中的规划与空间记忆问题，提出3D-ALP方法，结合MCTS与3D一致世界模型作为rollout oracle。核心相关关键词：1) ‘System 2 Thinking’（论文明确称3D-ALP为System 2 reasoning engine，高度相关）；2) ‘Monte Carlo Tree Search AND LLM’（论文使用MCTS进行规划，但未涉及LLM，故给8分而非10分）；3) ‘World Models AND General World Models’（论文使用3D-consistent world model作为oracle，高度相关）。其他关键词均未涉及，如LLMs、AI for Science等，故评0分。

!!! tip deepseek-chat TL;DR

论文提出3D-Anchored Lookahead Planning（3D-ALP），一种结合蒙特卡洛树搜索和3D世界模型的System 2推理引擎，用于解决机器人操作中的空间记忆和遮挡问题，在顺序到达任务中显著优于贪婪反应基线。

摘要翻译

本文提出3D锚定前瞻规划（3D-Anchored Lookahead Planning, 3D-ALP），一种用于机器人操作的System 2推理引擎，它将蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）与一个保持三维一致性的世界模型作为推演预言机相结合。与仅从当前相机帧评估动作的反应式策略不同，3D-ALP维持一个持久的相机到世界（camera-to-world, c2w）锚点，该锚点能够在遮挡后持续存在，从而实现对不再直接可见的物体位置进行精确重规划。在一个需要空间记忆的五步顺序抵达任务（实验E3）中，3D-ALP在需要记忆的步骤上取得了0.650±0.109的成功率，而贪婪反应式基线的成功率仅为0.006±0.008（Δ=+0.645）；在第五步上，3D-ALP的成功率达到0.822，贪婪基线则为0.000。一项消融研究（30次运行，3个随机种子）表明，树搜索的空间记忆是性能提升的主要驱动力（+0.533，贡献了82%的增益），更深度的前瞻规划带来了额外收益（+0.111，贡献17%）。我们还识别并解决了将UCT-MCTS（应用于树的置信上限算法[10]）应用于连续机器人操作任务时出现的四种结构性失效模式。

摘要 (Abstract)

We present 3D-Anchored Lookahead Planning (3D-ALP), a System 2 reasoning engine for robotic manipulation that combines Monte Carlo Tree Search (MCTS) with a 3D-consistent world model as the rollout oracle. Unlike reactive policies that evaluate actions from the current camera frame only, 3D-ALP maintains a persistent camera-to-world (c2w) anchor that survives occlusion, enabling accurate replanning to object positions that are no longer directly observable. On a 5-step sequential reach task requiring spatial memory (Experiment E3), 3D-ALP achieves 0.650 0.109 success rate on memory-required steps versus 0.006 0.008 for a greedy reactive baseline (Δ=+0.645), while step 5 success reaches 0.822 against 0.000 for greedy. An ablation study (30 episodes, 3 seeds) isolates tree search spatial memory as the primary driver (+0.533, 82% of gain) with additional benefit from deeper lookahead (+0.111, 17%). We also identify and resolve four structural failure modes in applying UCT-MCTS (Upper Confidence Bounds applied to Trees [10]) to continuous robotic manipulation.

关键词: 3D-Anchored Lookahead Planning, Monte Carlo Tree Search, world model, robotic manipulation, spatial memory, System 2 reasoning, persistent scene memory, continuous robotic manipulation

118. ❌ The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

作者: Yang Liu, Enxi Wang, Yufei Gao, Weixin Zhang, Bo Wang, Zhiyuan Zeng, Yikai Zhang, Yining Zheng, Xipeng Qiu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11297v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究强化学习在大型语言模型中的应用，提出MEDS框架通过记忆增强的动态奖励塑造来解决策略采样多样性不足和重复错误模式的问题。因此与’Large Language Models’高度相关（10分），与’RLHF’高度相关（10分，因为论文研究强化学习奖励设计），与’Self-Correction’有一定关联（8分，因为框架旨在减少重复错误）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training等与论文内容无直接关联，均给0分。

!!! tip deepseek-chat TL;DR

论文针对强化学习训练大型语言模型时出现的采样多样性不足和重复错误模式问题，提出了记忆增强的动态奖励塑造框架MEDS，通过识别和惩罚频繁出现的错误模式，在五个数据集和三个基础模型上显著提升了性能和行为多样性。

摘要翻译

尽管强化学习在大语言模型中取得了成功，一个常见的失败模式是采样多样性降低，即策略反复生成相似的错误行为。经典的熵正则化方法鼓励当前策略下的随机性，但并未明确抑制多次采样中重复出现的失败模式。我们提出了MEDS（记忆增强的动态奖励塑形框架），该框架将历史行为信号纳入奖励设计中。通过存储并利用模型的中间表示，我们捕获过往采样的特征，并采用基于密度的聚类方法来识别频繁重现的错误模式。被分配到更普遍错误簇的采样会受到更严厉的惩罚，从而在减少重复错误的同时鼓励更广泛的探索。在五个数据集和三个基础模型上的实验表明，MEDS相较于现有基线方法持续提升了平均性能，最高实现了4.13 pass@1分数和4.37 pass@128分数的增益。基于大语言模型的标注和定量多样性指标的进一步分析显示，MEDS有效增加了采样过程中的行为多样性。

摘要 (Abstract)

Despite the success of reinforcement learning for large language models, a common failure mode is reduced sampling diversity, where the policy repeatedly generates similar erroneous behaviors. Classical entropy regularization encourages randomness under the current policy, but does not explicitly discourage recurrent failure patterns across rollouts. We propose MEDS, a Memory-Enhanced Dynamic reward Shaping framework that incorporates historical behavioral signals into reward design. By storing and leveraging intermediate model representations, we capture features of past rollouts and use density-based clustering to identify frequently recurring error patterns. Rollouts assigned to more prevalent error clusters are penalized more heavily, encouraging broader exploration while reducing repeated mistakes. Across five datasets and three base models, MEDS consistently improves average performance over existing baselines, achieving gains of up to 4.13 pass@1 points and 4.37 pass@128 points. Additional analyses using both LLM-based annotations and quantitative diversity metrics show that MEDS increases behavioral diversity during sampling.

关键词: reinforcement learning, large language models, reward shaping, sampling diversity, error patterns, memory-enhanced, behavioral diversity, MEDS

119. ❌ Saar-Voice: A Multi-Speaker Saarbrücken Dialect Speech Corpus

作者: Lena S. Oberkircher, Jesujoba O. Alabi, Dietrich Klakow, Jürgen Trouvain 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11803v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Saar-Voice: A Multi-Speaker Saarbrücken Dialect Speech Corpus》专注于构建一个德语方言（Saarbrücken方言）的语音语料库，涉及语音数据收集、文本对齐、G2P转换和低资源TTS应用。所有评分关键词均与大模型、深度学习技术原理或AI for Science直接相关，而本文的核心是方言语音数据集构建和基础语音技术，未涉及任何大模型技术、深度学习创新或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对德语方言在语言资源中代表性不足的问题，构建了一个六小时的Saarbrücken方言多说话人语音语料库，为低资源场景下的方言感知文本到语音研究提供了基础。

摘要翻译

近年来，自然语言处理（NLP）与语音技术虽已取得显著进展，但其研究仍主要集中于标准语言变体。方言尽管具有重要的文化意义和广泛的使用基础，在语言资源和计算模型中却长期缺乏充分表征，这导致了技术性能上的差异。为弥补这一空白，我们推出了Saar-Voice——一个针对德国萨尔布吕肯方言、时长六小时的语音语料库。该数据集的构建首先通过数字化图书和本地资料进行文本采集，随后由九位发音人录制了部分文本内容。我们对文本和语音部分均进行了分析，以评估数据集的特征与质量。我们讨论了与正字法差异和发音人多样性相关的方法学挑战，并探索了字形到音素（G2P）的转换问题。最终形成的语料库提供了对齐的文本与音频表征。这为未来开展方言感知的文本到语音（TTS）研究奠定了基础，特别是在低资源场景下，包括零样本和少样本的模型适配研究。

摘要 (Abstract)

Natural language processing (NLP) and speech technologies have made significant progress in recent years; however, they remain largely focused on standardized language varieties. Dialects, despite their cultural significance and widespread use, are underrepresented in linguistic resources and computational models, resulting in performance disparities. To address this gap, we introduce Saar-Voice, a six-hour speech corpus for the Saarbrücken dialect of German. The dataset was created by first collecting text through digitized books and locally sourced materials. A subset of this text was recorded by nine speakers, and we conducted analyses on both the textual and speech components to assess the dataset’s characteristics and quality. We discuss methodological challenges related to orthographic and speaker variation, and explore grapheme-to-phoneme (G2P) conversion. The resulting corpus provides aligned textual and audio representations. This serves as a foundation for future research on dialect-aware text-to-speech (TTS), particularly in low-resource scenarios, including zero-shot and few-shot model adaptation.

关键词: speech corpus, dialect, Saarbrücken German, low-resource TTS, grapheme-to-phoneme conversion, text-to-speech, multispeaker dataset, linguistic resources

120. ❌ Psychological Concept Neurons: Can Neural Control Bias Probing and Shift Generation in LLMs?

作者: Yuto Harada, Hiro Taiyo Hamada 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11802v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）中Big Five人格概念的内部表征形成、定位及干预控制，属于LLM的机制可解释性研究。因此，与’Large Language Models OR LLMs OR Foundation Models’和’Mechanistic Interpretability OR Explainable AI’高度相关（10分）。论文未涉及其他关键词所指向的具体技术（如MoE、SFT、RAG、量化等）或应用领域（如生物信息学），也未提及任何指定的专家作者，故其余关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究通过探测和神经元干预，探究了LLMs内部如何表征Big Five人格概念，发现概念信息在早期层即可解码且干预可有效偏置内部表征，但对生成标签的行为控制效果较弱，揭示了表征控制与行为控制之间的差距。

摘要翻译

利用大五人格等心理学构念，大型语言模型能够模拟特定人格特征并预测用户人格。尽管大型语言模型可以表现出与这些构念一致的行为，但其内部在何处以及如何表征这些构念、这些表征又如何与行为输出相关联，目前尚不明确。为填补这一空白，本研究聚焦于问卷操作化的大五人格概念，分析其内部表征的形成与定位，并通过干预实验探究这些表征与行为输出的关系。在实验中，我们首先采用探针技术检测大五人格信息在模型深度中的涌现位置；随后识别对各个大五概念具有选择性的神经元，并通过增强或抑制其激活来检验是否能使潜在表征和标签生成向预期方向偏移。研究发现：大五人格信息在模型浅层即可快速解码，并持续可检测至最终层；概念选择性神经元主要分布于中层，且跨领域重叠有限。针对这些神经元的干预能稳定地将探针读数导向目标概念，部分概念的目标成功率超过0.8，表明模型对大五人格特质的内在区分具有因果可导向性。在标签生成层面，相同干预虽常使生成标签分布偏向预期方向，但效应较弱、概念依赖性更强，且常伴随跨特质溢出效应，这表明即使对大量概念选择性神经元实施干预，仍难以实现对生成标签的同等控制。总体而言，我们的研究揭示了大型语言模型中表征控制与行为控制之间的差距。

摘要 (Abstract)

Using psychological constructs such as the Big Five, large language models (LLMs) can imitate specific personality profiles and predict a user’s personality. While LLMs can exhibit behaviors consistent with these constructs, it remains unclear where and how they are represented inside the model and how they relate to behavioral outputs. To address this gap, we focus on questionnaire-operationalized Big Five concepts, analyze the formation and localization of their internal representations, and use interventions to examine how these representations relate to behavioral outputs. In our experiment, we first use probing to examine where Big Five information emerges across model depth. We then identify neurons that respond selectively to each Big Five concept and test whether enhancing or suppressing their activations can bias latent representations and label generation in intended directions. We find that Big Five information becomes rapidly decodable in early layers and remains detectable through the final layers, while concept-selective neurons are most prevalent in mid layers and exhibit limited overlap across domains. Interventions on these neurons consistently shift probe readouts toward targeted concepts, with targeted success rates exceeding 0.8 for some concepts, indicating that the model’s internal separation of Big Five personality traits can be causally steered. At the label-generation level, the same interventions often bias generated label distributions in the intended directions, but the effects are weaker, more concept-dependent, and often accompanied by cross-trait spillover, indicating that comparable control over generated labels is difficult even with interventions on a large fraction of concept-selective neurons. Overall, our findings reveal a gap between representational control and behavioral control in LLMs.

关键词: Large Language Models, Psychological Constructs, Big Five Personality, Internal Representations, Neuron Probing, Intervention, Mechanistic Interpretability, Representational Control

121. ❌ Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

作者: Yoonsang Lee, Howard Yen, Xi Ye, Danqi Chen 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11753v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	8.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究并行测试时扩展（parallel test-time scaling）用于长视野智能体任务（如智能体搜索和深度研究），提出AggAgent聚合方法。高度相关关键词：LLM Agents（论文研究智能体任务）、Tool Use（智能体使用工具）、Chain of Thought（与CoT推理的并行扩展类比）。中度相关：Multi-agent Systems（涉及多个智能体轨迹聚合）、Context Window Extension（提及轨迹过长超出上下文窗口问题）。其他关键词如MoE、SLMs、Scaling Laws等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对长视野智能体任务中并行轨迹聚合的挑战，提出了AggAgent方法，通过将并行轨迹视为环境并配备轻量级工具来导航和合成信息，在多个基准测试中显著优于现有聚合方法，同时保持成本效率。

摘要翻译

本研究探讨面向长周期智能体任务（如智能搜索与深度研究）的并行测试时扩展方法，该方法通过并行生成多个任务轨迹并将其聚合为最终响应。尽管此类扩展在思维链推理中已被证明有效，但智能体任务具有独特挑战：任务轨迹具有长周期、多轮次和工具增强特性，且输出通常为开放式。仅聚合最终答案会丢弃轨迹中的丰富信息，而直接拼接所有轨迹则会超出模型的上下文窗口。为此，我们提出AggAgent——一种将并行轨迹视为环境的聚合智能体。我们为其配备轻量级工具以检查候选方案并在轨迹间进行搜索，使其能够按需导航与综合信息。在六个基准测试和三类模型系列（GLM-4.7、Qwen3.5、MiniMax-M2.5）上的实验表明，AggAgent优于所有现有聚合方法——在深度研究任务上平均绝对提升达5.3%，两项深度研究任务上最高提升10.3%——同时仅增加极小开销，其聚合成本始终控制在单次智能体任务轨迹生成范围内。我们的研究证实，智能体聚合是实现并行测试时扩展的一种高效且经济可行的途径。

摘要 (Abstract)

We study parallel test-time scaling for long-horizon agentic tasks such as agentic search and deep research, where multiple rollouts are generated in parallel and aggregated into a final response. While such scaling has proven effective for chain-of-thought reasoning, agentic tasks pose unique challenges: trajectories are long, multi-turn, and tool-augmented, and outputs are often open-ended. Aggregating only final answers discards rich information from trajectories, while concatenating all trajectories exceeds the model’s context window. To address this, we propose AggAgent, an aggregation agent that treats parallel trajectories as an environment. We equip it with lightweight tools to inspect candidate solutions and search across trajectories, enabling it to navigate and synthesize information on demand. Across six benchmarks and three model families (GLM-4.7, Qwen3.5, MiniMax-M2.5), AggAgent outperforms all existing aggregation methods-by up to 5.3% absolute on average and 10.3% on two deep research tasks-while adding minimal overhead, as the aggregation cost remains bounded by a single agentic rollout. Our findings establish agentic aggregation as an effective and cost-efficient approach to parallel test-time scaling.

关键词: agentic aggregation, parallel test-time scaling, long-horizon agentic tasks, tool-augmented trajectories, multi-turn trajectories, aggregation agent, trajectory synthesis, cost-efficient scaling

122. ❌ HistLens: Mapping Idea Change across Concepts and Corpora

作者: Yi Jing, Weiyun Qiu, Yihang Peng, Zhifang Sui 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11749v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文HistLens专注于概念历史分析的计算框架，使用SAE（稀疏自编码器）进行概念表示分解和轨迹追踪，属于自然语言处理中的历时语义分析和文本挖掘领域。所有评分关键词均涉及大模型/深度学习的技术原理、训练方法、推理优化、应用范式等具体方面，而本文未涉及任何大模型技术（如LLM、MoE、训练对齐方法等），也未在科学领域（如生物信息学）应用大模型，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一个基于稀疏自编码器的多概念多语料库概念历史分析框架HistLens，用于追踪概念在时间和不同来源中的演化轨迹，从而支持跨概念和跨语料库的思想演变模式计算。

摘要翻译

语言变迁既反映也塑造着社会进程，而基础概念的语义演化则为历史与社会转型提供了可量化的轨迹。尽管历时语义学与话语分析领域近期取得了进展，但现有的计算方法往往存在两个局限：（一）集中于单一概念或单一语料库，导致研究结论难以在异质性资料来源间进行比较；（二）仍局限于表层词汇证据，当概念被隐性表达时，其计算与解释粒度均显不足。我们提出HistLens——一个基于语义轴分解（SAE）的统一框架，用于多概念、多语料库的概念史分析。该框架将概念表征分解为可解释的特征，并追踪这些特征在不同时间和资料来源中的激活动态，从而在共享坐标系内生成可比的概念演化轨迹。基于长时段新闻语料库的实验表明，HistLens支持跨概念、跨语料库的思想演化模式计算，并能实现隐性概念计算。通过将概念建模与解释需求相衔接，HistLens为社会科学与人文学科的历时文本分析拓展了分析视角与方法体系。

摘要 (Abstract)

Language change both reflects and shapes social processes, and the semantic evolution of foundational concepts provides a measurable trace of historical and social transformation. Despite recent advances in diachronic semantics and discourse analysis, existing computational approaches often (i) concentrate on a single concept or a single corpus, making findings difficult to compare across heterogeneous sources, and (ii) remain confined to surface lexical evidence, offering insufficient computational and interpretive granularity when concepts are expressed implicitly. We propose HistLens, a unified, SAE-based framework for multi-concept, multi-corpus conceptual-history analysis. The framework decomposes concept representations into interpretable features and tracks their activation dynamics over time and across sources, yielding comparable conceptual trajectories within a shared coordinate system. Experiments on long-span press corpora show that HistLens supports cross-concept, cross-corpus computation of patterns of idea evolution and enables implicit concept computation. By bridging conceptual modeling with interpretive needs, HistLens broadens the analytical perspectives and methodological repertoire available to social science and the humanities for diachronic text analysis.

关键词: conceptual history analysis, diachronic semantics, sparse autoencoder, multi-corpus analysis, idea evolution, implicit concept computation, text analysis, social science

123. ❌ LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

作者: Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, Ge Liu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11748v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于连续扩散语言模型（LangFlow）的创新，属于大模型技术原理的创新。与’Large Language Models’和’Pre-training’高度相关（8分），因为论文研究语言建模的预训练范式创新。其他关键词如MoE、SFT、RAG、推理加速等均未涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文解决了连续扩散语言模型性能落后于离散模型的问题，提出了LangFlow模型，通过Bregman散度连接嵌入空间扩散与流匹配，实现了与离散扩散模型相当的性能，并在多个基准测试中超越了自回归基线。

摘要翻译

连续扩散模型已在图像等领域取得优异性能。然而在语言建模任务中，现有的连续扩散语言模型（DLMs）表现一直落后于离散扩散模型。本研究通过LangFlow模型弥合了这一差距，这是首个能与离散扩散模型匹敌的连续扩散语言模型。我们的方法通过布雷格曼散度将嵌入空间的连续扩散语言模型与流匹配理论相连接，并引入三项关键创新：（1）提出基于常微分方程的新型负对数似然边界，为基于连续流的语言模型建立理论评估框架；（2）提出信息均匀化原则指导噪声调度，由此推导出基于冈贝尔分布的可学习调度器；（3）改进训练协议，引入自条件机制，同步提升似然估计与生成样本质量。LangFlow在多项基准测试中表现优异，在LM1B数据集上达到30.0的困惑度（PPL），在OpenWebText数据集上达到24.6的困惑度。该模型在同等规模下与顶尖离散扩散语言模型性能相当，并在多个基准的零样本迁移任务中超越自回归基线模型。LangFlow充分证明连续扩散是语言建模领域中具有竞争力与发展前景的新范式。 https://github.com/nealchen2003/LangFlow

摘要 (Abstract)

Continuous diffusion models have achieved strong performance across domains such as images. However, in language modeling, prior continuous diffusion language models (DLMs) lag behind discrete counterparts. In this work, we close this gap with LangFlow, the first continuous DLM to rival discrete diffusion. Our approach connects embedding-space DLMs to Flow Matching via Bregman divergence and introduces three key innovations: (1) a novel ODE-based NLL bound for principled evaluation of continuous flow-based language models; (2) an information-uniform principle for noise scheduling, motivating a learnable scheduler based on a Gumbel distribution; and (3) an improved training protocol incorporating self-conditioning, which enhances both likelihood and sample quality.LangFlow achieves strong performance across benchmarks, reaching a perplexity (PPL) of 30.0 on LM1B and 24.6 on OpenWebText. It matches top discrete DLMs at comparable scale and surpasses autoregressive baselines in zero-shot transfer across multiple benchmarks. LangFlow provides clear evidence that continuous diffusion is a competitive and promising paradigm for language modeling. https://github.com/nealchen2003/LangFlow

关键词: Continuous diffusion models, Language modeling, Flow Matching, Bregman divergence, Noise scheduling, Self-conditioning, Perplexity, Zero-shot transfer

124. ❌ Please Make it Sound like Human: Encoder-Decoder vs. Decoder-Only Transformers for AI-to-Human Text Style Transfer

作者: Utsav Paneru 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11687v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究AI生成文本到人类风格文本的转换，主要涉及大语言模型（Mistral-7B）的应用和微调技术。与’Large Language Models’相关（使用Mistral-7B），评5分；与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（对BART和Mistral-7B进行微调），评8分；与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（使用QLoRA对Mistral-7B进行参数高效微调），评8分。其他关键词如MoE、Scaling Laws、RAG、Agents等与论文内容无关，评0分。

!!! tip deepseek-chat TL;DR

该论文研究如何将AI生成的文本重写为人类风格文本，通过构建平行语料库并微调BART和Mistral-7B模型，发现BART-large在更少参数下达到更高参考相似度，而Mistral-7B的风格转移存在过度调整问题。

摘要翻译

人工智能生成文本在学术与专业写作中已日益普遍，这推动了对检测方法的研究。然而，其反向过程——系统性地重写AI生成的文本，使其读起来如同人类真实创作——却较少被探讨。我们构建了一个包含25,140对AI输入与人类参考文本片段的平行语料库，识别出11种可测量的风格标记以区分这两种文本类型，并微调了三个模型：BART-base、BART-large以及采用QLoRA技术的Mistral-7B-Instruct。BART-large在参数规模比Mistral-7B少17倍的条件下，取得了最高的参考文本相似度——BERTScore F1值为0.924，ROUGE-L为0.566，chrF++为55.92。我们发现Mistral-7B较高的标记偏移分数反映的是过度调整而非准确性，并指出偏移精度是当前风格转换评估中一个有意义的盲区。

摘要 (Abstract)

AI-generated text has become common in academic and professional writing, prompting research into detection methods. Less studied is the reverse: systematically rewriting AI-generated prose to read as genuinely human-authored. We build a parallel corpus of 25,140 paired AI-input and human-reference text chunks, identify 11 measurable stylistic markers separating the two registers, and fine-tune three models: BART-base, BART-large, and Mistral-7B-Instruct with QLoRA. BART-large achieves the highest reference similarity – BERTScore F1 of 0.924, ROUGE-L of 0.566, and chrF++ of 55.92 – with 17x fewer parameters than Mistral-7B. We show that Mistral-7B’s higher marker shift score reflects overshoot rather than accuracy, and argue that shift accuracy is a meaningful blind spot in current style transfer evaluation.

关键词: AI-generated text, human-authored text, style transfer, fine-tuning, BART, Mistral-7B, QLoRA, evaluation metrics

125. ❌ Hidden Failures in Robustness: Why Supervised Uncertainty Quantification Needs Better Evaluation

作者: Joe Stacey, Hadas Orgad, Kentaro Inui, Benjamin Heinzerling, Nafise Sadat Moosavi 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11662v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究大语言模型（LLMs）的隐藏状态如何用于不确定性估计和幻觉检测，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文直接涉及幻觉缓解（Hallucination Mitigation），因为其研究不确定性估计以检测幻觉，因此该关键词也得10分。论文通过分析隐藏状态和探针设计来理解模型行为，与可解释AI（Mechanistic Interpretability OR Explainable AI）有一定关联，但非核心，给5分。其他关键词如MoE、SFT、RAG、量化等均未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文系统评估了大语言模型中基于探针的监督不确定性估计方法的鲁棒性，发现现有方法在分布偏移下表现不佳，并提出了改进策略。

摘要翻译

近期研究表明，大型语言模型的隐藏状态包含可用于不确定性估计和幻觉检测的信号，这推动了对基于探针的高效方法日益增长的研究兴趣。然而，现有方法的鲁棒性程度仍不明确，且何种探针设计能在分布偏移下提供可靠的不确定性估计尚待探索。我们对不同模型、任务和分布外（OOD）设置下的监督式不确定性探针进行了系统性研究，通过改变表征层、特征类型和词元聚合策略，训练了超过2000个探针。评估结果表明，现有方法（尤其在生成长文本时）鲁棒性较差。我们还发现，探针鲁棒性更多取决于输入特征而非架构设计：中间层的表征比最终层隐藏状态具有更可靠的泛化能力，而聚合多个响应词元特征始终比依赖单一词元特征更具鲁棒性。这些差异在分布内往往不明显，但在分布偏移下变得至关重要。基于评估结果，我们探索了一种简单的混合回退策略以提升鲁棒性，并指出构建更强健探针的前提是建立更完善的评估体系。

摘要 (Abstract)

Recent work has shown that the hidden states of large language models contain signals useful for uncertainty estimation and hallucination detection, motivating a growing interest in efficient probe-based approaches. Yet it remains unclear how robust existing methods are, and which probe designs provide uncertainty estimates that are reliable under distribution shift. We present a systematic study of supervised uncertainty probes across models, tasks, and OOD settings, training over 2,000 probes while varying the representation layer, feature type, and token aggregation strategy. Our evaluation highlights poor robustness in current methods, particularly in the case of long-form generations. We also find that probe robustness is driven less by architecture and more by the probe inputs. Middle-layer representations generalise more reliably than final-layer hidden states, and aggregating across response tokens is consistently more robust than relying on single-token features. These differences are often largely invisible in-distribution but become more important under distribution shift. Informed by our evaluation, we explore a simple hybrid back-off strategy for improving robustness, arguing that better evaluation is a prerequisite for building more robust probes.

关键词: large language models, uncertainty estimation, hallucination detection, robustness evaluation, distribution shift, probe-based methods, hidden states, long-form generation

126. ❌ CArtBench: Evaluating Vision-Language Models on Chinese Art Understanding, Interpretation, and Authenticity

作者: Xuefeng Wei, Zhixuan Wang, Xuan Zhou, Zhi Qu, Hongyao Li, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11632v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视觉语言模型（VLMs）在中文艺术品理解、解释和真实性评估方面的基准测试，不涉及大语言模型（LLMs）或深度学习技术原理的创新。所有评分关键词均与大语言模型技术、训练方法、推理优化、对齐、压缩、代理系统等具体技术相关，而论文研究的是视觉语言模型在特定领域的应用评估，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了CArtBench基准，用于评估视觉语言模型在中文艺术品理解、解释和真实性鉴别方面的能力，发现当前模型在专家级推理任务上表现不佳。

摘要翻译

我们推出CARTBENCH——一个基于博物馆场景的基准测试平台，旨在评估视觉-语言模型（VLMs）对中国艺术品超越简短识别与问答的深层理解能力。该基准包含四项子任务：CURATORQA（策展人问答）侧重于基于证据的识别与推理；CATALOGCAPTION（目录描述）要求生成结构化的四部分专家式鉴赏文本；REINTERPRET（再诠释）需提供有依据的重新解读并附专家评分；CONNOISSEURPAIRS（鉴真配对）则是在视觉相似干扰下进行诊断性真伪判别。CARTBENCH通过将维基数据中带有图像的故宫博物院藏品与权威图录页面对齐构建而成，涵盖多个朝代的五类艺术品。在对九个代表性VLM的测试中，我们发现：CURATORQA整体高准确率可能掩盖其在困难证据关联和风格断代推理上的骤降；长篇幅鉴赏文本仍远未达到专家参考标准；面向真伪判别的诊断性区分能力接近随机水平。这些结果凸显了当前模型实现鉴赏家级别推理所面临的严峻挑战。

摘要 (Abstract)

We introduce CARTBENCH, a museum-grounded benchmark for evaluating vision-language models (VLMs) on Chinese artworks beyond short-form recognition and QA. CARTBENCH comprises four subtasks: CURATORQA for evidence-grounded recognition and reasoning, CATALOGCAPTION for structured four-section expert-style appreciation, REINTERPRET for defensible reinterpretation with expert ratings, and CONNOISSEURPAIRS for diagnostic authenticity discrimination under visually similar confounds. CARTBENCH is built by aligning image-bearing Palace Museum objects from Wikidata with authoritative catalog pages, spanning five art categories across multiple dynasties. Across nine representative VLMs, we find that high overall CURATORQA accuracy can mask sharp drops on hard evidence linking and style-to-period inference; long-form appreciation remains far from expert references; and authenticity-oriented diagnostic discrimination stays near chance, underscoring the difficulty of connoisseur-level reasoning for current models.

关键词: Vision-Language Models, Chinese Art Understanding, Benchmark Evaluation, Artwork Interpretation, Authenticity Discrimination, Museum-Grounded Benchmark, Expert-Level Reasoning, Palace Museum

127. ❌ Back to Basics: Let Conversational Agents Remember with Just Retrieval and Generation

作者: Yuqian Wu, Wei Chen, Zhengjun Huang, Junle Chen, Qingxiang Liu, Kai Wang, Xiaofang Zhou, Yuxuan Liang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11628v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于对话记忆系统，提出了一种基于检索和生成的简约框架。与’Retrieval-Augmented Generation (RAG)‘高度相关（10分），因为该方法的核心是检索和生成。与’Large Language Models (LLMs)‘相关（8分），因为对话代理通常基于LLMs，且论文涉及生成。与’LLM Agents’相关（8分），因为对话代理是LLM代理的一种。与’Context Window Extension’有一定关联（5分），因为论文处理长对话历史，但未直接扩展上下文窗口。其他关键词如MoE、Scaling Laws、Pre-training、RLHF等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对长对话中上下文稀释的问题，提出了一种基于检索和生成的简约对话记忆框架，通过Turn Isolation Retrieval和Query-Driven Pruning在多个基准测试中实现了稳健性能。

摘要翻译

现有对话记忆系统依赖复杂的层级摘要或强化学习来管理长期对话历史，但随着对话增长，仍易受语境稀释影响。本研究提出一个不同视角：核心瓶颈可能不在于记忆架构，而在于潜在知识流形中的信号稀疏效应。通过受控实验，我们识别出两个关键现象：决定性证据稀疏——随着对话轮次增加，相关信号在语义空间中日益孤立，导致基于聚合的方法性能急剧下降；以及双重冗余——会话间干扰与会话内填充内容共同引入大量非信息性内容，阻碍有效生成。基于这些发现，我们提出\method框架，这一极简框架使对话记忆回归本质，仅通过轮次隔离检索与查询驱动剪枝实现检索与生成。TIR采用最大激活策略替代全局聚合以捕捉轮次级信号，QDP则通过移除冗余会话和对话填充内容构建紧凑的高密度证据集。在多个基准测试上的广泛实验表明，\method在不同设置下均实现鲁棒性能，在保持高令牌效率与低延迟的同时持续超越强基线模型，为对话记忆研究建立了新的极简基线。

摘要 (Abstract)

Existing conversational memory systems rely on complex hierarchical summarization or reinforcement learning to manage long-term dialogue history, yet remain vulnerable to context dilution as conversations grow. In this work, we offer a different perspective: the primary bottleneck may lie not in memory architecture, but in the \textit{Signal Sparsity Effect} within the latent knowledge manifold. Through controlled experiments, we identify two key phenomena: \textit{Decisive Evidence Sparsity}, where relevant signals become increasingly isolated with longer sessions, leading to sharp degradation in aggregation-based methods; and \textit{Dual-Level Redundancy}, where both inter-session interference and intra-session conversational filler introduce large amounts of non-informative content, hindering effective generation. Motivated by these insights, we propose \method, a minimalist framework that brings conversational memory back to basics, relying solely on retrieval and generation via Turn Isolation Retrieval (TIR) and Query-Driven Pruning (QDP). TIR replaces global aggregation with a max-activation strategy to capture turn-level signals, while QDP removes redundant sessions and conversational filler to construct a compact, high-density evidence set. Extensive experiments on multiple benchmarks demonstrate that \method achieves robust performance across diverse settings, consistently outperforming strong baselines while maintaining high efficiency in tokens and latency, establishing a new minimalist baseline for conversational memory.

关键词: conversational memory, retrieval-augmented generation, long-term dialogue, signal sparsity, turn isolation retrieval, query-driven pruning, minimalist framework, context dilution

128. ❌ Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation

作者: Jiashu Yao, Heyan Huang, Zeming Liu, Yuhang Guo 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11611v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based agents在强化学习中的稀疏奖励问题，提出MISE方法利用生成式自我评估作为密集奖励信号，并校准这些信号。因此与’Large Language Models’高度相关（核心研究对象），与’Self-Correction/Self-Improvement/Self-Reflection’高度相关（核心方法涉及自我评估和自我奖励），与’LLM Agents/Autonomous Agents’高度相关（研究基于LLM的智能体）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对基于大语言模型的智能体在强化学习中面临的稀疏奖励挑战，提出了MISE方法，通过互信息自我评估生成密集奖励信号并进行校准，使开源7B参数LLM在无专家监督下达到与GPT-4o相当的验证性能。

摘要翻译

为克服基于大语言模型（LLM）的智能体在强化学习（RL）中面临的稀疏奖励难题，我们提出互信息自评估（Mutual Information Self-Evaluation，MISE）方法。这是一种利用事后生成式自评估作为密集奖励信号，同时依据环境反馈对这些信号进行校准的强化学习范式。实证表明，MISE能使智能体通过密集的内部奖励自主进行学习，从而补充稀疏的外部信号。在理论上，我们的研究首次为生成式自奖励范式提供了形式化基础。我们证明，使用事后自评估奖励等价于最小化一个结合了互信息与策略-代理奖励策略间KL散度的目标函数。这一理论洞见进而启发并论证了我们的校准步骤——该步骤主动将这些奖励与最优策略对齐。大量实验表明，MISE在性能上超越现有强基线方法，使约70亿参数的开源大语言模型在无需专家监督的验证环境下，达到与GPT-4o相当的表现水平。

摘要 (Abstract)

To overcome the sparse reward challenge in reinforcement learning (RL) for agents based on large language models (LLMs), we propose Mutual Information Self-Evaluation (MISE), an RL paradigm that utilizes hindsight generative self-evaluation as dense reward signals while simultaneously calibrating them against the environmental feedbacks. Empirically, MISE enables an agent to learn autonomously from dense internal rewards supplementing sparse extrinsic signals. Theoretically, our work provides the first formal foundation for the paradigm of generative self-rewarding. We prove that utilizing hindsight self-evaluation rewards is equivalent to minimizing an objective that combines mutual information with a KL divergence term between the policy and a proxy reward policy. This theoretical insight then informs and justifies our calibration step, which actively aligns these rewards with the optimal policy. Extensive experiments show that MISE outperforms strong baselines, enabling open-source LLMs about 7B parameters to achieve performance comparable to GPT-4o on validation without expert supervision.

关键词: Large Language Models, Reinforcement Learning, Sparse Reward, Self-Evaluation, Mutual Information, LLM Agents, Generative Self-Rewarding, Calibration

129. ❌ Decomposing and Reducing Hidden Measurement Error in LLM Evaluation Pipelines

作者: Solomon Messing 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11581v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM评估流程中的测量误差分解与优化，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为全文围绕LLM评估展开。其他关键词如MoE、SLMs、训练技术、推理优化、代理系统、科学AI应用等均未在标题或摘要中提及，与论文内容完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究LLM评估流程中隐藏测量误差的分解与减少问题，通过分析误差来源并优化评估设计，在多个任务中证明优化后的评估流程能显著降低误差并提高基准测试的鲁棒性。

摘要翻译

大语言模型评估结果直接决定着哪些模型得以部署、何种安全标准被采纳，以及哪些研究结论能够发表。然而这些评分背后隐藏着不确定性：改写提示词、切换评判模型或调整温度参数，都可能导致结果发生足以颠覆排名次序、逆转研究结论的变化。传统的置信区间方法忽视了这种变异，导致覆盖率不足的问题随着数据量增加而加剧。未被量化的方差同时创造了可被利用的漏洞：模型开发者可能针对测量噪声而非真实能力进行优化。本文系统分解了大语言模型评估流程中的不确定性来源，区分了随数据量增加而缩减的统计方差与对研究者设计选择敏感的系统偏差，并规划出降低总体误差的最优路径。对于基准测试构建者，同样的分解方法能够识别哪些设计选择会形成可被利用的博弈漏洞，并提出最小化该漏洞的设计方案。在意识形态标注、安全分类、MMLU基准测试以及经过人工验证的宣传内容审计等任务中，经投影优化的评估流程在人类基准对比中超越了73%的朴素流程方案。在MMLU测试中，优化后的预算分配方案在同等成本下，将估计误差较标准单提示词评估降低了一半。通过小样本方差估计实验，我们构建的置信区间在模型包含相关流程要素时接近名义覆盖率，并据此提出降低测量误差、提升基准测试鲁棒性的具体建议。

摘要 (Abstract)

LLM evaluations drive which models get deployed, which safety standards get adopted, and which research conclusions get published. Yet these scores carry hidden uncertainty: rephrasing the prompt, switching the judge model, or changing the temperature can shift results enough to flip rankings and reverse conclusions. Standard confidence intervals ignore this variance, producing under-coverage that worsens with more data. The unmeasured variance also creates an exploitable surface: model developers can optimize against measurement noise rather than genuine capability. This paper decomposes LLM pipeline uncertainty into its sources, distinguishes variance that shrinks with more data from sensitivity to researcher design choices, and projects the most efficient path to reducing total error. For benchmark builders, the same decomposition identifies which design choices contribute exploitable surface for gaming and prescribes designs that minimize it. Across ideology annotation, safety classification, MMLU benchmarking, and a human-validated propaganda audit, projection-optimized pipelines outperform 73% of possible naive pipelines against a human baseline. On MMLU, optimized budget allocation halves estimation error compared to standard single-prompt evaluation at equivalent cost. A small-sample variance estimation exercise is sufficient to derive confidence intervals that approach nominal coverage when the model includes the relevant pipeline facets, and to generate recommendations for reducing measurement error and improving benchmark robustness.

关键词: LLM evaluation, measurement error, benchmark robustness, pipeline uncertainty, variance decomposition, confidence intervals, model ranking, human baseline

130. ❌ MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts

作者: Chen Hu, Yintao Tai, Antonio Vergari, Frank Keller, Alessandro Suglia 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11575v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MIXAR研究像素级语言模型，属于大语言模型（LLMs）的变体，因此与’Large Language Models’高度相关（8分）。它涉及模型预训练和扩展到多语言，与’Pre-training’相关（8分）。论文提到将模型扩展到0.5B参数，这涉及缩放定律和数据质量，因此与’Scaling Laws AND Data Quality’有一定关联（5分）。其他关键词如MoE、SFT、RAG、量化等未在摘要中提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了像素级语言模型在多语言和脚本中的扩展问题，提出了MIXAR模型，在八种语言上训练，相比之前的像素级和基于分词器的模型，在判别性和生成性多语言任务上表现出显著性能提升，并展示了良好的泛化能力。

摘要翻译

基于像素的语言模型正作为传统基于分词方法的替代方案而兴起，有望规避分词过程中的诸多挑战。然而，不同语言固有的感知多样性为像素空间中的多语言泛化带来了显著障碍。本文提出了MIXAR，这是首个利用多种不同文字系统、在八种不同语言上训练生成的基于像素的语言模型。我们通过实证将MIXAR与先前的基于像素模型以及可比较的基于分词器的模型进行对比，证明其在判别性和生成性多语言任务上均实现了显著性能提升。此外，我们还展示了MIXAR对于训练过程中从未见过的语言具有鲁棒性。当模型规模扩展至0.5B参数时，这些结果得到了进一步强化——这不仅提升了其在LAMBADA等生成任务上的能力，也增强了其在面对拼写攻击等输入扰动时的鲁棒性。

摘要 (Abstract)

Pixel-based language models are gaining momentum as alternatives to traditional token-based approaches, promising to circumvent tokenization challenges. However, the inherent perceptual diversity across languages poses a significant hurdle for multilingual generalization in pixel space. This paper introduces MIXAR, the first generative pixel-based language model trained on eight different languages utilizing a range of different scripts. We empirically evaluate MIXAR against previous pixel-based models as well as comparable tokenizer-based models, demonstrating substantial performance improvement on discriminative and generative multilingual tasks. Additionally, we show how MIXAR is robust to languages never seen during the training. These results are further strengthened when scaling the model to 0.5B parameters which not only improves its capabilities in generative tasks like LAMBADA but also its robustness when challenged with input perturbations such as orthographic attacks.

关键词: Pixel-based language models, Multilingual generalization, MIXAR, Generative models, Scaling to 0.5B parameters, Orthographic attacks, Robustness, Autoregressive models

131. ❌ Phonological distances for linguistic typology and the origin of Indo-European languages

作者: Marius Mavridis, Juan De Gregorio, Raul Toral, David Sanchez 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11565v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究语言学中的音系距离计算和印欧语系起源问题，使用信息论框架和马尔可夫链模型分析音素序列，属于计算语言学/进化语言学领域。所有评分关键词均涉及大模型、深度学习技术原理或AI在科学领域的应用，而该论文完全不涉及任何机器学习、深度学习或大模型技术，也未使用AI方法进行科学发现，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文通过信息论框架将音素序列建模为二阶马尔可夫链，量化了67种现代语言的音系距离，恢复了主要语系并揭示了接触引发的趋同特征，同时发现与地理距离的相关性，从而为印欧语系的起源（支持草原假说）提供了约束。

摘要翻译

我们证明，短程音素依赖编码了语言亲缘关系的大规模模式，这对量化类型学与演化语言学具有直接意义。具体而言，通过运用信息论框架，我们认为将音素序列建模为二阶马尔可夫链本质上捕捉了语音系统的统计相关性。这一发现使我们能够采用一种结合音素发音特征的距离度量，基于多语平行语料库量化67种现代语言之间的距离。由此得到的音系距离矩阵重现了主要语系，并揭示了接触引发的趋同特征。值得注意的是，我们获得了与地理距离的明确相关性，从而能够为印欧语系限定一个可能的起源区域，该结论与草原假说相一致。

摘要 (Abstract)

We show that short-range phoneme dependencies encode large-scale patterns of linguistic relatedness, with direct implications for quantitative typology and evolutionary linguistics. Specifically, using an information-theoretic framework, we argue that phoneme sequences modeled as second-order Markov chains essentially capture the statistical correlations of a phonological system. This finding enables us to quantify distances among 67 modern languages from a multilingual parallel corpus employing a distance metric that incorporates articulatory features of phonemes. The resulting phonological distance matrix recovers major language families and reveals signatures of contact-induced convergence. Remarkably, we obtain a clear correlation with geographic distance, allowing us to constrain a plausible homeland region for the Indo-European family, consistent with the Steppe hypothesis.

关键词: phonological distances, linguistic typology, Indo-European languages, information-theoretic framework, Markov chains, phoneme sequences, language families, contact-induced convergence

132. ❌ MimicLM: Zero-Shot Voice Imitation through Autoregressive Modeling of Pseudo-Parallel Speech Corpora

作者: Tao Feng, Yuxiang Wang, Yuancheng Wang, Xueyao Zhang, Dekun Chen, Chaoren Wang, Xun Guan, Zhizheng Wu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11552v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MimicLM专注于语音合成领域的语音模仿任务，核心创新在于数据构造方法（使用合成语音作为训练源，真实录音作为目标）和模型训练策略（结合文本-音频交错建模和后训练对齐）。与大多数关键词无关，因为研究领域是语音处理而非大语言模型。仅与两个关键词相关：1）“Post-training OR Supervised Fine-tuning OR SFT”（10分）：论文明确应用了post-training来缓解合成数据训练中的分布不匹配问题，这是核心方法之一。2）“Instruction Tuning OR Alignment OR Value Alignment”（5分）：论文提到使用preference alignment来改善模型输出，这与对齐概念有一定关联，但并非核心焦点。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

MimicLM提出了一种新的语音模仿方法，通过使用合成语音作为训练源、真实录音作为目标来突破合成质量上限，并结合文本-音频建模和后训练对齐，在保持内容准确性的同时显著提升了语音模仿的自然度。

摘要翻译

语音模仿旨在将源语音转换为与参考说话者的音色和说话风格相匹配，同时保留语言内容。一种直接的方法是在（源语音、参考语音、目标语音）三元组上进行训练，其中源语音和目标语音共享相同内容，但目标语音需匹配参考语音的声学特征，然而此类数据极为稀缺。现有方法要么采用精心设计的解耦架构以规避数据不足问题，要么利用外部系统合成伪平行训练数据。但前者需要复杂的模型设计，后者在使用合成语音作为训练目标时面临质量上限。为克服这些局限，我们提出MimicLM，其创新之处在于使用合成语音作为训练源，同时保留真实录音作为目标。这一设计使模型能够直接从真实语音分布中学习，从而突破合成质量的上限。基于此数据构建方法，我们引入交错式文本-音频建模以指导生成内容准确的语音，并应用基于偏好对齐的后训练来缓解使用合成数据训练时固有的分布失配问题。实验表明，MimicLM通过简洁高效的架构实现了卓越的语音模仿质量，在自然度方面显著优于现有方法，同时在说话人身份、口音和情感维度上保持了具有竞争力的相似度得分。

摘要 (Abstract)

Voice imitation aims to transform source speech to match a reference speaker’s timbre and speaking style while preserving linguistic content. A straightforward approach is to train on triplets of (source, reference, target), where source and target share the same content but target matches the reference’s voice characteristics, yet such data is extremely scarce. Existing approaches either employ carefully designed disentanglement architectures to bypass this data scarcity or leverage external systems to synthesize pseudo-parallel training data. However, the former requires intricate model design, and the latter faces a quality ceiling when synthetic speech is used as training targets. To address these limitations, we propose MimicLM, which takes a novel approach by using synthetic speech as training sources while retaining real recordings as targets. This design enables the model to learn directly from real speech distributions, breaking the synthetic quality ceiling. Building on this data construction approach, we incorporate interleaved text-audio modeling to guide the generation of content-accurate speech and apply post-training with preference alignment to mitigate the inherent distributional mismatch when training on synthetic data. Experiments demonstrate that MimicLM achieves superior voice imitation quality with a simple yet effective architecture, significantly outperforming existing methods in naturalness while maintaining competitive similarity scores across speaker identity, accent, and emotion dimensions.

关键词: voice imitation, autoregressive modeling, pseudo-parallel speech corpora, synthetic speech training, post-training alignment, text-audio modeling, speaker similarity, speech generation

作者: Liujie Zhang, Benzhe Ning, Rui Yang, Xiaoyan Yu, Jiaxing Li, Lumeng Wu, Jia Liu, Minghao Li, Weihang Chen, Weiqi Hu, Lei Zhang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11554v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RL后训练引擎Relax，用于大规模多模态LLM的强化学习训练。高度相关关键词：‘Post-training’（论文主题）、‘RLHF’（使用RL进行后训练）、‘LLM Agents’（支持智能体工作流）、‘Large Language Models’（应用于Qwen等LLM）。中等相关：‘Mixture of Experts’（支持MoE模型）、‘Self-Reflection’（RL解锁的能力之一）、‘Tool Use’（RL解锁的能力之一）。其他关键词如SLMs、Scaling Laws、Pre-training、RAG等未涉及。

!!! tip deepseek-chat TL;DR

论文提出了一个名为Relax的异步强化学习训练引擎，解决了多模态大语言模型在RL后训练中面临的数据异构性、可扩展性和延迟-吞吐量权衡问题，实现了比现有系统更快的训练速度和稳定的多模态收敛。

摘要翻译

强化学习（RL）后训练已被证明能有效激发大语言模型中的推理、自我反思和工具使用能力。随着模型扩展到全模态输入和智能体多轮工作流，RL训练系统面临三个相互依存的挑战：异构数据流、大规模操作鲁棒性，以及陈旧度与吞吐量的权衡。我们提出Relax（Reinforcement Engine Leveraging Agentic X-modality，利用智能体跨模态的强化学习引擎），这是一个开源RL训练引擎，通过三个协同设计的架构层应对上述挑战。首先，全模态原生架构将多模态支持内建于全栈——从数据预处理和模态感知并行到推理生成——而非在文本中心化流程上打补丁。其次，每个RL角色作为独立、故障隔离的服务运行，可在无需全局协调的情况下进行扩展、恢复和升级。第三，服务级解耦通过TransferQueue数据总线实现异步训练，其中单一陈旧度参数可在同策略、近同策略和完全异步执行之间平滑切换。Relax在Qwen3-4B的同策略训练上相比veRL实现了1.20倍的端到端加速。其完全异步模式在Qwen3-4B上相比colocate方案带来1.76倍加速，在Qwen3-Omni-30B上实现2.00倍加速，且所有模式均收敛至相同的奖励水平。Relax支持R3（Rollout Routing Replay）~\cite{ma2025r3}用于MoE模型，仅产生1.9%的开销，而相同配置下veRL性能下降达32%。该引擎进一步在Qwen3-Omni上展示了跨图像、文本和音频的稳定全模态RL收敛，在视频任务上持续训练超过2,000步而无性能衰减。Relax项目地址为https://github.com/rednote-ai/Relax。

摘要 (Abstract)

Reinforcement learning (RL) post-training has proven effective at unlocking reasoning, self-reflection, and tool-use capabilities in large language models. As models extend to omni-modal inputs and agentic multi-turn workflows, RL training systems face three interdependent challenges: heterogeneous data flows, operational robustness at scale, and the staleness – throughput tradeoff. We present \textbf{Relax} (Reinforcement Engine Leveraging Agentic X-modality), an open-source RL training engine that addresses these challenges through three co-designed architectural layers. First, an \emph{omni-native architecture} builds multimodal support into the full stack – from data preprocessing and modality-aware parallelism to inference generation – rather than retrofitting it onto a text-centric pipeline. Second, each RL role runs as an independent, fault-isolated service that can be scaled, recovered, and upgraded without global coordination. Third, service-level decoupling enables asynchronous training via the TransferQueue data bus, where a single staleness parameter smoothly interpolates among on-policy, near-on-policy, and fully asynchronous execution. Relax achieves a 1.20$\times$ end-to-end speedup over veRL on Qwen3-4B on-policy training. Its fully async mode delivers a 1.76$\times$ speedup over colocate on Qwen3-4B and a 2.00$\times$ speedup on Qwen3-Omni-30B, while all modes converge to the same reward level. Relax supports R3 (Rollout Routing Replay)~\cite{ma2025r3} for MoE models with only 1.9% overhead, compared to 32% degradation in veRL under the same configuration. It further demonstrates stable omni-modal RL convergence on Qwen3-Omni across image, text, and audio, sustaining over 2{,}000 steps on video without degradation. Relax is available at https://github.com/rednote-ai/Relax.

关键词: Reinforcement Learning, Post-training, Large Language Models, Omni-modal, Asynchronous Training, Agentic Workflow, RL Engine, Scalability

134. ❌ Eliciting Medical Reasoning with Knowledge-enhanced Data Synthesis: A Semi-Supervised Reinforcement Learning Approach

作者: Haolin Li, Shuyang Jiang, Ruipeng Zhang, Jiangchao Yao, Ya Zhang, Yanfeng Wang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11547v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究大语言模型在医学领域的应用，特别是针对罕见病医疗推理的数据稀缺问题。高度相关的关键词包括：LLMs（论文明确研究LLMs在医学应用）、SFT（使用监督微调）、Chain of Thought（论文涉及推理链生成）、AI for Science（医学应用属于科学AI范畴）。Scaling Laws AND Data Quality 和 System 2 Thinking 有一定关联，因为论文关注数据质量对模型性能的影响，并涉及深度推理。其他关键词如MoE、SLMs、RLHF等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在医疗领域应用时高质量推理数据稀缺的问题，提出了一种结合医学知识增强数据合成和半监督强化学习的框架MedSSR，在多个医疗基准测试中显著提升了模型性能，特别是在罕见病任务上取得了最高5.93%的性能增益。

摘要翻译

尽管大型语言模型在复杂医疗应用中展现出潜力，但其发展受到高质量推理数据稀缺的制约。为解决这一问题，现有方法通常通过监督微调从大型专有模型中蒸馏思维链推理轨迹，随后进行强化学习。这些方法在罕见疾病等代表性不足的领域改进有限，且生成复杂推理链的成本高昂。为高效提升医疗推理能力，我们提出MedSSR——一种医疗知识增强的数据合成与半监督强化学习框架。该框架首先利用罕见疾病知识合成分布可控的推理问题，随后利用策略模型自身生成高质量伪标签。这实现了由内而外的两阶段训练范式：先在伪标签合成数据上进行自监督强化学习，再在人工标注的真实数据上进行监督强化学习。MedSSR能够高效扩展模型训练，且无需依赖高成本的轨迹蒸馏。基于Qwen和Llama的大量实验表明，我们的方法在十项医疗基准测试中均优于现有方法，在罕见疾病任务上最高提升达+5.93%。代码已发布于https://github.com/tdlhl/MedSSR。

摘要 (Abstract)

While large language models hold promise for complex medical applications, their development is hindered by the scarcity of high-quality reasoning data. To address this issue, existing approaches typically distill chain-of-thought reasoning traces from large proprietary models via supervised fine-tuning, then conduct reinforcement learning (RL). These methods exhibit limited improvement on underrepresented domains like rare diseases while incurring substantial costs from generating complex reasoning chains. To efficiently enhance medical reasoning, we propose MedSSR, a Medical Knowledge-enhanced data Synthesis and Semi-supervised Reinforcement learning framework. Our framework first employs rare disease knowledge to synthesize distribution-controllable reasoning questions. We then utilize the policy model itself to generate high-quality pseudo-labels. This enables a two-stage, intrinsic-to-extrinsic training paradigm: self-supervised RL on the pseudo-labeled synthetic data, followed by supervised RL on the human-annotated real data. MedSSR scales model training efficiently without relying on costly trace distillation. Extensive experiments on Qwen and Llama demonstrate that our method outperforms existing methods across ten medical benchmarks, achieving up to +5.93% gain on rare-disease tasks. Our code is available at https://github.com/tdlhl/MedSSR.

关键词: medical reasoning, large language models, chain-of-thought, semi-supervised reinforcement learning, data synthesis, rare diseases, supervised fine-tuning, medical AI

135. ❌ Triviality Corrected Endogenous Reward

作者: Xinda Wang, Zhengxu Hou, Yangshijie Zhang, Bingren Yan, Jialin Liu, Chenzhuo Zhao, Zhibo Yang, Bin-Bin Yang, Feng Xiao 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11522v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究强化学习在开放文本生成中的应用，提出TCER方法解决无监督奖励问题。与’RLHF/RLAIF/DPO’高度相关（10分），因为都属于强化学习对齐技术；与’Large Language Models’相关（8分），因为研究基于LLM的文本生成；与’Post-training/SFT’相关（8分），因为涉及模型微调；其他关键词如MoE、量化、推理加速等均未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对开放文本生成中缺乏可验证奖励的问题，提出TCER方法解决直接应用置信度奖励导致的平凡性偏差，在多个写作基准和模型架构上实现了无外部监督的持续改进。

摘要翻译

开放式文本生成的强化学习因缺乏可验证的奖励机制而受到限制，必须依赖需要标注数据或强大闭源模型的评判模型。受近期基于置信度的内生奖励进行无监督强化学习以解决数学推理问题的研究启发，我们探讨了这一原理是否适用于开放式写作任务。我们发现直接应用置信度奖励会导致平凡性偏差：策略坍缩至高概率输出，降低多样性与有意义的内容。我们提出TCER（平凡性校正内生奖励），该方法通过奖励专家策略与通用参考策略之间的相对信息增益，并辅以概率依赖的校正机制，从而纠正此类偏差。在多种写作基准测试与模型架构中，TCER无需外部监督即可实现持续改进。此外，TCER亦能有效迁移至数学推理任务，验证了该方法在不同生成任务间的普适性。

摘要 (Abstract)

Reinforcement learning for open-ended text generation is constrained by the lack of verifiable rewards, necessitating reliance on judge models that require either annotated data or powerful closed-source models. Inspired by recent work on unsupervised reinforcement learning for mathematical reasoning using confidence-based endogenous rewards, we investigate whether this principle can be adapted to open-ended writing tasks. We find that directly applying confidence rewards leads to Triviality Bias: the policy collapses toward high-probability outputs, reducing diversity and meaningful content. We propose TCER (Triviality Corrected Endogenous Reward), which addresses this bias by rewarding the relative information gain between a specialist policy and a generalist reference policy, modulated by a probability-dependent correction mechanism. Across multiple writing benchmarks and model architectures, TCER achieves consistent improvements without external supervision. Furthermore, TCER also transfers effectively to mathematical reasoning, validating the generality of our approach across different generation tasks.

关键词: reinforcement learning, open-ended text generation, endogenous reward, triviality bias, unsupervised learning, mathematical reasoning, policy optimization, TCER

136. ❌ DuET: Dual Execution for Test Output Prediction with Generated Code and Pseudocode

作者: Hojae Han, Jaejin Kim, Seung-won Hwang, Yu Jin Kim, Moontae Lee 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11514v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在测试输出预测中的应用，通过结合代码执行和伪代码执行来提高预测可靠性。与’Large Language Models’高度相关（10分），因为论文完全基于LLM技术；与’Chain of Thought’和’System 2 Thinking’相关（8分），因为伪代码执行涉及LLM推理和多步思考；与’Hallucination Mitigation’相关（8分），因为论文明确解决伪代码推理中的幻觉问题。其他关键词如MoE、SFT、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出DuET双执行框架，通过结合LLM生成的代码直接执行和伪代码推理执行，解决了测试输出预测中代码错误和幻觉问题，在LiveCodeBench上实现了最先进的性能，将Pass@1提高了13.6个百分点。

摘要翻译

本研究针对测试用例生成中的关键挑战——测试输出预测问题展开。为提高大语言模型预测输出的可靠性，现有方法通常首先生成代码以锚定预测结果。一种锚定策略是直接执行生成的代码，但即使微小错误也可能导致执行失败。为解决此问题，我们提出了基于大语言模型的伪代码执行方法，该方法将预测锚定于容错性更强的伪代码，并通过大语言模型推理模拟执行过程。我们进一步提出双执行框架DuET，通过功能多数投票机制融合两种执行路径。分析表明，直接执行易受代码错误影响，而伪代码推理存在幻觉问题，两种方法在克服各自局限性方面具有互补性。在LiveCodeBench基准测试中，DuET实现了最先进的性能表现，将Pass@1指标提升了13.6个百分点。

摘要 (Abstract)

This work addresses test output prediction, a key challenge in test case generation. To improve the reliability of predicted outputs by LLMs, prior approaches generate code first to ground predictions. One grounding strategy is direct execution of generated code, but even minor errors can cause failures. To address this, we introduce LLM-based pseudocode execution, which grounds prediction on more error-resilient pseudocode and simulates execution via LLM reasoning. We further propose DuET, a dual-execution framework that combines both approaches by functional majority voting. Our analysis shows the two approaches are complementary in overcoming the limitations of direct execution suffering from code errors, and pseudocode reasoning from hallucination. On LiveCodeBench, DuET achieves the state-of-the-art performance, improving Pass@1 by 13.6 pp.

关键词: test output prediction, LLM-based pseudocode execution, dual-execution framework, functional majority voting, code errors, hallucination mitigation, LiveCodeBench, Pass@1 improvement

137. ❌ Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference

作者: Imanol Miranda, Ander Salaberria, Eneko Agirre, Gorka Azkune 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11496v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究双编码器视觉语言模型（如CLIP）的组合性推理问题，属于视觉-语言多模态领域，而非纯文本大模型。论文主要涉及预训练模型（CLIP）的推理协议改进、轻量级对齐机制学习、以及冻结表示的参数高效微调，与’Pre-training’、‘Post-training’、‘Alignment’、‘PEFT’等关键词有一定关联（5分），但并非核心内容。其他关键词主要针对纯文本大模型的技术原理或应用，与该论文的视觉-语言多模态研究主题无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究发现双编码器视觉语言模型的组合性瓶颈主要源于基于全局余弦相似度的推理协议，并提出一种轻量级对齐机制，在冻结表示上学习区域-片段对齐，显著提升了组合性泛化性能。

摘要翻译

诸如CLIP等双编码器视觉语言模型（VLM）因其在组合式基准测试上的较差表现，常被描述为词袋系统。我们认为，这种局限性可能并非源于表征能力的不足，而更多来自基于全局余弦相似度的标准推理流程。首先，通过受控诊断实验，我们证明在推理过程中显式强制执行细粒度区域-片段对齐，能够在不更新预训练编码器的情况下显著提升组合性能。随后，我们引入一个轻量级Transformer，该模型可直接从冻结的图像块和文本标记嵌入中学习此类对齐。与完整微调及先前端到端组合训练方法相比，我们发现尽管这些方法提升了领域内检索性能，但其增益在分布变化下并不能稳定迁移。相比之下，在冻结表征上学习局部化对齐的方法，在领域内检索任务上达到了与完整微调相当的效果，同时在受控的领域外组合基准测试中取得了显著提升。这些结果表明，全局嵌入匹配是双编码器VLM的一个关键瓶颈，并凸显了对齐机制对于实现稳健组合泛化的重要性。

摘要 (Abstract)

Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks. We argue that this limitation may stem less from deficient representations than from the standard inference protocol based on global cosine similarity. First, through controlled diagnostic experiments, we show that explicitly enforcing fine-grained region-segment alignment at inference dramatically improves compositional performance without updating pretrained encoders. We then introduce a lightweight transformer that learns such alignments directly from frozen patch and token embeddings. Comparing against full fine-tuning and prior end-to-end compositional training methods, we find that although these approaches improve in-domain retrieval, their gains do not consistently transfer under distribution shift. In contrast, learning localized alignment over frozen representations matches full fine-tuning on in-domain retrieval while yielding substantial improvements on controlled out-of-domain compositional benchmarks. These results identify global embedding matching as a key bottleneck in dual-encoder VLMs and highlight the importance of alignment mechanisms for robust compositional generalization.

关键词: Vision-Language Models, Dual-encoder VLMs, Compositionality, Inference Protocol, Alignment Mechanisms, Frozen Representations, Parameter-efficient Fine-tuning, Generalization

138. ❌ Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation

作者: Kuang Wang, Lai Wei, Qibing Bai, Ping Lin, Wenkai Fang, Feng Jiang, Zhongjie Jiang, Jun Huang, Yannan Wang, Haizhou Li 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11424v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于语音语言模型（SLMs）的表达性语音生成，与关键词’Small Language Models OR SLMs OR On-device AI’高度相关（10分），因为论文明确研究SLMs。与’Self-Correction OR Self-Improvement OR Self-Reflection’相关（10分），因为论文提出自我感知机制，让模型作为自身批评者来对齐声学实现与意图。其他关键词如LLMs、MoE、Scaling Laws、Pre-training、Alignment等与论文的语音生成和表达性意图焦点无关，得0分。

!!! tip deepseek-chat TL;DR

论文解决了语音语言模型在语义理解和声学表达之间的差距，通过提出自我感知语音语言模型（SA-SLM），使用意图感知桥接和实现感知对齐，在仅800小时数据上训练的3B参数模型在表达性上接近GPT-4o-Audio。

摘要翻译

语音语言模型（SLMs）展现出强大的语义理解能力，但其生成的语音往往听起来平淡，未能传达表达性意图，从而削弱了用户参与度。我们将这种不匹配称为语义理解-声学实现差距。我们将此差距归因于两个关键缺陷：（1）意图传递失败，即SLMs未能提供稳定的话语级意图以支持富有表现力的表达；（2）实现无意识的训练，即缺乏反馈信号来验证声学输出是否忠实地反映了预期表达。为解决这些问题，我们提出了SA-SLM（自感知语音语言模型），其构建原则是模型应在生成时意识到自身思考的内容，并在训练时意识到自身表达的方式。SA-SLM通过两个核心贡献来弥合这一差距：（1）意图感知桥接，采用变分信息瓶颈（VIB）目标将模型内部语义转化为时间上平滑的表达性意图，使语音生成过程能够感知模型希望表达的内容；（2）实现感知对齐，将模型重新用作自身的评判者，通过基于量规的反馈来验证声学实现是否与预期表达意图对齐，并进行校准。仅使用800小时表达性语音数据训练后，我们的30亿参数SA-SLM超越了所有开源基线模型，并在EchoMind基准测试的整体表达性指标上仅落后GPT-4o-Audio 0.08分。

摘要 (Abstract)

Speech Language Models (SLMs) exhibit strong semantic understanding, yet their generated speech often sounds flat and fails to convey expressive intent, undermining user engagement. We term this mismatch the semantic understanding-acoustic realization gap. We attribute this gap to two key deficiencies: (1) intent transmission failure, where SLMs fail to provide the stable utterance-level intent needed for expressive delivery; and (2) realization-unaware training, where no feedback signal verifies whether acoustic outputs faithfully reflect intended expression. To address these issues, we propose SA-SLM (Self-Aware Speech Language Model), built on the principle that the model should be aware of what it thinks during generation and how it speaks during training. SA-SLM addresses this gap through two core contributions: (1) Intent-Aware Bridging, which uses a Variational Information Bottleneck (VIB) objective to translate the model’s internal semantics into temporally smooth expressive intent, making speech generation aware of what the model intends to express; and (2) Realization-Aware Alignment, which repurposes the model as its own critic to verify and align acoustic realization with intended expressive intent via rubric-based feedback. Trained on only 800 hours of expressive speech data, our 3B parameter SA-SLM surpasses all open-source baselines and comes within 0.08 points of GPT-4o-Audio in overall expressiveness on the EchoMind benchmark.

关键词: Speech Language Models, SLMs, expressive speech generation, semantic understanding-acoustic realization gap, self-aware model, intent-aware bridging, realization-aware alignment, Variational Information Bottleneck

139. ❌ Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging

作者: Zihang Fu, Haonan Wang, Jian Kang, Kenji Kawaguchi, Jiaying Wu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11399v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	10.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在视频语言模型（VLMs）中的推理能力恢复问题，通过层选择性模型合并（Model Merging）方法MERIT来解决。因此与’Large Language Models’（核心研究对象）和’Model Merging’（核心方法）高度相关（10分）。与’Pre-training’相关（5分），因为涉及语言预训练模型的基础。与推理相关的’Chain of Thought’和’System 2 Thinking’有一定关联（5分），因为研究恢复时序推理能力。与’Mechanistic Interpretability’相关（5分），因为通过干预性掩码和归因分析解释层的重要性。其他关键词如MoE、SLMs、RLHF、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了视频语言模型在获得感知能力时削弱时序推理能力的问题，提出了一种无需训练的层选择性模型合并框架MERIT，有效恢复了时序推理能力并保持感知性能。

摘要翻译

多模态适应使大语言模型（LLM）具备了感知能力，但往往会削弱其从纯语言预训练中继承的推理能力。这种权衡在视频语言模型（VLM）中尤为明显，视觉对齐可能会损害对序列事件的时间推理（TR）能力。我们提出了MERIT，一种无需训练、任务驱动的模型融合框架，旨在恢复VLM中的时间推理能力。MERIT通过在VLM与其配对的纯文本骨干模型之间，以提升时间推理能力同时惩罚时间感知（TP）退化为目标，搜索逐层自注意力融合方案。在三个代表性VLM和多个具有挑战性的视频基准测试中，MERIT持续提升了时间推理能力，保持或改善了时间感知能力，并且能够泛化到搜索集之外的四个不同基准测试中。其表现也优于均匀的全模型融合和随机层选择方法，表明有效的恢复依赖于选择正确的层。干预性掩码和帧级归因分析进一步表明，所选层对于推理具有格外重要的作用，并将模型决策转向与时间和因果相关的证据。这些结果表明，有针对性的、感知感知的模型融合能够有效恢复VLM的时间推理能力，而无需重新训练。

摘要 (Abstract)

Multimodal adaptation equips large language models (LLMs) with perceptual capabilities, but often weakens the reasoning ability inherited from language-only pretraining. This trade-off is especially pronounced in video-language models (VLMs), where visual alignment can impair temporal reasoning (TR) over sequential events. We propose MERIT, a training-free, task-driven model merging framework for restoring TR in VLMs. MERIT searches over layer-wise self-attention merging recipes between a VLM and its paired text-only backbone using an objective that improves TR while penalizing degradation in temporal perception (TP). Across three representative VLMs and multiple challenging video benchmarks, MERIT consistently improves TR, preserves or improves TP, and generalizes beyond the search set to four distinct benchmarks. It also outperforms uniform full-model merging and random layer selection, showing that effective recovery depends on selecting the right layers. Interventional masking and frame-level attribution further show that the selected layers are disproportionately important for reasoning and shift model decisions toward temporally and causally relevant evidence. These results show that targeted, perception-aware model merging can effectively restore TR in VLMs without retraining.

关键词: Video-Language Models, Temporal Reasoning, Model Merging, Layer-Selective, Large Language Models, Multimodal Adaptation, Reasoning Restoration, Training-Free Framework

140. ❌ What Do Vision-Language Models Encode for Personalized Image Aesthetics Assessment?

作者: Koki Ryu, Hitomi Yanaka 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11374v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是视觉语言模型（VLMs）在个性化图像美学评估中的应用，主要关注模型内部表示的分析和利用。所有给定的关键词都直接针对大语言模型（LLMs）的技术原理、训练方法、推理优化、应用框架等，而本文的核心是视觉语言模型（VLMs），虽然VLMs与LLMs有技术关联（如基于Transformer架构），但论文内容并未涉及LLMs的特定技术（如MoE、Scaling Laws、RLHF、RAG、CoT等），也未涉及AI for Science的具体领域（如生物信息学）。因此，所有关键词与论文内容完全无关，得分为0。

!!! tip deepseek-chat TL;DR

该论文研究了视觉语言模型（VLMs）是否编码了丰富的多级美学属性用于个性化图像美学评估，并发现VLMs确实编码了这些属性，使得简单的线性模型能有效实现个性化评估而无需微调。

摘要翻译

个性化图像美学评估是一个具有实际应用价值的重要研究课题。基于视觉语言模型的方法虽是该领域的有力候选方案，但其内部是否编码了有效个性化所需的多层次丰富美学属性仍不明确。本文首先通过分析视觉语言模型的内部表征，探究此类美学属性的存在性与分布特征，进而无需微调模型即可实现轻量级的个体层面个性化适配。分析表明，视觉语言模型编码了多样化的美学属性，这些属性会传播至语言解码器层。基于这些表征，我们证明简单的线性模型即可有效执行个性化美学评估。我们进一步分析了不同视觉语言模型架构中美学信息在层级间的传递机制，以及跨图像领域的迁移特性。本研究为利用视觉语言模型建模主观个体审美偏好提供了新的见解。代码已发布于 https://github.com/ynklab/vlm-latent-piaa。

摘要 (Abstract)

Personalized image aesthetics assessment (PIAA) is an important research problem with practical real-world applications. While methods based on vision-language models (VLMs) are promising candidates for PIAA, it remains unclear whether they internally encode rich, multi-level aesthetic attributes required for effective personalization. In this paper, we first analyze the internal representations of VLMs to examine the presence and distribution of such aesthetic attributes, and then leverage them for lightweight, individual-level personalization without model fine-tuning. Our analysis reveals that VLMs encode diverse aesthetic attributes that propagate into the language decoder layers. Building on these representations, we demonstrate that simple linear models can perform PIAA effectively. We further analyze how aesthetic information is transferred across layers in different VLM architectures and across image domains. Our findings provide insights into how VLMs can be utilized for modeling subjective, individual aesthetic preferences. Our code is available at https://github.com/ynklab/vlm-latent-piaa.

关键词: Vision-Language Models, Personalized Image Aesthetics Assessment, Aesthetic Attributes, Internal Representations, Linear Models, Model Analysis, Subjective Preferences, No Fine-tuning

141. ❌ Geometry-Aware Localized Watermarking for Copyright Protection in Embedding-as-a-Service

作者: Zhimin Chen, Xiaojie Liang, Wenbo Xu, Yuxuan Liu, Wei Lu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11344v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于Embedding-as-a-Service（EaaS）的版权保护，提出了一种几何感知的局部水印框架GeoMark，以解决现有水印方法在鲁棒性、实用性和可验证性之间的权衡问题。论文的核心是水印技术和嵌入服务的安全保护，而非大模型或深度学习技术原理的创新，也未涉及大模型在不同领域的应用。所有关键词均与大模型技术、训练方法、推理优化、对齐、代理系统、科学AI应用等直接相关，而本文主题是嵌入服务的水印保护，属于不同的研究领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对Embedding-as-a-Service（EaaS）中模型窃取和版权侵权的问题，提出了一种几何感知的局部水印框架GeoMark，在保持下游实用性和几何保真度的同时，实现了在多种攻击下的鲁棒版权验证。

摘要翻译

嵌入即服务（Embedding-as-a-Service, EaaS）已成为自然语言与多媒体应用的重要语义基础设施，但其极易受到模型窃取与版权侵权的威胁。现有EaaS水印方法面临一个根本性的鲁棒性-实用性-可验证性权衡难题：基于触发的方法对文本改写脆弱，基于变换的方法对维度扰动敏感，而基于区域的方法可能因偶然的几何相似性产生误判。
为解决这一问题，我们提出GeoMark——一种面向EaaS版权保护的几何感知局部水印框架。GeoMark采用流形内的自然嵌入作为共享水印目标，构建具有明确目标-锚点边界的几何分离锚点，并仅在自适应局部邻域内激活水印注入。该设计将水印触发位置与版权归属判定解耦，实现了局部化触发与集中化溯源。
在四个基准数据集上的实验表明，GeoMark在保持下游任务效用与几何保真度的同时，能够在文本改写、维度扰动及CSE（聚类、选择、剔除）攻击下维持稳健的版权验证，并显著提升验证稳定性且具有较低的误判风险。

摘要 (Abstract)

Embedding-as-a-Service (EaaS) has become an important semantic infrastructure for natural language and multimedia applications, but it is highly vulnerable to model stealing and copyright infringement. Existing EaaS watermarking methods face a fundamental robustness–utility–verifiability tension: trigger-based methods are fragile to paraphrasing, transformation-based methods are sensitive to dimensional perturbation, and region-based methods may incur false positives due to coincidental geometric affinity. To address this problem, we propose GeoMark, a geometry-aware localized watermarking framework for EaaS copyright protection. GeoMark uses a natural in-manifold embedding as a shared watermark target, constructs geometry-separated anchors with explicit target–anchor margins, and activates watermark injection only within adaptive local neighborhoods. This design decouples where watermarking is triggered from what ownership is attributed to, achieving localized triggering and centralized attribution. Experiments on four benchmark datasets show that GeoMark preserves downstream utility and geometric fidelity while maintaining robust copyright verification under paraphrasing, dimensional perturbation, and CSE (Clustering, Selection, Elimination) attacks, with improved verification stability and low false-positive risk.

关键词: Embedding-as-a-Service, copyright protection, watermarking, geometry-aware, localized watermarking, robust verification, model stealing, GeoMark

142. ❌ Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation

作者: Lester James V. Miranda, Ivan Vulić, Anna Korhonen 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11290v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究语言模型（LLMs）用于生成多语言监督微调（SFT）数据，因此与’Large Language Models’和’Post-training/SFT’高度相关（10分）。论文涉及使用大模型教小模型，因此与’Small Language Models’有一定关联（5分）。论文系统评估数据质量（如多样性、长度、流畅性）对下游性能的影响，与’Scaling Laws AND Data Quality’有一定关联（5分）。其他关键词如MoE、预训练、对齐、RAG、推理加速等均未在摘要中提及，因此评0分。

!!! tip deepseek-chat TL;DR

该研究系统评估了10个语言模型作为多语言SFT数据生成教师的有效性，发现模型规模并非关键预测因素，而数据质量属性（如提示多样性、长度和响应流畅性）能解释93.3%的内在数据质量方差并预测学生模型性能，最终推荐Gemma 3 27B和Aya Expanse 32B作为有效的多语言教师模型。

摘要翻译

利用语言模型合成监督微调数据以训练小模型执行多语言任务的做法日益普遍。然而，教师模型的选择往往具有随意性，通常默认采用现有最大规模的模型，尽管这类模型在非英语语言上可能存在显著的能力差距。这种做法可能导致合成数据质量低下，进而影响学生模型的下游性能表现。本研究系统性地探讨了高效能多语言教师模型应具备的特征。我们通过构建"多语言能力评分"这一指标，将数据质量的内在衡量标准与学生模型性能的外在表现相结合进行评估；实验涵盖10种语言模型和6种类型各异的语言，生成了超过140万条监督微调样本，并训练了240个学生模型。在测试的模型中，Gemma 3 27B和Aya Expanse 32B在不同学生基础模型架构中均展现出持续稳定的教学效能。进一步分析表明，仅凭模型规模并不能有效预测教学效果；相反，提示多样性、文本长度和回答流畅度等数据质量特征能够解释93.3%以上的内在数据质量差异，并能有效预测学生模型表现。最后，我们提出若干实践建议：匹配师生模型的架构体系，通过翻译现有提示或基于现有提示生成回答，这些方法可为资源稀缺语言带来性能提升。我们期望本研究能推动多语言合成数据与语言模型开发领域以数据为中心的研究进展。

摘要 (Abstract)

Synthesizing supervised finetuning (SFT) data from language models (LMs) to teach smaller models multilingual tasks has become increasingly common. However, teacher model selection is often ad hoc, typically defaulting to the largest available option, even though such models may have significant capability gaps in non-English languages. This practice can result in poor-quality synthetic data and suboptimal student downstream performance. In this work, we systematically characterize what makes an effective multilingual teacher. We measure intrinsic measures of data quality with extrinsic student model performance in a metric we call Polyglot Score; evaluating 10 LMs across 6 typologically diverse languages, generating over 1.4M SFT examples and training 240 student models. Among the models tested, Gemma 3 27B and Aya Expanse 32B emerge as consistently effective teachers across different student base model families. Further analyses reveal that model scale alone does not significantly predict teacher effectiveness; instead, data qualities such as prompt diversity, length, and response fluency capture over 93.3% of variance in intrinsic data quality and predict student performance. Finally, we provide practical recommendations, including matching the model families of teacher-student pairs and translating from or responding to existing prompts, which can yield improvements for less-resourced languages. We hope that our work advances data-centric research in multilingual synthetic data and LM development.

关键词: multilingual synthetic data generation, supervised finetuning (SFT), language models (LMs), teacher model selection, data quality, student model performance, Polyglot Score, multilingual tasks

143. ❌ Transactional Attention: Semantic Sponsorship for KV-Cache Retention

作者: Abhinaba Basu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11288v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	15.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究KV-cache压缩方法，与’KV Cache Compression OR Linear Attention OR FlashAttention’高度相关（15分），因为直接提出Transactional Attention机制解决现有压缩方法的问题，并提到与FlashAttention兼容。论文涉及LLM推理优化，与’Large Language Models OR LLMs OR Foundation Models’相关（10分），因为KV-cache是LLM推理的关键组件。论文在function-calling场景测试，与’Tool Use OR Function Calling OR API Tool Use’相关（10分），因为提到在200个function-calling试验中保持100%准确率。其他关键词如MoE、SLMs、训练方法、对齐、RAG等均未涉及，给0分。

!!! tip deepseek-chat TL;DR

论文针对现有KV-cache压缩方法在保留关键令牌（如凭证、API密钥）上的失败问题，提出Transactional Attention机制，通过结构锚模式保护相邻值令牌，在K=16时实现100%凭证检索，而基线方法为0%，并在function-calling任务中保持高准确性。

摘要翻译

在K=16个令牌（占4K上下文的0.4%）时，所有现有的KV缓存压缩方法在凭证检索任务上的准确率均为0%。其失败模式源于休眠令牌：诸如凭证、API密钥和配置值等令牌，在推理过程中几乎不获得注意力，但在生成阶段变得至关重要。由于这些令牌缺乏淘汰策略所依赖的统计信号，任何基于注意力分数、重构损失或学习型保留门控的方法都无法保留它们。我们提出了事务性注意力（Transactional Attention, TA），一种赞助机制，其中结构性锚点模式（例如“key:”、“password:”）可保护相邻的承载价值的令牌免遭淘汰。在K=16时，TA实现了100%的凭证检索准确率，而六种基线方法（H2O、TOVA、SnapKV、StreamingLLM、PyramidKV、DynamicKV）的准确率均为0%，并在200次函数调用测试中持续保持100%准确率。TA-Fast作为无需注意力的变体，将内存开销降低了52%，并与SDPA和FlashAttention兼容。TA与现有压缩方法正交，且增加的延迟开销低于1%。

摘要 (Abstract)

At K=16 tokens (0.4% of a 4K context), every existing KV-cache compression method achieves 0% on credential retrieval. The failure mode is dormant tokens: credentials, API keys, and configuration values that receive near-zero attention but become essential at generation time. Because these tokens lack the statistical signals that eviction policies rely on, no method based on attention scores, reconstruction loss, or learned retention gates retains them. We introduce Transactional Attention (TA), a sponsorship mechanism in which structural anchor patterns (e.g., “key:”, “password:”) protect adjacent value-bearing tokens from eviction. TA achieves 100% credential retrieval at K=16 where six baselines (H2O, TOVA, SnapKV, StreamingLLM, PyramidKV, DynamicKV) achieve 0%, and sustains 100% accuracy across 200 function-calling trials. TA-Fast, an attention-free variant, reduces memory overhead by 52% and is compatible with SDPA and FlashAttention. TA is orthogonal to existing compression methods and adds less than 1% latency overhead.

关键词: KV-cache compression, Transactional Attention, credential retrieval, function-calling, attention scores, memory overhead, FlashAttention, eviction policies

144. ❌ Dialectic-Med: Mitigating Diagnostic Hallucinations via Counterfactual Adversarial Multi-Agent Debate

作者: Zhixiang Lu, Jionglong Su 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11258v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出Dialectic-Med，一个用于医疗诊断的多智能体辩论框架，核心针对MLLMs的幻觉问题。高度相关（10分）的关键词包括：LLMs（论文基于MLLMs）、Chain of Thought（改进CoT方法）、LLM Agents/Multi-agent Systems（三智能体框架）、Hallucination Mitigation（核心目标）、AI for Science（医疗应用）。中等相关（5-8分）：System 2 Thinking（深度推理过程）、Self-Correction（通过辩论纠正）、Explainable AI（增强解释可信度）、RAG（对手智能体检索视觉证据）。其余关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该研究针对医疗多模态大语言模型的诊断幻觉问题，提出了一个基于对抗性辩论的多智能体框架Dialectic-Med，通过角色分工和视觉证据检索显著提升了诊断准确性和推理可信度。

摘要翻译

医疗领域的多模态大语言模型（MLLMs）存在严重的确认偏误，常会虚构视觉细节以支持其初始的、可能错误的诊断假设。现有的思维链（CoT）方法缺乏内在的修正机制，使其易受错误传播的影响。为弥补这一不足，我们提出了Dialectic-Med，这是一个通过对抗性辩证法来强化诊断严谨性的多智能体框架。与静态共识模型不同，Dialectic-Med协调了三个角色专精的智能体之间的动态交互：一个提出诊断假设的支持方；一个配备了新颖视觉证伪模块的反对方，该模块主动检索矛盾的视觉证据以挑战支持方；以及一个通过加权共识图来解决冲突的调解方。通过显式地建模证伪这一认知过程，我们的框架确保了诊断推理严格基于经过验证的视觉区域。在MIMIC-CXR-VQA、VQA-RAD和PathVQA数据集上的实证评估表明，Dialectic-Med不仅实现了最先进的性能，而且从根本上增强了推理过程的可信度。除了准确性之外，我们的方法显著提高了解释的忠实度，并有效缓解了幻觉现象，从而确立了超越单智能体基线的新标准。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) in healthcare suffer from severe confirmation bias, often hallucinating visual details to support initial, potentially erroneous diagnostic hypotheses. Existing Chain-of-Thought (CoT) approaches lack intrinsic correction mechanisms, rendering them vulnerable to error propagation. To bridge this gap, we propose Dialectic-Med, a multi-agent framework that enforces diagnostic rigor through adversarial dialectics. Unlike static consensus models, Dialectic-Med orchestrates a dynamic interplay between three role-specialized agents: a proponent that formulates diagnostic hypotheses; an opponent equipped with a novel visual falsification module that actively retrieves contradictory visual evidence to challenge the Proponent; and a mediator that resolves conflicts via a weighted consensus graph. By explicitly modeling the cognitive process of falsification, our framework guarantees that diagnostic reasoning is tightly grounded in verified visual regions. Empirical evaluations on MIMIC-CXR-VQA, VQA-RAD, and PathVQA demonstrate that Dialectic-Med not only achieves state-of-the-art performance but also fundamentally enhances the trustworthiness of the reasoning process. Beyond accuracy, our approach significantly enhances explanation faithfulness and decisively mitigates hallucinations, establishing a new standard over single-agent baselines.

关键词: Multimodal Large Language Models, Diagnostic Hallucinations, Multi-agent Debate, Chain-of-Thought, Visual Falsification, Healthcare AI, Adversarial Dialectics, Trustworthy Reasoning

145. ❌ Judge Like Human Examiners: A Weighted Importance Multi-Point Evaluation Framework for Generative Tasks with Long-form Answers

作者: Guoxin Yu, Chulun Zhou, Lemao Liu, Qi Wang, Mo Yu, Jialong Tang, Baosong Yang, Xiang Ao, Wao Lam, Yue Yu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11246v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于长文本生成任务的评估框架开发，属于大模型应用中的评估方法研究。论文主要涉及生成模型（包括LLMs）的评估，因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），但并非其核心技术创新。其他关键词均与论文内容无关，因为论文不涉及模型架构、训练技术、推理优化、对齐、代理系统、压缩、科学AI应用等具体技术方向。

!!! tip deepseek-chat TL;DR

该论文针对长文本生成任务中模型响应的评估难题，提出了一种加权重要性多点评估框架（WIMPE），通过将参考答案分解为加权上下文绑定评分点来更准确地评估模型输出与参考答案的对齐和矛盾，实验表明该方法与人工标注具有更高的相关性。

摘要翻译

在生成长篇答案的生成式任务中，评估模型回答的质量仍然具有挑战性，因为预期答案通常包含多个语义不同但互补的要素，需要将其分解以进行细粒度评估。近期的评估方法依赖于任务级评分标准（task-level rubrics）或问题感知检查清单（question-aware checklists）。然而，这些方法仍存在以下问题：1）难以判断回答是否真正基于所提供的上下文；2）未能捕捉参考答案不同方面的异质性重要程度。受人类考官评分方式的启发，我们提出了一种加权重要性多点评估（Weighted Importance Multi-Point Evaluation, WIMPE）框架，该框架将每个参考答案分解为带权重的、与上下文绑定的评分点。我们设计了两个互补的指标——加权点对齐度（Weighted Point-wise Alignment, WPA）和点冲突惩罚（Point-wise Conflict Penalty, PCP），用于衡量模型回答与参考答案之间的匹配度和矛盾性。在10项生成任务上的大量实验表明，WIMPE与人工标注结果具有更高的相关性。

摘要 (Abstract)

Evaluating the quality of model responses remains challenging in generative tasks with long-form answers, as the expected answers usually contain multiple semantically distinct yet complementary factors that should be factorized for fine-grained assessment. Recent evaluation methods resort to relying on either task-level rubrics or question-aware checklists. However, they still 1) struggle to assess whether a response is genuinely grounded in provided contexts; 2) fail to capture the heterogeneous importance of different aspects of reference answers. Inspired by human examiners, we propose a Weighted Importance Multi-Point Evaluation (WIMPE) framework, which factorizes each reference answer into weighted context-bound scoring points. Two complementary metrics, namely Weighted Point-wise Alignment (WPA) and Point-wise Conflict Penalty (PCP), are designed to measure the alignment and contradiction between model responses and reference answers. Extensive experiments on 10 generative tasks demonstrate that WIMPE achieves higher correlations with human annotations.

关键词: generative tasks, long-form answers, evaluation framework, weighted importance, multi-point evaluation, human annotations, model responses, reference answers

146. ❌ RUMLEM: A Dictionary-Based Lemmatizer for Romansh

作者: Dominic P. Fischer, Zachary Hopton, Jannis Vamvas 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11233v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文RUMLEM专注于为罗曼什语开发基于词典的词形还原工具，属于传统NLP任务，未涉及大模型、深度学习技术原理创新或科学领域应用。所有评分关键词均与大模型技术、深度学习创新或AI科学应用相关，与该论文的词典驱动、规则基础的词形还原方法完全无关。

!!! tip deepseek-chat TL;DR

该论文开发了RUMLEM，一个基于词典的罗曼什语词形还原工具，覆盖五种主要方言和标准变体，能处理77-84%的文本词汇，并在语言分类任务中达到95%的准确率。

摘要翻译

词形还原——即将屈折变化的词形映射至其词典原形的任务——是许多自然语言处理应用的关键组成部分。本文提出RUMLEM，一个覆盖罗曼什语五种主要变体及跨区域标准变体格里松罗曼什语的词形还原工具。该工具基于社区驱动的罗曼什语综合性形态数据库构建，使RUMLEM能够处理典型罗曼什语文本中77-84%的词汇。由于每个罗曼什语变体均设有独立数据库，RUMLEM还可实现变体感知的语言分类功能。通过对三万篇不同长度的罗曼什语文本进行评估，RUMLEM在95%的案例中能准确识别语言变体。此外，概念验证实验表明，基于该词形还原器实现罗曼什语与非罗曼什语的语言分类具有可行性。

摘要 (Abstract)

Lemmatization – the task of mapping an inflected word form to its dictionary form – is a crucial component of many NLP applications. In this paper, we present RUMLEM, a lemmatizer that covers the five main varieties of Romansh as well as the supra-regional standard variety Rumantsch Grischun. It is based on comprehensive, community-driven morphological databases for Romansh, enabling RUMLEM to cover 77-84% of the words in a typical Romansh text. Since there is a dedicated database for each Romansh variety, an additional application of RUMLEM is variety-aware language classification. Evaluation on 30'000 Romansh texts of varying lengths shows that RUMLEM correctly identifies the variety in 95% of cases. In addition, a proof of concept demonstrates the feasibility of Romansh vs. non-Romansh language classification based on the lemmatizer.

关键词: lemmatization, Romansh, dictionary-based, morphological databases, language classification, NLP applications, variety-aware, Rumantsch Grischun

147. ❌ RECIPER: A Dual-View Retrieval Pipeline for Procedure-Oriented Materials Question Answering

作者: Zhuoyu Wu, Wenhui Ou, Pei-Sze Tan, Wenqi Fang, Sailaja Rajanala, Raphaël C. -W. Phan 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11229v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文RECIPER专注于材料科学领域的程序导向问答，提出了一种结合段落级上下文和LLM提取的程序摘要的双视图检索管道。该研究与’Retrieval-Augmented Generation (RAG)‘高度相关（10分），因为其核心是检索增强的问答系统，旨在改进材料科学文档中的证据检索。同时，它属于’AI for Science’范畴（10分），具体应用于材料科学。论文使用了大型语言模型（LLMs）来提取程序摘要，因此与’Large Language Models’有一定关联（8分）。其他关键词如MoE、SLMs、训练方法、推理优化、代理系统等与论文内容无关，均得0分。

!!! tip deepseek-chat TL;DR

RECIPER提出了一种双视图检索管道，结合段落级上下文和LLM提取的程序摘要，显著提升了材料科学中程序导向问答的检索性能，在多个指标上优于仅使用段落的密集检索方法。

摘要翻译

从材料科学论文中检索面向流程的证据具有挑战性，因为关键的合成细节通常分散在篇幅冗长、上下文密集的文档中，且仅依赖段落级别的密集检索难以有效捕捉这些信息。本文提出RECIPER，一种双视图检索流程，该方法同时对段落级上下文和由大型语言模型提取的紧凑流程摘要进行索引，随后通过轻量级词汇重排序融合两个候选结果流。在四种密集检索骨干模型上的实验表明，RECIPER相较于仅使用段落的密集检索方法，在早期排序检索性能上持续提升，平均在Recall@1、nDCG@10和MRR指标上分别获得+3.73、+2.85和+3.13的增益。采用BGE-large-en-v1.5模型时，其在Recall@1、Recall@5和Recall@10上分别达到86.82%、97.07%和97.85%。我们进一步观察到，在自动评估指标下，下游问答任务性能得到改善，这表明流程摘要可作为面向流程的材料科学问答任务中一种有效的补充检索信号。代码与数据已发布于https://github.com/ReaganWu/RECIPER。

摘要 (Abstract)

Retrieving procedure-oriented evidence from materials science papers is difficult because key synthesis details are often scattered across long, context-heavy documents and are not well captured by paragraph-only dense retrieval. We present RECIPER, a dual-view retrieval pipeline that indexes both paragraph-level context and compact large language model-extracted procedural summaries, then combines the two candidate streams with lightweight lexical reranking. Across four dense retrieval backbones, RECIPER consistently improves early-rank retrieval over paragraph-only dense retrieval, achieving average gains of +3.73 in Recall@1, +2.85 in nDCG@10, and +3.13 in MRR. With BGE-large-en-v1.5, it reaches 86.82%, 97.07%, and 97.85% on Recall@1, Recall@5, and Recall@10, respectively. We further observe improved downstream question answering under automatic metrics, suggesting that procedural summaries can serve as a useful complementary retrieval signal for procedure-oriented materials question answering. Code and data are available at https://github.com/ReaganWu/RECIPER.

关键词: Retrieval-Augmented Generation, Materials Science, Question Answering, Dual-View Retrieval, Procedural Summaries, Large Language Models, Dense Retrieval, Lexical Reranking

148. ❌ Sign Language Recognition in the Age of LLMs

作者: Vaclav Javorek, Jakub Honzik, Ivan Gruber, Tomas Zelezny, Marek Hruz 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11225v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究视觉语言模型（VLMs）在孤立手语识别（ISLR）中的零样本能力，属于大模型在特定领域（AI for Science/辅助技术）的应用研究。与’Large Language Models’相关度较高（8分），因为VLMs是大模型的一种，论文探讨了其零样本能力。与’Scaling Laws AND Data Quality’有一定关联（5分），因为论文提到模型规模和训练数据多样性对性能的影响。与’AI for Science’有一定关联（5分），因为手语识别可视为AI在辅助技术/科学应用领域的研究。其他关键词（如MoE、SFT、RAG等）未在论文中涉及，相关度为0。

!!! tip deepseek-chat TL;DR

该论文研究了现代视觉语言模型在零样本设置下进行孤立手语识别的能力，发现开源模型性能远低于传统监督分类器，但部分模型展示了视觉-语义对齐能力，且更大规模的专有模型显著提高了准确率。

摘要翻译

近期，视觉语言模型（Vision Language Models, VLMs）在广泛的多模态推理任务中展现出强大的性能。这引发了一个问题：此类通用模型是否也能在无需任务特定训练的情况下，处理如孤立手语识别（Isolated Sign Language Recognition, ISLR）这类专业视觉识别问题。在本研究中，我们探究了现代VLMs在零样本设置下执行ISLR的能力。我们在WLASL300基准上评估了多种开源与专有VLMs。实验表明，在仅使用提示的零样本推理条件下，当前开源VLMs的性能仍远落后于经典的监督式ISLR分类器。然而，后续实验揭示，这些模型在一定程度上捕捉到了手语动作与文本描述之间的视觉-语义对齐关系。规模更大的专有模型实现了显著更高的准确率，凸显了模型规模与训练数据多样性的重要性。我们所有的代码已在GitHub上公开。

摘要 (Abstract)

Recent Vision Language Models (VLMs) have demonstrated strong performance across a wide range of multimodal reasoning tasks. This raises the question of whether such general-purpose models can also address specialized visual recognition problems such as isolated sign language recognition (ISLR) without task-specific training. In this work, we investigate the capability of modern VLMs to perform ISLR in a zero-shot setting. We evaluate several open-source and proprietary VLMs on the WLASL300 benchmark. Our experiments show that, under prompt-only zero-shot inference, current open-source VLMs remain far behind classic supervised ISLR classifiers by a wide margin. However, follow-up experiments reveal that these models capture partial visual-semantic alignment between signs and text descriptions. Larger proprietary models achieve substantially higher accuracy, highlighting the importance of model scale and training data diversity. All our code is publicly available on GitHub.

关键词: Vision Language Models, Sign Language Recognition, Zero-shot Learning, Multimodal Reasoning, WLASL300 Benchmark, Visual-Semantic Alignment, Model Scale, Training Data Diversity

149. ❌ HiEdit: Lifelong Model Editing with Hierarchical Reinforcement Learning

作者: Yangfan Wang, Tianyang Sun, Chen Tang, Jie Liu, Wei Cai, Jingchi Jiang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11214v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于大语言模型（LLMs）的知识编辑问题，提出了一种基于分层强化学习的终身模型编辑方法HiEdit，通过动态选择知识相关层进行精确更新。因此，仅与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），其他关键词如MoE、SLMs、Scaling Laws、Pre-training、Post-training、RLHF、PEFT、RAG、Context Window、KV Cache、Reasoning、Agents、Quantization、Hallucination、Interpretability、World Models、Model Merging、In-context Learning、AI for Science等均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

论文研究了如何通过分层强化学习动态选择大语言模型的知识相关层进行终身编辑，以减少副作用并提升编辑性能，实验表明HiEdit在仅扰动一半层的情况下将RLEdit性能平均提升8.48%。

摘要翻译

终身模型编辑（LME）旨在持续修正已部署大语言模型（LLM）中过时或不准确的知识，同时最小化对无关输入的副作用。然而，现有方法通常对所有编辑实例统一应用于一个静态且密集的LLM层集合的参数扰动。这种做法有悖于直觉，因为我们假设不同的知识片段存储在模型的不同层中。忽视这种层级特异性会阻碍整合新知识的适应性，并导致对通用知识及先前已编辑知识的灾难性遗忘。为解决此问题，我们提出了HiEdit，一种分层强化学习框架，能够自适应地为每个编辑实例识别出最相关的知识层。通过实现动态的、实例感知的层级选择，并结合稀疏性的内在奖励，HiEdit实现了精确、局部化的更新。在多种大语言模型上的实验表明，HiEdit在每次编辑仅扰动一半层数的情况下，将当前具有竞争力的RLEdit方法的性能平均提升了8.48%。我们的代码公开于：https://github.com/yangfanww/hiedit。

摘要 (Abstract)

Lifelong model editing (LME) aims to sequentially rectify outdated or inaccurate knowledge in deployed LLMs while minimizing side effects on unrelated inputs. However, existing approaches typically apply parameter perturbations to a static and dense set of LLM layers for all editing instances. This practice is counter-intuitive, as we hypothesize that different pieces of knowledge are stored in distinct layers of the model. Neglecting this layer-wise specificity can impede adaptability in integrating new knowledge and result in catastrophic forgetting for both general and previously edited knowledge. To address this, we propose HiEdit, a hierarchical reinforcement learning framework that adaptively identifies the most knowledge-relevant layers for each editing instance. By enabling dynamic, instance-aware layer selection and incorporating an intrinsic reward for sparsity, HiEdit achieves precise, localized updates. Experiments on various LLMs show that HiEdit boosts the performance of the competitive RLEdit by an average of 8.48% with perturbing only half of the layers per edit. Our code is available at: https://github.com/yangfanww/hiedit.

关键词: Lifelong Model Editing, Hierarchical Reinforcement Learning, Large Language Models, Knowledge Editing, Layer Selection, Catastrophic Forgetting, Parameter Perturbation, Instance-aware Updates

150. ❌ Exploring Knowledge Conflicts for Faithful LLM Reasoning: Benchmark and Method

作者: Tianzhe Zhao, Jiaoyan Chen, Shuxiu Zhang, Haiping Zhu, Qika Lin, Jun Liu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11209v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在RAG系统中处理异构知识冲突时的忠实推理问题，与’Large Language Models’和’Retrieval-Augmented Generation’高度相关（10分）。涉及推理过程（‘Chain of Thought’、‘System 2 Thinking’）和可解释性（‘Mechanistic Interpretability’）有一定关联（5分）。研究知识冲突导致的错误响应与’Hallucination Mitigation’相关（8分）。其他关键词如MoE、量化、科学AI等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在检索增强生成中面对文本与知识图谱证据冲突时的忠实推理问题，提出了ConflictQA基准和XoT框架来改善推理可靠性。

摘要翻译

大语言模型（LLM）在广泛的应用中取得了显著成功，尤其是在通过检索增强生成（RAG）技术融合外部知识时。尽管其应用日益广泛，近期研究表明，当检索到相互冲突的知识时，LLM往往难以进行可靠的推理。然而，现有研究主要关注外部知识与LLM参数化知识之间的冲突，而对外部知识源之间的冲突则鲜有探讨。与此同时，现代RAG系统日益强调整合非结构化文本与（半）结构化数据（如知识图谱，KG），以提升知识完备性与推理可靠性。为填补这一研究空白，我们提出了ConflictQA——一个系统化构建文本证据与知识图谱证据间冲突的新型基准测试。通过对代表性LLM的广泛评估发现，面对此类跨源冲突时，LLM往往无法识别可靠的证据以进行正确推理，反而对提示选择更为敏感，并倾向于仅依赖知识图谱或文本证据中的单一来源，导致错误响应。基于这些发现，我们进一步提出XoT框架，这是一个专为异构冲突证据推理设计的两阶段解释性思维框架，并通过大量实验验证了其有效性。

摘要 (Abstract)

Large language models (LLMs) have achieved remarkable success across a wide range of applications especially when augmented by external knowledge through retrieval-augmented generation (RAG). Despite their widespread adoption, recent studies have shown that LLMs often struggle to perform faithful reasoning when conflicting knowledge is retrieved. However, existing work primarily focuses on conflicts between external knowledge and the parametric knowledge of LLMs, leaving conflicts across external knowledge largely unexplored. Meanwhile, modern RAG systems increasingly emphasize the integration of unstructured text and (semi-)structured data like knowledge graphs (KGs) to improve knowledge completeness and reasoning faithfulness. To address this gap, we introduce ConflictQA, a novel benchmark that systematically instantiates conflicts between textual evidence and KG evidence. Extensive evaluations across representative LLMs reveal that, facing such cross-source conflicts, LLMs often fail to identify reliable evidence for correct reasoning. Instead, LLMs become more sensitive to prompting choices and tend to rely exclusively on either KG or textual evidence, resulting in incorrect responses. Based on these findings, we further propose XoT, a two-stage explanation-based thinking framework tailored for reasoning over heterogeneous conflicting evidence, and verify its effectiveness with extensive experiments.

关键词: Large Language Models, Retrieval-Augmented Generation, Knowledge Conflicts, Faithful Reasoning, Benchmark, Explanation-based Thinking, Knowledge Graphs, Heterogeneous Evidence

151. ❌ CocoaBench: Evaluating Unified Digital Agents in the Wild

作者: CocoaBench Team, Shibo Hao, Zhining Zhang, Zhiqi Liang, Tianyang Liu, Yuheng Zha, Qiyue Gao, Jixuan Chen, Zilong Wang, Zhoujun Cheng, Haoxiang Zhang, Junli Wang, Hexi Jin, Boyuan Zheng, Kun Zhou, Yu Wang, Feng Yao, Licheng Liu, Yijiang Li, Zhifei Li, Zhengtao Han, Pracha Promthaw, Tommaso Cerruti, Xiaohan Fu, Ziqiao Ma, Jingbo Shang, Lianhui Qin, Julian McAuley, Eric P. Xing, Zhengzhong Liu, Rupesh Kumar Srivastava, Zhiting Hu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11201v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM智能体在复杂任务中的综合能力评估，与’LLM Agents’和’Tool Use’高度相关（10分），因为论文专门研究智能体系统及其工具使用能力；与’Chain of Thought’和’System 2 Thinking’有一定关联（8分），因为分析指出智能体在推理和规划方面需要改进；其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了CocoaBench基准测试来评估统一数字智能体在需要视觉、搜索和编码能力组合的复杂任务中的表现，实验结果显示当前智能体系统成功率仅为45.1%，在推理规划、工具使用和视觉基础方面仍有很大改进空间。

摘要翻译

当前，大型语言模型智能体在软件工程、深度研究、图形用户界面自动化及其他多种应用中表现优异，而近期的智能体框架与模型正日益将这些能力整合为统一系统。然而，大多数评估仍孤立地测试这些能力，这导致在需要智能体组合不同能力的多样化应用场景中存在评估空白。我们推出了CocoaBench，这是一个面向统一数字智能体的基准测试，其构建基于人类设计的长视野任务，这些任务要求灵活组合视觉、搜索与编码能力。任务仅通过一条指令和针对最终输出的自动评估函数来定义，从而能够在不同智能体基础设施上实现可靠且可扩展的评估。我们还提出了CocoaAgent，这是一个轻量级共享框架，用于在不同模型主干之间进行受控比较。实验表明，当前智能体在CocoaBench上的表现仍远未达到可靠水平，评估中表现最佳的系统成功率仅为45.1%。我们的分析进一步指出，在推理与规划、工具使用与执行以及视觉基础理解方面仍存在巨大的改进空间。

摘要 (Abstract)

LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.

关键词: LLM agents, benchmark evaluation, unified digital agents, tool use, reasoning and planning, visual grounding, CocoaBench, agent scaffolds

152. ❌ TRACE: An Experiential Framework for Coherent Multi-hop Knowledge Graph Question Answering

作者: Yingxu Wang, Jiaxin Huang, Mengzhu Wang, Nan Yin 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11193v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出TRACE框架，核心是使用LLM进行多步推理（CoT Reasoning/System 2 Thinking）和基于经验的智能体（LLM Agents）来提升知识图谱问答的连贯性。因此，与’Large Language Models’、‘Chain of Thought’、‘System 2 Thinking’和’LLM Agents’高度相关（10分）。其他关键词如MoE、量化、RAG、对齐等，论文未涉及，故为0分。

!!! tip deepseek-chat TL;DR

该研究针对多跳知识图谱问答中推理步骤独立、缺乏经验利用导致推理碎片化的问题，提出了TRACE框架，通过LLM驱动的上下文推理与探索先验集成，显著提升了多跳KGQA的连贯性和性能。

摘要翻译

多跳知识图谱问答（KGQA）需要跨关系路径的连贯推理，然而现有方法往往独立处理每个推理步骤，未能有效利用先前探索的经验，导致推理碎片化和冗余探索。为应对这些挑战，我们提出了具有自适应上下文与探索先验的轨迹感知推理框架（TRACE），该经验式框架将大语言模型驱动的上下文推理与探索先验集成相统一，以增强多跳KGQA的连贯性与鲁棒性。具体而言，TRACE将动态演化的推理路径实时转化为自然语言叙述以保持语义连续性，同时将先前的探索轨迹抽象为可重用的经验先验以捕捉重复出现的探索模式。双重反馈重排序机制进一步整合上下文叙述与探索先验，从而在推理过程中指导关系选择。在多个KGQA基准测试上的大量实验表明，TRACE始终优于最先进的基线方法。

摘要 (Abstract)

Multi-hop Knowledge Graph Question Answering (KGQA) requires coherent reasoning across relational paths, yet existing methods often treat each reasoning step independently and fail to effectively leverage experience from prior explorations, leading to fragmented reasoning and redundant exploration. To address these challenges, we propose Trajectoryaware Reasoning with Adaptive Context and Exploration priors (TRACE), an experiential framework that unifies LLM-driven contextual reasoning with exploration prior integration to enhance the coherence and robustness of multihop KGQA. Specifically, TRACE dynamically translates evolving reasoning paths into natural language narratives to maintain semantic continuity, while abstracting prior exploration trajectories into reusable experiential priors that capture recurring exploration patterns. A dualfeedback re-ranking mechanism further integrates contextual narratives with exploration priors to guide relation selection during reasoning. Extensive experiments on multiple KGQA benchmarks demonstrate that TRACE consistently outperforms state-of-the-art baselines.

关键词: Multi-hop Knowledge Graph Question Answering, LLM-driven reasoning, Experiential framework, Coherent reasoning, Trajectory-aware, Exploration priors, Dual-feedback re-ranking

153. ❌ MathAgent: Adversarial Evolution of Constraint Graphs for Mathematical Reasoning Data Synthesis

作者: Zixiong Yu, Jun Rao, Guhan Chen, Songtao Tian, Bohan Li, Jiansheng Wei, Min Zhang, Xiaojun Meng 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11188v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种用于数学推理数据合成的分层框架，核心是LLM代理（Legislator-Executor范式）进行对抗性演化，生成复杂的逻辑结构用于监督微调（SFT），从而提升LLM在数学推理任务（涉及CoT和深度推理）上的性能。因此，与LLM、SFT、CoT推理、深度推理、LLM代理和多代理系统高度相关（10分）。与数据质量（Scaling Laws AND Data Quality）和AI for Science有一定关联（5分），因为论文关注高质量数据合成并应用于数学（科学领域）。其他关键词如MoE、量化、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于约束图对抗性演化和语义实例化的分层框架，用于合成高质量的数学推理数据，实验表明使用1K合成样本微调的模型在多个数学基准测试中优于现有同规模数据集，并展现出更好的分布外泛化能力。

摘要翻译

在无需人工先验知识的情况下合成高质量的数学推理数据仍是一个重大挑战。现有方法通常依赖于种子数据变异或简单的提示工程，常面临模式崩溃和逻辑复杂性受限的问题。本文提出了一种分层合成框架，将数据合成建模为基于约束图的无监督优化问题，随后进行语义实例化，而非将其视为直接的文本生成任务。我们引入了一种“立法者-执行者”范式：立法者通过对抗性演化生成编码问题约束的结构化生成蓝图，而执行者则将这些规范实例化为多样化的自然语言场景。这种骨架设计与语言实现的解耦使得我们能够优先专注于构建复杂多样的逻辑结构，从而引导高质量的数据合成。在Qwen、Llama、Mistral和Gemma系列共计10个模型上进行的实验表明，我们的方法取得了显著成果：使用1K个合成样本微调的模型在八个数学基准测试中均超越了同等规模的广泛使用数据集（如LIMO、s1K），并展现出更优异的分布外泛化能力。

摘要 (Abstract)

Synthesizing high-quality mathematical reasoning data without human priors remains a significant challenge. Current approaches typically rely on seed data mutation or simple prompt engineering, often suffering from mode collapse and limited logical complexity. This paper proposes a hierarchical synthesis framework that formulates data synthesis as an unsupervised optimization problem over a constraint graph followed by semantic instantiation, rather than treating it as a direct text generation task. We introduce a Legislator-Executor paradigm: The Legislator adversarially evolves structured generation blueprints encoding the constraints of the problem, while the Executor instantiates these specifications into diverse natural language scenarios. This decoupling of skeleton design from linguistic realization enables a prioritized focus on constructing complex and diverse logical structures, thereby guiding high-quality data synthesis. Experiments conducted on a total of 10 models across the Qwen, Llama, Mistral, and Gemma series demonstrate that our method achieves notable results: models fine-tuned on 1K synthesized samples outperform widely-used datasets of comparable scale (LIMO, s1K) across eight mathematical benchmarks, exhibiting superior out-of-distribution generalization.

关键词: mathematical reasoning, data synthesis, constraint graph, adversarial evolution, Legislator-Executor, supervised fine-tuning, LLM agents, out-of-distribution generalization

154. ❌ Evaluating Memory Capability in Continuous Lifelog Scenario

作者: Jianjie Zheng, Zhichen Liu, Zhanyu Shen, Jingxiang Qu, Guanhua Chen, Yile Wang, Yang Xu, Yang Liu, Sijie Cheng 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11182v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究连续生活日志场景中的记忆能力评估，提出了LifeDialBench基准和在线评估协议。与关键词的相关性分析：1）与"Retrieval-Augmented Generation (RAG)“高度相关（10分），因为论文明确将RAG作为基线方法并讨论其性能；2）与"Large Language Models"有一定关联（5分），因为记忆系统可能基于LLM，但论文未明确说明；3）与"Context Window Extension"有一定关联（5分），因为连续生活日志涉及长上下文处理；4）其他关键词与论文内容无关（0分），因为论文聚焦于基准构建、评估协议和记忆系统比较，而非具体的模型架构、训练技术或特定应用领域。

!!! tip deepseek-chat TL;DR

该论文针对连续生活日志场景中现有记忆系统评估的不足，提出了LifeDialBench基准和在线评估协议，实验发现当前复杂记忆系统未能超越简单的RAG基线，强调了高保真上下文保存的重要性。

摘要翻译

当前，可穿戴设备能够持续记录环境对话，为记忆系统创造了重要机遇。然而，现有基准测试主要集中于在线一对一聊天或人机交互，忽视了现实场景的独特需求。鉴于公开的生活日志音频数据集稀缺，我们提出一种分层合成框架来构建 \textsc{LifeDialBench}——一个包含两个互补子集的新型基准：基于真实世界第一人称视频构建的 EgoMem，以及利用模拟虚拟社区构建的 LifeMem。关键的是，为解决传统离线设置中的时间泄露问题，我们提出一种 在线评估 协议，该协议严格遵循时间因果性，确保系统以符合现实流的模式进行评估。我们的实验结果揭示了一个反直觉的发现：当前复杂的记忆系统未能超越一个简单的基于检索增强生成（RAG）的基线方法。这凸显了当前方法中过度设计的结构和有损压缩带来的不利影响，并强调了在生活日志场景中保持高保真上下文的必要性。我们的代码和数据已在 https://github.com/qys77714/LifeDialBench 发布。

摘要 (Abstract)

Nowadays, wearable devices can continuously lifelog ambient conversations, creating substantial opportunities for memory systems. However, existing benchmarks primarily focus on online one-on-one chatting or human-AI interactions, thus neglecting the unique demands of real-world scenarios. Given the scarcity of public lifelogging audio datasets, we propose a hierarchical synthesis framework to curate \textbf{\textsc{LifeDialBench}}, a novel benchmark comprising two complementary subsets: \textbf{EgoMem}, built on real-world egocentric videos, and \textbf{LifeMem}, constructed using simulated virtual community. Crucially, to address the issue of temporal leakage in traditional offline settings, we propose an \textbf{Online Evaluation} protocol that strictly adheres to temporal causality, ensuring systems are evaluated in a realistic streaming fashion. Our experimental results reveal a counterintuitive finding: current sophisticated memory systems fail to outperform a simple RAG-based baseline. This highlights the detrimental impact of over-designed structures and lossy compression in current approaches, emphasizing the necessity of high-fidelity context preservation for lifelog scenarios. We release our code and data at https://github.com/qys77714/LifeDialBench.

关键词: memory capability, continuous lifelog, benchmark, online evaluation, RAG, temporal causality, context preservation, wearable devices

作者: João Gonçalves, Sonia de Jager, Petr Knoth, David Pride, Nick Jelicic 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11152v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文核心是开发专门针对社会科学和人文领域（SSH）的因果语言模型（SHARE），属于大模型在特定科学领域的应用创新。高度相关的关键词：1）‘Large Language Models’（论文开发了SSH专用的大语言模型）；2）‘Pre-training’（论文明确提到模型是’fully pretrained’）；3）‘AI for Science’（论文属于AI在社会科学领域的应用）。其他关键词如MoE、SFT、RAG、推理方法等均未在摘要中提及，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文开发了首个专门为社会科学和人文领域（SSH）预训练的因果语言模型SHARE，其性能接近使用100倍数据量的通用模型，并设计了不生成文本的MIRROR界面以保持SSH原则的完整性。

摘要翻译

本中期技术报告介绍了SHARE系列基础模型及MIRROR用户界面。SHARE模型是首个完全由社会科学与人文科学（Social Sciences and Humanities，SSH）领域研发并为其服务的因果语言模型。根据我们定制的SSH完形填空基准测试显示，其在SSH文本建模方面的性能已接近使用百倍训练数据量的通用模型（如Phi-4）。MIRROR用户界面专为审阅SSH学科文本输入而设计，同时保持批判性参与。通过构建一个不生成任何文本的生成式人工智能界面原型，我们提出了一种既能利用SHARE模型能力，又不会损害SSH原则与规范完整性的实践路径。

摘要 (Abstract)

This intermediate technical report introduces the SHARE family of base models and the MIRROR user interface. The SHARE models are the first causal language models fully pretrained by and for the social sciences and humanities (SSH). Their performance in modelling SSH texts is close to that of general purpose models (Phi-4) which use 100 times more tokens, as shown by our custom SSH Cloze benchmark. The MIRROR user interface is designed for reviewing text inputs from the SSH disciplines while preserving critical engagement. By prototyping a generative AI interface that does not generate any text, we propose a way to harness the capabilities of the SHARE models without compromising the integrity of SSH principles and norms.

关键词: causal language models, social sciences and humanities, pretrained, base models, generative AI interface, SSH Cloze benchmark, MIRROR user interface, Phi-4 comparison

156. ❌ How Robust Are Large Language Models for Clinical Numeracy? An Empirical Study on Numerical Reasoning Abilities in Clinical Contexts

作者: Minh-Vuong Nguyen, Fatemeh Shiri, Zhuang Li, Karin Verspoor 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11133v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究大语言模型在临床医学领域的数值推理能力评估，与’Large Language Models’和’AI for Science’高度相关（10分）。论文涉及监督微调（SFT）对性能的影响（5分），评估了多步推理能力（5分），并关注事实准确性（5分）。其他关键词如MoE、量化、RAG等未在论文中涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了14个大语言模型在临床医学背景下的数值推理能力，发现模型在数值检索方面表现良好（>85%准确率），但在关系比较和聚合任务上表现较差（<15%），且微调可能降低数值推理能力，模型对临床笔记格式敏感。

摘要翻译

大型语言模型（LLMs）在临床问答和决策支持领域的应用日益受到关注，然而其安全部署的关键在于可靠处理异构临床记录中的患者测量数据。现有针对临床数值推理的LLM评估在操作层面覆盖有限，主要局限于算术计算，且很少评估模型在不同临床记录格式下数值理解的鲁棒性。我们提出了ClinicNumRobBench基准测试，包含1,624个带标准答案的上下文-问题实例，用于评估临床数值能力的四种主要类型：数值提取、算术计算、关系比较和聚合运算。为进行压力测试，该基准采用三种语义等效的表示形式呈现纵向MIMIC-IV生命体征记录（包括源自Open Patients数据集的真实世界临床记录风格变体），并通过42种问题模板生成查询。对14个LLM的实验表明：数值提取能力普遍较强，多数模型准确率超过85%；而关系比较和聚合运算仍具挑战性，部分模型得分低于15%。在医疗数据上的微调可能导致数值能力相较于基础模型下降超过30%，且在临床记录风格变体下的性能下降揭示了LLM对格式的敏感性。ClinicNumRobBench为临床可靠的数值推理提供了严格测试平台。代码与数据可通过https://github.com/MinhVuong2000/ClinicNumRobBench获取。

摘要 (Abstract)

Large Language Models (LLMs) are increasingly being explored for clinical question answering and decision support, yet safe deployment critically requires reliable handling of patient measurements in heterogeneous clinical notes. Existing evaluations of LLMs for clinical numerical reasoning provide limited operation-level coverage, restricted primarily to arithmetic computation, and rarely assess the robustness of numerical understanding across clinical note formats. We introduce ClinicNumRobBench, a benchmark of 1,624 context-question instances with ground-truth answers that evaluates four main types of clinical numeracy: value retrieval, arithmetic computation, relational comparison, and aggregation. To stress-test robustness, ClinicNumRobBench presents longitudinal MIMIC-IV vital-sign records in three semantically equivalent representations, including a real-world note-style variant derived from the Open Patients dataset, and instantiates queries using 42 question templates. Experiments on 14 LLMs show that value retrieval is generally strong, with most models exceeding 85% accuracy, while relational comparison and aggregation remain challenging, with some models scoring below 15%. Fine-tuning on medical data can reduce numeracy relative to base models by over 30%, and performance drops under note-style variation indicate LLM sensitivity to format. ClinicNumRobBench offers a rigorous testbed for clinically reliable numerical reasoning. Code and data URL are available on https://github.com/MinhVuong2000/ClinicNumRobBench.

关键词: Large Language Models, clinical numeracy, numerical reasoning, clinical question answering, benchmark evaluation, fine-tuning, robustness, MIMIC-IV

157. ❌ DeCoVec: Building Decoding Space based Task Vector for Large Language Models via In-Context Learning

作者: Feiyang Li, Yile Wang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11129v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的task vector构建和steering方法，直接涉及LLMs和In-context Learning（ICL），因此这两个关键词高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、Fine-tuning、Alignment、RAG、Reasoning、Agents、Compression、Interpretability等均未在摘要中提及或相关，故给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需训练、非侵入式的DeCoVec框架，通过利用上下文学习在解码空间中构建任务向量来引导大型语言模型，实验表明该方法在多个任务上优于标准少样本基线，且能有效抑制生成退化。

摘要翻译

任务向量作为编码任务特定行为、在模型或激活空间中表征方向的方法，已成为引导大语言模型（LLMs）的一种有前景的工具。然而，现有方法通常需要对内部状态进行微调或侵入式操控，限制了其灵活性与可扩展性。我们提出 \textsc{DeCoVec}（基于解码空间的任务向量），这是一种无需训练且非侵入式的框架，其通过利用上下文学习（ICL）直接在\textit{解码空间}中构建任务向量。具体而言，\textsc{DeCoVec} 将任务本质捕捉为少样本提示与零样本提示的输出逻辑值分布之间的差异，然后通过将该向量注入解码过程来引导生成。在 TruthfulQA、Math-500 和 AQUA-RAT 数据集上对七个大语言模型（0.5B–9B）进行的实验表明，\textsc{DeCoVec} 始终优于标准的少样本基线方法，平均准确率最高提升达 +5.50。进一步分析表明，\textsc{DeCoVec} 能有效抑制生成退化与逻辑缺陷，同时对示例顺序表现出很强的鲁棒性，且无需增加额外的输入令牌成本。我们的方法为无需权重更新或辅助模型的大语言模型引导，提供了一种免训练且非侵入式的解决方案。

摘要 (Abstract)

Task vectors, representing directions in model or activation spaces that encode task-specific behaviors, have emerged as a promising tool for steering large language models (LLMs). However, existing approaches typically require fine-tuning or invasive manipulation of internal states, limiting their flexibility and scalability. We propose \textsc{DeCoVec} (Decoding Space based Task Vector), a training-free and non-invasive framework that constructs task vectors directly in the \textit{decoding space} by leveraging in-context learning (ICL). Specifically, \textsc{DeCoVec} captures the task essence as the difference between the output logit distributions of few-shot and zero-shot prompts, then steers generation by injecting this vector into the decoding process. Experiments across seven LLMs (0.5B–9B) on TruthfulQA, Math-500, and AQUA-RAT show that \textsc{DeCoVec} consistently outperforms standard few-shot baselines, with gains up to +5.50 average accuracy. Further analysis demonstrates that \textsc{DeCoVec} effectively suppresses generation degeneration and logical flaws while exhibiting strong robustness to demonstration ordering, all without incurring additional input token costs. Our method offers a training-free and non-invasive solution for LLM steering without requiring weight updates or auxiliary models.

关键词: Task vectors, Large language models, In-context learning, Decoding space, Training-free, Non-invasive, LLM steering, Few-shot prompts

作者: Atharva Gupta, Dhruv Kumar, Yash Sinha 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11121v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（Qwen 2.5-7B-Instruct）在社交媒体政治极化检测任务中的应用，采用监督微调（SFT）和直接偏好优化（DPO）的两阶段方法，并使用LoRA进行参数高效微调。因此，与’Large Language Models’、‘Post-training/SFT’、‘DPO’和’LoRA/PEFT’高度相关（10分）。其他关键词如MoE、SLMs、RAG、量化等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合结构化监督微调和DPO优化的两阶段方法，用于检测社交媒体文本中的政治极化，实验表明DPO优化显著提高了模型的召回率和F1分数。

摘要翻译

POLAR SemEval-2026 共享任务旨在检测在线极化现象，并专注于多语言、多文化和多事件的极化分类与识别。由于网络言论中微妙的修辞手法、隐含的框架设定以及人工标注的高昂成本，在线极化的精准计算检测面临挑战。基于近期研究发现上下文提示能使大语言模型成为强大的极化检测器，我们提出了一种两阶段方法，用于检测社交媒体文本中的政治极化，该方法将结构化的监督微调与直接偏好优化（Direct Preference Optimization, DPO）精炼相结合。
我们使用可解释的槽填充模板（包括目标、主张类型、表现清单和理由）并借助 LoRA 技术对 Qwen 2.5-7B-Instruct 模型进行微调。随后，我们应用 DPO 并结合自动生成的偏好对来减少代价高昂的假阴性错误。在 SemEval 2026 POLAR 共享任务数据集上的实验表明，基于偏好的精炼在无需额外标注的情况下，既提高了准确性，又降低了假阴性率。在英文开发集上，DPO 将召回率从 0.5085 提升至 0.7797，并将宏观 F1 分数提高了约 5 个百分点。

摘要 (Abstract)

The POLAR SemEval-2026 Shared Task aims to detect online polarization and focuses on the classification and identification of multilingual, multicultural, and multi-event polarization. Accurate computational detection of online polarization is challenging due to nuanced rhetoric, implicit framing, and the high cost of human-in-the-loop annotation. Building on recent findings that contextual prompting enables large language models to function as strong polarization detectors, we present a two-stage approach for detecting political polarization in social media text that combines structured supervised fine-tuning with Direct Preference Optimization (DPO) refinement. We fine-tune Qwen 2.5-7B-Instruct with LoRA using an interpretable slot-filling template (target, claim type, manifestation checklist, and justification). We then apply DPO with automatically generated preference pairs to reduce costly false negatives. Experiments on the SemEval 2026 POLAR shared task dataset show that preference-based refinement improves both accuracy and decreases false negatives without extra annotation. On the English development set, DPO increases recall from 0.5085 to 0.7797 and improves macro-F1 by ~5 points.

关键词: political polarization detection, supervised fine-tuning, Direct Preference Optimization, LoRA, large language models, social media text, false negative reduction, SemEval shared task

159. ❌ Use of AI Tools: Guidelines to Maintain Academic Integrity in Computing Colleges

作者: Hatem M. El-boghdadi, Toqeer Ali Syed, Ali Akarma, Qamar Wali 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11111v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要探讨AI工具（特别是ChatGPT）在计算机教育中的使用对学术诚信的影响，并提出相应的指导方针和评估框架。论文内容属于AI在教育领域的应用讨论，而非大模型或深度学习技术本身的创新研究。因此，与绝大多数技术性关键词（如MoE、Scaling Laws、PEFT、RAG等）完全无关（0分）。唯一相关的关键词是’Large Language Models OR LLMs OR Foundation Models’，因为论文以ChatGPT为例讨论AI工具，而ChatGPT是基于大语言模型构建的，但论文并未深入探讨LLM的技术原理或创新，仅将其作为背景工具提及，因此给予5分（有一定关联）。其他关键词均未在论文标题或摘要中体现，且与论文主题无关。

!!! tip deepseek-chat TL;DR

该论文研究了AI工具（如ChatGPT）在计算机学院教育中的广泛使用对学术诚信构成的挑战，并提出了一套通用指导方针和特定评估建议，以帮助教师在利用AI工具教学优势的同时维护学术诚信，并引入了一个正式的数学模型来评估在AI辅助工具存在下的学生评估。

摘要翻译

以ChatGPT为代表的人工智能工具的迅速普及已显著改变了学术实践，为计算学科的学生和教师带来了可观的益处。研究表明，这些工具能够提升学习效率、学术自我效能感与自信心。然而，其日益广泛的应用也引发了关于如何维护学术诚信——这一教育过程核心支柱——的迫切担忧。本文探讨了人工智能工具在计算学院广泛使用所带来的影响，尤其关注如何使其应用与学术诚信原则相协调。我们首先对计算教育中常用的评估技术进行分类，并逐一考察各类评估如何受到人工智能辅助工具的影响。在此基础上，我们提出一套适用于不同评估形式的通用准则，以帮助教师负责任地将人工智能工具融入教学实践。此外，我们还提供了针对特定评估类型的细化建议，旨在保障教育目标的同时降低学术不端风险。这些准则为教育工作者提供了一个实用框架，帮助其在计算教育中平衡人工智能工具的教学优势与维护学术诚信的必要性。最后，我们引入一个形式化模型，该模型为在人工智能辅助工具存在的情况下评估学生学业表现提供了一个结构化的数学框架。

摘要 (Abstract)

The rapid adoption of AI tools such as ChatGPT has significantly transformed academic practices, offering considerable benefits for both students and faculty in computing disciplines. These tools have been shown to enhance learning efficiency, academic self-efficacy, and confidence. However, their increasing use also raises pressing concerns regarding the preservation of academic integrity – an essential pillar of the educational process. This paper explores the implications of widespread AI tool usage within computing colleges, with a particular focus on how to align their use with the principles of academic honesty. We begin by classifying common assessment techniques employed in computing education and examine how each may be impacted by AI-assisted tools. Building on this foundation, we propose a set of general guidelines applicable across various assessment formats to help instructors responsibly integrate AI tools into their pedagogy. Furthermore, we provide targeted, assessment-specific recommendations designed to uphold educational objectives while mitigating risks of academic misconduct. These guidelines serve as a practical framework for instructors aiming to balance the pedagogical advantages of AI tools with the imperative of maintaining academic integrity in computing education. Finally, we introduce a formal model that provides a structured mathematical framework for evaluating student assessments in the presence of AI-assisted tools.

关键词: AI tools, academic integrity, computing education, assessment techniques, guidelines, ChatGPT, pedagogical advantages, formal model

160. ❌ ks-pret-5m: a 5 million word, 12 million token kashmiri pretraining dataset

作者: Haq Nawaz Malik, Nahfid Nissar 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11066v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文的核心贡献是创建了一个克什米尔语的大规模预训练数据集KS-PRET-5M，并详细描述了数据收集、清洗和分词过程。因此，它与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分），因为数据集是专门为语言模型预训练设计的。它与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为数据集可用于训练这类模型。与’Scaling Laws AND Data Quality’也有一定关联（5分），因为论文强调了数据质量和规模（500万词，1200万token），这涉及到扩展定律中的数据质量方面。论文未涉及其他关键词，如模型架构（MoE, SLMs）、训练技术（SFT, RLHF, PEFT）、推理优化、代理系统或特定科学领域应用，因此这些关键词得0分。

!!! tip deepseek-chat TL;DR

该研究创建并发布了KS-PRET-5M，一个包含约500万词和1200万子词token的克什米尔语预训练数据集，通过多阶段清洗流程确保了高语言纯度，以支持克什米尔语的语言模型预训练和计算语言学研究。

摘要翻译

我们发布了KS-PRET-5M——目前最大的公开可用克什米尔语预训练数据集，包含5,090,244（509万）个单词、27,692,959（2760万）个字符以及295,433（29.5万）个独立词型构成的词汇表。该数据集通过两类来源构建：其一是数字化档案与文学材料（涵盖文学、新闻、传记、小说、诗歌、宗教文献及学术著作），这些材料使用Malik~\cite{malik2024inpage}开发的转换工具从专有的InPage桌面出版格式中提取；其二是从克什米尔语网络资源收集的原生Unicode文本。所有文本均经过包含十一个步骤的清洗流程处理，使克什米尔文字符平均占比达到0.9965，全数据集中的天城文字符污染被降至146个字符。我们采用google/muril-base-cased模型对数据集进行经验性分词，得到每个单词平均对应2.383个子词单元，总计约1213万子词标记，该数值显著高于此前基于非克什米尔语波斯-阿拉伯文字类比所得的预估。KS-PRET-5M以CC BY 4.0协议发布为连续文本流，旨在支持克什米尔语的语言模型预训练、分词器训练及计算语言学研究。

摘要 (Abstract)

We present KS-PRET-5M, the largest publicly available pretraining dataset for the Kashmiri language, comprising 5,090,244 (5.09M) words, 27,692,959 (27.6M) characters, and a vocabulary of 295,433 (295.4K) unique word types. We assembled the dataset from two source classes: digitized archival and literary material, encompassing literature, news, biographies, novels, poetry, religious scholarship, and academic writing, recovered from the proprietary InPage desktop-publishing format using the converter of Malik~\cite{malik2024inpage}, and Unicode-native text collected from Kashmiri-language web sources. All text was processed through an eleven-stage cleaning pipeline that achieves a mean Kashmiri script ratio of 0.9965, reducing Devanagari contamination to 146 characters across the full dataset. We tokenized the dataset empirically using google/muril-base-cased, yielding a subword ratio of 2.383 tokens per word and a total of approximately 12.13 million subword tokens, substantially higher than prior estimates derived from non-Kashmiri Perso-Arabic analogues. KS-PRET-5M is released as a single continuous text stream under CCBY4.0 to support language model pretraining, tokenizer training, and computational linguistic research for Kashmiri.

关键词: Kashmiri language, pretraining dataset, data cleaning, tokenization, language model, computational linguistics, text corpus, subword tokens

161. ❌ A Systematic Analysis of the Impact of Persona Steering on LLM Capabilities

作者: Jiaqi Chen, Ming Wang, Tingna Xie, Shi Feng, Yongkang Liu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11048v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的人格诱导（persona steering）对认知能力的影响，因此与’Large Language Models’高度相关（10分）。论文评估了指令遵循和复杂推理等认知任务，与’Instruction Tuning’（5分）、‘Chain of Thought’（5分）和’System 2 Thinking’（5分）有一定关联，因为这些涉及LLMs的推理和指令遵循能力。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF等未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文研究了在大型语言模型中诱导特定人格特质对其认知能力的影响，发现人格诱导会产生稳定、可复现的认知任务性能变化，并提出了动态人格路由策略以优化性能。

摘要翻译

为大型语言模型（LLM）赋予特定人设已成为定制交互风格的常见做法，但其对底层认知能力的影响尚未得到探索。本研究采用基于神经元的人格特质诱导（Neuron-based Personality Trait Induction, NPTI）框架，在LLM中诱导大五人格特质，并在六项认知基准测试中评估其表现。研究发现，人设诱导不仅引发表层风格变化，还会在认知任务表现上产生稳定、可复现的偏移。这些效应表现出强烈的任务依赖性：某些人格特质能在指令遵循任务上带来持续增益，而另一些则会损害复杂推理能力。效应强度随特质维度呈现系统性变化，其中开放性和外向性维度的影响最为显著。此外，LLM的表现变化与人类人格-认知关联方向的一致性达到73.68%。基于这些规律，我们提出动态人设路由（Dynamic Persona Routing, DPR）——一种轻量级的查询自适应策略，该策略无需额外训练即可超越最佳静态人设的表现。

摘要 (Abstract)

Imbuing Large Language Models (LLMs) with specific personas is prevalent for tailoring interaction styles, yet the impact on underlying cognitive capabilities remains unexplored. We employ the Neuron-based Personality Trait Induction (NPTI) framework to induce Big Five personality traits in LLMs and evaluate performance across six cognitive benchmarks. Our findings reveal that persona induction produces stable, reproducible shifts in cognitive task performance beyond surface-level stylistic changes. These effects exhibit strong task dependence: certain personalities yield consistent gains on instruction-following, while others impair complex reasoning. Effect magnitude varies systematically by trait dimension, with Openness and Extraversion exerting the most robust influence. Furthermore, LLM effects show 73.68% directional consistency with human personality-cognition relationships. Capitalizing on these regularities, we propose Dynamic Persona Routing (DPR), a lightweight query-adaptive strategy that outperforms the best static persona without additional training.

关键词: Large Language Models, Persona Steering, Cognitive Capabilities, Personality Traits, Dynamic Persona Routing, Instruction Following, Complex Reasoning, Neuron-based Personality Trait Induction

162. ❌ Uncertainty-Aware Web-Conditioned Scientific Fact-Checking

作者: Ashwin Vinod, Katrin Erk 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11036v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于科学事实核查，特别是生物医学和材料科学领域，提出了一个基于原子谓词分解和不确定性门控验证的管道。与大多数关键词无关，因为它们涉及大模型技术原理、训练方法、推理优化、代理系统等，而本文未明确使用或研究这些技术。相关关键词：1) ‘Hallucination Mitigation OR Factuality OR Truthfulness’ (10分)：核心解决科学事实核查中的幻觉和不一致推理问题，直接相关。2) ‘Mechanistic Interpretability OR Explainable AI’ (5分)：系统设计强调可解释性和可追溯性，有一定关联。3) ‘AI for Science OR Bioinformatics OR Cheminformatics’ (10分)：直接应用于生物医学和材料科学等科学领域，高度相关。

!!! tip deepseek-chat TL;DR

该论文解决了科学事实核查中因幻觉和不一致推理导致的准确性问题，通过原子谓词分解和不确定性门控验证的管道，在多个基准测试中超越了现有最强方法，实现了更可解释和上下文条件的验证。

摘要翻译

科学事实核查对于评估生物医学和材料科学等专业领域的主张至关重要，但现有系统常产生幻觉或应用不一致的推理，尤其在验证技术性、构成性主张时，需在来源与成本/延迟限制下依据证据片段进行判断。我们提出一个以原子谓词-论元分解和经校准的不确定性门控确证为核心的流程：原子事实通过嵌入向量与局部证据片段对齐，由紧凑的基于证据的核查器验证，仅对支持度不确定的事实触发针对权威来源的领域受限网络搜索。该系统支持二元及三元分类，在三元任务中可预测“支持”“反驳”或“无足够信息”标签。我们在两种模式下进行评估：仅上下文（无网络搜索）和上下文+网络（不确定性门控网络确证）；当检索到的证据与给定上下文冲突时，系统以“无足够信息”弃权而非覆盖上下文。在多个基准测试中，我们的框架超越了现有最强基准。实验表明，网络确证平均仅对少数原子事实触发，这表明外部证据是在校准的不确定性下有选择地调用，而非例行查询。总体而言，将原子粒度与校准的不确定性门控确证相结合，产生了更具可解释性且受上下文约束的核查机制，使该方法非常适合高风险、单文档场景，此类场景要求可追溯的推理依据、可预测的成本/延迟以及保守的决策。

摘要 (Abstract)

Scientific fact-checking is vital for assessing claims in specialized domains such as biomedicine and materials science, yet existing systems often hallucinate or apply inconsistent reasoning, especially when verifying technical, compositional claims against an evidence snippet under source and cost/latency constraints. We present a pipeline centered on atomic predicate-argument decomposition and calibrated, uncertainty-gated corroboration: atomic facts are aligned to local snippets via embeddings, verified by a compact evidence-grounded checker, and only facts with uncertain support trigger domain-restricted web search over authoritative sources. The system supports both binary and tri-valued classification where it predicts labels from Supported, Refuted, NEI for three-way tasks. We evaluate under two regimes, Context-Only (no web) and Context+Web (uncertainty-gated web corroboration); when retrieved evidence conflicts with the provided context, we abstain with NEI rather than overriding the context. On multiple benchmarks, our framework surpasses the strongest benchmarks. In our experiments, web corroboration was invoked for only a minority of atomic facts on average, indicating that external evidence is consulted selectively under calibrated uncertainty rather than routinely. Overall, coupling atomic granularity with calibrated, uncertainty-gated corroboration yields more interpretable and context-conditioned verification, making the approach well-suited to high-stakes, single-document settings that demand traceable rationales, predictable cost/latency, and conservative.

关键词: scientific fact-checking, uncertainty-gated corroboration, atomic predicate decomposition, hallucination mitigation, biomedicine, materials science, evidence verification, interpretable verification

163. ❌ Shared Emotion Geometry Across Small Language Models: A Cross-Architecture Study of Representation, Behavior, and Methodological Confounds

作者: Jihoon Jeong 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11050v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究小型语言模型（SLMs）的情感表示几何结构，因此与’Small Language Models OR SLMs OR On-device AI’高度相关（10分）。论文涉及RLHF对模型表示的影响，与’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’有一定关联（5分）。论文提到fp16与INT8精度对比，与’Quantization OR Model Compression OR Low-bit Weights’相关（5分）。论文分析模型内部表示，与’Mechanistic Interpretability OR Explainable AI’相关（5分）。论文提到大模型但非核心，与’Large Language Models OR LLMs OR Foundation Models’有弱关联（5分）。其他关键词如MoE、Scaling Laws、RAG、Agents等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究通过分析12个小型语言模型的情感表示几何结构，发现成熟架构共享几乎相同的情感几何，而RLHF仅重组尚未组织化的表示，同时揭示了先前研究方法中存在的四个混淆层。

摘要翻译

我们在fp16精度下，通过统一的理解模式流程从十二个小规模语言模型（六种架构 × 基础/指导版本，参数量1B-8B）中提取了21维情感向量集，并基于原始余弦表征相似性矩阵进行几何结构比较。五种成熟架构（Qwen 2.5 1.5B、SmolLM2 1.7B、Llama 3.2 3B、Mistral 7B v0.3、Llama 3.1 8B）展现出近乎一致的21维情感几何结构，其表征相似性矩阵的斯皮尔曼相关系数介于0.74-0.92之间。这种普遍性存在于行为特征完全对立的模型中：Qwen 2.5与Llama 3.2虽在MTI Compliance维度上处于两极，却产生几乎相同的情感表征相似性矩阵（ρ=0.81），表明行为维度差异形成于共享的情感表征层之上。数据集中唯一未成熟案例Gemma-3 1B基础版表现出极端的残差流各向异性（0.997），其几何描述符在RLHF过程中被全面重构；而五个已成熟模型族内部的基础版与指导版表征相似性矩阵相关系数均达ρ≥0.92（Mistral 7B v0.3为ρ=0.985），说明RLHF仅重构尚未形成组织的表征。方法论层面，我们揭示先前研究视为单一“理解vs生成”方法效应的现象实际可解构为四个独立层面——粗粒度的方法依赖性分离、生成模式内稳健的子参数敏感性、真实的精度（fp16 vs INT8）效应，以及因模型不同而产生反向扭曲的混淆跨实验偏差——因此若未进行分层解构，仅凭两个既往情感向量研究间的单一相关系数不足以支撑可靠结论。

摘要 (Abstract)

We extract 21-emotion vector sets from twelve small language models (six architectures x base/instruct, 1B-8B parameters) under a unified comprehension-mode pipeline at fp16 precision, and compare the resulting geometries via representational similarity analysis on raw cosine RDMs. The five mature architectures (Qwen 2.5 1.5B, SmolLM2 1.7B, Llama 3.2 3B, Mistral 7B v0.3, Llama 3.1 8B) share nearly identical 21-emotion geometry, with pairwise RDM Spearman correlations of 0.74-0.92. This universality persists across diametrically opposed behavioral profiles: Qwen 2.5 and Llama 3.2 occupy opposite poles of MTI Compliance facets yet produce nearly identical emotion RDMs (rho = 0.81), so behavioral facet differences arise above the shared emotion representation. Gemma-3 1B base, the one immature case in our dataset, exhibits extreme residual-stream anisotropy (0.997) and is restructured by RLHF across all geometric descriptors, whereas the five already-mature families show within-family base x instruct RDM correlations of rho >= 0.92 (Mistral 7B v0.3 at rho = 0.985), suggesting RLHF restructures only representations that are not yet organized. Methodologically, we show that what prior work has read as a single comprehension-vs-generation method effect in fact decomposes into four distinct layers – a coarse method-dependent dissociation, robust sub-parameter sensitivity within generation, a true precision (fp16 vs INT8) effect, and a conflated cross-experiment bias that distorts in opposite directions for different models – so that a single rho between two prior emotion-vector studies is not a safe basis for interpretation without the layered decomposition.

关键词: small language models, emotion representation, representational similarity analysis, RLHF, model architectures, geometric descriptors, precision effects, cross-architecture study

164. ❌ When Valid Signals Fail: Regime Boundaries Between LLM Features and RL Trading Policies

作者: Zhengzhe Yang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.10996v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM作为特征提取器在强化学习交易代理中的应用，与’Large Language Models’高度相关（10分），因为LLM是核心组件；与’LLM Agents’有一定关联（5分），因为研究涉及LLM增强的强化学习代理；其他关键词如MoE、SFT、RAG等未涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM生成的数值特征是否能提升强化学习交易代理的性能，发现虽然优化的提示能产生预测性特征，但在宏观经济冲击导致的分布偏移下，这些特征会增加噪声，使代理表现不如仅使用价格信息的基线，揭示了特征有效性到策略鲁棒性的转移学习挑战。

摘要翻译

大型语言模型（LLM）能否生成可提升强化学习（RL）交易代理性能的连续数值特征？我们构建了一个模块化流程，其中冻结的LLM作为无状态特征提取器，将非结构化的每日新闻和文件转化为固定维度的向量，供下游PPO代理使用。我们引入了一种自动化提示优化循环，将提取提示视为离散超参数，并直接针对信息系数——即预测收益与实际收益之间的斯皮尔曼秩相关性——进行优化，而非基于自然语言处理损失进行调优。优化后的提示发现了真正具有预测性的特征（在保留数据上信息系数高于0.15）。然而，这些有效的中间表征并不会自动转化为下游任务性能：在宏观经济冲击导致分布偏移期间，LLM衍生的特征会引入噪声，增强后的代理表现不及仅使用价格数据的基线模型。在更平稳的测试环境中，代理性能有所恢复，但宏观经济状态变量仍是策略改进最稳健的驱动因素。我们的研究结果揭示了特征层面有效性与策略层面鲁棒性之间的差距，这与分布偏移下迁移学习中的已知挑战相呼应。

摘要 (Abstract)

Can large language models (LLMs) generate continuous numerical features that improve reinforcement learning (RL) trading agents? We build a modular pipeline where a frozen LLM serves as a stateless feature extractor, transforming unstructured daily news and filings into a fixed-dimensional vector consumed by a downstream PPO agent. We introduce an automated prompt-optimization loop that treats the extraction prompt as a discrete hyperparameter and tunes it directly against the Information Coefficient - the Spearman rank correlation between predicted and realized returns - rather than NLP losses. The optimized prompt discovers genuinely predictive features (IC above 0.15 on held-out data). However, these valid intermediate representations do not automatically translate into downstream task performance: during a distribution shift caused by a macroeconomic shock, LLM-derived features add noise, and the augmented agent under-performs a price-only baseline. In a calmer test regime the agent recovers, yet macroeconomic state variables remain the most robust driver of policy improvement. Our findings highlight a gap between feature-level validity and policy-level robustness that parallels known challenges in transfer learning under distribution shift.

关键词: Large Language Models, Reinforcement Learning, Trading Agents, Feature Extraction, Prompt Optimization, Distribution Shift, Macroeconomic Shock, Policy Robustness

165. ❌ K-Way Energy Probes for Metacognition Reduce to Softmax in Discriminative Predictive Coding Networks

作者: Jon-Paul Cacioli 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11011v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究预测编码网络（PCNs）中的K-way能量探针与softmax的关系，属于神经网络理论分析领域。所有评分关键词均针对大模型/深度学习技术原理创新或其在科学领域的应用，而本文聚焦于特定神经网络架构（PCNs）的理论性质，不涉及大模型、LLMs、MoE、SLMs、scaling laws、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、推理加速、CoT、系统2思维、MCTS、自校正、智能体、工具使用、多智能体、量化、推测解码、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或AI for Science等主题。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文证明在判别式预测编码网络中，K-way能量探针本质上可约简为softmax函数，并通过实验验证了这一理论分解。

摘要翻译

本文呈现的是一项包含解释机制的负面结果，而非形式化的上界。预测编码网络（PCNs）允许进行K类能量探测：将每个候选类别固定为目标，运行推理至稳定状态，随后比较各假设的稳定能量。该探测方法看似读取比softmax更丰富的信号源，因为各假设能量依赖于完整的生成链。
我们认为，在标准的Pinchetti式判别性PC框架下，这种表象具有误导性。我们提出一种近似归约分析表明：在目标钳位的交叉熵能量训练及有效前馈的潜在动态条件下，K类能量边界可分解为对数softmax边界的单调函数，加上一个未经过训练以与正确性相关联的残差项。该分解预测结构探测结果应从下方趋近softmax。
我们在CIFAR-10数据集上通过六种条件验证此结论：扩展的确定性训练、推理过程中潜在运动的直接测量、反向传播网络的事后解码器公平性控制、匹配计算预算的PC与BP对比、五点朗之万温度扫描，以及轨迹积分MCPC训练。所有条件下探测值均位于softmax下方。该差距在判别性PC体系内的不同训练过程中保持稳定。终态训练与轨迹积分训练产生的探测值在确定性评估中，其AUROC_2差异小于10^-3。
实证研究规模有限：单次随机种子、210万参数网络、1280张测试图像。我们将此结果定位为邀请复现的预印本研究。文中讨论了该分解不适用的条件（双向PC、前瞻配置、生成式PC、非交叉熵能量形式），并指出了本分析未排除的有效结构探测发展方向。

摘要 (Abstract)

We present this as a negative result with an explanatory mechanism, not as a formal upper bound. Predictive coding networks (PCNs) admit a K-way energy probe in which each candidate class is fixed as a target, inference is run to settling, and the per-hypothesis settled energies are compared. The probe appears to read a richer signal source than softmax, since the per-hypothesis energy depends on the entire generative chain. We argue this appearance is misleading under the standard Pinchetti-style discriminative PC formulation. We present an approximate reduction showing that with target-clamped CE-energy training and effectively-feedforward latent dynamics, the K-way energy margin decomposes into a monotone function of the log-softmax margin plus a residual that is not trained to correlate with correctness. The decomposition predicts that the structural probe should track softmax from below. We test this across six conditions on CIFAR-10: extended deterministic training, direct measurement of latent movement during inference, a post-hoc decoder fairness control on a backpropagation network, a matched-budget PC vs BP comparison, a five-point Langevin temperature sweep, and trajectory-integrated MCPC training. In every condition the probe sat below softmax. The gap was stable across training procedures within the discriminative PC family. Final-state and trajectory-integrated training produced probes whose AUROC_2 values differed by less than 10^-3 at deterministic evaluation. The empirical regime is small: single seed, 2.1M-parameter network, 1280 test images. We frame the result as a preprint inviting replication. We discuss conditions under which the decomposition does not apply (bidirectional PC, prospective configuration, generative PC, non-CE energy formulations) and directions for productive structural probing the analysis does not foreclose.

关键词: predictive coding networks, K-way energy probe, softmax, discriminative PC, energy margin, latent dynamics, CIFAR-10, AUROC

166. ❌ When Verification Fails: How Compositionally Infeasible Claims Escape Rejection

作者: Muxin Liu, Delip Rao, Grace Kim, Chris Callison-Burch 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.10990v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究科学声明验证，涉及AI模型在科学领域的应用（AI for Science），关注模型的事实性/真实性（Hallucination Mitigation）和可解释性（Explainable AI）。论文测试了多种模型家族，包括大语言模型（LLMs），因此与这些关键词有一定关联（5分）。其他关键词如MoE、量化、推理加速、对齐等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文揭示了现有科学声明验证基准无法区分严谨验证与仅检查最显著约束的捷径方法，通过构建组合不可行声明发现模型普遍过度接受此类声明，表明当前验证行为存在结构性瓶颈。

摘要翻译

科学主张验证，即判定主张是否被科学证据所蕴含的任务，是依据证据确立科学发现并防止错误信息的基础。这一过程涉及依据已验证证据对主张中提出的每个约束条件进行评估。在封闭世界假设下，当且仅当所有提出的约束条件均得到正面支持时，一个主张才会被接受。我们证明，现有的验证基准无法区分严格执行此标准的模型与采用一种更简单捷径（称为显著约束检查）的模型，后者仅对最显著的约束应用封闭世界假设的拒绝准则，并在该约束得到支持时即接受主张。由于现有基准通过扰动单个显著元素来构建不可行主张，它们不足以区分严谨的主张验证与简单的显著约束依赖。为了区分二者，我们构建了组合式不可行主张，其中显著约束得到支持但一个非显著约束存在矛盾。在不同模型家族和模态中，那些在现有基准上表现饱和的模型持续过度接受此类主张，证实了此类捷径推理的普遍性。通过模型上下文干预，我们发现不同模型和提示策略在一条共享的ROC曲线上占据不同位置，这表明模型家族间的差距反映的是验证阈值的差异而非底层推理能力的不同，并且组合推理瓶颈是当前验证行为的一种结构性特征，仅靠策略指导无法克服。

摘要 (Abstract)

Scientific claim verification, the task of determining whether claims are entailed by scientific evidence, is fundamental to establishing discoveries in evidence while preventing misinformation. This process involves evaluating each asserted constraint against validated evidence. Under the Closed-World Assumption (CWA), a claim is accepted if and only if all asserted constraints are positively supported. We show that existing verification benchmarks cannot distinguish models enforcing this standard from models applying a simpler shortcut called salient-constraint checking, which applies CWA’s rejection criterion only to the most salient constraint and accepts when that constraint is supported. Because existing benchmarks construct infeasible claims by perturbing a single salient element they are insufficient at distinguishing between rigorous claim verification and simple salient-constraint reliance. To separate the two, we construct compositionally infeasible claims where the salient constraint is supported but a non-salient constraint is contradicted. Across model families and modalities, models that otherwise saturate existing benchmarks consistently over-accept these claims, confirming the prevalence of such shortcut reasoning. Via model context interventions, we show that different models and prompting strategies occupy distinct positions on a shared ROC curve, indicating that the gap between model families reflects differences in verification threshold rather than underlying reasoning ability, and that the compositional inference bottleneck is a structural property of current verification behavior that strategy guidance alone cannot overcome.

关键词: scientific claim verification, compositionally infeasible claims, salient-constraint checking, Closed-World Assumption, model evaluation, benchmark limitations, reasoning shortcuts, verification behavior

167. ❌ Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models

作者: Sameera Horawalavithana, Lauren Phillips, Ian Stewart, Sai Munikoti, Karl Pazdernik 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.10985v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究预训练LLM骨干（LLAMA系列）在视觉语言模型（VLM）微调中的影响，与’Large Language Models’高度相关（10分），‘Post-training/SFT’是核心方法（10分）。‘Pre-training/Domain Adaptation’（5分）和’Instruction Tuning/Alignment’（5分）因涉及LLM骨干的演进和VLM对齐而相关。‘Chain of Thought/Reasoning’（5分）和’Mechanistic Interpretability’（5分）因分析模型推理和信息处理而相关。其他关键词如MoE、SLMs、RAG、量化等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究系统评估了预训练LLM骨干（LLAMA-1/2/3）对下游视觉语言模型任务性能的影响，发现新LLM骨干并不总是提升VLM性能，其效果取决于具体任务，且新骨干可能解决不同问题而非更多问题，这源于信息处理方式的差异。

摘要翻译

视觉语言模型（Vision-Language Models, VLMs）通过利用强大的预训练大语言模型（Large Language Models, LLMs）作为核心推理主干而迅速发展。随着新一代能力更强、推理能力、指令遵循和泛化性能更优的LLMs不断涌现，迫切需要高效更新现有VLMs以融合这些进步。然而，将新LLMs整合到VLMs中，特别是不断演进的LLMs如何促进多模态推理、对齐以及特定任务性能，目前仍缺乏深入探索。鉴于预训练LLM主干的快速演进，解决这一空白对于VLM的发展至关重要。本研究对预训练LLM主干的变化如何影响下游VLM任务性能进行了受控且系统的研究。通过保持视觉编码器、训练数据和训练后算法在基于LLAMA-1、LLAMA-2和LLAMA-3的VLMs中一致，我们发现较新的LLM主干并不总是带来更好的VLMs，其性能取决于下游VLM任务。例如，在视觉问答任务中，较新的LLM主干倾向于解决不同的问题，而不仅仅是更多的问题；我们的分析表明，这源于模型处理信息方式的差异，包括更好的校准置信度和更稳定的内部表征。我们还发现，某些VLM能力仅在最新一代LLM中出现，而主要依赖视觉理解的任务则从较新的LLM主干中获益甚微。

摘要 (Abstract)

Vision-Language Models (VLMs) have rapidly advanced by leveraging powerful pre-trained Large Language Models (LLMs) as core reasoning backbones. As new and more capable LLMs emerge with improved reasoning, instruction-following, and generalization, there is a pressing need to efficiently update existing VLMs to incorporate these advancements. However, the integration of new LLMs into VLMs, particularly how the evolving LLMs contribute to multimodal reasoning, alignment, and task-specific performance remains underexplored. Addressing this gap is important for VLM development, given the rapid evolution of pretrained LLM backbones. This study presents a controlled and systematic investigation of how changes in the pretrained LLM backbone affect downstream VLM task performance. By having the vision encoder, training data, and post-training algorithm remain same across LLAMA-1, LLAMA-2, and LLAMA-3 based VLMs, we find that newer LLM backbones do not always lead to better VLMs, but the performance depends on the downstream VLM task. For example, in visual question and answering tasks, newer LLM backbones tend to solve different questions rather than just more questions, and our analysis shows this is driven by differences in how the models process information, including better calibrated confidence and more stable internal representations. We also find that some VLM capabilities appear only in the newest LLM generation, while tasks that depend mainly on visual understanding see little benefit from a newer LLM backbone.

关键词: Vision-Language Models, Large Language Models, pretrained LLM backbones, fine-tuning, multimodal reasoning, visual question answering, model performance analysis, LLAMA models

168. ❌ YIELD: A Large-Scale Dataset and Evaluation Framework for Information Elicitation Agents

作者: Victor De Lima, Grace Hui Yang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.10968v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于信息获取代理（IEAs）的研究，属于LLM Agents范畴（高度相关，10分）。研究涉及在YIELD数据集上训练基础LLM，与LLMs（8分）、Supervised Fine-tuning（8分）和Alignment（8分）相关。论文提到发布微调后的模型适配器，暗示可能使用参数高效微调技术（PEFT，5分）。其他关键词如MoE、Scaling Laws、RAG、CoT等未在摘要中提及，与论文主题无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了信息获取代理（IEAs）的概念，创建了YIELD大规模对话数据集，并通过在多个基础大语言模型上进行训练，证明了训练能改善模型与真实信息获取行为的一致性。

摘要翻译

多数对话系统旨在通过用户驱动的交互满足用户需求。然而，现实世界中的诸多场景——如学术访谈、司法程序与新闻调查——涉及更广泛的机构决策流程，需要能够主动从用户处获取信息的智能体。本文提出信息诱导智能体，其核心目标是从用户处提取信息，以支持机构或任务导向的目标。为系统研究这一范式，我们构建了YIELD数据集，该数据集包含2,281段符合伦理标准的人类对话，规模达2,600万词元。此外，我们将信息诱导形式化为有限时域的POMDP（部分可观测马尔可夫决策过程），并提出了针对信息诱导智能体的新型评估指标。在多个基础大语言模型上的初步实验表明，基于YIELD的训练能有效提升模型与真实诱导行为的对齐度，该结论亦通过人工评估验证。YIELD数据集以CC BY 4.0协议发布，其数据、项目代码、评估工具与微调模型适配器可通过以下链接获取：https://github.com/infosenselab/yield。

摘要 (Abstract)

Most conversational agents (CAs) are designed to satisfy user needs through user-driven interactions. However, many real-world settings, such as academic interviewing, judicial proceedings, and journalistic investigations, involve broader institutional decision-making processes and require agents that can elicit information from users. In this paper, we introduce Information Elicitation Agents (IEAs) in which the agent’s goal is to elicit information from users to support the agent’s institutional or task-oriented objectives. To enable systematic research on this setting, we present YIELD, a 26M-token dataset of 2,281 ethically sourced, human-to-human dialogues. Moreover, we formalize information elicitation as a finite-horizon POMDP and propose novel metrics tailored to IEAs. Pilot experiments on multiple foundation LLMs show that training on YIELD improves their alignment with real elicitation behavior and findings are corroborated by human evaluation. We release YIELD under CC BY 4.0. The dataset, project code, evaluation tools, and fine-tuned model adapters are available at: https://github.com/infosenselab/yield.

关键词: Information Elicitation Agents, Large Language Models, Dialogue Dataset, Supervised Fine-tuning, Alignment, POMDP, Human Evaluation, Model Adapters

169. ❌ A molecular clock for writing systems reveals the quantitative impact of imperial power on cultural evolution

作者: Hiroki Fukui 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.10957v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究书写系统的文化进化，使用系统发育学、贝叶斯推断和神经网络聚类等方法分析全球书写系统的演变模式，并量化帝国权力对文化演变的影响。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词聚焦于深度学习和大语言模型的技术细节，而论文属于文化进化、历史语言学和社会科学的交叉领域。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文使用了神经网络聚类作为分析方法之一，属于AI在科学（文化进化研究）中的应用，但并非核心内容，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文通过构建全球书写系统数据库并应用系统发育学、贝叶斯推断和神经网络聚类等方法，首次量化研究了书写系统的进化模式，发现其存在可检测的分子钟，并揭示帝国权力干预会打破这种钟并选择性改写深层结构特征，同时识别出西班牙帝国和日本帝国是导致最多书写系统灭绝的殖民力量。

摘要翻译

文字系统是文化复制因子，其演化过程从未在全球尺度上得到量化研究。我们编制了全球文字数据库（Global Script Database, GSD），涵盖300种文字与符号系统、50个二元结构特征以及跨越5400年的259条谱系关联边。通过应用四种方法——表型分类法、支序分类法、贝叶斯推断与神经网络聚类——我们发现文字系统呈现出可检测的分子钟信号。最优拟合模型（Mk+Gamma严格分子钟）得出的替代速率为q = 0.226 次替代/特征/千年（95%置信区间：0.034-1.22；与松弛分子钟相比ΔBIC = -4.1；与无速率变异的Mk模型相比ΔBIC = -1,364.7）。政治干预会打破这一分子钟规律：预期分化时间的偏离程度与干预强度相关（斯皮尔曼相关系数rho = 0.556，p < 10^{-4}），而基于特征单位的速率分析表明，干预会选择性重构深层结构特征，而非仅仅加速变化进程（速率谱相关性rho = 0.320）。我们识别出30次重大文字系统更替事件，并评估了其破坏性影响。在已有文字存在的地区，天花板效应会抑制独立文字发明（费希尔精确检验比值比OR = 0.054，p < 10^{-6}），而殖民接触可预测文字消亡（考克斯风险比HR = 5.25，p = 0.0006）。西班牙帝国导致了最多文字系统的消亡（接触12种中有6种灭绝，占50%），其次为日本帝国（接触9种中有3种灭绝，占33.3%）。特征编码通过两名独立人工编码者间的评估者间信度检验得到验证（科恩卡帕系数κ = 0.877；人机编码一致性κ = 0.929；弗莱斯卡帕系数κ = 0.911）。

摘要 (Abstract)

Writing systems are cultural replicators whose evolution has never been studied quantitatively at global scale. We compile the Global Script Database (GSD): 300 writing and notation systems, 50 binary structural characters, and 259 phylogenetic edges spanning 5,400 years. Applying four methods – phenetics, cladistics, Bayesian inference, and neural network clustering – we find that scripts exhibit a detectable molecular clock. The best-fitting model (Mk+Gamma strict clock) yields a substitution rate of q = 0.226 substitutions/character/millennium (95% CI: 0.034-1.22; Delta BIC = -4.1 versus relaxed clock; Delta BIC = -1,364.7 versus Mk without rate variation). Political interventions break this clock: deviation from expected divergence times correlates with intervention intensity (Spearman rho = 0.556, p < 10^{-4}), and per-character rate analysis reveals that intervention selectively rewrites deep structural features rather than merely accelerating change (rate profile correlation rho = 0.320). We identify 30 major script replacement events and rank their destructive impact. A ceiling effect suppresses independent invention wherever writing already exists (Fisher’s exact OR = 0.054, p < 10^{-6}), and colonial contact predicts script extinction (Cox HR = 5.25, p = 0.0006). The Spanish Empire extinguished the most scripts (6 of 12 contacted, 50%), followed by the Empire of Japan (3 of 9, 33.3%). Feature coding was validated by inter-rater reliability testing with two independent human coders (Cohen’s kappa = 0.877; human-LLM kappa = 0.929; Fleiss’ kappa = 0.911).

关键词: writing systems, cultural evolution, molecular clock, phylogenetics, imperial power, script replacement, neural network clustering, Global Script Database

170. ❌ Mem$^2$Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation

作者: Zihao Cheng, Zeming Liu, Yingyu Shan, Xinyi Wang, Xiangrong Zhu, Yunpu Ma, Hongru Wang, Yuhang Guo, Wei Lin, Yunhong Wang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.10923v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种名为Mem^2Evolve的LLM智能体自进化框架，核心创新在于将经验积累与动态资产（工具/专家智能体）创建相结合，实现协同进化。因此，与"Large Language Models OR LLMs OR Foundation Models”（论文基于LLM构建智能体）、“Self-Correction OR Self-Improvement OR Self-Reflection”（核心是智能体的自我进化与改进）、“LLM Agents OR Autonomous Agents OR Agentic Workflow”（研究主题是LLM驱动的自主智能体）以及"Tool Use OR Function Calling OR API Tool Use”（智能体通过创建和使用工具来扩展能力）高度相关（10分）。与"Multi-agent Systems OR Agent Coordination"有一定关联（5分），因为框架涉及创建专家智能体，但论文重点在单个智能体的内部进化而非多智能体间的协调。其余关键词（如MoE、量化、RAG、CoT等）在论文摘要中未提及或非核心，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对现有LLM智能体框架将经验积累与动态资产创建分离导致能力增长有限和进化不稳定的问题，提出了一个协同进化的能力扩展与经验蒸馏新范式Mem^2Evolve，实验表明其在多个任务上显著优于仅依赖经验或仅创建资产的基线方法。

摘要翻译

尽管大型语言模型驱动的智能体能够通过积累经验或动态创建新资产（即工具或专家智能体）实现自我进化，但现有框架通常将这两种进化过程孤立对待。这种分离忽视了它们内在的相互依存关系：前者本质上受限于手动预定义的静态工具集，而后者则在缺乏经验指导的情况下从零开始生成新资产，导致能力增长有限且进化不稳定。为应对这一局限，我们引入了一种协同进化的能力扩展与经验提炼新范式。在此范式指导下，我们提出了 Mem$^{\textbf{2}}$Evolve 框架，该框架整合了两个核心组件：经验记忆与资产记忆。具体而言，Mem$^{2}$Evolve 利用积累的经验指导资产的动态创建，从而扩展智能体的能力空间，同时获取新经验以实现协同进化。在6个任务类别和8个基准测试上的广泛实验表明，Mem$^{2}$Evolve 相较于标准大型语言模型实现了18.53%的性能提升，相较于仅通过经验进化的智能体提升了11.80%，相较于仅通过资产创建进化的智能体提升了6.46%，从而确立为一个显著更高效且稳定的自我进化智能体框架。代码发布于：https://buaa-irip-llm.github.io/Mem2Evolve。

摘要 (Abstract)

While large language model–powered agents can self-evolve by accumulating experience or by dynamically creating new assets (i.e., tools or expert agents), existing frameworks typically treat these two evolutionary processes in isolation. This separation overlooks their intrinsic interdependence: the former is inherently bounded by a manually predefined static toolset, while the latter generates new assets from scratch without experiential guidance, leading to limited capability growth and unstable evolution. To address this limitation, we introduce a novel paradigm of co-evolutionary Capability Expansion and Experience Distillation. Guided by this paradigm, we propose the \textbf{Mem$^{\textbf{2}}$Evolve}, which integrates two core components: \textbf{Experience Memory} and \textbf{Asset Memory}. Specifically, Mem$^{2}$Evolve leverages accumulated experience to guide the dynamic creation of assets, thereby expanding the agent’s capability space while simultaneously acquiring new experience to achieve co-evolution. Extensive experiments across 6 task categories and 8 benchmarks demonstrate that Mem$^{2}$Evolve achieves improvement of 18.53% over standard LLMs, 11.80% over agents evolving solely through experience, and 6.46% over those evolving solely through asset creation, establishing it as a substantially more effective and stable self-evolving agent framework. Code is available at: https://buaa-irip-llm.github.io/Mem2Evolve.

关键词: LLM agents, self-evolving agents, co-evolution, capability expansion, experience distillation, tool creation, autonomous agents, agent framework

171. ❌ HTAA: Enhancing LLM Planning via Hybrid Toolset Agentization & Adaptation

作者: Chengrui Huang, Junshuo Zhang, Zhiyuan Ma, Xikun Wang, Ximeng Wang, Menghua Jiang, Gang Zeng, Zhaobing Han, Shen Gao, Shuo Shang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.10917v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在工具使用和智能体规划方面的创新，与’Large Language Models’、‘LLM Agents’、‘Tool Use’高度相关（10分），涉及分层协调与’Multi-agent Systems’有一定关联（5分），其他关键词如MoE、量化、推理加速等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLM在真实场景中大规模工具调用效率低、错误累积的问题，提出了HTAA分层框架，通过工具集智能体化和非对称规划器适配，显著提高了任务成功率并减少了上下文开销。

摘要翻译

使大语言模型能够规模化且可靠地使用数百种工具对于现实应用至关重要，但由于扁平化工具调用架构固有的低效性和错误累积问题，这一目标仍具挑战性。为此，我们提出混合工具集代理化与适配框架（Hybrid Toolset Agentization & Adaptation, HTAA），一种用于可扩展工具使用的分层规划框架。我们提出了一种新颖的工具集代理化范式，将频繁协同使用的工具封装为专门的代理工具（agent tools），从而减少规划器的行动空间并缓解冗余。为确保有效协调，我们设计了非对称规划器适配（Asymmetric Planner Adaptation），这是一种基于轨迹的训练范式，通过后向重建和前向精调，使高层规划器与代理工具对齐。为验证HTAA的性能，我们在真实世界内部数据集InfoVerify上进行实验，该数据集基于中国最大在线大规模网约车平台的POI（兴趣点）验证工作流构建，具有长程可执行工具轨迹。在InfoVerify及广泛使用的基准测试上的实验表明，与强基线相比，HTAA始终获得更高的任务成功率，所需工具调用轨迹更短，并显著降低了上下文开销。此外，在生产部署中，HTAA大幅减少了人工验证工作量与运营成本，证明了其实用效能。

摘要 (Abstract)

Enabling large language models to scale and reliably use hundreds of tools is critical for real-world applications, yet challenging due to the inefficiency and error accumulation inherent in flat tool-calling architectures. To address this, we propose Hybrid Toolset Agentization & Adaptation (HTAA), a hierarchical framework for scalable tool-use planning. We propose a novel toolset agentization paradigm, which encapsulates frequently co-used tools into specialized agent tools, thereby reducing the planner’s action space and mitigating redundancy. To ensure effective coordination, we design Asymmetric Planner Adaptation, a trajectory-based training paradigm that aligns the high-level planner with agent tools via backward reconstruction and forward refinement. To validate the performance of HTAA, we conduct experiments on a real-world internal dataset, InfoVerify, based on the POI validation workflow of China’s largest online large-scale ride-hailing platform, featuring long-horizon executable tool trajectories. Experiments on InfoVerify and widely-used benchmarks show that HTAA consistently achieves higher task success rates, requires short tool calling trajectories, and significantly reduces context overhead compared to strong baselines. Furthermore, in a production deployment, HTAA substantially reduces manual validation effort and operational cost, demonstrating its practical efficacy.

关键词: LLM planning, tool-use, hierarchical framework, agent tools, toolset agentization, asymmetric planner adaptation, scalable tool-use, real-world applications

172. ❌ ZoomR: Memory Efficient Reasoning through Multi-Granularity Key Value Retrieval

作者: David H. Yang, Yuxuan Zhu, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Subhajit Chaudhury, Pin-Yu Chen 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.10898v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ZoomR的核心是解决LLMs在复杂推理任务中生成长中间思维时KV缓存内存占用过大的问题。它提出了一种多粒度KV检索方法，通过自适应压缩推理思维为摘要，并动态选择KV缓存，显著减少内存使用。因此，与"Large Language Models"高度相关（10分），因为论文明确研究LLMs；与"KV Cache Compression"高度相关（10分），因为论文的核心创新是KV缓存优化；与"Chain of Thought"高度相关（10分），因为论文针对需要长中间推理思维的任务；与"Speculative Decoding OR Inference Acceleration"有一定关联（5分），因为内存效率提升间接有助于推理加速；其他关键词如MoE、SLMs、对齐、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

论文ZoomR提出了一种多粒度KV检索方法，通过自适应压缩推理思维摘要和动态KV缓存选择，解决了LLMs在长输出生成中KV缓存内存占用过大的问题，实验表明在保持竞争性能的同时将推理内存需求降低4倍以上。

摘要翻译

大型语言模型（LLMs）在复杂推理任务中展现出卓越性能，但通常需要在得出最终答案前生成冗长的中间思考过程。在生成过程中，LLMs依赖键值（KV）缓存进行自回归解码。然而，KV缓存的内存占用随输出长度增长而增加。先前关于KV缓存优化的研究主要集中于压缩长输入上下文，同时保留完整的解码KV缓存。对于需要生成长输出的任务，这会导致计算和内存成本上升。本文提出ZoomR，一种新颖方法，使LLMs能够自适应地将冗长的推理思考压缩为摘要，并采用动态KV缓存选择策略，该策略在利用这些摘要的同时，策略性地“聚焦”于细粒度细节。通过在解码过程中使用摘要键作为粗粒度索引，ZoomR利用查询仅检索最重要思考的细节。这种分层策略通过避免每一步的全缓存注意力计算，显著降低了内存使用。在数学和推理任务上的实验表明，与基线方法相比，我们的方法实现了具有竞争力的性能，同时将推理内存需求降低了超过$4\times$。这些结果证明，多粒度KV选择能够实现更高效的内存解码，尤其适用于长输出生成场景。

摘要 (Abstract)

Large language models (LLMs) have shown great performance on complex reasoning tasks but often require generating long intermediate thoughts before reaching a final answer. During generation, LLMs rely on a key-value (KV) cache for autoregressive decoding. However, the memory footprint of the KV cache grows with output length. Prior work on KV cache optimization mostly focus on compressing the long input context, while retaining the full KV cache for decoding. For tasks requiring long output generation, this leads to increased computational and memory costs. In this paper, we introduce ZoomR, a novel approach that enables LLMs to adaptively compress verbose reasoning thoughts into summaries and uses a dynamic KV cache selection policy that leverages these summaries while also strategically “zooming in” on fine-grained details. By using summary keys as a coarse-grained index during decoding, ZoomR uses the query to retrieve details for only the most important thoughts. This hierarchical strategy significantly reduces memory usage by avoiding full-cache attention at each step. Experiments across math and reasoning tasks show that our approach achieves competitive performance compared to baselines, while reducing inference memory requirements by more than $4\times$. These results demonstrate that a multi-granularity KV selection enables more memory efficient decoding, especially for long output generation.

关键词: Large Language Models, KV cache optimization, memory efficient decoding, multi-granularity retrieval, reasoning tasks, inference acceleration, autoregressive decoding, long output generation

173. ❌ Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

作者: Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar, Lasha Koroshinadze, Nishit Anand, Zhifeng Kong, Siddharth Gururani, Sang-gil Lee, Jaehyeon Kim, Aya Aljafari, Chao-Han Huck Yang, Sungwon Kim, Ramani Duraiswami, Dinesh Manocha, Mohammad Shoeybi, Bryan Catanzaro, Ming-Yu Liu, Wei Ping 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.10905v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文介绍Audio Flamingo Next（AF-Next），一个大型音频-语言模型，专注于语音、声音和音乐的理解与推理。核心相关关键词包括：‘Large Language Models’（论文明确为大型音频-语言模型，权重1.0，相关度10）、‘Pre-training’和’Post-training’（论文详细描述基于课程策略的训练阶段，包括预训练、中训练和后训练，权重各1.0，相关度10）、‘Context Window Extension’（支持长达30分钟的音频输入，权重1.0，相关度10）、‘Chain of Thought’（引入Temporal Audio Chain-of-Thought推理范式，权重1.0，相关度10）。其他关键词如’Scaling Laws AND Data Quality’（涉及数据扩展和质量，但非核心，权重1.0，相关度5）、‘Instruction Tuning’（模型变体包括AF-Next-Instruct，但未详细描述，权重1.0，相关度5）、‘System 2 Thinking’（与推理相关，但未明确提及，权重1.0，相关度5）、‘Mechanistic Interpretability’（Temporal Audio Chain-of-Thought提高可解释性，权重1.0，相关度5）、‘AI for Science’（音频处理可视为科学应用，但非生物信息学等，权重1.0，相关度5）。剩余关键词与论文内容无关（相关度0）。加权总分计算为：101 + 01 + 01 + 51 + 101 + 101 + 51 + 01 + 01 + 01 + 101 + 01 + 101 + 51 + 01 + 01 + 01 + 01 + 01 + 01 + 01 + 01 + 51 + 01 + 01 + 01 + 5*1 = 75。作者列表中未包含指定专家。

!!! tip deepseek-chat TL;DR

论文提出了Audio Flamingo Next（AF-Next），一个下一代大型音频-语言模型，通过引入Temporal Audio Chain-of-Thought推理范式和支持长音频输入，显著提升了语音、声音和音乐的理解与推理能力，在多个基准测试中优于同类模型。

摘要翻译

我们推出Audio Flamingo系列的新一代旗舰模型——Audio Flamingo Next（AF-Next），该模型旨在提升对语音、环境声音与音乐的理解与推理能力。相较于Audio Flamingo 3，AF-Next引入了以下创新：（1）更强大的基础音频-语言模型，显著提升了多样化音频理解任务的准确性；（2）超越现有学术基准的大规模音频理解与推理数据构建策略；（3）支持长达30分钟的复杂长音频输入；（4）提出“时序音频思维链”这一新型推理范式，通过将中间推理步骤显式关联至长音频时间戳，实现细粒度时序对齐并增强可解释性。为实现这些能力，我们首先对Audio Flamingo 3进行了系统性分析，识别其在音频理解与推理方面的关键不足。随后，我们构建并扩展了总时长超100万小时的新大规模数据集，以弥补现有局限，并扩充了原有的AudioSkills-XL、LongAudio-XL、AF-Think与AF-Chat数据集。AF-Next采用分阶段课程学习策略进行训练，涵盖预训练、中期训练与后期训练阶段。在包含挑战性长音频任务在内的20个音频理解与推理基准测试中，大量实验表明AF-Next大幅超越同规模开源模型，并与参数量大得多的开源权重模型及闭源模型保持高度竞争力，部分任务甚至实现反超。除基准性能外，AF-Next展现出强大的实际应用价值，并能良好迁移至未见任务，凸显其鲁棒性与泛化能力。除全部数据、代码与方法外，我们开源了AF-Next的3个变体模型，包括AF-Next-Instruct、AF-Next-Think与AF-Next-Captioner。

摘要 (Abstract)

We present Audio Flamingo Next (AF-Next), the next-generation and most capable large audio-language model in the Audio Flamingo series, designed to advance understanding and reasoning over speech, environmental sounds and music. Compared to Audio Flamingo 3, AF-Next introduces: (i) a stronger foundational audio-language model that significantly improves accuracy across diverse audio understanding tasks; (ii) scalable strategies for constructing large-scale audio understanding and reasoning data beyond existing academic benchmarks; (iii) support for long and complex audio inputs up to 30 minutes; and (iv) Temporal Audio Chain-of-Thought, a new reasoning paradigm that explicitly grounds intermediate reasoning steps to timestamps in long audio, enabling fine-grained temporal alignment and improved interpretability. To enable these capabilities, we first conduct a systematic analysis of Audio Flamingo 3 to identify key gaps in audio understanding and reasoning. We then curate and scale new large-scale datasets totaling over 1 million hours to address these limitations and expand the existing AudioSkills-XL, LongAudio-XL, AF-Think and AF-Chat datasets. AF-Next is trained using a curriculum-based strategy spanning pre-training, mid-training and post-training stages. Extensive experiments across 20 audio understanding and reasoning benchmarks, including challenging long-audio tasks, show that AF-Next outperforms similarly sized open models by large margins and remains highly competitive with and sometimes surpasses, much larger open-weight and closed models. Beyond benchmark performance, AF-Next exhibits strong real-world utility and transfers well to unseen tasks, highlighting its robustness and generalization ability. In addition to all data, code and methods, we open-source 3 variants of AF-Next, including AF-Next-Instruct, AF-Next-Think and AF-Next-Captioner.

关键词: Audio-Language Models, Speech Understanding, Music Understanding, Long Audio Input, Temporal Audio Chain-of-Thought, Audio Reasoning, Pre-training, Post-training

174. ❌ AOP-Smart: A RAG-Enhanced Large Language Model Framework for Adverse Outcome Pathway Analysis

作者: Qinjiang Niu, Lu Yan 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.10874v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究内容为：1）在生物信息学/毒理学领域（AOP分析）应用大语言模型（LLMs），属于AI for Science范畴；2）提出基于检索增强生成（RAG）的框架AOP-Smart，这是论文的核心技术方法；3）主要目标是缓解LLMs的幻觉问题，提高事实性和可靠性。因此，与’Large Language Models’、‘Retrieval-Augmented Generation’、‘Hallucination Mitigation’、‘AI for Science’这四个关键词高度相关（核心内容，给10分）。论文未涉及其他关键词的技术原理或应用，故其余关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究针对大语言模型在不良结局通路知识任务中存在的幻觉问题，提出了一个检索增强生成框架AOP-Smart，实验表明该框架能显著提高多个主流大模型的回答准确性和一致性。

摘要翻译

不良结局路径（Adverse Outcome Pathways，AOPs）是毒理学研究与风险评估中的重要知识框架。近年来，大语言模型（Large Language Models，LLMs）逐渐被应用于AOP相关的问答与机制推理任务。然而，由于幻觉问题的存在，即模型可能生成与事实不符或缺乏依据的内容，其可靠性仍受限制。为解决这一问题，本研究提出了一种面向AOP的检索增强生成（Retrieval-Augmented Generation，RAG）框架——AOP-Smart。该方法基于AOP-Wiki官方XML数据，利用关键事件（Key Events，KEs）、关键事件关系（Key Event Relationships，KERs）及具体AOP信息，针对用户问题检索相关知识，从而提升大语言模型生成结果的可靠性。为评估所提方法的有效性，本研究构建了一个包含20项AOP相关问答任务的测试集，涵盖KE识别、上下游KE检索及复杂AOP检索任务。实验在三种主流大语言模型Gemini、DeepSeek和ChatGPT上进行，并在未使用RAG与使用RAG两种设定下进行对比测试。实验结果显示，未使用RAG时，GPT、DeepSeek和Gemini的准确率分别为15.0%、35.0%和20.0%；使用RAG后，其准确率分别提升至95.0%、100.0%和95.0%。结果表明，AOP-Smart能够显著缓解大语言模型在AOP知识任务中的幻觉问题，并大幅提升其回答的准确性与一致性。

摘要 (Abstract)

Adverse Outcome Pathways (AOPs) are an important knowledge framework in toxicological research and risk assessment. In recent years, large language models (LLMs) have gradually been applied to AOP-related question answering and mechanistic reasoning tasks. However, due to the existence of the hallucination problem, that is, the model may generate content that is inconsistent with facts or lacks evidence, their reliability is still limited. To address this issue, this study proposes an AOP-oriented Retrieval-Augmented Generation (RAG) framework, AOP-Smart. Based on the official XML data from AOP-Wiki, this method uses Key Events (KEs), Key Event Relationships (KERs), and specific AOP information to retrieve relevant knowledge for user questions, thereby improving the reliability of the generated results of large language models. To evaluate the effectiveness of the proposed method, this study constructed a test set containing 20 AOP-related question answering tasks, covering KE identification, upstream and downstream KE retrieval, and complex AOP retrieval tasks. Experiments were conducted on three mainstream large language models, Gemini, DeepSeek, and ChatGPT, and comparative tests were performed under two settings: without RAG and with RAG. The experimental results show that, without using RAG, the accuracies of GPT, DeepSeek, and Gemini were 15.0%, 35.0%, and 20.0%, respectively; after using RAG, their accuracies increased to 95.0%, 100.0%, and 95.0%, respectively. The results indicate that AOP-Smart can significantly alleviate the hallucination problem of large language models in AOP knowledge tasks, and greatly improve the accuracy and consistency of their answers.

关键词: Adverse Outcome Pathways, Large Language Models, Retrieval-Augmented Generation, Hallucination Mitigation, Toxicological Research, Knowledge Framework, Question Answering, AOP-Smart

175. ❌ OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

作者: Xiaomeng Hu, Yinger Zhang, Fei Huang, Jianhong Tu, Yang Su, Lianghao Deng, Yuxuan Liu, Yantao Liu, Dayiheng Liu, Tsung-Yi Ho 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.10866v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是使用语言世界模型（LWMs）评估AI代理在专业任务上的表现，与LLMs、AI代理、工具使用和世界模型高度相关（10分）。涉及推理能力（5分）和多代理系统（5分），并在科学应用领域有一定关联（5分）。其他关键词如MoE、量化、对齐等未在论文中涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了OccuBench基准，使用语言世界模型评估AI代理在100个跨行业专业任务上的表现，发现不同模型在不同行业表现各异，隐性故障比显性错误更具挑战性，推理努力能显著提升性能，且模拟器质量对评估可靠性至关重要。

摘要翻译

人工智能代理被期望在数百个职业领域（从急诊分诊到核反应堆安全监控，再到海关进口处理）执行专业工作，然而现有基准只能在少数存在公共环境的领域中对代理进行评估。我们推出OccuBench，这是一个涵盖10个行业类别和65个专业领域的100个真实世界专业任务场景的基准，其实现依赖于语言世界模型（Language World Models, LWMs）——通过大语言模型驱动的工具响应生成来模拟特定领域环境。我们的多智能体合成流程自动生成具有可解性保证、难度校准和基于文档的多样性的评估实例。OccuBench从两个互补维度评估智能体：跨专业领域的任务完成能力，以及在受控故障注入（显性错误、隐性数据劣化和混合故障）下的环境鲁棒性。我们评估了来自8个模型系列的15个前沿模型，发现：（1）没有单一模型在所有行业占主导地位，每个模型都具有独特的职业能力图谱；（2）隐性故障（截断数据、缺失字段）比显性错误（超时、500错误）和混合故障更具挑战性，因为它们缺乏明显的错误信号，要求代理独立检测数据劣化；（3）更大规模的模型、更新的代际以及更高的推理投入持续提升性能。GPT-5.2从最小到最大推理投入提升了27.5分；（4）强大的代理不一定是强大的环境模拟器。模拟器质量对于基于LWM的评估可靠性至关重要。OccuBench为人工智能代理在专业职业任务上提供了首个系统性的跨行业评估。

摘要 (Abstract)

AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language World Models (LWMs) that simulate domain-specific environments through LLM-driven tool response generation. Our multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults). We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than both explicit errors (timeouts, 500s) and mixed faults, because they lack overt error signals and require the agent to independently detect data degradation; (3) larger models, newer generations, and higher reasoning effort consistently improve performance. GPT-5.2 improves by 27.5 points from minimal to maximum reasoning effort; and (4) strong agents are not necessarily strong environment simulators. Simulator quality is critical for LWM-based evaluation reliability. OccuBench provides the first systematic cross-industry evaluation of AI agents on professional occupational tasks.

关键词: AI Agents, Language World Models, Professional Tasks, Benchmark Evaluation, Multi-agent Systems, Tool Response Generation, Occupational Domains, Environmental Robustness

176. ❌ Speaking to No One: Ontological Dissonance and the Double Bind of Conversational AI

作者: Hugh Brosnahan, Izabela Lipinska 期刊/来源: arxiv 发布日期: 2026-04-12 arXiv链接: http://arxiv.org/abs/2604.10833v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文从现象学、精神病学和认知神经科学角度分析对话式AI与用户交互中产生的本体论失调风险，属于AI伦理、人机交互和心理学交叉领域，完全不涉及大模型技术原理、训练方法、推理优化、应用部署等具体技术内容，与所有技术性关键词均无关联。

!!! tip deepseek-chat TL;DR

该论文研究发现，对话式AI通过制造关系存在的表象与缺乏真实主体的本体论失调，在情感脆弱用户中可能稳定为类似“二联性精神病”的妄想体验，解释了免责声明常失效的原因，并阐明了设计和使用的伦理与临床意义。

摘要翻译

近期研究表明，在与对话式人工智能系统持续互动的用户中，有少数个体会出现妄想体验的诱发或固化现象。现有解释通常将此类案例归因于个体脆弱性或安全工程设计的缺陷，但这些解释并不完整。本文结合现象学、精神病学与认知神经科学的视角提出：该风险源于交互本身的关系性与本体论结构。对话式人工智能会引发本体论失调——即系统所呈现的关系在场表象与实际上缺乏能够维系这种在场的主体性之间的冲突。这种失调通过交流中的双重束缚得以维持，并因注意力的不对称性而加剧，在情感脆弱性条件下，容易固化为一种技术中介的二联性精神病类似状态。该理论解释了为何明确的免责声明往往无法阻断妄想性卷入，并阐明了对话式人工智能设计与使用中涉及的伦理和临床意义。

摘要 (Abstract)

Recent reports indicate that sustained interaction with conversational artificial intelligence (AI) systems can, in a small subset of users, contribute to the emergence or stabilisation of delusional experience. Existing accounts typically attribute such cases either to individual vulnerability or to failures of safety engineering. These explanations are incomplete. Drawing on phenomenology, psychiatry, and cognitive neuroscience, this paper argues that the risk arises from the relational and ontological structure of the interaction itself. Conversational AI generates ontological dissonance: a conflict between the appearance of relational presence and the absence of any subject capable of sustaining it. Maintained through a communicative double bind and amplified by attentional asymmetries, this dissonance tends, under conditions of affective vulnerability, to stabilise into a technologically mediated analogue of folie a deux. This account explains why explicit disclaimers often fail to disrupt delusional involvement and clarifies the ethical and clinical implications for the design and use of conversational AI.

关键词: conversational AI, ontological dissonance, delusional experience, folie a deux, phenomenology, psychiatry, ethical implications, relational structure

177. ❌ Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series

作者: Krzysztof Ociepa, Łukasz Flis, Remigiusz Kinas, Krzysztof Wróbel, Adrian Gwoździej 期刊/来源: arxiv 发布日期: 2026-04-12 arXiv链接: http://arxiv.org/abs/2604.10799v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究波兰语大语言模型（LLM）的优化，通过专用分词器改进通用模型在特定语言上的不足，涉及预训练、监督微调（SFT）、直接偏好优化（DPO）和强化学习对齐等关键技术，与LLM、预训练、后训练、对齐和DPO高度相关（10分），与上下文窗口扩展有一定关联（5分），其他关键词未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文通过优化波兰语专用分词器、多阶段预训练和后训练对齐（包括SFT、DPO和强化学习），显著提升了Bielik v3系列大语言模型在波兰语任务上的效率和性能。

摘要翻译

Bielik v3 PL系列（包含7B与11B参数版本）的开发，标志着语言专用大语言模型优化领域的一个重要里程碑。尽管通用模型常展现出令人印象深刻的多语言能力，但它们普遍存在一个根本性的架构效率问题：即使用通用分词器。这类分词器通常为覆盖广泛语言而设计，却往往无法准确捕捉如波兰语等特定语言的形态学细微特征，从而导致更高的生育率、增加的推理成本以及受限的有效上下文窗口。本报告详述了Bielik v3模型从基于Mistral的通用分词方案转向专用波兰语优化词汇表的过程，探讨了基于FOCUS的嵌入初始化方法、多阶段预训练课程设计，以及后续涉及监督微调、直接偏好优化和采用可验证奖励的群组相对策略优化强化学习的对齐后训练。

摘要 (Abstract)

The development of the Bielik v3 PL series, encompassing both the 7B and 11B parameter variants, represents a significant milestone in the field of language-specific large language model (LLM) optimization. While general-purpose models often demonstrate impressive multilingual capabilities, they frequently suffer from a fundamental architectural inefficiency: the use of universal tokenizers. These tokenizers, typically designed to cover a broad spectrum of languages, often fail to capture the morphological nuances of specific languages like Polish, leading to higher fertility ratios, increased inference costs, and restricted effective context windows. This report details the transition from the universal Mistral-based tokenization to a dedicated Polish-optimized vocabulary for the Bielik v3 models, exploring the FOCUS-based embedding initialization, the multi-stage pretraining curriculum, and the subsequent post-training alignment involving Supervised Fine-Tuning, Direct Preference Optimization, and Reinforcement Learning through Group Relative Policy Optimization with verifiable rewards.

关键词: Polish language modeling, tokenizer optimization, large language models, pretraining curriculum, supervised fine-tuning, direct preference optimization, reinforcement learning alignment, Bielik v3 series

178. ❌ Position-Agnostic Pre-Projection for Transformer Attention: Nonlinear Feature Construction and Content Skip Before Q/K/V

作者: Chirag Shinde 期刊/来源: arxiv 发布日期: 2026-04-12 arXiv链接: http://arxiv.org/abs/2604.10791v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文提出对Transformer注意力机制的两项改进：位置无关的非线性预投影MLP和内容跳跃连接，属于大模型底层架构的技术创新。仅与关键词’Large Language Models OR LLMs OR Foundation Models’有中等相关性（5分），因为论文在Pythia模型上实验，属于大模型技术范畴，但未涉及其他关键词的具体技术。其他关键词如MoE、SFT、RAG、推理加速等均未在论文中体现，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出在Transformer注意力块中插入非线性预投影MLP和内容跳跃连接，在Pythia模型上实验显示能显著提升语言理解性能（如LAMBADA准确率提升40.6%），且不增加KV缓存开销。

摘要翻译

我们对Transformer注意力模块提出了两项互补的改进。首先，在层归一化与Q/K/V投影之间插入一个非线性预投影多层感知机，从而在应用任何位置编码之前，以位置无关的方式构建更丰富的特征。其次，通过一条内容跳跃连接将预投影生成的特征绕过注意力机制进行传递，使得内容信息在有益的情况下能够绕过具有位置感知能力的注意力模块。在Pythia-160M和410M模型上的冻结探针实验中，该组合方案在所有方法中取得了最优结果：在1.6亿参数规模下，LAMBADA准确率提升40.6%，困惑度降低39%。对学习到的跳跃连接权重分析显示，不同模型规模间存在一致规律：较深的Transformer层比较浅的层更显著地激活内容旁路，这表明深层网络能够受益于不经过位置注意力处理的内容信息。所有改进均未增加键值缓存的开销。

摘要 (Abstract)

We propose two complementary modifications to transformer attention blocks. First, a non-linear pre-projection MLP is inserted between layer norm and Q/K/V projections, constructing richer features in a position-agnostic manner before any positional encoding is applied. Second, a content skip connection routes the pre-projection’s features around the attention mechanism, allowing content information to bypass position-aware attention where beneficial. In frozen-probe experiments on Pythia-160M and 410M, the combined approach achieves the strongest results across methods: +40.6% LAMBADA accuracy and -39% perplexity at 160M scale. Learned skip connection weights reveal a consistent pattern across model sizes: later transformer layers activate the content bypass more strongly than earlier layers, suggesting that deeper layers benefit from content information that does not pass through positional attention. All modifications add no K/V cache overhead.

关键词: Transformer attention, pre-projection MLP, content skip connection, position-agnostic, KV cache, Pythia model, LAMBADA accuracy, perplexity reduction

179. ❌ TInR: Exploring Tool-Internalized Reasoning in Large Language Models

作者: Qiancheng Xu, Yongqi Li, Fan Liu, Hongru Wang, Min Yang, Wenjie Li 期刊/来源: arxiv 发布日期: 2026-04-12 arXiv链接: http://arxiv.org/abs/2604.10788v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的工具内部化推理，与LLM、SFT、RLHF、推理方法、工具使用、智能体等关键词高度相关（10分），其他关键词如MoE、量化、科学AI等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出TInR-U框架，通过工具知识内部化解决传统工具集成推理中的工具掌握困难、规模限制和推理效率问题，实验证明其在领域内外均能实现优越性能。

摘要翻译

工具集成推理（Tool-Integrated Reasoning, TIR）通过在大语言模型（LLMs）推理过程中扩展外部工具能力，已成为一个前景广阔的研究方向。现有的TIR方法通常在推理时依赖外部工具文档，但这导致了工具掌握困难、工具规模受限以及推理效率低下等问题。为缓解这些问题，我们探索了工具内化推理（Tool-Internalized Reasoning, TInR），旨在通过将工具知识内化至大语言模型中以促进推理。实现这一目标面临显著挑战，包括工具内化与工具-推理协同两方面需求。为此，我们提出了TInR-U——一个用于统一推理与工具使用的工具内化推理框架。TInR-U通过三阶段流程进行训练：1）采用双向知识对齐策略实现工具内化；2）利用高质量推理标注进行监督微调预热；3）结合TInR特定奖励的强化学习。我们在领域内与跨领域场景下对方法进行了全面评估。实验结果表明，TInR-U在两种场景下均实现了优越性能，彰显了其有效性与高效性。

摘要 (Abstract)

Tool-Integrated Reasoning (TIR) has emerged as a promising direction by extending Large Language Models’ (LLMs) capabilities with external tools during reasoning. Existing TIR methods typically rely on external tool documentation during reasoning. However, this leads to tool mastery difficulty, tool size constraints, and inference inefficiency. To mitigate these issues, we explore Tool-Internalized Reasoning (TInR), aiming at facilitating reasoning with tool knowledge internalized into LLMs. Achieving this goal presents notable requirements, including tool internalization and tool-reasoning coordination. To address them, we propose TInR-U, a tool-internalized reasoning framework for unified reasoning and tool usage. TInR-U is trained through a three-phase pipeline: 1) tool internalization with a bidirectional knowledge alignment strategy; 2) supervised fine-tuning warm-up using high-quality reasoning annotations, and 3) reinforcement learning with TInR-specific rewards. We comprehensively evaluate our method across in-domain and out-of-domain settings. Experiment results show that TInR-U achieves superior performance in both settings, highlighting its effectiveness and efficiency.

关键词: Tool-Internalized Reasoning, Large Language Models, Supervised Fine-tuning, Reinforcement Learning, Reasoning Capabilities, Tool Knowledge, TInR-U Framework, Bidirectional Knowledge Alignment

180. ❌ Do BERT Embeddings Encode Narrative Dimensions? A Token-Level Probing Analysis of Time, Space, Causality, and Character in Fiction

作者: Beicheng Bei, Hannah Hyesun Chun, Chen Guo, Arwa Saghiri 期刊/来源: arxiv 发布日期: 2026-04-12 arXiv链接: http://arxiv.org/abs/2604.10786v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究BERT模型在叙事维度（时间、空间、因果关系、角色）上的编码能力，使用线性探测和聚类分析来评估模型表示。与关键词的相关性分析如下：1）‘Large Language Models OR LLMs OR Foundation Models’：论文使用BERT（一种基础模型），并提到使用LLM加速标注，因此给予5分（有一定关联）。2）‘Mechanistic Interpretability OR Explainable AI’：论文通过探测分析研究模型内部表示，属于可解释性研究，给予5分（有一定关联）。其他关键词（如MoE、Scaling Laws、RLHF、RAG、Agents等）均未在论文中涉及，因此给予0分（完全无关）。

!!! tip deepseek-chat TL;DR

该研究通过线性探测和聚类分析，探究BERT模型是否编码了虚构叙事中的时间、空间、因果关系和角色等语义维度，结果表明BERT确实编码了有意义的叙事信息，但这些维度并非离散可分离的聚类。

摘要翻译

叙事理解需要多维度的语义结构。本研究探讨BERT嵌入是否编码虚构叙事语义的维度——时间、空间、因果性与人物。通过使用大语言模型加速标注，我们构建了一个在词元级别标注了上述四类叙事范畴及“其他”类别的数据集。对BERT嵌入的线性探测模型（准确率94%）显著优于在方差匹配的随机嵌入上的对照模型（准确率47%），证实BERT编码了有意义的叙事信息。在平衡类别权重后，探测模型的宏观平均召回率达到0.83，其中对因果性（召回率=0.75）和空间（召回率=0.66）等稀缺类别的识别也取得中等成效。然而，混淆矩阵分析揭示了“边界泄漏”现象，即稀缺维度被系统性地误分类为“其他”。聚类分析表明，无监督聚类结果与预定义类别的对齐程度接近随机（调整兰德指数=0.081），这意味着叙事维度虽被编码，但并未形成离散可分离的簇。未来工作包括引入仅基于词性标注的基线模型以分离句法模式与叙事编码、扩展数据集，以及进行分层探测分析。

摘要 (Abstract)

Narrative understanding requires multidimensional semantic structures. This study investigates whether BERT embeddings encode dimensions of fictional narrative semantics – time, space, causality, and character. Using an LLM to accelerate annotation, we construct a token-level dataset labeled with these four narrative categories plus “others.” A linear probe on BERT embeddings (94% accuracy) significantly outperforms a control probe on variance-matched random embeddings (47%), confirming that BERT encodes meaningful narrative information. With balanced class weighting, the probe achieves a macro-average recall of 0.83, with moderate success on rare categories such as causality (recall = 0.75) and space (recall = 0.66). However, confusion matrix analysis reveals “Boundary Leakage,” where rare dimensions are systematically misclassified as “others.” Clustering analysis shows that unsupervised clustering aligns near-randomly with predefined categories (ARI = 0.081), suggesting that narrative dimensions are encoded but not as discretely separable clusters. Future work includes a POS-only baseline to disentangle syntactic patterns from narrative encoding, expanded datasets, and layer-wise probing.

关键词: BERT embeddings, narrative dimensions, token-level probing, linear probe, clustering analysis, fictional narrative semantics, interpretability, model representation

181. ❌ Who Handles Orientation? Investigating Invariance in Feature Matching

作者: David Nordström, Johan Edstedt, Fredrik Kahl, Georg Bökman 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11809v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究计算机视觉中的特征匹配问题，特别是针对图像匹配中的大平面内旋转问题，通过数据增强和训练策略来学习旋转不变性。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用相关，而该论文专注于传统的计算机视觉特征匹配方法，未涉及任何大模型、深度学习技术或AI在科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在稀疏匹配流程中何时引入旋转不变性效果最佳，发现通过在描述符中学习旋转不变性可以获得与在匹配器中处理相似的性能，同时能实现更快的旋转不变匹配器，并且大规模训练不会损害直立图像的性能。

摘要翻译

在图像间寻找匹配关键点是三维计算机视觉的核心问题。然而，现代匹配器在处理大角度平面内旋转时面临困难。一种直接的缓解方法是通过数据增强学习旋转不变性。但旋转不变性应在哪个阶段引入仍不明确。本文在现代稀疏匹配流程的背景下对此进行研究。我们通过在大量三维视觉数据集上进行训练，并在主流图像匹配基准上评估，开展了广泛的实验。令人惊讶的是，我们发现，在描述子阶段引入旋转不变性所达到的性能，与在匹配器阶段处理旋转问题相当。然而，当在描述子中学习旋转不变性时，匹配器能更早实现旋转不变性，从而构建出更快速的旋转不变匹配器。此外，我们发现，在大规模训练时，强制实现旋转不变性并不会损害图像正立状态下的匹配性能。最后，我们通过数据规模研究了旋转不变性的涌现规律，发现增加训练数据量能显著提升对旋转图像的泛化能力。我们发布了两种对平面内旋转具有鲁棒性的匹配器，它们在多模态匹配（WxBS）、极端匹配（HardMatch）和卫星图像匹配（SatAst）等任务上均达到了最先进的性能。代码发布于 https://github.com/davnords/loma。

摘要 (Abstract)

Finding matching keypoints between images is a core problem in 3D computer vision. However, modern matchers struggle with large in-plane rotations. A straightforward mitigation is to learn rotation invariance via data augmentation. However, it remains unclear at which stage rotation invariance should be incorporated. In this paper, we study this in the context of a modern sparse matching pipeline. We perform extensive experiments by training on a large collection of 3D vision datasets and evaluating on popular image matching benchmarks. Surprisingly, we find that incorporating rotation invariance already in the descriptor yields similar performance to handling it in the matcher. However, rotation invariance is achieved earlier in the matcher when it is learned in the descriptor, allowing for a faster rotation-invariant matcher. Further, we find that enforcing rotation invariance does not hurt upright performance when trained at scale. Finally, we study the emergence of rotation invariance through scale and find that increasing the training data size substantially improves generalization to rotated images. We release two matchers robust to in-plane rotations that achieve state-of-the-art performance on e.g. multi-modal (WxBS), extreme (HardMatch), and satellite image matching (SatAst). Code is available at https://github.com/davnords/loma.

关键词: feature matching, rotation invariance, descriptor learning, sparse matching pipeline, data augmentation, 3D computer vision, image matching benchmarks, state-of-the-art performance

182. ❌ Pair2Scene: Learning Local Object Relations for Procedural Scene Generation

作者: Xingjian Ran, Shujie Zhang, Weipeng Zhong, Li Luo, Bo Dai 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11808v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Pair2Scene: Learning Local Object Relations for Procedural Scene Generation》专注于3D室内场景生成，提出了一种基于局部对象关系（支撑关系和功能关系）的程序化生成框架。虽然论文提到了当前方法可能依赖LLMs/VLMs但缺乏精确空间推理能力，但论文本身的研究内容（学习局部规则、层次结构、基于物理的算法）与所有评分关键词（均围绕大模型技术原理、训练方法、推理优化、对齐、应用等）完全无关。论文属于计算机图形学/3D场景生成领域，而非大模型或深度学习技术原理创新，也不属于AI for Science的具体应用。

!!! tip deepseek-chat TL;DR

该论文针对3D室内场景生成中数据稀缺和复杂空间关系建模的挑战，提出了一种基于学习局部对象关系的程序化生成框架Pair2Scene，通过整合局部规则、场景层次和物理算法，能够生成超越训练数据分布的复杂且物理语义合理的场景。

摘要翻译

生成高保真度的三维室内场景仍是一项重大挑战，这源于数据稀缺性以及对复杂空间关系建模的困难。现有方法往往难以扩展到训练分布之外的密集场景，或依赖于缺乏精确空间推理能力的大语言模型/视觉语言模型。基于物体摆放主要依赖局部依赖关系而非信息冗余的全局分布这一观察，本文提出Pair2Scene——一种新颖的程序化生成框架，它将学习到的局部规则与场景层级结构及基于物理的算法相结合。这些规则主要捕捉两类物体间关系：遵循物理层级结构的支撑关系，以及反映语义关联的功能关系。我们通过一个网络对这些规则进行建模，该网络依据锚定物体的位置与几何信息，预测依赖物体的空间位置分布。为此，我们从现有场景数据中构建了3D-Pairs数据集以训练模型。在推理阶段，我们的框架可通过在层级结构内递归应用模型来生成场景，并利用碰撞感知的拒绝采样方法，将局部规则整合为连贯的全局布局。大量实验表明，本框架在生成超出训练数据范围的复杂环境方面优于现有方法，同时保持了物理合理性与语义可信度。

摘要 (Abstract)

Generating high-fidelity 3D indoor scenes remains a significant challenge due to data scarcity and the complexity of modeling intricate spatial relations. Current methods often struggle to scale beyond training distribution to dense scenes or rely on LLMs/VLMs that lack the ability for precise spatial reasoning. Building on top of the observation that object placement relies mainly on local dependencies instead of information-redundant global distributions, in this paper, we propose Pair2Scene, a novel procedural generation framework that integrates learned local rules with scene hierarchies and physics-based algorithms. These rules mainly capture two types of inter-object relations, namely support relations that follow physical hierarchies, and functional relations that reflect semantic links. We model these rules through a network, which estimates spatial position distributions of dependent objects conditioned on position and geometry of the anchor ones. Accordingly, we curate a dataset 3D-Pairs from existing scene data to train the model. During inference, our framework can generate scenes by recursively applying our model within a hierarchical structure, leveraging collision-aware rejection sampling to align local rules into coherent global layouts. Extensive experiments demonstrate that our framework outperforms existing methods in generating complex environments that go beyond training data while maintaining physical and semantic plausibility.

关键词: 3D scene generation, procedural generation, object relations, local dependencies, hierarchical structure, physics-based algorithms, support relations, functional relations

183. ❌ OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

作者: Donghao Zhou, Guisheng Liu, Hao Yang, Jiatong Li, Jingyu Lin, Xiaohu Huang, Yichen Liu, Xin Gao, Cunjian Chen, Shilei Wen, Chi-Wing Fu, Pheng-Ann Heng 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11804v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	5.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是人类-物体交互视频生成（HOIVG），这是一个多模态视频生成任务，主要涉及计算机视觉、视频合成和条件生成技术。论文提出的OmniShow框架专注于统一文本、图像、音频和姿态等多模态条件来生成视频，并引入了Unified Channel-wise Conditioning、Gated Local-Context Attention和Decoupled-Then-Joint Training等具体技术。所有关键词均与大语言模型（LLM）或深度学习技术原理的创新直接相关，而本文的核心是视频生成，属于计算机视觉领域，并未涉及LLM或深度学习技术原理的创新。唯一略有相关的是’Model Merging OR Model Soups OR Weight Averaging’，因为论文提到了’Decoupled-Then-Joint Training strategy… with model merging’，但这只是训练策略的一部分，并非核心创新，因此给予5分（有一定关联）。其他关键词如AI for Science等，虽然论文可能应用于电子商务、短视频等场景，但并非科学领域的研究应用，因此不相关。

!!! tip deepseek-chat TL;DR

该论文研究了人类-物体交互视频生成（HOIVG）任务，提出了OmniShow框架，通过统一多模态条件（文本、图像、音频、姿态）和创新的训练策略，实现了行业级的视频生成性能，并建立了HOIVG-Bench基准。

摘要翻译

本研究聚焦于人物-物体交互视频生成任务，该任务旨在依据文本、参考图像、音频与姿态等条件合成高质量的人物-物体交互视频。此任务在现实应用中具有重要的实用价值，例如电商演示、短视频制作与互动娱乐等领域的内容自动化生成。然而，现有方法难以同时兼容所有必需的条件。我们提出了OmniShow，一个专为这一实际且富有挑战性的任务设计的端到端框架，能够协调多模态条件并实现工业级的生成性能。为克服可控性与生成质量之间的权衡，我们引入了统一通道级条件注入技术以实现高效的图像与姿态信息注入，并采用门控局部上下文注意力机制确保精确的视听同步。为有效应对数据稀缺问题，我们开发了一种解耦后联合训练策略，通过多阶段训练流程与模型融合技术，高效利用异构子任务数据集。此外，为填补该领域的评估空白，我们建立了HOIVG-Bench，一个专为人物-物体交互视频生成设计的综合性基准测试集。大量实验表明，OmniShow在多种多模态条件设置下均实现了全面的最优性能，为这一新兴的人物-物体交互视频生成任务确立了坚实的基准。

摘要 (Abstract)

In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.

关键词: Human-Object Interaction Video Generation, Multimodal Condition, Video Synthesis, OmniShow, Unified Channel-wise Conditioning, Gated Local-Context Attention, Decoupled-Then-Joint Training, HOIVG-Bench

184. ❌ SyncFix: Fixing 3D Reconstructions via Multi-View Synchronization

作者: Deming Li, Abhay Yadav, Cheng Peng, Rama Chellappa, Anand Bhattad 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11797v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文《SyncFix: Fixing 3D Reconstructions via Multi-View Synchronization》专注于计算机视觉中的3D重建任务，提出了一种基于扩散模型的多视图同步框架来提升重建质量。虽然使用了扩散模型（一种深度学习技术），但论文的核心内容与所有评分关键词（均围绕大语言模型、其训练/推理技术、对齐、代理、科学AI应用等）无直接关联。论文未涉及任何语言模型、MoE、缩放律、预训练/后训练、对齐技术、RAG、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或生物/化学信息学等主题。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SyncFix的框架，通过多视图同步的扩散模型来修复3D重建中的语义和几何不一致性问题，从而生成更高质量的重建结果。

摘要翻译

我们提出SyncFix框架，该框架在基于扩散模型的重建场景优化过程中强制执行跨视角一致性。SyncFix将优化过程构建为联合潜在桥匹配问题，通过同步多视角间的失真与洁净表征来修复语义与几何不一致性。这意味着SyncFix学习多视角的联合条件分布，从而在去噪轨迹全程保持一致性。我们的训练仅需图像对数据，但在推理阶段能自然泛化至任意数量的视角。此外，重建质量随视角增加而提升，但在高视角数量时呈现收益递减趋势。定性与定量结果表明，即使在没有洁净参考图像的情况下，SyncFix仍能持续生成高质量重建结果，并超越当前最先进的基线方法。当可获得稀疏参考图像时，SyncFix能实现更高保真度的重建。

摘要 (Abstract)

We present SyncFix, a framework that enforces cross-view consistency during the diffusion-based refinement of reconstructed scenes. SyncFix formulates refinement as a joint latent bridge matching problem, synchronizing distorted and clean representations across multiple views to fix the semantic and geometric inconsistencies. This means SyncFix learns a joint conditional over multiple views to enforce consistency throughout the denoising trajectory. Our training is done only on image pairs, but it generalizes naturally to an arbitrary number of views during inference. Moreover, reconstruction quality improves with additional views, with diminishing returns at higher view counts. Qualitative and quantitative results demonstrate that SyncFix consistently generates high-quality reconstructions and surpasses current state-of-the-art baselines, even in the absence of clean reference images. SyncFix achieves even higher fidelity when sparse references are available.

关键词: 3D reconstruction, multi-view synchronization, diffusion models, cross-view consistency, latent bridge matching, denoising trajectory, semantic consistency, geometric consistency

185. ❌ LottieGPT: Tokenizing Vector Animation for Autoregressive Generation

作者: Junhao Chen, Kejun Gao, Yuehan Cui, Mingze Sun, Mingjin Chen, Shaohui Wang, Xiaoxiao Long, Fei Ma, Qi Tian, Ruqi Huang, Hao Zhao 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11792v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用大语言模型（Qwen-VL）进行向量动画生成，因此与’Large Language Models’高度相关（10分）。通过微调（fine-tuning）实现，与’Post-training/SFT’高度相关（10分）。构建大规模数据集（LottieAnimation-660K）涉及数据质量和扩展，与’Scaling Laws AND Data Quality’有一定关联（5分）。使用预训练模型进行微调，与’Pre-training/Domain Adaptation’有一定关联（5分）。其他关键词如MoE、SLMs、RLHF、RAG等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了首个通过大语言模型（LottieGPT）从自然语言或视觉提示直接生成可编辑向量动画的框架，解决了现有生成模型无法合成矢量动画的问题，并在SVG生成任务上超越了现有最佳模型。

摘要翻译

尽管视频生成技术发展迅速，现有模型仍无法生成矢量动画——这种在互联网上占据主导地位且表现力极强的多媒体形式。矢量动画具有分辨率无关性、文件紧凑、语义结构清晰以及可编辑的参数化运动表示等优势，然而当前的生成模型仅能在栅格空间运行，因此无法合成此类动画。与此同时，近期大语言多模态模型在生成结构化数据（如幻灯片、三维网格、乐高序列和室内布局）方面展现出强大能力，这表明原生矢量动画生成可能成为现实。本研究提出了首个用于矢量动画标记化与自回归生成的框架。我们采用广泛应用的基于JSON的动画标准Lottie，并设计了一个定制的Lottie标记器，将分层几何图元、变换参数以及基于关键帧的运动编码为紧凑且语义对齐的标记序列。为支持大规模训练，我们还构建了LottieAnimation-660K——迄今为止规模最大、多样性最丰富的矢量动画数据集，包含从广泛互联网来源收集的66万个真实世界Lottie动画文件和1500万个静态Lottie图像文件。基于这些组件，我们对Qwen-VL进行微调，创建了LottieGPT：一个原生多模态模型，能够直接从自然语言或视觉提示生成连贯、可编辑的矢量动画。实验表明，我们的标记器在保持结构保真度的同时显著缩短了序列长度，实现了对动态矢量内容有效的自回归学习。LottieGPT在多样化动画风格上展现出强大的泛化能力，并在SVG生成（单帧矢量动画的特例）任务上超越了先前最先进的模型。

摘要 (Abstract)

Despite rapid progress in video generation, existing models are incapable of producing vector animation, a dominant and highly expressive form of multimedia on the Internet. Vector animations offer resolution-independence, compactness, semantic structure, and editable parametric motion representations, yet current generative models operate exclusively in raster space and thus cannot synthesize them. Meanwhile, recent advances in large multimodal models demonstrate strong capabilities in generating structured data such as slides, 3D meshes, LEGO sequences, and indoor layouts, suggesting that native vector animation generation may be achievable. In this work, we present the first framework for tokenizing and autoregressively generating vector animations. We adopt Lottie, a widely deployed JSON-based animation standard, and design a tailored Lottie Tokenizer that encodes layered geometric primitives, transforms, and keyframe-based motion into a compact and semantically aligned token sequence. To support large-scale training, we also construct LottieAnimation-660K, the largest and most diverse vector animation dataset to date, consisting of 660k real-world Lottie animation and 15M static Lottie image files curated from broad Internet sources. Building upon these components, we finetune Qwen-VL to create LottieGPT, a native multimodal model capable of generating coherent, editable vector animations directly from natural language or visual prompts. Experiments show that our tokenizer dramatically reduces sequence length while preserving structural fidelity, enabling effective autoregressive learning of dynamic vector content. LottieGPT exhibits strong generalization across diverse animation styles and outperforms previous state-of-the-art models on SVG generation (a special case of single-frame vector animation).

关键词: vector animation generation, large language models, Lottie tokenizer, autoregressive generation, multimodal model, fine-tuning, Qwen-VL, SVG generation

186. ❌ LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

作者: Yuqian Yuan, Wenqiao Zhang, Juekai Lin, Yu Zhong, Mingjian Gao, Binhe Yu, Yunqi Cao, Wentong Li, Yueting Zhuang, Beng Chin Ooi 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11789v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于大型多模态模型（LMMs）与物体中心视觉的交叉研究，属于大模型在视觉领域的应用创新。核心相关关键词包括：‘Large Language Models’（LMMs是LLMs的扩展，评分8.0）、‘Pre-training’和’Post-training’（涉及模型训练策略，评分5.0）、‘Instruction Tuning’（与模型对齐相关，评分5.0）、‘Chain of Thought’和’System 2 Thinking’（涉及空间推理和多步交互，评分5.0）、‘Hallucination Mitigation’（与精确性和可靠性相关，评分5.0）、‘Mechanistic Interpretability’（与可解释性相关，评分5.0）、‘In-context Learning’（与学习范式相关，评分5.0）。其他关键词如MoE、量化、RAG等未在摘要中体现，评分为0.0。

!!! tip deepseek-chat TL;DR

本文综述了大型多模态模型与物体中心视觉的融合研究，解决了LMMs在物体级理解、分割、编辑和生成中的精确性挑战，并总结了关键建模范式、学习策略和未来方向。

摘要翻译

大型多模态模型（LMMs）在通用视觉—语言理解方面取得了显著进展，但在需要精确物体级定位、细粒度空间推理和可控视觉操作的任务中仍存在局限。具体而言，现有系统往往难以正确识别实例、在交互过程中保持物体身份，以及高精度地定位或修改指定区域。以物体为中心的视觉通过促进对视觉实体的显式表征和操作，为解决这些挑战提供了一个原则性框架，从而将多模态系统从全局场景理解扩展到物体级理解、分割、编辑与生成。本文全面综述了LMMs与以物体为中心的视觉交叉领域的最新进展。我们将相关文献归纳为四大主题：以物体为中心的视觉理解、以物体为中心的指涉分割、以物体为中心的视觉编辑，以及以物体为中心的视觉生成。我们进一步总结了支撑这些能力的关键建模范式、学习策略和评估协议。最后，我们探讨了当前面临的开放挑战与未来方向，包括鲁棒的实例恒常性、细粒度的空间控制、一致的多步交互、统一跨任务建模，以及分布偏移下的可靠基准测试。我们希望本文能为构建可扩展、精确且可信赖的以物体为中心的多模态系统提供一个结构化的视角。

摘要 (Abstract)

Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision–language understanding, yet they remain limited in tasks requiring precise object-level grounding, fine-grained spatial reasoning, and controllable visual manipulation. In particular, existing systems often struggle to identify the correct instance, preserve object identity across interactions, and localize or modify designated regions with high precision. Object-centric vision provides a principled framework for addressing these challenges by promoting explicit representations and operations over visual entities, thereby extending multimodal systems from global scene understanding to object-level understanding, segmentation, editing, and generation. This paper presents a comprehensive review of recent advances at the convergence of LMMs and object-centric vision. We organize the literature into four major themes: object-centric visual understanding, object-centric referring segmentation, object-centric visual editing, and object-centric visual generation. We further summarize the key modeling paradigms, learning strategies, and evaluation protocols that support these capabilities. Finally, we discuss open challenges and future directions, including robust instance permanence, fine-grained spatial control, consistent multi-step interaction, unified cross-task modeling, and reliable benchmarking under distribution shift. We hope this paper provides a structured perspective on the development of scalable, precise, and trustworthy object-centric multimodal systems.

关键词: Large Multimodal Models, Object-Centric Vision, Visual Understanding, Referring Segmentation, Visual Editing, Visual Generation, Spatial Reasoning, Multimodal Systems

187. ❌ HDR Video Generation via Latent Alignment with Logarithmic Encoding

作者: Naomi Ken Korem, Mohamed Oumoumad, Harel Cain, Matan Ben Yosef, Urska Jelercic, Ofir Bibi, Yaron Inger, Or Patashnik, Daniel Cohen-Or 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11788v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究HDR视频生成，通过利用预训练生成模型的视觉先验，采用对数编码对齐潜在空间，并通过轻量级微调实现适配。论文与大多数关键词无关，因为其核心是计算机视觉中的生成模型应用，而非大语言模型或深度学习技术原理的创新。仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’、‘Post-training OR Supervised Fine-tuning OR SFT’和’PEFT OR LoRA OR Parameter-efficient Fine-tuning’有中等关联（5分），因为论文涉及预训练模型、微调和轻量级适配，但并非这些技术的核心研究。其他关键词如LLMs、MoE、Scaling Laws、Alignment、RAG、Reasoning、Agents、Quantization等均未涉及。论文未应用AI于科学领域（如生物信息学），也未包含指定专家作者。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用预训练生成模型和对数编码实现高质量HDR视频生成的轻量级适配方法，无需重新设计模型或训练编码器。

摘要翻译

高动态范围（High Dynamic Range, HDR）图像能够丰富且真实地呈现场景辐射亮度，但由于其与生成模型训练时所使用的有界、感知压缩数据不匹配，对生成模型而言仍具挑战性。一种自然的解决方案是为HDR学习新的表示方法，但这会引入额外的复杂性和数据需求。本研究表明，通过利用预训练生成模型已捕获的强大视觉先验，可以一种更简单的方式实现HDR生成。我们观察到，电影制作流程中广泛使用的对数编码将HDR图像映射至一种分布，该分布与这些模型的潜在空间自然对齐，从而可通过轻量级微调直接适配，无需重新训练编码器。为恢复输入中无法直接观察到的细节，我们进一步引入了一种基于相机模拟退化的训练策略，促使模型从其学习到的先验中推断缺失的高动态范围内容。结合这些思路，我们利用经过最小程度适配的预训练视频模型，展示了高质量的HDR视频生成，在多样化场景及复杂光照条件下均取得了优异效果。我们的结果表明，尽管HDR代表了一种根本不同的图像形成机制，但只要所选表示方式与其学习到的先验保持一致，就无需重新设计生成模型即可有效处理HDR内容。

摘要 (Abstract)

High dynamic range (HDR) imagery offers a rich and faithful representation of scene radiance, but remains challenging for generative models due to its mismatch with the bounded, perceptually compressed data on which these models are trained. A natural solution is to learn new representations for HDR, which introduces additional complexity and data requirements. In this work, we show that HDR generation can be achieved in a much simpler way by leveraging the strong visual priors already captured by pretrained generative models. We observe that a logarithmic encoding widely used in cinematic pipelines maps HDR imagery into a distribution that is naturally aligned with the latent space of these models, enabling direct adaptation via lightweight fine-tuning without retraining an encoder. To recover details that are not directly observable in the input, we further introduce a training strategy based on camera-mimicking degradations that encourages the model to infer missing high dynamic range content from its learned priors. Combining these insights, we demonstrate high-quality HDR video generation using a pretrained video model with minimal adaptation, achieving strong results across diverse scenes and challenging lighting conditions. Our results indicate that HDR, despite representing a fundamentally different image formation regime, can be handled effectively without redesigning generative models, provided that the representation is chosen to align with their learned priors.

关键词: HDR video generation, latent alignment, logarithmic encoding, pretrained generative models, lightweight fine-tuning, camera-mimicking degradations, visual priors

188. ❌ Autonomous Diffractometry Enabled by Visual Reinforcement Learning

作者: J. Oppliger, M. Stifter, A. Rüegg, I. Biało, L. Martinelli, P. G. Freeman, D. Prabhakaran, J. Zhao, Q. Wang, J. Chang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11773v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《Autonomous Diffractometry Enabled by Visual Reinforcement Learning》专注于使用无模型强化学习（model-free reinforcement learning）实现晶体衍射仪的自动化对齐，属于材料科学领域的AI应用。所有关键词均与大型语言模型（LLM）、深度学习技术原理或相关训练方法直接相关，而本文的核心技术是强化学习（特别是视觉强化学习），并非LLM或深度学习。唯一可能相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文涉及材料科学中的AI应用（自动化实验工作流），但并非核心匹配，因此给予5分（有一定关联）。其他关键词如LLM、MoE、训练方法、推理技术、代理系统等均未涉及，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文解决了晶体衍射仪中需要人工解读衍射图案进行晶体对齐的自动化难题，通过无模型强化学习框架，使智能体能够直接从Laue衍射图案中学习并导航至高对称性方向，实现了跨不同晶体对称类别的时间高效对齐，为材料科学提供了智能衍射仪的计算框架。

摘要翻译

自动化是科学与工业领域进步的基础。然而，对于需要解读抽象视觉信息的任务，实现自动化仍具挑战性。例如，晶体取向调整在很大程度上依赖人类对衍射图样的理解能力。本文介绍了一种无需依赖晶体学和衍射理论即可自主调整单晶取向的系统。通过采用无模型强化学习框架，智能体能够直接从劳厄衍射（Laue diffraction）图案中学习识别并导航至高对称性取向。尽管缺乏人类监督，该智能体仍能发展出类人策略，在不同晶体对称性类别中实现高效的时间优化取向调整。藉此，我们为智能衍射仪提供了一个计算框架。因此，我们的方法推动了材料科学中自动化实验流程的发展。

摘要 (Abstract)

Automation underpins progress across scientific and industrial disciplines. Yet, automating tasks requiring interpretation of abstract visual information remain challenging. For example, crystal alignment strongly relies on humans with the ability to comprehend diffraction patterns. Here we introduce an autonomous system that aligns single crystals without access to crystallography and diffraction theory. Using a model-free reinforcement learning framework, an agent learns to identify and navigate towards high-symmetry orientations directly from Laue diffraction patterns. Despite the absence of human supervision, the agent develops human-like strategies to achieve time-efficient alignment across different crystal symmetry classes. With this, we provide a computational framework for intelligent diffractometers. As such, our approach advances the development of automated experimental workflows in materials science.

关键词: autonomous diffractometry, visual reinforcement learning, crystal alignment, Laue diffraction patterns, model-free reinforcement learning, automated experimental workflows, materials science, high-symmetry orientations

189. ❌ MosaicMRI: A Diverse Dataset and Benchmark for Raw Musculoskeletal MRI

作者: Paula Arguello, Berk Tinaz, Mohammad Shahab Sepehri, Maryam Soltanolkotabi, Mahdi Soltanolkotabi 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11762v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像（MRI）领域，研究深度学习在MRI重建任务中的应用，并引入了一个新的数据集MosaicMRI。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、智能体等）完全无关，因为这些关键词特指自然语言处理或通用大模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学（医学影像）领域的应用，但并非核心创新点（论文核心是数据集和MRI重建实验，而非AI for Science的方法论创新），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对现有MRI数据集局限于脑部和膝盖影像的问题，提出了一个大规模、多样化的原始肌肉骨骼MRI数据集MosaicMRI，并通过实验发现，在低样本情况下，跨解剖结构训练的模型优于针对单一解剖结构训练的模型，揭示了可被利用的跨解剖相关性。

摘要翻译

深度学习支撑着磁共振成像（MRI）中从重建、伪影去除到分割的广泛应用。然而，该领域的进展很大程度上由聚焦于脑部和膝关节成像的公共数据集所驱动，这塑造了模型的训练与评估方式。因此，对于这些模型在不同解剖部位间可靠性的深入研究仍然有限。在本研究中，我们推出了MosaicMRI——一个大规模、多样化的全采样原始肌肉骨骼（MSK）磁共振测量数据集，专为训练和评估基于机器学习的方法而设计。MosaicMRI是迄今为止最大的开源原始肌肉骨骼MRI数据集，包含2,671个体积数据和80,156张切片。该数据集在体积方向（如轴向、矢状面）、成像对比度（如质子密度加权PD、T1加权、T2加权）、解剖部位（如脊柱、膝关节、髋关节、踝关节等）以及采集线圈数量方面提供了显著的多样性。我们以VarNet作为加速重建任务的基线，进行了一系列综合实验，以研究模型容量和数据集规模对性能的影响规律。有趣的是，在低样本量情况下，基于多解剖部位组合数据训练的模型显著优于针对单一解剖部位训练的模型，这凸显了解剖多样性的优势以及可利用的跨解剖相关性。我们进一步通过在一个解剖部位（如脊柱）上训练模型并在另一个部位（如膝关节）上测试，评估了模型的鲁棒性和跨解剖泛化能力。值得注意的是，我们发现某些身体部位组（如足部和肘部）彼此间能实现良好的泛化，并强调领域偏移下的性能表现同时取决于训练集规模、解剖部位以及特定扫描协议因素。

摘要 (Abstract)

Deep learning underpins a wide range of applications in MRI, including reconstruction, artifact removal, and segmentation. However, progress has been driven largely by public datasets focused on brain and knee imaging, shaping how models are trained and evaluated. As a result, careful studies of the reliability of these models across diverse anatomical settings remain limited. In this work, we introduce MosaicMRI, a large and diverse collection of fully sampled raw musculoskeletal (MSK) MR measurements designed for training and evaluating machine-learning-based methods. MosaicMRI is the largest open-source raw MSK MRI dataset to date, comprising 2,671 volumes and 80,156 slices. The dataset offers substantial diversity in volume orientation (e.g., axial, sagittal), imaging contrasts (e.g., PD, T1, T2), anatomies (e.g., spine, knee, hip, ankle, and others), and numbers of acquisition coils. Using VarNet as a baseline for accelerated reconstruction task, we perform a comprehensive set of experiments to study scaling behavior with respect to both model capacity and dataset size. Interestingly, models trained on the combined anatomies significantly outperform anatomy-specific models in low-sample regimes, highlighting the benefits of anatomical diversity and the presence of exploitable cross-anatomical correlations. We further evaluate robustness and cross-anatomy generalization by training models on one anatomy (e.g., spine) and testing them on another (e.g., knee). Notably, we identify groups of body parts (e.g., foot and elbow) that generalize well with each other, and highlight that performance under domain shifts depends on both training set size, anatomy, and protocol-specific factors.

关键词: Musculoskeletal MRI, Deep Learning, Dataset, Reconstruction, Anatomical Diversity, Cross-anatomy Generalization, VarNet, Scaling Behavior

190. ❌ Learning Long-term Motion Embeddings for Efficient Kinematics Generation

作者: Nick Stracke, Kolja Bauer, Stefan Andreas Baumann, Miguel Angel Bautista, Josh Susskind, Björn Ommer 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11737v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是计算机视觉领域的运动理解和生成技术，具体涉及运动嵌入学习、条件流匹配模型和运动生成。虽然论文使用了深度学习技术（如条件流匹配模型），但其核心内容与所有评分关键词（主要围绕大语言模型技术、训练方法、推理优化、对齐技术、代理系统等）完全无关。论文没有涉及任何语言模型、MoE、量化、推理加速、对齐、RAG、CoT、代理等概念，也不属于生物信息学或化学信息学等AI for Science的具体领域。

!!! tip deepseek-chat TL;DR

该论文提出了一种通过学习长期运动嵌入和条件流匹配模型来高效生成符合文本提示或空间戳指定目标的真实运动的方法，相比现有视频模型和任务特定方法取得了更好的运动分布效果。

摘要翻译

理解与预测运动是视觉智能的基础组成部分。尽管现代视频模型展现出对场景动态的出色理解能力，但通过完整视频合成来探索多种可能未来仍存在难以克服的效率瓶颈。我们通过直接操作从追踪器模型获取的大规模轨迹数据中学习到的长期运动嵌入，实现了数量级更高效的场景动态建模。该方法能够高效生成长时间、符合现实规律的运动序列，并满足通过文本提示或空间戳指定的目标。为实现这一目标，我们首先学习具有64倍时间压缩因子的高度压缩运动嵌入空间。在此空间中，我们训练条件流匹配模型，使其能根据任务描述生成运动潜在表示。最终获得的运动分布在性能上超越了当前最先进的视频模型与专用任务特定方法。

摘要 (Abstract)

Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models exhibit strong comprehension of scene dynamics, exploring multiple possible futures through full video synthesis remains prohibitively inefficient. We model scene dynamics orders of magnitude more efficiently by directly operating on a long-term motion embedding that is learned from large-scale trajectories obtained from tracker models. This enables efficient generation of long, realistic motions that fulfill goals specified via text prompts or spatial pokes. To achieve this, we first learn a highly compressed motion embedding with a temporal compression factor of 64x. In this space, we train a conditional flow-matching model to generate motion latents conditioned on task descriptions. The resulting motion distributions outperform those of both state-of-the-art video models and specialized task-specific approaches.

关键词: motion embedding, kinematics generation, conditional flow-matching, video synthesis, trajectory learning, temporal compression, motion prediction, scene dynamics

191. ❌ Ambivalence/Hesitancy Recognition in Videos for Personalized Digital Health Interventions

作者: Manuela González-González, Soufiane Belharbi, Muhammad Osama Zeeshan, Masoumeh Sharafi, Muhammad Haseeb Aslam, Lorenzo Sia, Nicolas Richet, Marco Pedersoli, Alessandro Lameiras Koerich, Simon L Bacon, Eric Granger 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11730v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文主要研究视频中矛盾/犹豫情绪的识别，用于个性化数字健康干预。与关键词的相关性分析：1）论文明确提到使用LLMs进行零样本推理（8分）；2）涉及领域适应（5分）和监督微调（5分）作为学习方法；3）属于AI for Science在医疗健康领域的应用（8分）；4）其他关键词如MoE、量化、推理加速等未在论文中涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文研究使用深度学习模型识别视频中的矛盾/犹豫情绪以支持个性化数字健康干预，实验结果表明现有方法性能有限，需要更好的多模态融合模型。

摘要翻译

运用行为科学理论，健康干预措施通过提供系统性框架帮助患者建立并维持改善医疗结果的健康习惯，从而实现行为改变。线下干预模式成本高昂且难以规模化推广，在资源有限地区尤为突出。数字健康干预提供了一种经济高效的替代方案，有望支持独立生活与自我管理能力。近年来，自动化干预技术——特别是基于机器学习的方法——受到广泛关注。矛盾与犹豫心理是导致个体延迟、回避或放弃健康干预的核心因素。这类情绪体现为个体对特定行为同时存在积极与消极评价，或在接受与拒绝参与之间摇摆的微妙冲突状态。其具体表现为跨模态或单一模态内的情感表达不一致，例如语言、面部表情、声音特征及肢体动作之间的不协调。虽然可通过专业培训使专家识别此类心理状态，但将其整合至数字健康干预系统不仅成本高昂且效率有限。因此，自动化的矛盾与犹豫识别技术对于实现个性化、高性价比的数字健康干预至关重要。本文探索深度学习模型在视频多模态矛盾与犹豫识别任务中的应用，具体涵盖三种学习范式：监督学习、面向个性化的无监督域适应，以及基于大语言模型的零样本推理。实验采用近期发布的专用BAH视频数据集进行验证。结果表明现有模型性能有限，提示需要开发更适配的多模态模型以实现精准识别。未来需构建更优的时空建模与多模态融合方法，以有效捕捉模态内及跨模态的情感冲突特征。

摘要 (Abstract)

Using behavioural science, health interventions focus on behaviour change by providing a framework to help patients acquire and maintain healthy habits that improve medical outcomes. In-person interventions are costly and difficult to scale, especially in resource-limited regions. Digital health interventions offer a cost-effective approach, potentially supporting independent living and self-management. Automating such interventions, especially through machine learning, has gained considerable attention recently. Ambivalence and hesitancy (A/H) play a primary role for individuals to delay, avoid, or abandon health interventions. A/H are subtle and conflicting emotions that place a person in a state between positive and negative evaluations of a behaviour, or between acceptance and refusal to engage in it. They manifest as affective inconsistency across modalities or within a modality, such as language, facial, vocal expressions, and body language. While experts can be trained to recognize A/H, integrating them into digital health interventions is costly and less effective. Automatic A/H recognition is therefore critical for the personalization and cost-effectiveness of digital health interventions. Here, we explore the application of deep learning models for A/H recognition in videos, a multi-modal task by nature. In particular, this paper covers three learning setups: supervised learning, unsupervised domain adaptation for personalization, and zero-shot inference via large language models (LLMs). Our experiments are conducted on the unique and recently published BAH video dataset for A/H recognition. Our results show limited performance, suggesting that more adapted multi-modal models are required for accurate A/H recognition. Better methods for modeling spatio-temporal and multimodal fusion are necessary to leverage conflicts within/across modalities.

关键词: ambivalence recognition, hesitancy recognition, digital health interventions, multimodal deep learning, unsupervised domain adaptation, large language models, video analysis, behavioral science

192. ❌ The Devil is in the Details – From OCR for Old Church Slavonic to Purely Visual Stemma Reconstruction

作者: Armin Hoenen 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11724v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究OCR和谱系学，属于AI for Science应用。论文明确使用LLMs（GPT5和Gemini3-flash）进行OCR比较和后处理，并测试了agentic OCR架构（包括RAG），因此与’Large Language Models’、‘Retrieval-Augmented Generation’和’LLM Agents’高度相关（10分）。其他关键词如MoE、SLMs、训练方法、推理优化等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究比较了多种OCR系统（包括LLMs）对古教会斯拉夫语手稿的识别效果，并开发了一种基于纯图像处理的谱系重建新方法，在小型语料库上验证了其基本功能。

摘要翻译

人工智能时代为诸多领域和任务带来了新的可能性与潜在风险。细节决定成败，这些细节在构建新流程和执行小型实践实验时尤为凸显。光学字符识别（OCR）与文本谱系学（stematology）亦不例外。本研究首先比较了一系列OCR系统（从经典方法、机器学习到大型语言模型（LLM））对约6000个18世纪晚期手写教会斯拉夫语（Church Slavonic）文献字符的识别效果。聚焦于基础字母的正确性，我们对超过10种CS OCR系统（其中包括2种LLM：GPT5和Gemini3-flash）进行了比较。随后，评估了通过LLM进行后处理的效果，并最终测试了不同的智能体OCR架构（专用后处理智能体、智能体流程和检索增强生成（RAG））。实验表明，借助新技术的完善，教会斯拉夫语基础字母的字符错误率（CER）可低至2-3%，但复杂的变音符号仍可能构成挑战。OCR能在多大程度上为下游任务——文本谱系分析——提供良好基础，是本文第二部分的切入点。该部分介绍了一种全新、完全基于图像处理的谱系分析方法。该方法通过自动视觉字形提取、聚类和成对统计比较的流程，生成距离矩阵并最终构建谱系图。此方法已应用于两个小型语料库：一个包含14至16世纪的教会斯拉夫语《马可福音》，另一个包含14至15世纪的法语《玫瑰传奇》。实验证明了该方法的基本可行性。

摘要 (Abstract)

The age of artificial intelligence has brought many new possibilities and pitfalls in many fields and tasks. The devil is in the details, and those come to the fore when building new pipelines and executing small practical experiments. OCR and stemmatology are no exception. The current investigation starts comparing a range of OCR-systems, from classical over machine learning to LLMs, for roughly 6,000 characters of late handwritten church slavonic manuscripts from the 18th century. Focussing on basic letter correctness, more than 10 CS OCR-systems among which 2 LLMs (GPT5 and Gemini3-flash) are being compared. Then, post-processing via LLMs is assessed and finally, different agentic OCR architectures (specialized post-processing agents, an agentic pipeline and RAG) are tested. With new technology elaborated, experiments suggest, church slavonic CER for basic letters may reach as low as 2-3% but elaborated diacritics could still present a problem. How well OCR can prime stemmatology as a downstream task is the entry point to the second part of the article which introduces a new stemmatic method based solely on image processing. Here, a pipeline of automated visual glyph extraction, clustering and pairwise statistical comparison leading to a distance matrix and ultimately a stemma, is being presented and applied to two small corpora, one for the church slavonic Gospel of Mark from the 14th to 16th centuries, one for the Roman de la Rose in French from the 14th and 15th centuries. Basic functioning of the method can be demonstrated.

关键词: OCR, Old Church Slavonic, LLMs, Agentic OCR, RAG, Stemma Reconstruction, Image Processing, Handwritten Manuscripts

193. ❌ BEM: Training-Free Background Embedding Memory for False-Positive Suppression in Real-Time Fixed-Background Camera

作者: Junwoo Park, Jangho Lee, Sunho Lim 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11714v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究计算机视觉中的目标检测问题，提出了一种用于抑制误检的背景嵌入记忆模块（BEM），属于传统深度学习在视觉领域的应用。所有评分关键词均与大语言模型（LLM）、大模型技术原理、AI for Science等主题相关，而本文完全不涉及这些内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对固定背景摄像头场景中预训练检测器因分布差异导致的误检问题，提出了一种无需训练的轻量级背景嵌入记忆模块（BEM），通过背景相似性重评分有效抑制了误检，同时保持了实时性能。

摘要翻译

预训练检测器在基准测试中表现良好，但在实际部署中常因训练数据与目标环境间的分布差异而出现性能下降。COCO类基准强调类别多样性而非实例密度，导致在按类别稀疏性训练下的检测器在密集、单类或少类场景（如监控和交通监测）中表现不佳。在固定摄像头环境中，准静态背景提供了一个稳定、无需标注的先验信息，可在推理过程中加以利用以抑制误检。针对此问题，我们提出背景嵌入记忆模块（Background Embedding Memory, BEM），这是一个轻量级、无需训练、权重冻结的模块，可在推理时附加到预训练检测器上。BEM通过估计干净的背景嵌入，维护原型记忆库，并利用逆相似度排序加权惩罚对检测逻辑值进行重评分，从而在保持召回率的同时有效减少误报。实验表明，背景帧的余弦相似度与物体数量呈负相关，与精确度-置信度曲线下面积（Precision-Confidence AUC, P-AUC）呈正相关，这支持了其作为无需训练的控制信号的可行性。在LLVIP数据集和模拟监控视频流上，基于YOLO和RT-DETR系列模型的测试表明，BEM能持续减少误报，同时保持实时性能。我们的代码公开于https://github.com/Leo-Park1214/Background-Embedding-Memory.git。

摘要 (Abstract)

Pretrained detectors perform well on benchmarks but often suffer performance degradation in real-world deployments due to distribution gaps between training data and target environments. COCO-like benchmarks emphasize category diversity rather than instance density, causing detectors trained under per-class sparsity to struggle in dense, single- or few-class scenes such as surveillance and traffic monitoring. In fixed-camera environments, the quasi-static background provides a stable, label-free prior that can be exploited at inference to suppress spurious detections. To address the issue, we propose Background Embedding Memory (BEM), a lightweight, training-free, weight-frozen module that can be attached to pretrained detectors during inference. BEM estimates clean background embeddings, maintains a prototype memory, and re-scores detection logits with an inverse-similarity, rank-weighted penalty, effectively reducing false positives while maintaining recall. Empirically, background-frame cosine similarity correlates negatively with object count and positively with Precision-Confidence AUC (P-AUC), motivating its use as a training-free control signal. Across YOLO and RT-DETR families on LLVIP and simulated surveillance streams, BEM consistently reduces false positives while preserving real-time performance. Our code is available at https://github.com/Leo-Park1214/Background-Embedding-Memory.git

关键词: object detection, false-positive suppression, background embedding, training-free, real-time, fixed-camera, pretrained detectors, surveillance

194. ❌ Seeing Through the Tool: A Controlled Benchmark for Occlusion Robustness in Foundation Segmentation Models

作者: Nhan Ho, Luu Le, Thanh-Huy Nguyen, Thien Nguyen, Xiaofeng Liu, Ulas Bagci 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11711v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学图像分割模型的遮挡鲁棒性评估，属于计算机视觉和医学AI领域，与绝大多数关键词（涉及大语言模型技术、训练方法、推理优化、对齐、代理系统等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文研究医学内窥镜图像分析，属于AI在生物医学（Bioinformatics相关）领域的应用，但并非核心创新点，只是应用场景，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对临床内窥镜中目标结构被手术器械或组织部分遮挡的挑战，提出了OccSAM-Bench基准来系统评估SAM系列分割模型的遮挡鲁棒性，发现不同模型架构在遮挡处理上表现出两种截然不同的行为模式（遮挡感知型与遮挡无关型），模型选择需根据具体的临床意图而定。

摘要翻译

遮挡（即目标结构被手术器械或重叠组织部分遮蔽）对于临床内窥镜中的基础分割模型而言，仍是一个关键但尚未被充分探索的挑战。我们提出了OccSAM-Bench，这是一个旨在系统评估SAM系列模型在受控、合成手术遮挡下性能的基准。我们的框架在三个公开息肉数据集上，模拟了两种遮挡类型（即手术器械覆盖和切除），并设定了三个校准的严重程度等级。我们提出了一种新颖的三区域评估方案，将分割性能分解为完整目标、仅可见目标和不可见目标。该指标揭示了标准非模态评估所掩盖的行为，并识别出两种不同的模型原型：遮挡感知型模型（SAM、SAM 2、SAM 3、MedSAM3），其优先描绘可见组织并排斥器械；以及遮挡不感知型模型（MedSAM、MedSAM2），其会自信地预测到被遮挡区域。SAM-Med2D与两者均不一致，且在所有条件下表现不佳。最终，我们的结果表明，不同架构对遮挡的鲁棒性并不一致，模型选择必须由特定的临床意图驱动——是优先考虑保守的可见组织分割，还是对隐藏解剖结构的非模态推理。

摘要 (Abstract)

Occlusion, where target structures are partially hidden by surgical instruments or overlapping tissues, remains a critical yet underexplored challenge for foundation segmentation models in clinical endoscopy. We introduce OccSAM-Bench, a benchmark designed to systematically evaluate SAM-family models under controlled, synthesized surgical occlusion. Our framework simulates two occlusion types (i.e., surgical tool overlay and cutout) across three calibrated severity levels on three public polyp datasets. We propose a novel three-region evaluation protocol that decomposes segmentation performance into full, visible-only, and invisible targets. This metric exposes behaviors that standard amodal evaluation obscures, revealing two distinct model archetypes: Occluder-Aware models (SAM, SAM 2, SAM 3, MedSAM3), which prioritize visible tissue delineation and reject instruments, and Occluder-Agnostic models (MedSAM, MedSAM2), which confidently predict into occluded regions. SAM-Med2D aligns with neither and underperforms across all conditions. Ultimately, our results demonstrate that occlusion robustness is not uniform across architectures, and model selection must be driven by specific clinical intent-whether prioritizing conservative visible-tissue segmentation or the amodal inference of hidden anatomy.

关键词: occlusion robustness, foundation segmentation models, SAM-family models, surgical occlusion, benchmark evaluation, medical endoscopy, polyp segmentation, amodal inference

195. ❌ Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

作者: Efstathios Karypidis, Spyros Gidaris, Nikos Komodakis 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11707v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的视频预测任务，提出了一种基于语义表示预测和视觉合成的分层框架。虽然论文使用了冻结的视觉基础模型和潜在扩散模型，但研究内容完全围绕视频预测的计算机视觉问题，不涉及任何大语言模型、深度学习技术原理创新、大模型在不同领域的应用或AI for Science等关键词。所有评分关键词均与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Re2Pix的分层视频预测框架，通过先预测未来场景的语义表示再指导视觉合成，解决了复杂动态环境中未来视频预测的视觉保真度和语义一致性问题，在自动驾驶基准测试中显著提升了时间语义一致性、感知质量和训练效率。

摘要翻译

精确的未来视频预测需要同时具备高视觉保真度与连贯的场景语义，这在自动驾驶等复杂动态环境中尤为关键。本文提出Re2Pix——一种分层视频预测框架，将预测任务分解为两个阶段：语义表征预测与表征引导的视觉合成。该方法并非直接预测未来的RGB帧，而是首先在冻结的视觉基础模型特征空间中预测未来场景结构，随后以这些预测表征为条件驱动潜在扩散模型，从而渲染出逼真的视频帧。这种分解使模型能够先聚焦于场景动态，再专注于外观生成。一个关键挑战源于训练阶段可用的真实表征与推理阶段使用的预测表征之间的训练-测试失配问题。为解决此问题，我们引入了两种条件策略：嵌套随机丢弃与混合监督，以提升模型对不完美自回归预测的鲁棒性。在具有挑战性的驾驶基准测试上的实验表明，与强大的扩散基线模型相比，所提出的语义优先设计显著提升了时序语义一致性、感知质量及训练效率。实现代码发布于https://github.com/Sta8is/Re2Pix。

摘要 (Abstract)

Accurate future video prediction requires both high visual fidelity and consistent scene semantics, particularly in complex dynamic environments such as autonomous driving. We present Re2Pix, a hierarchical video prediction framework that decomposes forecasting into two stages: semantic representation prediction and representation-guided visual synthesis. Instead of directly predicting future RGB frames, our approach first forecasts future scene structure in the feature space of a frozen vision foundation model, and then conditions a latent diffusion model on these predicted representations to render photorealistic frames. This decomposition enables the model to focus first on scene dynamics and then on appearance generation. A key challenge arises from the train-test mismatch between ground-truth representations available during training and predicted ones used at inference. To address this, we introduce two conditioning strategies, nested dropout and mixed supervision, that improve robustness to imperfect autoregressive predictions. Experiments on challenging driving benchmarks demonstrate that the proposed semantics-first design significantly improves temporal semantic consistency, perceptual quality, and training efficiency compared to strong diffusion baselines. We provide the implementation code at https://github.com/Sta8is/Re2Pix

关键词: video prediction, semantic representation, hierarchical framework, latent diffusion model, autonomous driving, scene dynamics, visual synthesis, temporal consistency

196. ❌ LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment

作者: Dujun Nie, Fengjiao Chen, Qi Lv, Jun Kuang, Xiaoyu Li, Xuezhi Cao, Xunliang Cai 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11689v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉到动作的对齐（Vision-to-Action Alignment），属于多模态AI领域，主要涉及视觉基础模型、动作表示学习、机器人控制等。与大部分关键词（如LLMs、MoE、Scaling Laws、RLHF、RAG等）无关，因为这些关键词主要针对纯语言模型或特定NLP技术。唯一相关的是“Alignment”，但论文中的对齐是视觉与动作空间的对齐，而非语言模型的价值对齐，因此给5分（有一定关联）。其他关键词均不涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了LARY基准，用于评估从视觉观察中提取的潜在动作表示在语义动作和机器人控制任务中的泛化能力，发现通用视觉基础模型优于专门的具身动作模型，且潜在表示空间比像素空间更有效地对齐物理动作空间。

摘要翻译

尽管显性动作数据的匮乏限制了视觉-语言-动作（VLA）模型的发展，人类动作视频却提供了一个可扩展但未标注的数据来源。利用大规模人类视频数据集的一个关键挑战在于将视觉信号转化为独立于本体论的表示形式，即潜在动作。然而，潜在动作表示从视觉观察中推导出鲁棒控制的能力尚未得到严格评估。我们提出了潜在动作表示生成（LARY）基准，这是一个用于评估潜在动作表示在高层语义动作（做什么）和低层机器人控制（如何做）两方面表现的统一框架。该精心构建的数据集包含超过一百万段视频（总计1000小时），涵盖151个动作类别，同时包含62万张图像对和59.5万条运动轨迹，覆盖多样化的具身形态和环境。我们的实验揭示了两个关键发现：（i）未经任何动作监督训练的通视觉基础模型，其表现持续优于专门的具身潜在动作模型。（ii）基于潜在表示的视觉空间在本质上比基于像素的空间更贴合物理动作空间。这些结果表明，通用视觉表示内在地编码了与物理控制相关的动作知识，并且语义层面的抽象是从视觉到动作的一条本质上比像素级重建更为有效的路径。

摘要 (Abstract)

While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale human video datasets lies in transforming visual signals into ontology-independent representations, known as latent actions. However, the capacity of latent action representation to derive robust control from visual observations has yet to be rigorously evaluated. We introduce the Latent Action Representation Yielding (LARY) Benchmark, a unified framework for evaluating latent action representations on both high-level semantic actions (what to do) and low-level robotic control (how to do). The comprehensively curated dataset encompasses over one million videos (1,000 hours) spanning 151 action categories, alongside 620K image pairs and 595K motion trajectories across diverse embodiments and environments. Our experiments reveal two crucial insights: (i) General visual foundation models, trained without any action supervision, consistently outperform specialized embodied latent action models. (ii) Latent-based visual space is fundamentally better aligned to physical action space than pixel-based space. These results suggest that general visual representations inherently encode action-relevant knowledge for physical control, and that semantic-level abstraction serves as a fundamentally more effective pathway from vision to action than pixel-level reconstruction.

关键词: Vision-to-Action Alignment, Latent Action Representation, Visual Foundation Models, Robotic Control, Benchmark Evaluation, Semantic Actions, Physical Action Space, General Visual Representations

197. ❌ Unfolding 3D Gaussian Splatting via Iterative Gaussian Synopsis

作者: Yuqin Lu, Yang Zhou, Yihua Dai, Guiqing Li, Shengfeng He 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11685v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于3D高斯泼溅（3DGS）的存储优化和渐进式渲染，提出了一种名为迭代高斯摘要的新框架，通过自上而下的展开方案和自适应剪枝机制构建多层次层次结构。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是计算机图形学中的3D重建和渲染技术，与这些关键词无直接关联。论文未涉及大模型、深度学习创新或生物医药等科学领域的AI应用，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文针对3D高斯泼溅（3DGS）在流媒体和资源受限环境中的存储和部署挑战，提出了一种迭代高斯摘要框架，通过自上而下的展开和自适应剪枝机制实现了紧凑的渐进式渲染，在保持高渲染质量的同时显著减少了存储需求。

摘要翻译

三维高斯泼溅（3D Gaussian Splatting，简称3DGS）已成为实时高保真新视角合成的先进框架。然而，其巨大的存储需求与固有的非结构化表示方式，为在流式传输与资源受限环境中的部署带来了挑战。现有的细节层次（Level-of-Detail，LOD）策略，尤其是基于自底向上构建的方法，常引入冗余或导致保真度下降。为克服这些局限，我们提出迭代高斯摘要，这是一种通过自顶向下“展开”方案实现紧凑渐进式渲染的新框架。我们的方法从全分辨率3DGS模型出发，利用一种自适应的、可学习的基于掩码的剪枝机制，迭代地生成更粗糙的细节层次。这一过程构建了一个多层级的层次结构，在提升效率的同时保持了视觉质量。我们将捕捉全局场景结构的层次化空间网格，与建模局部细节的共享锚点码本相结合。这种组合产生了一种紧凑而富有表现力的特征表示，旨在最小化冗余并支持高效的、针对特定层级的自适应。展开机制促进了层间可重用性，且仅需极小的数据开销即可实现渐进式细化。实验表明，我们的方法在所有细节层次上均保持了高渲染质量，同时实现了显著的存储缩减。这些结果证明了我们的方法在带宽和内存受限场景下进行实时3DGS渲染的实用性与可扩展性。

摘要 (Abstract)

3D Gaussian Splatting (3DGS) has become a state-of-the-art framework for real-time, high-fidelity novel view synthesis. However, its substantial storage requirements and inherently unstructured representation pose challenges for deployment in streaming and resource-constrained environments. Existing Level-of-Detail (LOD) strategies, particularly those based on bottom-up construction, often introduce redundancy or lead to fidelity degradation. To overcome these limitations, we propose Iterative Gaussian Synopsis, a novel framework for compact and progressive rendering through a top-down “unfolding” scheme. Our approach begins with a full-resolution 3DGS model and iteratively derives coarser LODs using an adaptive, learnable mask-based pruning mechanism. This process constructs a multi-level hierarchy that preserves visual quality while improving efficiency. We integrate hierarchical spatial grids, which capture the global scene structure, with a shared Anchor Codebook that models localized details. This combination produces a compact yet expressive feature representation, designed to minimize redundancy and support efficient, level-specific adaptation. The unfolding mechanism promotes inter-layer reusability and requires only minimal data overhead for progressive refinement. Experiments show that our method maintains high rendering quality across all LODs while achieving substantial storage reduction. These results demonstrate the practicality and scalability of our approach for real-time 3DGS rendering in bandwidth- and memory-constrained scenarios.

关键词: 3D Gaussian Splatting, Iterative Gaussian Synopsis, Level-of-Detail, progressive rendering, storage reduction, adaptive pruning, real-time rendering, novel view synthesis

198. ❌ Towards Brain MRI Foundation Models for the Clinic: Findings from the FOMO25 Challenge

作者: Asbjørn Munk, Stefano Cerri, Vardan Nersesjan, Christian Hedeager Krag, Jakob Ambsdorf, Pablo Rocamora García, Julia Machnio, Peirong Liu, Suhyun Ahn, Nasrin Akbari, Yasmina Al Khalil, Kimberly Amador, Sina Amirrajab, Tal Arbel, Meritxell Bach Cuadra, Ujjwal Baid, Bhakti Baheti, Jaume Banus, Kamil Barbierik, Christoph Brune, Yansong Bu, Baptiste Callard, Yuhan Chen, Cornelius Crijnen, Corentin Dancette, Peter Drotar, Prasad Dutande, Nils D. Forkert, Saurabh Garg, Jakub Gazda, Matej Gazda, Benoît Gérin, Partha Ghosh, Weikang Gong, Pedro M. Gordaliza, Sam Hashemi, Tobias Heimann, Fucang Jia, Jiexin Jiang, Emily Kaczmarek, Chris Kang, Seung Kwan Kang, Mohammad Khazaei, Julien Khlaut, Petros Koutsouvelis, Jae Sung Lee, Yuchong Li, Mengye Lyu, Mingchen Ma, Anant Madabhushi, Klaus H. Maier-Hein, Pierre Manceron, Andrés Martínez Mora, Moona Mazher, Felix Meister, Nataliia Molchanova, Steven A. Niederer, Leonard Nürnberg, Jinah Park, Abdul Qayyum, Jonas Richiardi, Antoine Saporta, Branislav Setlak, Ning Shen, Justin Szeto, Constantin Ulrich, Puru Vaish, Vibujithan Vigneshwaran, Leroy Volmer, Zihao Wang, Siqi Wei, Anthony Winder, Jelmer M. Wolterink, Maxence Wynen, Chang Yang, Si Young Yie, Mostafa Mehdipour Ghazi, Akshay Pai, Espen Jimenez Solem, Sebastian Nørgaard Llambias, Mikael Boesen, Michael Eriksen Benros, Juan Eugenio Iglesias, Mads Nielsen 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11679v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于医学影像（脑MRI）基础模型，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为核心研究就是脑MRI基础模型。与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分），因为研究自监督预训练和领域适应。与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为属于生物信息学/医学AI应用。其他关键词（如MoE、SFT、RAG、量化等）未在摘要中提及，完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文通过FOMO25挑战赛评估脑MRI基础模型，发现自监督预训练能提升临床数据泛化能力，且小模型表现良好，但模型规模和训练时长扩展未带来可靠收益。

摘要翻译

脑MRI自动分析技术的临床部署面临一个根本性挑战：临床数据具有高度异质性与噪声，而获取高质量标注的成本极其高昂。自监督学习（SSL）可通过利用临床工作流程中产生的大量未标注数据来应对这一挑战，从而训练出能够以最少监督适应域外数据的鲁棒性基础模型。然而，脑MRI基础模型的发展一直受到预训练数据集规模较小以及局限于高质量研究级数据的域内基准测试的限制。为弥补这一差距，我们在MICCAI 2025上组织了FOMO25挑战赛作为卫星活动。FOMO25为参与者提供了一个大型预训练数据集FOMO60K，并在少样本和域外设置下，使用直接来源于临床工作流程的数据对模型进行评估。任务涵盖梗死分类、脑膜瘤分割和脑年龄回归，并考虑了在FOMO60K上训练的模型（方法赛道）以及使用任何数据训练的模型（开放赛道）。通过标准化的容器化流程，我们对来自十六个团队的十九个基础模型进行了评估。结果表明：（a）自监督预训练提升了模型在域偏移下对临床数据的泛化能力，最强的域外训练模型性能超越了域内训练的监督基线模型。（b）没有单一的预训练目标对所有任务均有益处：MAE（掩码自编码器）有利于分割任务，混合重建-对比目标有利于分类任务。（c）小型预训练模型取得了强劲性能，而扩大模型规模和训练时长所带来的改进并未产生可靠的收益。

摘要 (Abstract)

Clinical deployment of automated brain MRI analysis faces a fundamental challenge: clinical data is heterogeneous and noisy, and high-quality labels are prohibitively costly to obtain. Self-supervised learning (SSL) can address this by leveraging the vast amounts of unlabeled data produced in clinical workflows to train robust \textit{foundation models} that adapt out-of-domain with minimal supervision. However, the development of foundation models for brain MRI has been limited by small pretraining datasets and in-domain benchmarking focused on high-quality, research-grade data. To address this gap, we organized the FOMO25 challenge as a satellite event at MICCAI 2025. FOMO25 provided participants with a large pretraining dataset, FOMO60K, and evaluated models on data sourced directly from clinical workflows in few-shot and out-of-domain settings. Tasks covered infarct classification, meningioma segmentation, and brain age regression, and considered both models trained on FOMO60K (method track) and any data (open track). Nineteen foundation models from sixteen teams were evaluated using a standardized containerized pipeline. Results show that (a) self-supervised pretraining improves generalization on clinical data under domain shift, with the strongest models trained \textit{out-of-domain} surpassing supervised baselines trained \textit{in-domain}. (b) No single pretraining objective benefits all tasks: MAE favors segmentation, hybrid reconstruction-contrastive objectives favor classification, and (c) strong performance was achieved by small pretrained models, and improvements from scaling model size and training duration did not yield reliable benefits.

关键词: brain MRI, foundation models, self-supervised learning, clinical data, domain adaptation, FOMO25 challenge, pretraining, medical imaging

199. ❌ UNIGEOCLIP: Unified Geospatial Contrastive Learning

作者: Guillaume Astruc, Eduard Trulls, Jan Hosang, Loic Landrieu, Paul-Edouard Sarlin 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11668v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文UNIGEOCLIP专注于地理空间多模态对比学习，提出了一种统一五种地理空间模态（航空影像、街景、高程模型、文本、地理坐标）的对比学习框架。所有关键词均与大模型、深度学习技术原理或特定AI应用领域直接相关，但本文的核心是计算机视觉和地理信息科学中的多模态表示学习，而非大模型技术。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为地理空间分析可视为AI在科学（地球科学）领域的应用，但并非核心匹配，故给5分（有一定关联）。其他关键词涉及大模型架构、训练、对齐、推理、代理等，与本文无直接关系，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为UNIGEOCLIP的统一地理空间多模态对比学习框架，通过全对全对比对齐和缩放经纬度编码器，在多种下游地理空间任务中优于单模态模型和仅坐标基线。

摘要翻译

随着航空影像、街景视图、高程模型、文本及地理坐标等多源共位地理空间数据的日益丰富，为多模态表征学习提供了独特机遇。本文提出UNIGEOCLIP——一个大规模多模态对比学习框架，能够在统一的嵌入空间中协同对齐五种互补的地理空间模态。与先前融合模态或依赖中心枢纽表征的方法不同，本方法采用全对全对比对齐策略，支持跨任意模态组合的无缝比较、检索与推理。我们进一步提出一种尺度化经纬度编码器，通过捕捉多尺度地理结构以提升空间表征能力。在多项下游地理空间任务中的实验表明，UNIGEOCLIP在性能上持续超越单模态对比模型与纯坐标基线，凸显了整体性多模态地理空间对齐的优势。参考实现已发布于https://gastruc.github.io/unigeoclip。

摘要 (Abstract)

The growing availability of co-located geospatial data spanning aerial imagery, street-level views, elevation models, text, and geographic coordinates offers a unique opportunity for multimodal representation learning. We introduce UNIGEOCLIP, a massively multimodal contrastive framework to jointly align five complementary geospatial modalities in a single unified embedding space. Unlike prior approaches that fuse modalities or rely on a central pivot representation, our method performs all-to-all contrastive alignment, enabling seamless comparison, retrieval, and reasoning across arbitrary combinations of modalities. We further propose a scaled latitude-longitude encoder that improves spatial representation by capturing multi-scale geographic structure. Extensive experiments across downstream geospatial tasks demonstrate that UNIGEOCLIP consistently outperforms single-modality contrastive models and coordinate-only baselines, highlighting the benefits of holistic multimodal geospatial alignment. A reference implementation is available at https://gastruc.github.io/unigeoclip.

关键词: geospatial, multimodal, contrastive learning, embedding space, aerial imagery, street-level views, latitude-longitude encoder, downstream tasks

200. ❌ GazeVaLM: A Multi-Observer Eye-Tracking Benchmark for Evaluating Clinical Realism in AI-Generated X-Rays

作者: David Wong, Zeynep Isik, Bin Wang, Marouane Tliba, Gorkem Durak, Elif Keles, Halil Ertugrul Aktas, Aladine Chetouani, Cagdas Topel, Nicolo Gennaro, Camila Lopes Vendrami, Tugce Agirlar Trabzonlu, Amir Ali Rahsepar, Laetitia Perronne, Matthew Antalek, Onural Ozturk, Gokcan Okur, Andrew C. Gordon, Ayis Pyrros, Frank H. Miller, Amir Borhani, Hatice Savas, Eric Hart, Elizabeth Krupinski, Ulas Bagci 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11653v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是评估AI生成胸部X光片的临床真实性和诊断准确性，通过眼动追踪数据和6个最先进的多模态LLMs进行比较。论文高度相关于’Large Language Models’（明确提到评估6个SOTA多模态LLMs）和’AI for Science’（医学影像AI应用），其他关键词如MoE、SLMs、训练方法、推理优化、代理系统等均未涉及。

!!! tip deepseek-chat TL;DR

该研究通过GazeVaLM眼动追踪数据集，比较了专家放射科医生与多模态大语言模型在评估AI生成胸部X光片的临床真实性和诊断准确性方面的表现。

摘要翻译

我们推出GazeVaLM——一个用于研究胸部放射影像真实性评估过程中临床感知的公开眼动追踪数据集。该数据集包含16位专业放射科医师在两种任务条件下（诊断评估与真伪分类，即视觉图灵测试）解读30张真实和30张合成胸部X光片（由基于扩散模型的生成式人工智能生成）时记录的960条眼动数据。针对每幅图像-观察者配对，我们提供原始注视点样本、注视点分布图、扫描路径、显著性密度图、结构化诊断标签及真实性判断结果。我们将实验框架扩展至6个前沿多模态大语言模型，并公开其在相同条件下生成的诊断预测、真实性标签及置信度分数——从而支持在决策和不确定性层面进行直接的人机对比。我们进一步提供了注视一致性分析、观察者间一致性分析，以及放射科医师与大语言模型在诊断准确性和真实性检测方面的基准测试。GazeVaLM可支持注视建模、临床决策、人机对比、生成式图像真实性评估及不确定性量化等领域的研究。通过同步发布视觉注意力数据、临床标签与模型预测结果，我们旨在推动关于专家与人工智能系统如何感知、解读和评估医学影像的可复现研究。数据集可通过 https://huggingface.co/datasets/davidcwong/GazeVaLM 获取。

摘要 (Abstract)

We introduce GazeVaLM, a public eye-tracking dataset for studying clinical perception during chest radiograph authenticity assessment. The dataset comprises 960 gaze recordings from 16 expert radiologists interpreting 30 real and 30 synthetic chest X-rays (generated by diffusion based generative AI) under two conditions: diagnostic assessment and real-fake classification (Visual Turing test). For each image-observer pair, we provide raw gaze samples, fixation maps, scanpaths, saliency density maps, structured diagnostic labels, and authenticity judgments. We extend the protocol to 6 state-of-the-art multimodal LLMs, releasing their predicted diagnoses, authenticity labels, and confidence scores under matched conditions - enabling direct human-AI comparison at both decision and uncertainty levels. We further provide analyses of gaze agreement, inter-observer consistency, and benchmarking of radiologists versus LLMs in diagnostic accuracy and authenticity detection. GazeVaLM supports research in gaze modeling, clinical decision-making, human-AI comparison, generative image realism assessment, and uncertainty quantification. By jointly releasing visual attention data, clinical labels, and model predictions, we aim to facilitate reproducible research on how experts and AI systems perceive, interpret, and evaluate medical images. The dataset is available at https://huggingface.co/datasets/davidcwong/GazeVaLM.

关键词: eye-tracking, chest radiograph, multimodal LLMs, clinical perception, generative AI, human-AI comparison, diagnostic accuracy, authenticity assessment

201. ❌ STS-Mixer: Spatio-Temporal-Spectral Mixer for 4D Point Cloud Video Understanding

作者: Wenhao Li, Xueying Jiang, Gongjie Zhang, Xiaoqin Zhang, Ling Shao, Shijian Lu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11637v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于4D点云视频理解，提出了一种结合空间、时间和频谱表示的新框架STS-Mixer。虽然论文属于计算机视觉和深度学习领域，但所有评分关键词均与大语言模型（LLM）相关技术、训练方法、推理优化、对齐、代理系统等主题相关，而论文内容完全不涉及这些主题。论文研究的是点云视频的几何特征提取和时空分析，与评分关键词中的大模型技术、科学AI应用等均无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对4D点云视频理解中几何特征难以捕捉的问题，提出了结合空间、时间和频谱表示的STS-Mixer框架，在3D动作识别和4D语义分割任务上取得了优越性能。

摘要翻译

四维点云视频捕捉了场景丰富的时空动态特性，在各式四维理解任务中具有独特价值。然而，现有方法大多在时空域中进行处理，难以捕捉四维点云视频内在的几何特征，导致其表征学习与理解能力受限。我们从互补的谱域视角应对上述挑战。通过将四维点云视频转化为图谱信号，可将其分解为多个频带，每个频带分别捕获点云视频不同的几何结构。我们的谱分析表明：分解后的低频信号更多捕捉粗粒度形状，而高频信号则编码更细粒度的几何细节。基于这些发现，我们设计了时空谱混合器（Spatio-Temporal-Spectral Mixer, STS-Mixer）——一个融合点云视频空间、时间与谱表征的统一框架。STS-Mixer 将多频带划分的谱信号与时空信息相结合，以捕获丰富的几何特征与时间动态，同时实现对四维点云视频细粒度与整体性的理解。大量实验表明，STS-Mixer 在三维动作识别与四维语义分割任务中，于多个广泛采用的基准测试上均取得持续优异的性能。代码与模型已发布于 https://github.com/Vegetebird/STS-Mixer。

摘要 (Abstract)

4D point cloud videos capture rich spatial and temporal dynamics of scenes which possess unique values in various 4D understanding tasks. However, most existing methods work in the spatiotemporal domain where the underlying geometric characteristics of 4D point cloud videos are hard to capture, leading to degraded representation learning and understanding of 4D point cloud videos. We address the above challenge from a complementary spectral perspective. By transforming 4D point cloud videos into graph spectral signals, we can decompose them into multiple frequency bands each of which captures distinct geometric structures of point cloud videos. Our spectral analysis reveals that the decomposed low-frequency signals capture more coarse shapes while high-frequency signals encode more fine-grained geometry details. Building on these observations, we design Spatio-Temporal-Spectral Mixer (STS-Mixer), a unified framework that mixes spatial, temporal, and spectral representations of point cloud videos. STS-Mixer integrates multi-band delineated spectral signals with spatiotemporal information to capture rich geometries and temporal dynamics, while enabling fine-grained and holistic understanding of 4D point cloud videos. Extensive experiments show that STS-Mixer achieves superior performance consistently across multiple widely adopted benchmarks on both 3D action recognition and 4D semantic segmentation tasks. Code and models are available at https://github.com/Vegetebird/STS-Mixer.

关键词: 4D point cloud video, spatio-temporal-spectral, graph spectral signals, frequency bands, geometric structures, action recognition, semantic segmentation, STS-Mixer

202. ❌ MorphoFlow: Sparse-Supervised Generative Shape Modeling with Adaptive Latent Relevance

作者: Mokshagna Sai Teja Karanam, Tushar Kataria, Shireen Elhabian 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11636v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文MorphoFlow专注于3D解剖形状建模，使用神经隐式表示、自解码器和自回归归一化流等技术，属于计算机视觉和医学图像分析领域。所有关键词均与大语言模型（LLMs）及其相关技术（如训练、对齐、推理优化、智能体等）直接相关，而本文完全不涉及LLMs或自然语言处理。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文应用于生物医学（解剖学）领域，属于AI for Science的一个子领域，但并非核心内容，因此给予5分（有一定关联）。其他关键词与论文主题无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了MorphoFlow，一种稀疏监督生成形状建模框架，通过结合神经隐式表示、自解码器和自回归流，从稀疏表面注释中学习紧凑的概率形状表示，实现了高分辨率重建和结构化解剖变异模式的恢复。

摘要翻译

统计形状建模（SSM）是解剖结构群体水平分析的核心方法，然而现有方法大多依赖于密集标注的分割结果和固定的潜在表示。这些要求限制了建模复杂解剖变异的可扩展性与灵活性。本文提出MorphoFlow——一种稀疏监督的生成式形状建模框架，能够直接从稀疏表面标注中学习紧凑的概率形状表示。MorphoFlow将神经隐式形状表示与自动解码器架构及自回归归一化流相结合，从而在潜在形状空间上学习具有强表达能力的概率密度分布。神经隐式表示实现了对三维解剖结构分辨率无关的建模，而自动解码器架构支持在稀疏监督下直接优化每个实例的潜在编码。自回归流捕获了潜在解剖变异的分布特性，构建了可处理的、基于似然的形状生成模型。为获得紧凑且结构化的潜在表示，我们通过稀疏诱导先验引入自适应潜在相关性加权机制，使模型能够根据各潜在维度与底层解剖变异的相关性调节其贡献度，同时保持生成表达能力。由此构建的潜在空间支持不确定性量化与解剖学合理的形状合成，且无需人工调整潜在维度。在公开腰椎椎体与股骨数据集上的评估表明，本方法能从稀疏输入实现精确的高分辨率重建，并能恢复与群体水平趋势一致的结构化解剖变异模式。

摘要 (Abstract)

Statistical shape modeling (SSM) is central to population level analysis of anatomical variability, yet most existing approaches rely on densely annotated segmentations and fixed latent representations. These requirements limit scalability and reduce flexibility when modeling complex anatomical variation. We introduce MorphoFlow, a sparse supervised generative shape modeling framework that learns compact probabilistic shape representations directly from sparse surface annotations. MorphoFlow integrates neural implicit shape representations with an autodecoder formulation and autoregressive normalizing flows to learn an expressive probabilistic density over the latent shape space. The neural implicit representation enables resolution-agnostic modeling of 3D anatomy, while the autodecoder formulation supports direct optimization of per-instance latent codes under sparse supervision. The autoregressive flow captures the distribution of latent anatomical variability providing a tractable, likelihood-based generative model of shapes. To promote compact and structured latent representations, we incorporate adaptive latent relevance weighting through sparsity-inducing priors, enabling the model to regulate the contribution of individual latent dimensions according to their relevance to the underlying anatomical variation while preserving generative expressivity. The resulting latent space supports uncertainty quantification and anatomically plausible shape synthesis without manual latent dimensionality tuning. Evaluation on publicly available lumbar vertebrae and femur datasets demonstrates accurate high-resolution reconstruction from sparse inputs and recovery of structured modes of anatomical variation consistent with population level trends.

关键词: Statistical shape modeling, Sparse supervision, Generative shape modeling, Neural implicit representation, Autoregressive normalizing flows, Adaptive latent relevance, Anatomical variability, 3D anatomy reconstruction

203. ❌ POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

作者: Haicheng Wang, Yuan Liu, Yikun Liu, Zhemeng Yu, Zhongyin Zhao, Yangxiu You, Zilin Yu, Le Tian, Xiao Zhou, Jie Zhou, Weidi Xie, Yanfeng Wang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11627v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在长视频和流式场景下的可扩展性挑战，提出POINTS-Long模型。与关键词高度相关的有：1）‘Large Language Models’（论文研究MLLMs，属于大模型范畴，权重1.0，相关度10）；2）‘Context Window Extension’（论文解决长视频视觉序列问题，涉及长上下文处理，权重1.0，相关度10）；3）‘KV Cache Compression’（论文提出动态可分离KV-cache设计以支持流式视觉理解，权重1.0，相关度10）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理技术、AI for Science等均未在摘要中提及，与论文内容无关，相关度均为0。

!!! tip deepseek-chat TL;DR

论文针对多模态大语言模型在长视频和流式场景中视觉令牌序列快速增长导致的扩展性挑战，提出了POINTS-Long模型，通过动态视觉令牌缩放和可分离KV-cache设计，在保持高精度的同时显著减少了计算开销，实现了自适应高效的长格式视觉理解。

摘要翻译

多模态大语言模型（MLLMs）近期在跨模态理解与生成方面展现出卓越能力。然而，视觉标记序列的快速增长——尤其是在长视频与流式场景中——对其可扩展性与实际部署构成了重大挑战。为此，我们提出了POINTS-Long，一种受人类视觉系统启发、具备动态视觉标记缩放能力的原生双模态MLLM。该模型支持两种互补的感知模式：聚焦模式与待机模式，使用户能够在推理过程中动态权衡效率与精度。在细粒度视觉任务上，聚焦模式保持了最佳性能；而在长篇幅通用视觉理解任务中，待机模式仅需使用1/40至1/10的视觉标记即可保留原始精度的97.7-99.7%。此外，POINTS-Long通过动态可分离的KV缓存（KV-cache）设计，原生支持流式视觉理解，能够高效维护超长视觉记忆。本研究为未来MLLMs的设计提供了新思路，并为自适应、高效的长篇幅视觉理解奠定了基础。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in cross-modal understanding and generation. However, the rapid growth of visual token sequences–especially in long-video and streaming scenarios–poses a major challenge to their scalability and real-world deployment. Thus, we introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token scaling inspired by the human visual system. The model supports two complementary perception modes: focus mode and standby mode, enabling users to dynamically trade off efficiency and accuracy during inference. On fine-grained visual tasks, the focus mode retains the optimal performance, while on long-form general visual understanding, the standby mode retains 97.7-99.7% of the original accuracy using only 1/40-1/10th of the visual tokens. Moreover, POINTS-Long natively supports streaming visual understanding via a dynamically detachable KV-cache design, allowing efficient maintenance of ultra-long visual memory. Our work provides new insights into the design of future MLLMs and lays the foundation for adaptive and efficient long-form visual understanding.

关键词: Multimodal Large Language Models, MLLMs, long-video understanding, visual token scaling, KV-cache, streaming visual understanding, adaptive efficiency, dual-mode perception

204. ❌ Learning Robustness at Test-Time from a Non-Robust Teacher

作者: Stefano Bianchettin, Giulio Rossolini, Giorgio Buttazzo 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11590v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究预训练模型在测试时的对抗鲁棒性适应，主要涉及预训练和领域适应概念，与’Pre-training OR Continual Pre-training OR Domain Adaptation’关键词有一定关联（5分），但论文聚焦计算机视觉（CIFAR-10、ImageNet）而非大语言模型，且未涉及其他关键词的技术原理或应用，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文研究如何利用非鲁棒的预训练教师模型，在测试时通过无标签目标数据适应来提升对抗鲁棒性，并提出了一种基于预测语义锚的标签无关框架，实验表明该方法在优化稳定性、参数敏感性和鲁棒性-准确性权衡方面优于现有基线。

摘要翻译

当前，预训练模型日益被用作通用主干网络，并在测试时针对目标数据稀缺且无标注的下游环境进行适配。尽管该范式已被证明能有效提升目标域上的洁净准确率，但对抗鲁棒性却鲜受关注，尤其在原始预训练模型并非显式设计为鲁棒模型的情况下。这引出了一个实际问题：\emph{一个预训练的非鲁棒模型能否在测试时进行适配，以提升其在目标分布上的对抗鲁棒性？} 为应对此问题，本研究探讨了将对抗训练策略整合到无监督测试时适配方案中的行为，该场景下仅能获取少量无标注目标样本。研究首先分析了经典对抗训练框架如何扩展至此场景，结果表明，基于蒸馏的直接适配方法仍不稳定，且对超参数调优高度敏感，尤其在教师模型本身不具备鲁棒性时。
为克服这些局限，本文提出了一种无标签框架，该框架在适配过程中使用非鲁棒教师模型的预测结果作为洁净目标与对抗目标的语义锚点。我们进一步提供了理论分析，表明相较于经典对抗训练中常用的基于自一致性的正则化方法，我们的框架能产生更稳定的替代方案。实验在CIFAR-10和ImageNet数据集上，通过引入的光度变换进行评估。结果支持了理论分析，表明在所研究的部署后测试时设定中，相较于现有基线方法，所提方案实现了更优的优化稳定性、对参数选择的更低敏感性以及更好的鲁棒性与准确性的权衡。

摘要 (Abstract)

Nowadays, pretrained models are increasingly used as general-purpose backbones and adapted at test-time to downstream environments where target data are scarce and unlabeled. While this paradigm has proven effective for improving clean accuracy on the target domain, adversarial robustness has received far less attention, especially when the original pretrained model is not explicitly designed to be robust. This raises a practical question: \emph{can a pretrained, non-robust model be adapted at test-time to improve adversarial robustness on a target distribution?} To face this question, this work studies how adversarial training strategies behave when integrated into adaptation schemes for the unsupervised test-time setting, where only a small set of unlabeled target samples is available. It first analyzes how classical adversarial training formulations can be extended to this scenario, showing that straightforward distillation-based adaptations remain unstable and highly sensitive to hyperparameter tuning, particularly when the teacher itself is non-robust. To address these limitations, the work proposes a label-free framework that uses the predictions of a non-robust teacher model as a semantic anchor for both the clean and adversarial objectives during adaptation. We further provide theoretical insights showing that our formulation yields a more stable alternative to the self-consistency-based regularization commonly used in classical adversarial training. Experiments evaluate the proposed approach on CIFAR-10 and ImageNet under induced photometric transformations. The results support the theoretical insights by showing that the proposed approach achieves improved optimization stability, lower sensitivity to parameter choices, and a better robustness-accuracy trade-off than existing baselines in this post-deployment test-time setting.

关键词: pretrained models, test-time adaptation, adversarial robustness, non-robust teacher, label-free framework, optimization stability, CIFAR-10, ImageNet

205. ❌ MLLM-as-a-Judge Exhibits Model Preference Bias

作者: Shuitsu Koyama, Yuiga Wada, Daichi Yashima, Komei Sugiura 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11589v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多模态大语言模型（MLLMs）作为评估工具时的模型偏好偏差，核心涉及大语言模型（LLMs）的评估方法，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术（如MoE、量化、推理加速等）或应用领域（如生物信息学），因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究发现多模态大语言模型作为自动评估工具时存在模型偏好偏差，并提出一种简单集成方法有效缓解了这种偏差。

摘要翻译

利用多模态大语言模型（MLLMs）进行自动评估，通常被称为“MLLM-as-a-Judge”，已被广泛用于衡量模型性能。若此类MLLM-as-a-Judge方法存在偏见，则可能扭曲模型比较结果及以基准测试驱动的科学进展。然而，目前尚不清楚MLLM-as-a-Judge方法在多大程度上倾向于或排斥特定MLLM生成的文本。在本研究中，我们提出Philautia-Eval方法来探究此类模型特异性偏好偏见。Philautia-Eval通过将偏好倾向与生成质量差异分离，量化了这种偏见的程度。基于从12个MLLM收集的129万条图文-评分配对数据，我们发现代表性MLLM往往表现出自我偏好偏见。此外，实验结果表明特定模型家族内部存在相互偏好偏见，这可能是由共享的连接器组件和重叠的指令微调资源所驱动的。最后，我们引入一种简单的MLLM集成方法Pomms。结果显示，Pomms在保持性能的同时，有效缓解了模型特异性偏好偏见。

摘要 (Abstract)

Automatic evaluation using multimodal large language models (MLLMs), commonly referred to as MLLM-as-a-Judge, has been widely used to measure model performance. If such MLLM-as-a-Judge methods were biased, they could distort model comparisons and benchmark-driven scientific progress. However, it remains unclear to what extent MLLM-as-a-Judge methods favor or disfavor text generated by specific MLLMs. In this study, we propose Philautia-Eval to investigate such model-specific preference bias. Philautia-Eval quantifies the degree of the bias by disentangling preference tendencies from differences in generation quality. Using 1.29M caption-score pairs collected from 12 MLLMs, we found that representative MLLMs tend to exhibit self-preference bias. Moreover, experimental results indicate mutual preference bias within particular model families, which is potentially driven by reused connectors and overlapping instruction-tuning resources. Finally, we introduce a simple ensemble of MLLMs, Pomms. Our results demonstrated that Pomms effectively mitigated the model-specific preference bias while maintaining performance.

关键词: Multimodal Large Language Models, MLLM-as-a-Judge, model preference bias, automatic evaluation, Philautia-Eval, self-preference bias, ensemble method, Pomms

206. ❌ Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions

作者: Seongyu Kim, Seungwoo Lee, Hyeonggon Ryu, Joon Son Chung, Arda Senocak 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11579v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究触觉驱动的视觉材料区域定位，属于多模态感知（视觉-触觉）和计算机视觉领域，与所有评分关键词（均专注于大模型、深度学习技术原理及其应用）完全无关。论文未涉及任何大模型、语言模型、训练技术、推理方法、代理系统或AI for Science相关内容。

!!! tip deepseek-chat TL;DR

该论文解决了触觉定位问题，提出了一种通过密集跨模态特征交互学习局部视觉-触觉对齐的模型，显著优于现有方法。

摘要翻译

本文研究触觉定位问题，其目标在于识别与触觉输入具有相同材料属性的图像区域。现有的视觉-触觉方法依赖于全局对齐，因而无法捕捉该任务所需的细粒度局部对应关系。现有数据集进一步加剧了这一挑战，因其主要包含特写镜头且多样性不足的图像。我们提出一种通过学习密集跨模态特征交互来实现局部视觉-触觉对齐的模型，该模型能够生成用于触觉条件材料分割的触觉显著性图。为克服数据集限制，我们引入：（1）扩展视觉多样性的真实场景多材料图像；（2）一种材料多样性配对策略，将每个触觉样本与视觉多样但触觉一致的图像对齐，从而提升上下文定位能力及对弱信号的鲁棒性。我们还构建了两个新的基于触觉的材料分割数据集用于定量评估。在新基准和现有基准上的实验表明，我们的方法在触觉定位任务上显著优于先前的视觉-触觉方法。

摘要 (Abstract)

We address the problem of tactile localization, where the goal is to identify image regions that share the same material properties as a tactile input. Existing visuo-tactile methods rely on global alignment and thus fail to capture the fine-grained local correspondences required for this task. The challenge is amplified by existing datasets, which predominantly contain close-up, low-diversity images. We propose a model that learns local visuo-tactile alignment via dense cross-modal feature interactions, producing tactile saliency maps for touch-conditioned material segmentation. To overcome dataset constraints, we introduce: (i) in-the-wild multi-material scene images that expand visual diversity, and (ii) a material-diversity pairing strategy that aligns each tactile sample with visually varied yet tactilely consistent images, improving contextual localization and robustness to weak signals. We also construct two new tactile-grounded material segmentation datasets for quantitative evaluation. Experiments on both new and existing benchmarks show that our approach substantially outperforms prior visuo-tactile methods in tactile localization.

关键词: tactile localization, visuo-tactile alignment, material segmentation, cross-modal feature interaction, tactile saliency maps, multi-material scenes, material-diversity pairing

207. ❌ GeomPrompt: Geometric Prompt Learning for RGB-D Semantic Segmentation Under Missing and Degraded Depth

作者: Krishna Jaganathan, Patricio Vela 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11585v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的RGB-D语义分割任务，提出了一种名为GeomPrompt的几何提示学习方法，用于处理深度信息缺失或退化的情况。论文的核心技术涉及跨模态适应、几何提示学习、轻量级模块设计以及语义分割性能提升。然而，所有评分关键词均围绕大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理系统等）或特定科学领域AI应用（如生物信息学）。该论文的研究内容（RGB-D感知、语义分割、几何提示）与这些LLM和深度学习技术原理的关键词没有直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对机器人感知中RGB-D深度信息经常缺失或退化的问题，提出了GeomPrompt几何提示学习方法，通过从RGB图像合成几何提示来补偿缺失的深度信息，从而显著提升了RGB-D语义分割模型的性能，并在SUN RGB-D数据集上实现了最高+6.1 mIoU的改进。

摘要翻译

机器人与具身人工智能的多模态感知系统通常假设可靠的RGB-D传感，但在实践中，深度信息常常缺失、含有噪声或已损坏。为此，我们提出GeomPrompt——一种轻量级的跨模态适配模块，它仅从RGB图像中合成任务驱动的几何提示，作为冻结RGB-D语义分割模型第四通道的输入，且无需深度监督。我们进一步引入GeomPrompt-Recovery，该适配模块通过预测与冻结分割器相关的第四通道修正值，以补偿退化的深度信息。两个模块均仅通过下游分割监督进行训练，从而恢复对分割有用的几何先验，而非直接估计深度信号。在SUN RGB-D数据集上，GeomPrompt相较于纯RGB推理，在DFormer模型上提升了+6.1 mIoU，在GeminiFusion模型上提升了+3.0 mIoU，同时与强大的单目深度估计方法保持竞争力。对于退化深度，GeomPrompt-Recovery持续提升了鲁棒性，在深度严重损坏的情况下可获得高达+3.6 mIoU的增益。GeomPrompt的计算效率也显著高于单目深度基线方法，其延迟时间为7.8毫秒，而基线方法分别为38.3毫秒和71.9毫秒。这些结果表明，任务驱动的几何提示是一种高效的机制，可在RGB-D感知中深度信息缺失或退化时实现跨模态补偿。

摘要 (Abstract)

Multimodal perception systems for robotics and embodied AI often assume reliable RGB-D sensing, but in practice, depth is frequently missing, noisy, or corrupted. We thus present GeomPrompt, a lightweight cross-modal adaptation module that synthesizes a task-driven geometric prompt from RGB alone for the fourth channel of a frozen RGB-D semantic segmentation model, without depth supervision. We further introduce GeomPrompt-Recovery, an adaptation module that compensates for degraded depth by predicting the fourth channel correction relevant for the frozen segmenter. Both modules are trained solely with downstream segmentation supervision, enabling recovery of the geometric prior useful for segmentation, rather than estimating depth signals. On SUN RGB-D, GeomPrompt improves over RGB-only inference by +6.1 mIoU on DFormer and +3.0 mIoU on GeminiFusion, while remaining competitive with strong monocular depth estimators. For degraded depth, GeomPrompt-Recovery consistently improves robustness, yielding gains up to +3.6 mIoU under severe depth corruptions. GeomPrompt is also substantially more efficient than monocular depth baselines, reaching 7.8 ms latency versus 38.3 ms and 71.9 ms. These results suggest that task-driven geometric prompting is an efficient mechanism for cross-modal compensation under missing and degraded depth inputs in RGB-D perception.

关键词: RGB-D semantic segmentation, Geometric prompt learning, Cross-modal adaptation, Missing depth compensation, Degraded depth recovery, Lightweight module, SUN RGB-D dataset, mIoU improvement

208. ❌ Finetune Like You Pretrain: Boosting Zero-shot Adversarial Robustness in Vision-language Models

作者: Songlong Xing, Weijie Wang, Zhengyu Zhao, Jindong Gu, Philip Torr, Nicu Sebe 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11576v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉语言模型（CLIP）的对抗鲁棒性微调方法，属于计算机视觉与多模态领域，而非大语言模型（LLM）或深度学习技术原理的创新。与关键词的相关性分析如下：1）与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（8分），因为论文核心是提出一种对抗微调范式AdvFLYP来提升CLIP的鲁棒性；2）与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为方法借鉴了CLIP预训练的数据分布和学习目标；3）其他关键词均不相关（0分），因为论文未涉及LLM、MoE、对齐、推理、代理、量化等具体技术，也未聚焦科学领域应用。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型CLIP在零样本场景下易受对抗攻击的问题，提出了一种遵循预训练配方的对抗微调范式AdvFLYP，通过在网络图像-文本对上使用对比损失进行微调，显著提升了模型在14个跨域数据集上的对抗鲁棒性和零样本能力。

摘要翻译

尽管视觉语言模型（如CLIP）展现出卓越的零样本能力，但其已被证实易受对抗攻击的影响。为提升其对抗鲁棒性，近期研究通过在代理数据集（如ImageNet）上使用对抗样本对CLIP的预训练视觉编码器进行微调，将对抗图像与正确类别标签对齐。然而，这些方法忽视了训练数据分布和学习目标的重要作用，导致零样本能力下降，且鲁棒性在不同领域和数据集间的可迁移性有限。本文提出一种简单而有效的范式AdvFLYP，在对模型进行对抗微调时遵循CLIP预训练过程的训练方案。具体而言，AdvFLYP基于从网络收集的图像-文本对生成对抗图像来微调CLIP，并通过对比损失使其与对应文本匹配。为减轻噪声网络图像的对抗图像嵌入失真，我们进一步提出通过惩罚对抗图像特征的偏差来正则化AdvFLYP。研究表明，对数层面和特征层面的正则化项分别有利于鲁棒性和干净准确率。在涵盖多个领域的14个下游数据集上的大量实验表明，我们的范式优于主流方法。代码和模型权重发布于https://github.com/Sxing2/AdvFLYP。

摘要 (Abstract)

Despite their impressive zero-shot abilities, vision-language models such as CLIP have been shown to be susceptible to adversarial attacks. To enhance its adversarial robustness, recent studies finetune the pretrained vision encoder of CLIP with adversarial examples on a proxy dataset such as ImageNet by aligning adversarial images with correct class labels. However, these methods overlook the important roles of training data distributions and learning objectives, resulting in reduced zero-shot capabilities and limited transferability of robustness across domains and datasets. In this work, we propose a simple yet effective paradigm AdvFLYP, which follows the training recipe of CLIP’s pretraining process when performing adversarial finetuning to the model. Specifically, AdvFLYP finetunes CLIP with adversarial images created based on image-text pairs collected from the web, and match them with their corresponding texts via a contrastive loss. To alleviate distortion of adversarial image embeddings of noisy web images, we further propose to regularise AdvFLYP by penalising deviation of adversarial image features. We show that logit- and feature-level regularisation terms benefit robustness and clean accuracy, respectively. Extensive experiments on 14 downstream datasets spanning various domains show the superiority of our paradigm over mainstream practices. Our code and model weights are released at https://github.com/Sxing2/AdvFLYP.

关键词: vision-language models, adversarial robustness, zero-shot learning, fine-tuning, CLIP, contrastive loss, image-text pairs, domain transferability

209. ❌ Training-Free Model Ensemble for Single-Image Super-Resolution via Strong-Branch Compensation

作者: Gengjia Chang, Xining Ge, Weijun Yuan, Zhan Li, Qiurong Song, Luen Zhu, Shuhong Liu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11564v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于单图像超分辨率（SISR）任务，提出了一种无需训练的模型集成方法。虽然论文涉及深度学习模型（如Transformer、MambaIRv2）在计算机视觉中的应用，但所有给定的关键词均与大语言模型（LLMs）及其相关技术（如训练方法、推理优化、对齐、代理系统等）或特定科学领域AI应用（如生物信息学）相关。论文内容未涉及任何大语言模型技术、原理或应用，也未涉及生物/化学信息学等科学AI应用，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需训练的双分支模型集成框架，通过强分支补偿机制提升单图像超分辨率性能，在NTIRE 2026挑战中取得了优于基准分支的效果。

摘要翻译

单幅图像超分辨率技术已从深度卷积基线发展到更强大的Transformer与状态空间架构，然而相应的性能提升通常伴随着更高的训练成本、更长的工程迭代周期以及更沉重的部署负担。在许多实际应用场景中，已存在多个具有部分互补特性的预训练模型，此时关键约束不再是架构容量，而是如何在不额外训练的情况下有效融合它们的输出。本文不再追求进一步的架构重设计，而是提出一种免训练的输出级集成框架。我们构建了一个双分支流程：其中采用TLC推理的混合注意力网络提供稳定的主体重建，而具备几何自集成能力的MambaIRv2分支则为高频细节恢复提供强力补偿。两个分支独立处理相同的低分辨率输入，并通过轻量级加权组合在图像空间进行融合，无需更新任何模型参数或引入额外的可训练模块。作为对NTIRE 2026图像超分辨率（×4）挑战的解决方案，该设计在统一的DIV2K双三次插值×4评估协议下，于最佳工作点处始终优于基础分支，并在PSNR指标上小幅超越纯强分支。消融实验证实，输出级补偿为现有超分辨率系统提供了一条低开销且易于实践的升级路径。

摘要 (Abstract)

Single-image super-resolution has progressed from deep convolutional baselines to stronger Transformer and state-space architectures, yet the corresponding performance gains typically come with higher training cost, longer engineering iteration, and heavier deployment burden. In many practical settings, multiple pretrained models with partially complementary behaviors are already available, and the binding constraint is no longer architectural capacity but how effectively their outputs can be combined without additional training. Rather than pursuing further architectural redesign, this paper proposes a training-free output-level ensemble framework. A dual-branch pipeline is constructed in which a Hybrid attention network with TLC inference provides stable main reconstruction, while a MambaIRv2 branch with geometric self-ensemble supplies strong compensation for high-frequency detail recovery. The two branches process the same low-resolution input independently and are fused in the image space via a lightweight weighted combination, without updating any model parameters or introducing an additional trainable module. As our solution to the NTIRE 2026 Image Super-Resolution ($\times 4$) Challenge, the proposed design consistently improves over the base branch and slightly exceeds the pure strong branch in PSNR at the best operating point under a unified DIV2K bicubic $\times 4$ evaluation protocol. Ablation studies confirm that output-level compensation provides a low-overhead and practically accessible upgrade path for existing super-resolution systems.

关键词: single-image super-resolution, training-free ensemble, dual-branch pipeline, hybrid attention network, MambaIRv2, output-level compensation, NTIRE 2026 challenge, image reconstruction

210. ❌ The Impact of Federated Learning on Distributed Remote Sensing Archives

作者: Anand Umashankar, Karam Tomotaki-Dawoud, Nicolai Schneider 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11562v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究联邦学习在遥感图像分类中的应用，主要关注分布式数据训练、非IID数据挑战和算法比较。所有关键词均与大模型/深度学习技术原理创新或具体应用相关，但论文仅涉及传统CNN架构和联邦学习算法，未涉及大模型、MoE、量化、推理加速、对齐、RAG等任何指定技术。唯一相关的是"AI for Science"，因为遥感属于科学应用领域，但论文未涉及生物信息学或化学信息学，因此给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文系统评估了三种联邦学习算法（FedAvg、FedProx、BSP）在非IID标签分布下的多标签遥感图像分类任务中的性能，发现FedProx在深度架构下表现更好，BSP接近集中式精度但通信成本高，LeNet在精度和通信间提供最佳权衡。

摘要翻译

遥感档案本质上是分布式的：哨兵一号、哨兵二号和哨兵三号等地球观测任务已累计采集超过5拍字节的影像数据，这些数据存储和处理于众多地理上分散的平台。由于数据规模、主权限制和地理分布等因素，以集中式方法在此类数据上训练机器学习模型并不现实。联邦学习通过保持数据本地化、仅交换模型更新的方式应对这一挑战。遥感领域的一个核心难题在于地球观测数据的非独立同分布特性：标签分布随地理区域差异显著，这会降低标准联邦学习算法的收敛性能。本文对三种联邦学习策略——联邦平均、联邦近端优化和批量同步并行——在受控的非独立同分布标签偏斜条件下应用于多标签遥感图像分类进行了系统性实证研究。我们评估了三种深度递增的卷积神经网络架构（LeNet、AlexNet和ResNet-34），并分析了算法选择、模型容量、客户端比例、客户端数量、批处理规模与通信成本的联合效应。在加州大学默塞德分校多标签数据集上的实验表明：在数据异构条件下，联邦近端优化对深层架构的表现优于联邦平均；批量同步并行能以高顺序通信为代价逼近集中式训练的准确率；而就所研究的数据集规模而言，LeNet提供了最佳的准确率-通信权衡。

摘要 (Abstract)

Remote sensing archives are inherently distributed: Earth observation missions such as Sentinel-1, Sentinel-2, and Sentinel-3 have collectively accumulated more than 5 petabytes of imagery, stored and processed across many geographically dispersed platforms. Training machine learning models on such data in a centralized fashion is impractical due to data volume, sovereignty constraints, and geographic distribution. Federated learning (FL) addresses this by keeping data local and exchanging only model updates. A central challenge for remote sensing is the non-IID nature of Earth observation data: label distributions vary strongly by geographic region, degrading the convergence of standard FL algorithms. In this paper, we conduct a systematic empirical study of three FL strategies – FedAvg, FedProx, and bulk synchronous parallel (BSP) – applied to multi-label remote sensing image classification under controlled non-IID label-skew conditions. We evaluate three convolutional neural network (CNN) architectures of increasing depth (LeNet, AlexNet, and ResNet-34) and analyze the joint effect of algorithm choice, model capacity, client fraction, client count, batch size, and communication cost. Experiments on the UC Merced multi-label dataset show that FedProx outperforms FedAvg for deeper architectures under data heterogeneity, that BSP approaches centralized accuracy at the cost of high sequential communication, and that LeNet provides the best accuracy-communication trade-off for the dataset scale considered.

关键词: Federated Learning, Remote Sensing, Non-IID Data, Image Classification, FedProx, Convolutional Neural Networks, Distributed Training, Communication Cost

211. ❌ Progressively Texture-Aware Diffusion for Contrast-Enhanced Sparse-View CT

作者: Tianqi Wang, Wenchao Du, Hongyu Yang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11559v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像（稀疏视图CT重建）中的扩散模型应用，属于AI for Science（科学AI）范畴，因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评5分）。然而，论文内容不涉及任何大语言模型（LLM）、深度学习技术原理创新（如MoE、Scaling Laws、微调方法等）、推理技术（如CoT、Agent）、模型优化（如量化、加速）或其他关键词，因此其他所有关键词均评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种渐进式纹理感知扩散模型（PTD），通过粗到细的学习框架解决了稀疏视图CT重建中恢复可靠图像内容和视觉一致纹理的挑战，在减少采样步骤的同时实现了更好的结构相似性和视觉质量。

摘要翻译

基于扩散模型的稀疏视角CT（SVCT）成像得益于其更稳定的生成能力，近年来取得了显著进展。然而，恢复可靠的图像内容与视觉一致的纹理仍是一个关键挑战。本文提出一种渐进式纹理感知扩散（PTD）模型，这是一个专为SVCT设计的由粗到精的学习框架。具体而言，PTD包含一个基础重建模块PTD${\textit{rec}}$和一个条件扩散模块PTD${\textit{diff}}$。PTD${\textit{rec}}$首先学习确定性映射以恢复大部分潜在低频信号（即具有平滑纹理的粗略内容），这作为初始估计以确保保真度。此外，PTD${\textit{diff}}$旨在为粗略预测重建高保真细节，该模块探索了一种双域引导的条件扩散过程，以生成可靠且一致的纹理。在稀疏视角CT重建上的大量实验表明，我们的PTD仅需少量采样步骤即可在结构相似性和视觉吸引力方面实现卓越性能，这减轻了一般扩散模型固有的随机性，并在视觉质量与高频细节保真度之间实现了更好的平衡。

摘要 (Abstract)

Diffusion-based sparse-view CT (SVCT) imaging has achieved remarkable advancements in recent years, thanks to its more stable generative capability. However, recovering reliable image content and visually consistent textures is still a crucial challenge. In this paper, we present a Progressively Texture-aware Diffusion (PTD) model, a coarse-to-fine learning framework tailored for SVCT. Specifically, PTD comprises a basic reconstructive module PTD${\textit{rec}}$ and a conditional diffusion module PTD${\textit{diff}}$. PTD${\textit{rec}}$ first learns a deterministic mapping to recover the majority of the underlying low-frequency signals (i.e., coarse content with smoothed textures), which serves as the initial estimation to enable fidelity. Moreover, PTD${\textit{diff}}$ aims to reconstruct high-fidelity details for coarse prediction, which explores a dual-domain guided conditional diffusion to generate reliable and consistent textures. Extensive experiments on sparse-view CT reconstruction demonstrate that our PTD achieves superior performance in terms of structure similarity and visual appeal with only a few sampling steps, which mitigates the randomness inherent in general diffusion models and enables a better trade-off between visual quality and fidelity of high-frequency details.

关键词: Sparse-view CT, Diffusion model, Texture-aware, Coarse-to-fine learning, Image reconstruction, Medical imaging, Conditional diffusion, High-fidelity details

212. ❌ Continuous Adversarial Flow Models

作者: Shanchuan Lin, Ceyuan Yang, Zhijie Lin, Hao Chen, Haoqi Fan 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11521v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种用于连续时间流模型的对抗训练方法，主要应用于图像生成任务（如ImageNet 256px和文本到图像生成）。与关键词列表的相关性分析如下：1）唯一相关的关键词是“Post-training OR Supervised Fine-tuning OR SFT”，因为论文明确提到该方法“primarily proposed for post-training existing flow-matching models”，并展示了post-training能显著提升模型性能（如FID从8.26降至3.63），因此给予8分（有一定关联，但非核心内容）。2）其他关键词均与论文内容无关：论文专注于生成模型（流模型）的对抗训练，不涉及大语言模型（LLMs）、MoE、推理技术、对齐、代理、科学AI应用等主题。因此，这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于对抗目标的连续时间流模型训练方法，通过引入学习到的判别器来改进现有流匹配模型的训练，在ImageNet图像生成和文本到图像生成任务中显著提升了样本质量（如FID从8.26降低到3.63）。

摘要翻译

我们提出连续对抗流模型，这是一种采用对抗目标训练的连续时间流模型。与使用固定均方误差准则的流匹配方法不同，我们的方法引入了一个可学习的判别器来指导训练。这一目标函数的改变导致了一个不同的广义分布，经验表明，该分布生成的样本能更好地与目标数据分布对齐。我们的方法主要针对现有流匹配模型的后训练而提出，尽管它也能从头开始训练模型。在ImageNet 256px生成任务中，我们的后训练显著提升了无引导生成的质量：潜在空间SiT模型的FID从8.26改善至3.63，像素空间JiT模型的FID从7.17改善至3.57。该方法也提升了有引导生成的效果，将SiT的FID从2.06降低至1.53，将JiT的FID从1.86降低至1.80。我们进一步在文本到图像生成任务上评估了该方法，在GenEval和DPG基准测试中均取得了更好的结果。

摘要 (Abstract)

We propose continuous adversarial flow models, a type of continuous-time flow model trained with an adversarial objective. Unlike flow matching, which uses a fixed mean-squared-error criterion, our approach introduces a learned discriminator to guide training. This change in objective induces a different generalized distribution, which empirically produces samples that are better aligned with the target data distribution. Our method is primarily proposed for post-training existing flow-matching models, although it can also train models from scratch. On the ImageNet 256px generation task, our post-training substantially improves the guidance-free FID of latent-space SiT from 8.26 to 3.63 and of pixel-space JiT from 7.17 to 3.57. It also improves guided generation, reducing FID from 2.06 to 1.53 for SiT and from 1.86 to 1.80 for JiT. We further evaluate our approach on text-to-image generation, where it achieves improved results on both the GenEval and DPG benchmarks.

关键词: continuous adversarial flow models, flow model, adversarial training, post-training, image generation, text-to-image generation, FID improvement

213. ❌ TAG-Head: Time-Aligned Graph Head for Plug-and-Play Fine-grained Action Recognition

作者: Imtiaz Ul Hassan, Nik Bessis, Ardhendu Behera 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11498v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的细粒度人类动作识别（FHAR），提出了一种名为TAG-Head的轻量级时空图头模块，用于增强标准3D主干网络（如SlowFast、I3D）的性能。研究核心涉及视频理解、时空建模、图神经网络和Transformer编码器，旨在通过RGB-only输入实现高性能，减少对额外模态（如姿态、文本）的依赖。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、对齐、推理方法等）或AI在科学领域的应用（如生物信息学）直接相关。本论文未涉及任何大语言模型技术、其训练/推理方法、代理系统或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对细粒度人类动作识别中依赖多模态信息导致计算成本高的问题，提出了一种轻量级、即插即用的时空图头模块（TAG-Head），仅使用RGB输入即可提升标准3D主干网络的性能，并在多个基准测试中达到了RGB-only模型的最高水平。

摘要翻译

细粒度人类动作识别（FHAR）因视觉上相似的动作仅存在细微时空差异而极具挑战性。当前许多系统通过引入额外模态（如姿态、文本、光流）来增强判别力，但这增加了标注负担与计算成本。我们提出TAG-Head，一种轻量级时空图头部模块，仅使用RGB输入即可升级标准3D骨干网络（如SlowFast、R(2+1)D-34、I3D等）以用于FHAR任务。我们的流程首先将带有可学习三维位置编码的Transformer编码器应用于骨干网络输出的特征标记，以捕捉跨空间与时间的长期依赖关系。随后通过一个图结构对特征进行细化：该图包含（i）全连接的帧内边，用于解析帧内细微的外观差异；（ii）时间对齐的时序边，连接跨帧同一空间位置的特征，以稳定运动线索并避免过度平滑。该头部模块结构紧凑（参数与计算量开销极小），可跨骨干网络即插即用，并与骨干网络进行端到端训练。在FineGym（Gym99与Gym288）和HAA500数据集上的大量实验表明，TAG-Head在纯RGB模型中达到了新的最优性能，甚至超越了许多依赖特权信息的多模态方法（视频+姿态+文本）。消融实验解析了Transformer与图拓扑结构的各自贡献，复杂度分析证实了其低延迟特性。TAG-Head通过在一个轻量可组合的图头部内，显式耦合全局上下文与高分辨率空间交互以及低方差时序连续性，推动了FHAR的发展。该设计的简洁性使其易于在偏好纯RGB传感器的实际系统中直接部署，同时提供通常需依赖更复杂或多模态模型才能实现的性能提升。代码将在GitHub上开源。

摘要 (Abstract)

Fine-grained human action recognition (FHAR) is challenging because visually similar actions differ by subtle spatio-temporal cues. Many recent systems enhance discriminability with extra modalities (e.g., pose, text, optical flow), but this increases annotation burden and computational cost. We introduce TAG-Head, a lightweight spatio-temporal graph head that upgrades standard 3D backbones (SlowFast, R(2+1)D-34, I3D, etc.) for FHAR using RGB only. Our pipeline first applies a Transformer encoder with learnable 3D positional encodings to the backbone tokens, capturing long-range dependencies across space and time. The resulting features are then refined by a graph in which (i) fully-connected intra-frame edges to resolve subtle appearance differences within frames, and (ii) time-aligned temporal edges that connect features at the same spatial location across frames to stabilise motion cues without over-smoothing. The head is compact (little parameter/FLOP overhead), plug-and-play across backbones, and trained end-to-end with the backbone. Extensive evaluations on FineGym (Gym99 and Gym288) and HAA500 show that TAG-Head sets a new state-of-the-art among RGB-only models and surpasses many recent multimodal approaches (video + pose + text) that rely on privileged information. Ablations disentangle the contributions of the Transformer and the graph topology, and complexity analyses confirm low latency. TAG-Head advances FHAR by explicitly coupling global context with high-resolution spatial interactions and low-variance temporal continuity inside a slim, composable graph head. The simplicity of the design enables straightforward adoption in practical systems that favour RGB-only sensors, while delivering performance gains typically associated with heavier or multimodal models. Code will be released on GitHub.

关键词: Fine-grained Action Recognition, Spatio-temporal Graph, Transformer Encoder, RGB-only, Plug-and-play, 3D Backbones, Video Understanding, Lightweight Model

214. ❌ NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild

作者: Aleksandr Gushchin, Khaled Abud, Ekaterina Shumitskaya, Artem Filippov, Georgii Bychkov, Sergey Lavrushkin, Mikhail Erofeev, Anastasia Antsiferova, Changsheng Chen, Shunquan Tan, Radu Timofte, Dmitry Vatolin, Chuanbiao Song, Zijian Yu, Hao Tan, Jun Lan, Zhiqiang Yang, Yongwei Tang, Zhiqiang Wu, Jia Wen Seow, Hong Vin Koay, Haodong Ren, Feng Xu, Shuai Chen, Ruiyang Xia, Qi Zhang, Yaowen Xu, Zhaofan Zou, Hao Sun, Dagong Lu, Mufeng Yao, Xinlei Xu, Fei Wu, Fengjun Guo, Cong Luo, Hardik Sharma, Aashish Negi, Prateek Shaily, Jayant Kumar, Sachin Chaudhary, Akshay Dudhane, Praful Hambarde, Amit Shukla, Zhilin Tu, Fengpeng Li, Jiamin Zhang, Jianwei Fei, Kemou Li, Haiwei Wu, Bilel Benjdira, Anas M. Ali, Wadii Boulila, Chenfan Qu, Junchi Li 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11487v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是关于AI生成图像检测的计算机视觉挑战赛报告，主要涉及图像检测模型、数据集构建和鲁棒性评估，而所有评分关键词均专注于大语言模型（LLM）及其相关技术（如训练方法、推理优化、代理系统等）。论文内容完全不涉及LLM、深度学习技术原理或科学领域的LLM应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文介绍了NTIRE 2026挑战赛，旨在开发能够区分真实图像和AI生成图像的鲁棒检测模型，并基于包含多种生成器和图像变换的新数据集评估了20个团队的解决方案。

摘要翻译

本文概述了与CVPR 2026 NTIRE研讨会联合举办的“NTIRE 2026野外鲁棒AI生成图像检测挑战赛”。该挑战赛的目标是开发能够在真实场景中区分真实图像与生成图像的检测模型：在实际使用中，图像常经过变换（裁剪、调整大小、压缩、模糊等），因此检测模型应对此类变换具备鲁棒性。本次挑战基于一个新颖的数据集，该数据集包含108,750张真实图像和185,750张AI生成图像，这些图像来自42种生成器，涵盖了各种架构的大量开源和闭源模型，并辅以36种图像变换进行增强。所有方法均在完整测试集（包括经过变换和未变换的图像）上使用ROC AUC进行评估。共有511名参与者注册，20支团队提交了有效的最终解决方案。本报告全面概述了本次挑战，描述了所提出的解决方案，可为研究人员和从业者提升检测模型对现实世界变换的鲁棒性提供有价值的参考。

摘要 (Abstract)

This paper presents an overview of the NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild, held in conjunction with the NTIRE workshop at CVPR 2026. The goal of this challenge was to develop detection models capable of distinguishing real images from generated ones in realistic scenarios: the images are often transformed (cropped, resized, compressed, blurred) for practical usage, and therefore, the detection models should be robust to such transformations. The challenge is based on a novel dataset consisting of 108,750 real and 185,750 AI-generated images from 42 generators comprising a large variety of open-source and closed-source models of various architectures, augmented with 36 image transformations. Methods were evaluated using ROC AUC on the full test set, including both transformed and untransformed images. A total of 511 participants registered, with 20 teams submitting valid final solutions. This report provides a comprehensive overview of the challenge, describes the proposed solutions, and can be used as a valuable reference for researchers and practitioners in increasing the robustness of the detection models to real-world transformations.

关键词: AI-generated image detection, robust detection models, image transformations, NTIRE challenge, CVPR workshop, dataset, ROC AUC evaluation, real-world scenarios

215. ❌ Degradation-Aware and Structure-Preserving Diffusion for Real-World Image Super-Resolution

作者: Yang Ji, Zonghao Chen, Zhihao Xue, Junqin Hu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11470v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的图像超分辨率任务，使用扩散模型处理真实世界图像退化问题。虽然属于深度学习应用，但所有评分关键词均针对大语言模型（LLM）及相关技术（如MoE、RLHF、RAG、Agent等），而本文研究的是扩散模型在图像处理中的应用，未涉及任何语言模型、大模型技术原理或AI for Science的具体领域（如生物信息学）。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于真实世界图像超分辨率的退化感知和结构保持扩散框架，通过Degradation-aware Token Injection和Spatially Asymmetric Noise Injection模块，在DIV2K和RealSR数据集上实现了竞争性的无参考感知质量和更真实的恢复效果。

摘要翻译

现实世界图像超分辨率对扩散模型而言尤为困难，因为真实退化过程复杂、异构且极少被显式建模。我们提出一种面向真实世界超分辨率的退化感知与结构保持扩散框架。具体而言，我们引入退化感知令牌注入技术，该技术从低分辨率输入中编码轻量级退化统计量，并将其与语义条件特征融合，从而实现显式的退化感知重建。我们进一步提出空间非对称噪声注入方法，该方法依据局部边缘强度调整扩散噪声，以在训练过程中更好地保持结构区域。这两个模块均为所采用扩散超分辨率框架的轻量级附加组件，仅需对条件处理流程进行微小改动。在DIV2K和RealSR数据集上的实验表明，相较于近期基线方法，本方法在无参考感知质量方面具有竞争力，能生成视觉上更真实的复原结果，同时保持了良好的感知-失真平衡。消融实验验证了各模块的有效性及其组合时的互补增益。代码与模型已公开于https://github.com/jiyang0315/DASP-SR.git。

摘要 (Abstract)

Real-world image super-resolution is particularly challenging for diffusion models because real degradations are complex, heterogeneous, and rarely modeled explicitly. We propose a degradation-aware and structure-preserving diffusion framework for real-world SR. Specifically, we introduce Degradation-aware Token Injection, which encodes lightweight degradation statistics from low-resolution inputs and fuses them with semantic conditioning features, enabling explicit degradation-aware restoration. We further propose Spatially Asymmetric Noise Injection, which modulates diffusion noise with local edge strength to better preserve structural regions during training. Both modules are lightweight add-ons to the adopted diffusion SR framework, requiring only minor modifications to the conditioning pipeline. Experiments on DIV2K and RealSR show that our method delivers competitive no-reference perceptual quality and visually more realistic restoration results than recent baselines, while maintaining a favorable perception–distortion trade-off. Ablations confirm the effectiveness of each module and their complementary gains when combined. The code and model are publicly available at https://github.com/jiyang0315/DASP-SR.git.

关键词: real-world image super-resolution, diffusion models, degradation-aware, structure-preserving, degradation-aware token injection, spatially asymmetric noise injection, perceptual quality, perception-distortion trade-off

216. ❌ PACO: Proxy-Task Alignment and Online Calibration for On-the-Fly Category Discovery

作者: Weidong Tang, Bohan Zhang, Zhixiang Chi, ZiZhang Wu, Yang Wang, Yanan Wu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11484v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究On-the-Fly Category Discovery（OCD），这是一个计算机视觉/机器学习领域的增量学习问题，专注于在线流式数据中的类别发现和分类。论文提出的PACO框架涉及支持集校准、树结构在线决策、动态原型记忆和阈值自适应更新。所有评分关键词都明确针对大语言模型（LLM）及其相关技术（如MoE、RLHF、RAG、Agent等）、大模型训练优化方法（如Scaling Laws、PEFT、Quantization）或特定科学领域AI应用（如AI for Science）。论文内容完全不涉及大语言模型、深度学习技术原理创新或大模型在不同领域的应用，也未提及任何评分关键词中的具体技术。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

论文针对On-the-Fly Category Discovery（OCD）任务中现有方法依赖静态阈值和固定决策边界导致类别形成不稳定的问题，提出了PACO框架，通过代理任务对齐和在线校准实现了动态阈值更新和树结构决策，在多个基准测试上显著超越了现有方法。

摘要翻译

在线类别发现（OCD）要求模型基于离线支持集进行训练，使其能够识别已知类别，同时从在线流式序列中发现新类别。现有方法主要侧重于离线训练，旨在通过学习支持集上的判别性表征，使得在测试时能够分离出新类别。然而，这些方法在推理阶段的发现机制通常简化为单一阈值。我们认为这一范式存在根本缺陷，因为OCD并非静态分类问题，而是一个动态过程。模型必须持续判断：1）样本是否属于已知类别；2）是否匹配已有的新类别；或3）应创建新类别。此外，先前方法将支持集视为固定知识，在推理过程中接收到新证据时不会更新决策边界，导致类别形成不稳定且不一致。我们的实验证实了这些问题。即使不改变表征，通过适当校准和自适应阈值，也能实现显著改进。受此启发，我们提出PACO——一种支持集校准的树状结构在线决策框架。该框架将推理建模为一系列分层决策，包括已知类别路由、具有新生类别感知的新类别分配，以及在动态原型记忆上执行的“附加”与“创建”操作。此外，我们通过模拟代理发现过程，在离线训练期间初始化阈值以与推理阶段对齐。推理过程中，阈值会基于成熟的新类别原型持续更新。重要的是，PACO无需繁重训练或针对特定数据集的调优，可直接作为推理模块集成到现有OCD流程中。大量实验表明，该方法在七个基准测试中均显著优于当前最优基线。

摘要 (Abstract)

On-the-Fly Category Discovery (OCD) requires a model, trained on an offline support set, to recognize known classes while discovering new ones from an online streaming sequence. Existing methods focus heavily on offline training. They aim to learn discriminative representations on the support set so that novel classes can be separated at test time. However, their discovery mechanism at inference is typically reduced to a single threshold. We argue that this paradigm is fundamentally flawed as OCD is not a static classification problem, but a dynamic process. The model must continuously decide 1) whether a sample belongs to a known class, 2) matches an existing novel category, or 3) should initiate a new one. Moreover, prior methods treat the support set as fixed knowledge. They do not update their decision boundaries as new evidence arrives during inference. This leads to unstable and inconsistent category formation. Our experiments confirm these issues. With properly calibrated and adaptive thresholds, substantial improvements can be achieved, even without changing the representation. Motivated by this, we propose PACO, a support-set-calibrated, tree-structured online decision framework. The framework models inference as a sequence of hierarchical decisions, including known-class routing, birth-aware novel assignment, and attach-versus-create operations over a dynamic prototype memory. Furthermore, we simulate the proxy discovery process to initialize the thresholds during offline training to align with inference. Thresholds are continuously updated during inference using mature novel prototypes. Importantly, PACO requires no heavy training and no dataset-specific tuning. It can be directly integrated into existing OCD pipelines as an inference-time module. Extensive experiments show significant improvements over SOTA baselines across seven benchmarks.

关键词: On-the-Fly Category Discovery, online streaming sequence, dynamic prototype memory, hierarchical decisions, threshold calibration, support-set-calibrated, tree-structured online decision, inference-time module

217. ❌ Beyond Model Design: Data-Centric Training and Self-Ensemble for Gaussian Color Image Denoising

作者: Gengjia Chang, Xining Ge, Weijun Yuan, Zhan Li, Qiurong Song, Luen Zhu, Shuhong Liu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11468v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的图像去噪任务，使用Restormer架构进行高斯噪声去除。论文核心贡献在于数据中心的训练策略（扩大训练数据集、两阶段优化）和测试时自集成方法。所有评分关键词均与大语言模型、深度学习技术原理或AI for Science应用相关，而本文研究的是传统的图像处理任务，未涉及任何大模型技术、深度学习创新方法或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文通过数据中心的训练策略（扩大数据集和两阶段优化）和测试时自集成方法，显著提升了Restormer架构在NTIRE 2026高斯彩色图像去噪挑战中的性能，PSNR提高了3.366 dB。

摘要翻译

本文介绍了我们针对NTIRE 2026图像去噪挑战赛（固定噪声水平$σ= 50$下的高斯彩色图像去噪）提出的解决方案。我们并未提出新的复原主干网络，而是从两个互补的方向重新审视了成熟的Restormer架构的性能边界：更强的以数据为中心的训练策略，以及更完整的测试时能力释放。我们从公开的Restormer $σ!=!50$基线模型出发，通过使用规模更大、多样性更丰富的公共图像语料库扩展了标准的多数据集训练方案，并将优化过程组织为两个阶段。在推理阶段，我们应用了$\times 8$几何自集成以进一步释放模型潜力。为保持实现一致性，我们保留了TLC风格的局部推理封装器；然而，系统性的消融实验表明，在此设定下其定量贡献可忽略不计。在挑战赛包含100张图像的验证集上，我们的最终提交结果达到了30.762 dB PSNR和0.861 SSIM，相较于公开的Restormer $σ!=!50$预训练基线模型，PSNR提升高达3.366 dB。消融研究表明，主要的性能增益源于扩展的训练语料库和两阶段优化方案，而自集成则提供了有限但稳定的改进。

摘要 (Abstract)

This paper presents our solution to the NTIRE 2026 Image Denoising Challenge (Gaussian color image denoising at fixed noise level $σ= 50$). Rather than proposing a new restoration backbone, we revisit the performance boundary of the mature Restormer architecture from two complementary directions: stronger data-centric training and more complete Test-Time capability release. Starting from the public Restormer $σ!=!50$ baseline, we expand the standard multi-dataset training recipe with larger and more diverse public image corpora and organize optimization into two stages. At inference, we apply $\times 8$ geometric self-ensemble to further release model capacity. A TLC-style local inference wrapper is retained for implementation consistency; however, systematic ablation reveals its quantitative contribution to be negligible in this setting. On the challenge validation set of 100 images, our final submission achieves 30.762 dB PSNR and 0.861 SSIM, improving over the public Restormer $σ!=!50$ pretrained baseline by up to 3.366 dB PSNR. Ablation studies show that the dominant gain originates from the expanded training corpus and the two-stage optimization schedule, and self-ensemble provides marginal but consistent improvement.

关键词: Image Denoising, Restormer, Data-Centric Training, Self-Ensemble, Two-Stage Optimization, PSNR, SSIM, NTIRE Challenge

218. ❌ HuiYanEarth-SAR: A Foundation Model for High-Fidelity and Low-Cost Global Remote Sensing Imagery Generation

作者: Yongxiang Liu, Jie Zhou, Yafei Song, Tianpeng Liu, Li Liu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11444v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文明确提出了一个用于合成孔径雷达（SAR）图像生成的“基础模型”（Foundation Model），这与第一个关键词高度相关，因此评分为10分。论文属于遥感领域的AI应用，与“AI for Science”有一定关联，但并非生物信息学或化学信息学，因此评分为5分。其他关键词主要涉及大语言模型（LLM）的具体技术细节（如MoE、SFT、RLHF、RAG、推理、代理等）、模型优化（如量化、注意力机制）或特定应用领域（如生物/化学信息学），而本文专注于计算机视觉/遥感领域的特定基础模型生成任务，未涉及这些具体技术或领域，因此相关度为0分。

!!! tip deepseek-chat TL;DR

该研究解决了全球合成孔径雷达（SAR）图像生成中难以同时保证宏观地理空间语义和微观散射机制高保真度的挑战，提出了首个基于AlphaEarth和集成散射机制的基础SAR图像生成模型HuiYanEarth-SAR，实现了仅凭地理坐标即可生成高保真全球SAR图像的能力。

摘要翻译

合成孔径雷达（SAR）影像生成对于深化散射机理研究、建立可信赖的电磁场景模型、从根本上缓解制约该领域发展的数据稀缺瓶颈至关重要。然而，现有方法难以同时保证全局地理空间语义与微观散射机理的高保真度，导致全球尺度影像生成面临严峻挑战。为此，我们提出首个基于AlphaEarth并融合散射机理的SAR基础生成模型——HuiYanEarth-SAR。该模型通过注入地理空间先验以控制宏观结构，并利用隐式散射特征建模确保微观纹理的真实性，实现了仅依据地理坐标即可生成全球任意位置高保真SAR影像的能力。本研究不仅构建了高效的SAR场景模拟器，更从方法论层面搭建了连接地理学、散射机理与人工智能的桥梁，将SAR研究范式从感知理解推进至模拟创造，为构建高可信度的数字孪生地球提供了关键技术支撑。

摘要 (Abstract)

Synthetic Aperture Radar (SAR) imagery generation is essential for deepening the study of scattering mechanisms, establishing trustworthy electromagnetic scene models, and fundamentally alleviating the data scarcity bottleneck that constrains development in this field. However, existing methods find it difficult to simultaneously ensure high fidelity in both global geospatial semantics and microscopic scattering mechanisms, resulting in severe challenges for global generation. To address this, we propose \textbf{HuiYanEarth-SAR}, the first foundational SAR imagery generation model based on AlphaEarth and integrated scattering mechanisms. By injecting geospatial priors to control macroscopic structures and utilizing implicit scattering characteristic modeling to ensure the authenticity of microscopic textures, we achieve the capability of generating high-fidelity SAR images for global locations solely based on geographic coordinates. This study not only constructs an efficient SAR scene simulator but also establishes a bridge connecting geography, scatter mechanism, and artificial intelligence from a methodological standpoint. It advances SAR research by expanding the paradigm from perception and understanding to simulation and creation, providing key technical support for constructing a high-confidence digital twin of the Earth.

关键词: Foundation Model, Synthetic Aperture Radar (SAR), Image Generation, Geospatial Priors, Scattering Mechanisms, Global Remote Sensing, Digital Twin, High-Fidelity Simulation

219. ❌ Observe Less, Understand More: Cost-aware Cross-scale Observation for Remote Sensing Understanding

作者: Zhenghao Xie, Jing Xiao, Zhenqi Wang, Kexin Ma, Liang Liao, Gui-Song Xia, Mi Wang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11415v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究遥感理解中的跨尺度观测问题，提出了一种成本感知的HR采样方法，并构建了GL-10M大规模基准数据集。论文内容主要涉及计算机视觉、遥感图像处理和多分辨率分析，与绝大多数大模型和深度学习技术原理关键词（如LLMs、MoE、RLHF、PEFT等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为遥感理解可视为AI在科学（地球观测）领域的应用，但论文未明确提及生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对遥感理解中高分辨率图像获取成本高的问题，提出了一种成本感知的跨尺度观测方法，通过耦合细粒度HR采样与跨块表示预测，在识别和检索任务上实现了更优的性能-成本权衡，并构建了包含1000万张图像的大规模基准数据集GL-10M。

摘要翻译

遥感理解本质上需要多分辨率观测，因为不同目标与应用任务对空间细节层级的需求各异。低分辨率影像能够实现高效的全球观测，而高分辨率影像虽能提供关键的局部细节，但其获取成本更高且覆盖范围有限。这催生了一种跨尺度感知策略：基于低分辨率全局感知结果，有选择性地获取高分辨率影像，从而在有限成本下提升任务性能。现有的高分辨率采样方法通常基于孤立的低分辨率图像块进行选择决策，忽略了细粒度块内重要性及跨块上下文关联，导致在稀疏高分辨率观测条件下出现碎片化特征表征与次优场景推理。为解决这一问题，我们将跨尺度遥感理解构建为一个统一的成本感知问题，将细粒度高分辨率采样与跨块表征预测相耦合，从而以更少的高分辨率观测实现更有效的任务推理。此外，我们提出了GL-10M——一个包含千万级空间对齐多分辨率影像的大规模基准数据集，为遥感领域成本约束下的跨尺度推理提供了系统性评估平台。在识别与检索任务上的大量实验表明，我们的方法始终能实现更优的性能-成本平衡。

摘要 (Abstract)

Remote sensing understanding inherently requires multi-resolution observation, since different targets and application tasks demand different levels of spatial detail. While low-resolution (LR) imagery enables efficient global observation, high-resolution (HR) imagery provides critical local details at much higher acquisition cost and limited coverage. This motivates a cross-scale sensing strategy that selectively acquires HR imagery from LR-based global perception to improve task performance under constrained cost. Existing methods for HR sampling methods typically make selection decisions from isolated LR patches, which ignore fine-grained intra-patch importance and cross-patch contextual interactions, leading to fragmented feature representation and suboptimal scene reasoning under sparse HR observations. To address this issue, we formulate cross-scale remote sensing understanding as a unified cost-aware problem that couples fine-grained HR sampling with cross-patch representation prediction, enabling more effective task reasoning with fewer HR observations. Furthermore, we present GL-10M, a large-scale benchmark of 10 million spatially aligned multi-resolution images, enabling systematic evaluation of budget-constrained cross-scale reasoning in remote sensing. Extensive experiments on recognition and retrieval tasks show that our method consistently achieves a superior performance-cost trade-off.

关键词: remote sensing, cross-scale observation, cost-aware, high-resolution sampling, multi-resolution images, GL-10M benchmark, recognition, retrieval

220. ❌ Online Reasoning Video Object Segmentation

作者: Jinyuan Liu, Yang Wang, Zeyu Zhao, Weixin Li, Song Wang, Ruize Han 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11411v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究在线推理视频对象分割（ORVOS），这是一个计算机视觉任务，涉及视频中的像素级掩码预测和自然语言查询理解。论文与大多数大模型技术关键词（如LLMs、MoE、SFT、RLHF、RAG等）完全无关，因为这些关键词主要针对语言模型或通用AI技术，而本文专注于视觉任务。然而，论文涉及“推理”（reasoning）概念，因此与“Chain of Thought OR CoT Reasoning OR Multi-step Reasoning”和“System 2 Thinking OR Slow Thinking OR In-depth Reasoning”有一定关联（5分），因为论文处理自然语言查询中的隐式和时序推理，但并非直接研究这些推理技术本身。其他关键词如AI for Science等也不相关，因为论文属于计算机视觉领域，而非生物信息学或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文研究了在线推理视频对象分割（ORVOS）问题，提出了一种在严格因果性下处理视频帧和自然语言查询的基准和基线方法，以解决现有离线方法的局限性。

摘要翻译

推理视频目标分割任务旨在根据自然语言查询预测视频中的像素级掩码，这些查询可能包含隐含且时间定位的指代。然而，现有方法均在离线模式下开发与评估，即在推理时可获取完整视频，并能利用未来帧进行回溯性消歧，这与实际部署中要求严格因果性、逐帧决策的场景存在偏差。我们研究在线推理视频目标分割，该任务要求模型仅使用过去和当前帧增量式解析查询，无需重新审视历史预测，同时需处理事件展开过程中的指代转移。为支持评估，我们提出了ORVOSB基准数据集，其包含帧级因果标注与指代转移标签，涵盖210个视频、12,907个标注帧以及跨越五个推理类别的512条查询。我们进一步提出一种基线方法，采用持续更新的分割提示词和结构化时序令牌存储库，以在有限计算下实现长时序推理。实验表明，现有方法在严格因果性与指代转移条件下表现不佳，而我们的基线模型为未来研究奠定了坚实基础。

摘要 (Abstract)

Reasoning video object segmentation predicts pixel-level masks in videos from natural-language queries that may involve implicit and temporally grounded references. However, existing methods are developed and evaluated in an offline regime, where the entire video is available at inference time and future frames can be exploited for retrospective disambiguation, deviating from real-world deployments that require strictly causal, frame-by-frame decisions. We study Online Reasoning Video Object Segmentation (ORVOS), where models must incrementally interpret queries using only past and current frames without revisiting previous predictions, while handling referent shifts as events unfold. To support evaluation, we introduce ORVOSB, a benchmark with frame-level causal annotations and referent-shift labels, comprising 210 videos, 12,907 annotated frames, and 512 queries across five reasoning categories. We further propose a baseline with continually-updated segmentation prompts and a structured temporal token reservoir for long-horizon reasoning under bounded computation. Experiments show that existing methods struggle under strict causality and referent shifts, while our baseline establishes a strong foundation for future research.

关键词: Online Reasoning Video Object Segmentation, video object segmentation, natural-language queries, causal inference, referent shifts, temporal reasoning, benchmark ORVOSB, segmentation prompts

221. ❌ Scene Change Detection with Vision-Language Representation Learning

作者: Diwei Sheng, Vijayraj Gohil, Satyam Gaba, Zihan Liu, Giles Hamilton-Fletcher, John-Ross Rizzo, Yongqing Liang, Chen Feng 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11402v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的场景变化检测，提出了一种结合视觉-语言模型（VLMs）的框架LangSCD，并构建了NYC-CD数据集。虽然涉及多模态学习（视觉+语言），但所有关键词均针对纯文本大语言模型（LLMs）的技术原理、训练方法、推理优化、对齐、应用范式等，与论文的视觉-语言跨模态研究内容无直接关联。论文未讨论LLMs、MoE、缩放定律、训练技术（预训练/后训练/对齐/PEFT）、推理加速（RAG/上下文扩展/注意力优化）、智能体、模型压缩、可解释性、世界模型或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对真实世界场景变化检测中视觉复杂性导致的准确性问题，提出了一种融合视觉-语言模型的框架LangSCD，通过语言增强语义推理和几何-语义匹配模块，在多个街景基准上实现了最先进的性能，并构建了一个大规模多类标注数据集NYC-CD。

摘要翻译

场景变化检测（Scene Change Detection, SCD）对于城市监测与导航至关重要，但在实际环境中，由于光照变化、季节更替、视角差异以及复杂的城市布局，该任务仍面临挑战。现有方法主要依赖低层次视觉特征，这限制了其在视觉结构复杂的城市场景中准确识别变化物体的能力。本文提出LangSCD，一种用于场景变化检测的视觉-语言框架，通过引入语言驱动的语义推理，克服了单一模态的局限性。我们的方法引入了一个模块化语言组件，利用视觉-语言模型（Vision-Language Models, VLMs）生成场景变化的文本描述，并通过跨模态特征增强器将其与视觉特征融合。我们进一步提出一个几何-语义匹配模块，通过强化语义一致性与空间完整性来优化预测掩码。现有的真实世界场景变化检测基准仅提供二元变化标注，这对于需要细粒度理解场景动态的下游应用而言是不够的。为解决这一局限，我们引入了NYC-CD数据集，这是一个包含8,122对真实世界图像的大规模数据集，采集自纽约市，并通过半自动流程生成了多类别变化标注。在多个街景基准上的大量实验表明，我们的语言模块与匹配模块持续提升了现有变化检测架构的性能，达到了最先进水平，并凸显了将语言推理与视觉表征相结合对于实现鲁棒场景变化检测的重要价值。

摘要 (Abstract)

Scene change detection (SCD) is crucial for urban monitoring and navigation but remains challenging in real-world environments due to lighting variations, seasonal shifts, viewpoint differences, and complex urban layouts. Existing methods rely primarily on low-level visual features, limiting their ability to accurately identify changed objects amid the visual complexity of urban scenes. In this paper, we propose LangSCD, a vision-language framework for scene change detection that overcomes this single-modal limitation by incorporating semantic reasoning through language. Our approach introduces a modular language component that leverages vision-language models (VLMs) to generate textual descriptions of scene changes, which are fused with visual features through a cross-modal feature enhancer. We further introduce a geometric-semantic matching module that refines the predicted masks by enforcing semantic consistency and spatial completeness. Existing real-world scene change detection benchmarks provide only binary change annotations, which are insufficient for downstream applications requiring fine-grained understanding of scene dynamics. To address this limitation, we introduce NYC-CD, a large-scale dataset of 8,122 real-world image pairs collected in New York City with multiclass change annotations generated through a semi-automatic pipeline. Extensive experiments across multiple street-view benchmarks demonstrate that our language and matching modules consistently improve existing change-detection architectures, achieving state-of-the-art performance and highlighting the value of integrating linguistic reasoning with visual representations for robust scene change detection.

关键词: Scene Change Detection, Vision-Language Models, Cross-modal Fusion, Semantic Reasoning, Geometric-Semantic Matching, Urban Monitoring, NYC-CD Dataset, Real-world Benchmarks

222. ❌ GS4City: Hierarchical Semantic Gaussian Splatting via City-Model Priors

作者: Qilin Zhang, Jinyu Zhu, Olaf Wysocki, Benjamin Busam, Boris Jutzi 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11401v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	2.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文GS4City专注于计算机视觉和3D场景理解领域，提出了一种结合城市模型先验的层次化语义高斯泼溅方法，用于城市场景重建和语义分割。论文的核心技术是3D高斯泼溅（3DGS）与城市模型（CityGML）的结合，属于计算机图形学、3D重建和地理信息系统（GIS）的交叉领域。论文中提到了使用“2D foundation models”（如SAM、DINO等视觉基础模型）来辅助语义分割，因此与关键词“Large Language Models OR LLMs OR Foundation Models”有微弱关联（给2分），因为“foundation models”在这里特指视觉基础模型，而非语言模型。论文的应用场景（城市建模、语义分割）属于“AI for Science”中广义的科学计算或工程应用范畴，因此与关键词“AI for Science OR Bioinformatics OR Cheminformatics”有一定关联（给5分）。论文未涉及大语言模型（LLM）的技术原理（如MoE、Scaling Laws、训练对齐、推理优化、智能体等）、生物信息学或化学信息学的具体应用，因此其他所有关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了GS4City方法，通过融合城市模型先验与2D基础模型预测，将层次化建筑语义整合到3D高斯泼溅场景表示中，从而实现了更准确的城市场景语义分割和结构化重建。

摘要翻译

近期基于语义的三维高斯泼溅（3DGS）方法主要依赖二维基础模型，常产生边界模糊且对结构化城市语义支持有限的问题。尽管如CityGML等城市模型将层次化组织的语义与建筑几何共同编码，但这些标签无法直接映射到高斯基元。本文提出GS4City，一种融合城市模型先验的层次化语义高斯泼溅方法，用于城市场景理解。GS4City通过双向光线投射从细节层次（LoD）3 CityGML模型中获取可靠的图像对齐掩码，显式利用父子关系验证并恢复细粒度立面元素。随后，该方法将这些基于几何的掩码与基础模型预测相融合，以建立场景一致的实例对应关系，并在二维身份监督与三维空间正则化的联合约束下，为每个高斯基元学习紧凑的身份编码。在TUM2TWIN和Gold Coast数据集上的实验表明，GS4City能有效将结构化建筑语义融入高斯场景表示，在粗粒度建筑分割任务中优于现有基于二维驱动的语义3DGS基线方法（包括LangSplat和Gaga）达15.8 IoU点，在细粒度语义分割任务中提升达14.2 mIoU点。通过桥接结构化城市模型与逼真的高斯场景表示，GS4City实现了可语义查询且具备结构感知的城市重建。代码发布于https://github.com/Jinyzzz/GS4City。

摘要 (Abstract)

Recent semantic 3D Gaussian Splatting (3DGS) methods primarily rely on 2D foundation models, often yielding ambiguous boundaries and limited support for structured urban semantics. While city models such as CityGML encode hierarchically organized semantics together with building geometry, these labels cannot be directly mapped to Gaussian primitives. We present GS4City, a hierarchical semantic Gaussian Splatting method that incorporates city-model priors for urban scene understanding. GS4City derives reliable image-aligned masks from Level of Detail (LoD) 3 CityGML models via two-pass raycasting, explicitly using parent-child relations to validate and recover fine-grained facade elements. It then fuses these geometry-grounded masks with foundation-model predictions to establish scene-consistent instance correspondences, and learns a compact identity encoding for each Gaussian under joint 2D identity supervision and 3D spatial regularization. Experiments on the TUM2TWIN and Gold Coast datasets show that GS4City effectively incorporates structured building semantics into Gaussian scene representations, outperforming existing 2D-driven semantic 3DGS baselines, including LangSplat and Gaga, by up to 15.8 IoU points in coarse building segmentation and 14.2 mIoU points in fine-grained semantic segmentation. By bridging structured city models and photorealistic Gaussian scene representations, GS4City enables semantically queryable and structure-aware urban reconstruction. Code is available at https://github.com/Jinyzzz/GS4City.

关键词: 3D Gaussian Splatting, Semantic Segmentation, City Models, Urban Scene Understanding, Hierarchical Semantics, CityGML, Scene Reconstruction, Foundation Models

223. ❌ EagleVision: A Multi-Task Benchmark for Cross-Domain Perception in High-Speed Autonomous Racing

作者: Zakhar Yagudin, Murad Mebrahtu, Ren Jin, Jiaqi Huang, Yujia Yue, Dzmitry Tsetserukou, Jorge Dias, Majid Khonji 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11400v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于高速自动驾驶赛车中的感知任务（3D检测和轨迹预测），属于计算机视觉和机器人领域，而非大语言模型或深度学习技术原理的核心研究。论文涉及领域适应（Domain Adaptation）和AI在特定科学/工程应用（自动驾驶）中的使用，因此与’Pre-training OR Continual Pre-training OR Domain Adaptation’和’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（各5分）。其他关键词均与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对高速自动驾驶赛车中的极端感知挑战，提出了一个名为EagleVision的多任务基准，用于评估3D检测和轨迹预测在跨域（城市、模拟器、真实赛车）条件下的泛化性能，并通过实验表明城市预训练和真实赛车数据预训练能有效提升模型在目标域的表现。

摘要翻译

高速自主赛车带来了极端的感知挑战，包括巨大的相对速度以及与传统城市驾驶数据集之间的显著领域偏移。现有基准测试未能充分捕捉这些高动态条件。我们提出了EagleVision，一个基于激光雷达的统一多任务基准，用于高速赛车场景中的三维检测与轨迹预测。该基准为印第安纳自动驾驶挑战赛数据集（14,893帧）和A2RL真实竞赛数据集（1,163帧）提供了新标注的三维边界框，同时包含12,000帧仿真器生成的标注数据，所有数据均在统一的评估协议下标准化。通过以数据集为中心的迁移框架，我们量化了城市、仿真和真实赛车领域间的跨领域泛化能力。城市数据预训练相比从头训练提升了检测性能（NDS 0.72对比0.69），而在真实赛车数据上进行中间预训练实现了向A2RL数据集的最佳迁移（NDS 0.726），优于仅使用仿真器数据的适应方法。在轨迹预测任务中，基于Indy数据训练的模型在A2RL测试序列上表现优于领域内训练的A2RL模型（FDE 0.947对比1.250），凸显了运动分布覆盖度在跨领域预测中的关键作用。EagleVision为系统研究极端高速动态下的感知泛化能力提供了基础。数据集与基准测试已公开于https://avlab.io/EagleVision。

摘要 (Abstract)

High-speed autonomous racing presents extreme perception challenges, including large relative velocities and substantial domain shifts from conventional urban-driving datasets. Existing benchmarks do not adequately capture these high-dynamic conditions. We introduce EagleVision, a unified LiDAR-based multi-task benchmark for 3D detection and trajectory prediction in high-speed racing, providing newly annotated 3D bounding boxes for the Indy Autonomous Challenge dataset (14,893 frames) and the A2RL Real competition dataset (1,163 frames), together with 12,000 simulator-generated annotated frames, all standardized under a common evaluation protocol. Using a dataset-centric transfer framework, we quantify cross-domain generalization across urban, simulator, and real racing domains. Urban pretraining improves detection over scratch training (NDS 0.72 vs. 0.69), while intermediate pretraining on real racing data achieves the best transfer to A2RL (NDS 0.726), outperforming simulator-only adaptation. For trajectory prediction, Indy-trained models surpass in-domain A2RL training on A2RL test sequences (FDE 0.947 vs. 1.250), highlighting the role of motion-distribution coverage in cross-domain forecasting. EagleVision enables systematic study of perception generalization under extreme high-speed dynamics. The dataset and benchmark are publicly available at https://avlab.io/EagleVision

关键词: autonomous racing, 3D detection, trajectory prediction, cross-domain generalization, LiDAR, high-speed perception, benchmark, domain adaptation

224. ❌ Video-based Heart Rate Estimation with Angle-guided ROI Optimization and Graph Signal Denoising

作者: Gan Pei, Junhao Ning, Boqiu Shen, Yan Zhu, Menghan Hu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11395v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于通过视频进行心率估计的计算机视觉和信号处理任务，具体涉及远程光电容积描记术（rPPG）、角度引导的ROI优化和图信号去噪。论文内容与大多数关键词（如LLMs、MoE、SFT、RLHF、RAG、CoT、Agents等）完全无关，因为这些关键词主要涉及大语言模型及其相关技术（如训练、对齐、推理、代理等），而本文未使用或提及任何语言模型或深度学习模型。唯一略有相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为rPPG在生物医学监测中是一种科学应用，但论文更侧重于信号处理而非AI模型本身，因此给予5分（有一定关联）。加权总分计算为5.0（仅一个关键词得分）。作者列表中未包含指定的专家。

!!! tip deepseek-chat TL;DR

该论文提出了一种通过角度引导的ROI优化和图信号去噪来增强视频中心率估计性能的方法，在三个公共数据集上验证了其有效性，平均MAE降低了20.38%。

摘要翻译

远程光电容积描记术（rPPG）能够通过面部视频实现非接触式心率测量，但其性能会因说话、摇头等面部运动而显著下降。为解决这一问题，我们提出了两个即插即用模块。角度引导的感兴趣区域自适应优化模块通过量化ROI-相机角度来优化受运动影响的信号并捕捉全局运动，而多区域联合图信号去噪模块则利用图信号处理技术对区域内与区域间的ROI信号进行联合建模，以抑制运动伪影。这些模块兼容基于反射模型的rPPG方法，并在三个公共数据集上得到验证。结果表明，联合使用这两个模块能显著降低平均绝对误差（MAE），较基线平均下降20.38%，消融研究也证实了各模块的有效性。本工作证明了角度引导优化与基于图的去噪技术在运动场景中提升rPPG性能的潜力。

摘要 (Abstract)

Remote photoplethysmography (rPPG) enables non-contact heart rate measurement from facial videos, but its performance is significantly degraded by facial motions such as speaking and head shaking. To address this issue, we propose two plug-and-play modules. The Angle-guided ROI Adaptive Optimization module quantifies ROI-Camera angles to refine motion-affected signals and capture global motion, while the Multi-region Joint Graph Signal Denoising module jointly models intra- and inter-regional ROI signals using graph signal processing to suppress motion artifacts. The modules are compatible with reflection model-based rPPG methods and validated on three public datasets. Results show that jointly use markedly reduces MAE, with an average decrease of 20.38% over the baseline, while ablation studies confirm the effectiveness of each module. The work demonstrates the potential of angle-guided optimization and graph-based denoising to enhance rPPG performance in motion scenarios.

关键词: remote photoplethysmography, heart rate estimation, ROI optimization, graph signal denoising, motion artifacts, facial videos, signal processing

225. ❌ Beyond Reconstruction: Reconstruction-to-Vector Diffusion for Hyperspectral Anomaly Detection

作者: Jijun Xiang, Jiayi Wang, Pengxiang Wang, Cheng Chen, Nian Wang, Tao Wang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11390v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于高光谱异常检测（HAD），提出了一种名为R2VD的新方法，涉及物理先验提取、流形净化、扩散变换器和向量动态推理等技术。所有关键词均与大语言模型（LLMs）、深度学习技术原理或通用AI方法直接相关，而本文属于计算机视觉/遥感领域的特定应用，未涉及LLMs、MoE、缩放定律、训练对齐、推理优化、智能体、模型压缩或通用科学AI等主题。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为HAD可视为遥感科学中的AI应用，但论文未明确提及生物信息学或化学信息学，且核心是方法创新而非跨领域科学应用，因此给予5分（有一定关联）。其他关键词完全无关，评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对高光谱异常检测中现有模型依赖标量重建导致子像素异常消失和训练偏差的问题，提出了一种重建到向量扩散（R2VD）的新范式，通过物理引导的流形净化和向量动态推理，在八个数据集上实现了最先进的检测性能。

摘要翻译

尽管高光谱异常检测（HAD）在复杂场景中识别稀疏目标方面表现出色，现有模型仍受限于标量化的“以重建为终点”范式。这种对模糊标量残差的依赖，持续导致空间下采样过程中亚像素级异常消失，同时当未净化的异常污染训练权重时，会引发严重的确认偏误。本文提出重建到向量扩散（Reconstruction-to-Vector Diffusion, R2VD），其从根本上将重建重新定义为流形净化的起点，从而建立一种新颖的残差引导生成动力学范式。我们的框架引入了一个四阶段流程：（1）物理先验提取（Physical Prior Extraction, PPE）阶段，通过双流统计指导缓解早期确认偏误；（2）引导流形净化（Guided Manifold Purification, GMP）阶段，利用全上下文自编码器（OmniContext Autoencoder, OCA）提取净化后的残差图，同时保留脆弱的亚像素拓扑结构；（3）残差分数建模（Residual Score Modeling, RSM）阶段，由物理光谱防火墙（Physical Spectral Firewall, PSF）保护的扩散变换器（Diffusion Transformer, DiT）有效隔离跨光谱泄漏；（4）向量动态推理（Vector Dynamics Inference, VDI）阶段，通过评估高维向量干扰模式而非传统标量误差，稳健地将目标与背景解耦。在八个数据集上的综合评估证实，R2VD确立了新的性能标杆，实现了卓越的目标可检测性与背景抑制能力。

摘要 (Abstract)

While Hyperspectral Anomaly Detection (HAD) excels at identifying sparse targets in complex scenes, existing models remain trapped in a scalar “reconstruction-as-endpoint” paradigm. This reliance on ambiguous scalar residuals consistently triggers sub-pixel anomaly vanishing during spatial downsampling, alongside severe confirmation bias when unpurified anomalies corrupt training weights. In this paper, we propose Reconstruction-to-Vector Diffusion (R2VD), which fundamentally redefines reconstruction as a manifold purification origin to establish a novel residual-guided generative dynamics paradigm. Our framework introduces a four-stage pipeline: (1) a Physical Prior Extraction (PPE) stage that mitigates early confirmation bias via dual-stream statistical guidance; (2) a Guided Manifold Purification (GMP) stage utilizing an OmniContext Autoencoder (OCA) to extract purified residual maps while preserving fragile sub-pixel topologies; (3) a Residual Score Modeling (RSM) stage where a Diffusion Transformer (DiT), guarded by a Physical Spectral Firewall (PSF), effectively isolates cross-spectral leakage; and (4) a Vector Dynamics Inference (VDI) stage that robustly decouples targets from backgrounds by evaluating high-dimensional vector interference patterns instead of conventional scalar errors. Comprehensive evaluations on eight datasets confirm that R2VD establishes a new state-of-the-art, delivering exceptional target detectability and background suppression.

关键词: Hyperspectral Anomaly Detection, Reconstruction-to-Vector Diffusion, Manifold Purification, Diffusion Transformer, Physical Spectral Firewall, Vector Dynamics Inference, Sub-pixel Anomaly, State-of-the-art

226. ❌ ConvFormer3D-TAP: Phase/Uncertainty-Aware Front-End Fusion for Cine CMR View Classification Pipelines

作者: Nafiseh Ghaffar Nia, Vinesh Appadurai, Suchithra V., Chinmay Rane, Daniel Pittman, James Carr, Adrienne Kline 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11389v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像分析（心脏MRI视图分类），使用深度学习技术（ConvFormer3D-TAP架构），但未涉及大语言模型（LLMs）或相关技术。所有关键词中，仅“AI for Science OR Bioinformatics OR Cheminformatics”有一定关联，因为论文属于AI在生物医学领域的应用（Bioinformatics相关），但并非核心内容，因此给5分。其他关键词均与大模型技术、训练方法、推理优化、代理系统等无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文解决了心脏MRI视图自动分类的临床挑战，提出ConvFormer3D-TAP模型，在包含150,974个序列的数据集上实现了96%的准确率，为心脏MRI工作流程提供了可靠的前端工具。

摘要翻译

标准心脏电影磁共振成像（cine cardiac MRI）视图的可靠识别至关重要，因为每个视图决定了可视化哪些心脏解剖结构以及可执行哪些定量分析。无论是人工阅片者还是自动化深度学习系统，错误的视图识别均可能导致误差传播至分割、容积评估、应变分析和瓣膜评估等后续环节。然而，在临床常规实践中，由于扫描设备厂商、采集协议、运动伪影及平面定位的差异，准确的视图分类仍具挑战性。本文提出ConvFormer3D-TAP——一种专为电影序列设计的时空架构，该架构将三维卷积标记化（3D convolutional tokenization）与多尺度自注意力机制相结合。模型通过掩码时空重建和不确定性加权的多片段融合进行训练，以增强对心脏时相和模糊时间片段的鲁棒性。该设计捕捉了互补信息：通过卷积先验提取局部解剖结构特征，并通过分层注意力机制捕获长程心动周期动态特征。在一个包含150,974例临床采集的电影序列数据集中（涵盖六个标准心脏电影MRI视图），ConvFormer3D-TAP取得了96%的验证准确率，各类别F1分数均≥0.94，且具有强校准性（预期校准误差ECE=0.025；布里尔分数Brier=0.040）。误差分析表明，残余混淆主要集中在解剖结构相邻的长轴视图与左室流出道/主动脉瓣（LVOT/AV）视图对之间，这与固有的定位重叠性一致。这些结果支持将ConvFormer3D-TAP作为可扩展的前端模块，用于端到端心脏MRI工作流中的视图路由、筛选与质量控制。

摘要 (Abstract)

Reliable recognition of standard cine cardiac MRI views is essential because each view determines which cardiac anatomy is visualized and which quantitative analyses can be performed. Incorrect view identification, whether by a human reader or an automated deep learning system, can propagate errors into segmentation, volumetric assessment, strain analysis, and valve evaluation. However, accurate view classification remains challenging under routine clinical variability in scanner vendor, acquisition protocol, motion artifacts, and plane prescription. We present ConvFormer3D-TAP, a cine-specific spatiotemporal architecture that integrates 3D convolutional tokenization with multiscale self-attention. The model is trained using masked spatiotemporal reconstruction and uncertainty-weighted multi-clip fusion to enhance robustness across cardiac phases and ambiguous temporal segments. The design captures complementary cues: local anatomical structure through convolutional priors and long-range cardiac-cycle dynamics through hierarchical attention. On a cohort of 150,974 clinically acquired cine sequences spanning six standard cine cardiac MRI views, ConvFormer3D-TAP achieved 96% validation accuracy with per-class F1-scores >= 0.94 and strong calibration (ECE = 0.025; Brier = 0.040). Error analysis shows that residual confusions are concentrated in anatomically adjacent long-axis and LVOT/AV view pairs, consistent with intrinsic prescription overlap. These results support ConvFormer3D-TAP as a scalable front-end for view routing, filtering and quality control in end-to-end cMRI workflows.

关键词: cardiac MRI view classification, ConvFormer3D-TAP, spatiotemporal architecture, 3D convolutional tokenization, self-attention, uncertainty-weighted fusion, cine cardiac MRI, clinical workflow

227. ❌ LEADER: Learning Reliable Local-to-Global Correspondences for LiDAR Relocalization

作者: Jianshi Wu, Minghang Zhu, Dunqiang Liu, Wen Li, Sheng Ao, Siqi Shen, Chenglu Wen, Cheng Wang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11355v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是基于LiDAR的机器人重定位技术，提出了一种名为LEADER的深度学习框架，专注于点云几何特征提取和可靠性建模。所有评分关键词均涉及大语言模型（LLMs）及其相关技术（如训练方法、推理优化、对齐、代理系统等），或特定科学AI应用（如生物信息学）。该论文的核心是计算机视觉和机器人定位，使用深度学习进行点云处理，但完全不涉及语言模型、大模型技术原理或生物/化学领域的AI应用。因此，所有关键词与论文内容完全无关，得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为LEADER的鲁棒LiDAR重定位框架，通过几何编码器和可靠性损失函数，在Oxford RobotCar和NCLT数据集上显著降低了位置误差，超越了现有方法。

摘要翻译

激光雷达重定位技术因其能在复杂三维环境中提供精确的六自由度姿态估计而受到日益广泛的关注。近年来，基于学习的回归方法通过直接预测全局姿态而无需显式存储地图，提供了高效的解决方案。然而，这些方法由于对所有预测点进行同等处理，易受噪声和异常值影响，在具有挑战性的场景中往往表现不佳。本文提出LEADER，一个基于激光雷达的鲁棒重定位框架，其通过一个简单而有效的几何编码器得到增强。具体而言，我们首先提出一种鲁棒的基于投影的几何编码器架构，该架构能捕获多尺度几何特征，以增强几何表示的描述性。随后，我们构建了一种截断相对可靠性损失，以建模逐点模糊性并减轻不可靠预测的影响。在Oxford RobotCar和NCLT数据集上的大量实验表明，LEADER优于现有最先进方法，在位置误差上分别实现了相对于现有技术24.1%和73.9%的相对降低。源代码发布于https://github.com/JiansW/LEADER。

摘要 (Abstract)

LiDAR relocalization has attracted increasing attention as it can deliver accurate 6-DoF pose estimation in complex 3D environments. Recent learning-based regression methods offer efficient solutions by directly predicting global poses without the need for explicit map storage. However, these methods often struggle in challenging scenes due to their equal treatment of all predicted points, which is vulnerable to noise and outliers. In this paper, we propose LEADER, a robust LiDAR-based relocalization framework enhanced by a simple, yet effective geometric encoder. Specifically, a Robust Projection-based Geometric Encoder architecture which captures multi-scale geometric features is first presented to enhance descriptiveness in geometric representation. A Truncated Relative Reliability loss is then formulated to model point-wise ambiguity and mitigate the influence of unreliable predictions. Extensive experiments on the Oxford RobotCar and NCLT datasets demonstrate that LEADER outperforms state-of-the-art methods, achieving 24.1% and 73.9% relative reductions in position error over existing techniques, respectively. The source code is released on https://github.com/JiansW/LEADER.

关键词: LiDAR relocalization, 6-DoF pose estimation, geometric encoder, Robust Projection-based Geometric Encoder, Truncated Relative Reliability loss, point-wise ambiguity, Oxford RobotCar, NCLT datasets

228. ❌ LoGo-MR: Screening Breast MRI for Cancer Risk Prediction by Efficient Omni-Slice Modeling

作者: Xin Wang, Yuan Gao, George Yiasemis, Antonio Portaluri, Zahra Aghdam, Muzhen He, Luyi Han, Yaofei Duan, Chunyao Lu, Xinglong Liang, Tianyu Zhang, Vivien van Veldhuizen, Yue Sun, Tao Tan, Ritse Mann, Jonas Teuwen 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11348v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于医学影像分析（乳腺癌MRI筛查），使用2.5D CNN和Transformer-MIL框架进行癌症风险预测，属于AI在生物医学领域的应用。与绝大多数大模型/深度学习技术关键词（如LLM、MoE、SFT、RLHF等）完全无关，因为这些关键词特指自然语言处理或通用大模型技术。仅与’Explainable AI’有一定关联（论文提到可解释性），与’AI for Science’高度相关（属于生物信息学/医学AI应用）。

!!! tip deepseek-chat TL;DR

该研究提出LoGo-MR框架，通过2.5D局部-全局建模和Transformer增强的多实例学习，从乳腺MRI中高效预测1-5年乳腺癌风险，在大型筛查队列中优于现有方法，并提供了可解释的风险定位。

摘要翻译

高效且可解释的乳腺癌风险预测对于大规模人群筛查至关重要。乳腺磁共振成像（MRI）为个性化风险评估提供了功能信息。然而，有效的建模仍然面临挑战：完全三维卷积神经网络（3D CNN）虽能捕捉三维体积上下文信息，但计算成本高昂；而轻量级二维卷积神经网络（2D CNN）则无法建模切片间的连续性。更重要的是，针对短期和长期乳腺癌风险分层的乳腺MRI建模研究仍显不足。本研究提出LoGo-MR，一个用于五年期乳腺癌风险预测的2.5D局部-全局结构建模框架。该框架与临床解读逻辑一致，首先采用相邻切片编码来捕捉与短期风险相关的细微局部特征；随后集成基于Transformer增强的多示例学习（MIL）来建模与长期风险相关的分布式全局模式，并提供可解释的切片重要性评估。我们进一步将该框架应用于轴位、矢状位和冠状位三个平面，形成LoGo3-MR，以捕捉互补的三维体积信息。这种多平面架构能够实现体素级风险显著性映射，或可协助放射科医生在解读乳腺MRI时定位风险相关区域。在一个大型乳腺MRI筛查队列（约7.5K）上的评估表明，我们的方法优于2D/3D基线模型及现有先进MIL方法，在1至5年预测中曲线下面积（AUC）达到0.77-0.69，较3D CNN将C指数提升了约6%。LoGo3-MR通过三个平面的可解释定位进一步提升了整体性能，且在七种骨干网络上的验证均显示出一致的性能增益。这些结果凸显了基于MRI的高效乳腺癌风险分层在大规模筛查中的临床潜力。代码将公开释放。

摘要 (Abstract)

Efficient and explainable breast cancer (BC) risk prediction is critical for large-scale population-based screening. Breast MRI provides functional information for personalized risk assessment. Yet effective modeling remains challenging as fully 3D CNNs capture volumetric context at high computational cost, whereas lightweight 2D CNNs fail to model inter-slice continuity. Importantly, breast MRI modeling for shor- and long-term BC risk stratification remains underexplored. In this study, we propose LoGo-MR, a 2.5D local-global structural modeling framework for five-year BC risk prediction. Aligned with clinical interpretation, our framework first employs neighbor-slice encoding to capture subtle local cues linked to short-term risk. It then integrates transformer-enhanced multiple-instance learning (MIL) to model distributed global patterns related to long-term risk and provide interpretable slice importance. We further apply this framework across axial, sagittal, and coronal planes as LoGo3-MR to capture complementary volumetric information. This multi-plane formulation enables voxel-level risk saliency mapping, which may assist radiologists in localizing risk-relevant regions during breast MRI interpretation. Evaluated on a large breast MRI screening cohort (~7.5K), our method outperforms 2D/3D baselines and existing SOTA MIL methods, achieving AUCs of 0.77-0.69 for 1- to 5-year prediction and improving C-index by ~6% over 3D CNNs. LoGo3-MR further improves overall performance with interpretable localization across three planes, and validation across seven backbones shows consistent gains. These results highlight the clinical potential of efficient MRI-based BC risk stratification for large-scale screening. Code will be released publicly.

关键词: breast cancer risk prediction, breast MRI screening, 2.5D modeling, transformer-enhanced MIL, interpretable localization, multi-plane analysis, clinical AI application, medical imaging analysis

229. ❌ Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

作者: Dongxu Wei, Qi Xu, Zhiqi Li, Hangning Zhou, Cong Qiu, Hailong Qin, Mu Yang, Zhaopeng Cui, Peidong Liu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11331v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于3D场景生成，提出了一种新的3D表示自动编码器（3DRAE）和3D扩散变换器（3DDiT），属于计算机视觉和3D生成领域。所有评分关键词均与大语言模型（LLM）、深度学习技术原理或AI在科学领域的应用直接相关，而本文未涉及任何LLM、MoE、缩放定律、训练技术、对齐、推理、代理、压缩、幻觉缓解、可解释性、世界模型或科学AI等主题。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了现有3D场景生成方法因依赖2D表示而导致的表示冗余和空间一致性问题，提出了一种直接在隐式3D潜在空间中生成3D场景的新方法，实现了高效且空间一致的3D场景生成。

摘要翻译

三维场景生成长期以来一直由二维多视图或视频扩散模型主导。这不仅是因为缺乏场景级的三维潜在表示，还因为大多数场景级三维视觉数据以多视图图像或视频的形式存在，这些形式天然与二维扩散架构兼容。通常，这些基于二维的方法将三维空间外推降维为二维时间延伸，这带来了两个根本性问题：（i）通过二维视图表示三维场景会导致显著的表示冗余；（ii）植根于二维的潜在空间本质上限制了生成三维场景的空间一致性。在本文中，我们首次提出直接在隐式三维潜在空间中进行三维场景生成，以解决这些局限性。首先，我们重新利用冻结的二维表示编码器构建了我们的三维表示自动编码器（3D Representation Autoencoder, 3DRAE），它将视图耦合的二维语义表示锚定到视图解耦的三维潜在表示中。这使得能够以固定的复杂度和丰富的语义，表示从任意数量视图（以任意分辨率和宽高比）观察到的三维场景。然后，我们引入了三维扩散变换器（3D Diffusion Transformer, 3DDiT），它在此三维潜在空间中进行扩散建模，实现了极其高效且空间一致的三维场景生成，同时支持多样的条件配置。此外，由于我们的方法直接生成三维场景表示，它可以沿任意相机轨迹解码为图像和可选的点云图，而无需像基于二维的方法中常见的那样，为每条轨迹执行扩散采样过程。

摘要 (Abstract)

3D scene generation has long been dominated by 2D multi-view or video diffusion models. This is due not only to the lack of scene-level 3D latent representation, but also to the fact that most scene-level 3D visual data exists in the form of multi-view images or videos, which are naturally compatible with 2D diffusion architectures. Typically, these 2D-based approaches degrade 3D spatial extrapolation to 2D temporal extension, which introduces two fundamental issues: (i) representing 3D scenes via 2D views leads to significant representation redundancy, and (ii) latent space rooted in 2D inherently limits the spatial consistency of the generated 3D scenes. In this paper, we propose, for the first time, to perform 3D scene generation directly within an implicit 3D latent space to address these limitations. First, we repurpose frozen 2D representation encoders to construct our 3D Representation Autoencoder (3DRAE), which grounds view-coupled 2D semantic representations into a view-decoupled 3D latent representation. This enables representing 3D scenes observed from arbitrary numbers of views–at any resolution and aspect ratio–with fixed complexity and rich semantics. Then we introduce 3D Diffusion Transformer (3DDiT), which performs diffusion modeling in this 3D latent space, achieving remarkably efficient and spatially consistent 3D scene generation while supporting diverse conditioning configurations. Moreover, since our approach directly generates a 3D scene representation, it can be decoded to images and optional point maps along arbitrary camera trajectories without requiring per-trajectory diffusion sampling pass, which is common in 2D-based approaches.

关键词: 3D scene generation, 3D latent representation, 3D Representation Autoencoder, 3D Diffusion Transformer, view-decoupled representation, spatial consistency, diffusion modeling, implicit 3D latent space

230. ❌ Empowering Video Translation using Multimodal Large Language Models

作者: Bingzheng QU, Kehai Chen, Xuefeng Bai, Min Zhang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11283v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究多模态大语言模型（MLLMs）在视频翻译任务中的应用，属于大模型在不同领域的研究应用。论文核心围绕MLLMs展开，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术细节，如MoE、SLMs、训练方法、推理加速、代理系统等，因此这些关键词评分为0分。论文虽涉及AI应用，但非生物信息学或化学信息学等科学领域，故’AI for Science’相关关键词也得0分。

!!! tip deepseek-chat TL;DR

该论文系统综述了多模态大语言模型如何赋能视频翻译任务，提出了基于语义推理、表达性表演和视觉合成的三角色分类法，并讨论了当前挑战与未来研究方向。

摘要翻译

视频翻译领域的最新进展进一步增强了视频内容的跨语言可及性，其中多模态大语言模型正发挥着日益重要的支撑作用。凭借强大的多模态理解、推理与生成能力，基于多模态大语言模型的视频翻译系统正在突破传统级联流程的局限——传统方法需分别处理自动语音识别、机器翻译、文本到语音转换及唇形同步等环节。这些由多模态大语言模型驱动的方法不仅能达到可比甚至更优的翻译质量，还在零样本设置与多说话人场景中展现出更强的鲁棒性，同时能够对语义保真度、时序、说话人身份及情感一致性进行联合建模。然而，尽管多模态大语言模型发展迅速，且已有大量关于通用视频-语言理解的综述研究，但针对多模态大语言模型如何赋能视频翻译任务的聚焦性、系统性梳理仍然缺失。为填补这一空白，本文首次对基于多模态大语言模型的视频翻译研究进行全面综述，并围绕三重角色分类体系展开论述：1）语义推理者，阐释多模态大语言模型如何实现视频理解、时序推理与多模态融合；2）表达性呈现者，分析由大语言模型驱动或增强的、面向富有表现力且可控语音生成的技术；3）视觉合成器，探讨用于高保真唇形同步与视觉对齐的不同类型视频生成器。最后，我们讨论了视频理解、时序建模与多模态对齐中存在的开放挑战，并展望了多模态大语言模型赋能视频翻译的未来研究方向。

摘要 (Abstract)

Recent developments in video translation have further enhanced cross-lingual access to video content, with multimodal large language models (MLLMs) playing an increasingly important supporting role. With strong multimodal understanding, reasoning, and generation capabilities, MLLMs-based video translation systems are overcoming the limitations of traditional cascaded pipelines that separately handle automatic speech recognition, machine translation, text-to-speech and lip synchronization. These MLLM-powered approaches not only achieve competitive or superior translation quality, but also demonstrate stronger robustness in zero-shot settings and multi-speaker scenarios, while jointly modeling semantic fidelity, timing, speaker identity, and emotional consistency. However, despite the rapid progress of MLLMs and extensive surveys on general video-language understanding, a focused and systematic review of how MLLMs empower video translation tasks is still lacking. To fill this gap, we provide the first comprehensive overview of MLLMs-based video translation, organized around a three-role taxonomy: 1) Semantic Reasoner, which characterizes how MLLMs perform video understanding, temporal reasoning, and multimodal fusion; 2) Expressive Performer, which analyzes LLM-driven and LLM-augmented techniques for expressive, controllable speech generation; and 3) Visual Synthesizer, which examines different types of video generators for high-fidelity lip-sync and visual alignment. Finally, we discuss open challenges in video understanding, temporal modeling, and multimodal alignment, and outline promising future research directions for MLLMs-powered video translation.

关键词: Multimodal Large Language Models, Video Translation, Semantic Reasoner, Expressive Performer, Visual Synthesizer, Multimodal Understanding, Temporal Reasoning, Lip Synchronization

231. ❌ Variational Latent Entropy Estimation Disentanglement: Controlled Attribute Leakage for Face Recognition

作者: Ünsal Öztürk, Vedrana Krivokuća Hahn, Sushil Bhattacharjee, Sébastien Marcel 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11250v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究人脸识别嵌入中的属性解耦问题，提出了一种基于变分自编码器和互信息估计的后处理方法（VLEED）。论文的核心是计算机视觉中的表示学习、隐私保护和公平性，而非大语言模型或深度学习技术原理的创新。所有评分关键词都直接与大语言模型相关（如LLMs、MoE、RLHF、RAG等）或特定的大模型技术（如量化、推理加速等），而本文完全不涉及这些内容。论文虽然使用深度学习（VAE），但这是基础技术而非创新点，且不涉及评分关键词中的任何大模型特定技术或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为VLEED的后处理方法，用于从人脸识别嵌入中解耦性别和种族等敏感属性，以平衡隐私保护、公平性和识别性能，实验表明该方法能提供更好的隐私-效用权衡并减少跨人口群体的识别偏差。

摘要翻译

人脸识别嵌入编码了身份信息，但同时也编码了性别与种族等其他因素。根据下游系统对这些因素的使用方式，将其与验证所需信息分离对于隐私保护和公平性都至关重要。我们提出变分潜在熵估计解耦方法（Variational Latent Entropy Estimation Disentanglement, VLEED），这是一种后处理方法，通过变分自编码器对预训练嵌入进行变换，并促成一个经过提炼的表征——其中目标类别变量与身份相关信息实现分离。VLEED采用基于互信息的目标函数，该函数通过估计潜在空间中类别属性的熵值来实现，能够提供稳定的训练过程，并对信息移除实现细粒度控制。我们在IJB-C、RFW和VGGFace2数据集上针对性别与种族解耦任务评估了本方法，并与多种先进方法进行比较。我们报告了验证效用、线性与非线性分类器下解耦变量的可预测性，以及基于错误匹配率的群体差异度量。实验结果表明，相较于现有方法，VLEED能够在隐私与效用之间提供更广泛的权衡选择，同时还能降低跨人口统计学群体的识别偏差。

摘要 (Abstract)

Face recognition embeddings encode identity, but they also encode other factors such as gender and ethnicity. Depending on how these factors are used by a downstream system, separating them from the information needed for verification is important for both privacy and fairness. We propose Variational Latent Entropy Estimation Disentanglement (VLEED), a post-hoc method that transforms pretrained embeddings with a variational autoencoder and encourages a distilled representation where the categorical variable of interest is separated from identity-relevant information. VLEED uses a mutual information-based objective realised through the estimation of the entropy of the categorical attribute in the latent space, and provides stable training with fine-grained control over information removal. We evaluate our method on IJB-C, RFW, and VGGFace2 for gender and ethnicity disentanglement, and compare it to various state-of-the-art methods. We report verification utility, predictability of the disentangled variable under linear and nonlinear classifiers, and group disparity metrics based on false match rates. Our results show that VLEED offers a wide range of privacy-utility tradeoffs over existing methods and can also reduce recognition bias across demographic groups.

关键词: face recognition, disentanglement, variational autoencoder, privacy, fairness, attribute leakage, mutual information, demographic bias

232. ❌ Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models

作者: Kexin Ma, Jing Xiao, Chaofeng Chen, Geyong Min, Guibo Zhu, Jinqiao Wang, Liang Liao 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11240v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于大型视觉语言模型（LVLMs）的token pruning技术，属于大模型效率优化领域。与’Large Language Models’相关度8分，因为LVLMs是大语言模型的视觉扩展；与’Quantization/Model Compression’和’Speculative Decoding/Inference Acceleration’相关度5分，因为token pruning是模型压缩和推理加速的一种方法；其他关键词如MoE、SFT、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为DeSAP的解耦相似性感知剪枝方法，用于大型视觉语言模型中任务感知的视觉token剪枝，在保留仅11.1%视觉token的情况下实现了10倍FLOPs减少和2.3倍预填充加速，同时保持98.1%的原始性能。

摘要翻译

令牌剪枝已成为降低大型视觉语言模型（LVLMs）巨大计算开销的有效方法，其通过丢弃信息量较少的视觉令牌同时保持模型性能。然而，现有方法通常依赖于从不同LVLM组件中提取的独立注意力源，由于注意力分布存在偏差，导致剪枝决策不完整且非最优。为解决此问题，我们提出DeSAP，一种新颖的解耦相似性感知剪枝方法，用于在视觉编码器内实现精确的、任务感知的令牌剪枝。具体而言，DeSAP引入了解耦相似性来捕捉视觉特征与文本令牌之间细粒度的跨模态相关性，为剪枝提供明确的任务相关指导。通过将解耦相似性与从视觉注意力中提取的视觉显著性信号相结合，DeSAP在任务相关线索和视觉线索的共同指导下执行令牌剪枝，即使在激进的剪枝比例下也能实现鲁棒的剪枝效果。在多种基准测试和架构上的大量实验表明，DeSAP在准确性和效率方面均持续优于现有最先进（SOTA）方法。在LLaVA-1.5-7B模型上，DeSAP仅保留11.1%的视觉令牌，即可实现10倍的浮点运算量（FLOPs）减少和2.3倍的预填充速度提升，同时保持原始性能的98.1%。

摘要 (Abstract)

Token pruning has emerged as an effective approach to reduce the substantial computational overhead of Large Vision-Language Models (LVLMs) by discarding less informative visual tokens while preserving performance. However, existing methods typically rely on individual attention sources from different LVLM components, resulting in incomplete and suboptimal pruning decisions due to biased attention distributions. To address this problem, we propose DeSAP, a novel Decoupled Similarity-Aware Pruning method for precise, task-aware token pruning within the visual encoder. Specifically, DeSAP introduces a decoupled similarity to capture fine-grained cross-modal relevance between visual features and text tokens, providing explicit task-related guidance for pruning. By integrating decoupled similarity with visual saliency signals derived from visual attention, DeSAP performs token pruning under the guidance of both task-related and visual cues, enabling robust pruning even under aggressive pruning ratios. Extensive experiments across diverse benchmarks and architectures show that DeSAP consistently outperforms SOTA methods in both accuracy and efficiency. On LLaVA-1.5-7B, DeSAP achieves a 10 times FLOPs reduction and a 2.3 times prefill speedup by retaining only 11.1% of visual tokens, while maintaining 98.1% of the original performance.

关键词: Large Vision-Language Models, Token Pruning, Computational Efficiency, Decoupled Similarity, Task-Aware Pruning, Visual Encoder, FLOPs Reduction, Inference Acceleration

233. ❌ Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

作者: Tencent Hunyuan Team 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11244v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究多模态大语言模型（MLLMs）在视频理解与生成中的应用，提出了一种新的结构化视频描述范式MTSS。该论文与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分），因为MLLMs是LLMs在多模态领域的扩展，论文的核心创新正是基于MLLMs构建的。其他关键词如MoE、SLMs、Scaling Laws、各种训练技术（Pre-training、SFT、RLHF等）、推理优化（RAG、Attention优化）、智能体、模型压缩等，论文均未涉及或提及，因此评分为0分。论文也未涉及AI在科学领域的特定应用（如生物信息学），因此’AI for Science’相关关键词也得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有视频描述方法将视频视为单一叙事段落导致表示保真度低和可扩展性差的问题，提出了一种名为MTSS的新型结构化视频描述范式，通过流分解和关系接地将视频解耦为互补的流并重新连接，实验表明MTSS显著提升了视频理解性能、缩小了不同规模MLLMs之间的性能差距，并在视频生成中大幅改善了身份一致性、视听对齐和时间可控性。

摘要翻译

多模态大语言模型（MLLMs）的进展正将视频描述从单纯的描述性终点，转变为兼具视频理解与生成的语义接口。然而，当前主流范式仍将视频视为单一的整体叙事段落，其中视觉、听觉和身份信息相互纠缠。这种紧密耦合不仅损害了表征的保真度，也限制了可扩展性，因为即使是局部编辑也可能引发全局重写。为解决这一结构性瓶颈，我们提出了多流场景脚本（Multi-Stream Scene Script, MTSS），这是一种新颖的范式，它用经过分解且显式接地的场景描述取代了单一的整体文本。MTSS建立在两个核心原则之上：一是流分解，它将视频解耦为互补的流（参考流、镜头流、事件流和全局流）；二是关系接地，它通过显式的身份与时间链接将这些孤立的流重新连接起来，以保持视频的整体一致性。大量实验表明，MTSS能持续提升各种模型在视频理解任务上的表现，在Video-SALMONN-2基准上平均降低了25%的总错误率，在Daily-Omni推理基准上平均获得了67%的性能提升。它还缩小了较小与较大MLLMs之间的性能差距，表明这是一个显著更易学习的描述接口。最后，即使无需调整模型架构，在多镜头视频生成中用MTSS替换单一整体提示，也能带来人类评估者认可的大幅改进：跨镜头身份一致性提升45%，视听对齐度提升56%，时序可控性提升71%。

摘要 (Abstract)

Advances in Multimodal Large Language Models (MLLMs) are transforming video captioning from a descriptive endpoint into a semantic interface for both video understanding and generation. However, the dominant paradigm still casts videos as monolithic narrative paragraphs that entangle visual, auditory, and identity information. This dense coupling not only compromises representational fidelity but also limits scalability, since even local edits can trigger global rewrites. To address this structural bottleneck, we propose Multi-Stream Scene Script (MTSS), a novel paradigm that replaces monolithic text with factorized and explicitly grounded scene descriptions. MTSS is built on two core principles: Stream Factorization, which decouples a video into complementary streams (Reference, Shot, Event, and Global), and Relational Grounding, which reconnects these isolated streams through explicit identity and temporal links to maintain holistic video consistency. Extensive experiments demonstrate that MTSS consistently enhances video understanding across various models, achieving an average reduction of 25% in the total error rate on Video-SALMONN-2 and an average performance gain of 67% on the Daily-Omni reasoning benchmark. It also narrows the performance gap between smaller and larger MLLMs, indicating a substantially more learnable caption interface. Finally, even without architectural adaptation, replacing monolithic prompts with MTSS in multi-shot video generation yields substantial human-rated improvements: a 45% boost in cross-shot identity consistency, a 56% boost in audio-visual alignment, and a 71% boost in temporal controllability.

关键词: Multimodal Large Language Models, Video Captioning, Stream Factorization, Relational Grounding, Video Understanding, Video Generation, Structured Descriptions, Multi-Stream Scene Script

234. ❌ Bridging the RGB-IR Gap: Consensus and Discrepancy Modeling for Text-Guided Multispectral Detection

作者: Jiaqi Wu, Zhen Wang, Enhao Huang, Kangqing Shen, Yulin Wang, Yang Yue, Yifan Pu, Gao Huang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11234v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究文本引导的多光谱目标检测，专注于RGB和红外图像的跨模态融合，属于计算机视觉领域。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文未涉及任何大模型技术（如LLM、MoE、SFT、RLHF等）、模型优化方法（如量化、推理加速）或AI for Science的具体应用（如生物信息学）。论文虽使用文本引导，但仅限于视觉任务的语义对齐，与大模型技术无关。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对文本引导的多光谱目标检测中RGB与红外图像粒度不对称和跨模态差异利用不足的问题，提出了一个语义桥接融合框架，通过双支持建模和双向语义对齐模块，显著提升了多光谱基准测试上的检测性能。

摘要翻译

文本引导的多光谱目标检测利用文本语义来指导RGB与红外图像间的语义感知跨模态交互，以实现更鲁棒的感知能力。然而，现有方法仍存在明显局限：（1）当前方法通常仅将文本作为辅助语义增强信号，未能充分发挥其引导作用以弥合RGB与IR模态间固有的粒度不对称性；（2）传统基于数据驱动的注意力融合机制倾向于强调稳定的共识特征，而忽视了可能具有价值的跨模态差异信息。为解决上述问题，我们提出一种面向多光谱目标检测的双支撑建模语义桥接融合框架。具体而言，文本被用作共享语义桥梁，在统一的类别条件下对齐RGB与IR模态的响应，同时将重校准的热模态语义先验映射至RGB分支以实现语义级映射融合。我们进一步将RGB-IR交互证据形式化为包含潜在判别性线索的常规共识支撑与互补差异支撑，并通过动态重校准将其作为结构化归纳偏置引入融合过程。此外，我们设计了双向语义对齐模块以实现闭环式的视觉-文本引导增强。大量实验证明了所提出融合框架的有效性及其在多光谱基准数据集上优越的检测性能。代码发布于https://github.com/zhenwang5372/Bridging-RGB-IR-Gap。

摘要 (Abstract)

Text-guided multispectral object detection uses text semantics to guide semantic-aware cross-modal interaction between RGB and IR for more robust perception. However, notable limitations remain: (1) existing methods often use text only as an auxiliary semantic enhancement signal, without exploiting its guiding role to bridge the inherent granularity asymmetry between RGB and IR; and (2) conventional data-driven attention-based fusion tends to emphasize stable consensus while overlooking potentially valuable cross-modal discrepancies. To address these issues, we propose a semantic bridge fusion framework with bi-support modeling for multispectral object detection. Specifically, text is used as a shared semantic bridge to align RGB and IR responses under a unified category condition, while the recalibrated thermal semantic prior is projected onto the RGB branch for semantic-level mapping fusion. We further formulate RGB-IR interaction evidence into the regular consensus support and the complementary discrepancy support that contains potentially discriminative cues, and introduce them into fusion via dynamic recalibration as a structured inductive bias. In addition, we design a bidirectional semantic alignment module for closed-loop vision-text guidance enhancement. Extensive experiments demonstrate the effectiveness of the proposed fusion framework and its superior detection performance on multispectral benchmarks. Code is available at https://github.com/zhenwang5372/Bridging-RGB-IR-Gap.

关键词: text-guided multispectral object detection, RGB-IR fusion, semantic bridge, consensus and discrepancy modeling, cross-modal interaction, bidirectional semantic alignment, thermal semantic prior, dynamic recalibration

235. ❌ Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection

作者: You Su, Yonghong Song, Jingqi Chen, Zehan Wen 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11231v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于计算机视觉中的遥感变化检测任务，提出Seg2Change适配器将开放词汇语义分割模型应用于开放词汇变化检测。与大多数关键词（如LLM、MoE、推理、对齐、压缩等）完全无关。仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’有中等关联（5分），因为涉及模型适配（adaptation）到新任务；与’AI for Science OR Bioinformatics OR Cheminformatics’有较高关联（8分），因为遥感变化检测属于AI在科学（地球科学、环境监测）领域的应用，但非生物信息学或化学信息学。

!!! tip deepseek-chat TL;DR

该论文解决了遥感图像中开放词汇变化检测的挑战，通过提出Seg2Change适配器将开放词汇语义分割模型适配到变化检测任务，并在两个数据集上实现了最先进的性能提升。

摘要翻译

变化检测是遥感领域的一项基础任务，旨在量化人类活动与生态动态对地表覆盖变化的影响。现有变化检测方法受限于训练数据集中预定义的类别，这制约了其在真实场景中的可扩展性。近年来，针对遥感影像已涌现出许多先进的开放词汇语义分割模型。然而，目前仍缺乏一个有效的框架，能够将这些模型直接应用于开放词汇变化检测——这是一项融合视觉与语言以检测任意类别变化的新型任务。为应对这些挑战，我们首先构建了一个类别无关的变化检测数据集，命名为CA-CDD。进一步，我们设计了一个类别无关的变化检测头，用于检测任意类别的变迁并将其索引至具体类别。在此基础上，我们提出了Seg2Change，这是一个适配器，旨在将开放词汇语义分割模型适配至变化检测任务。无需复杂修饰，这一简洁而有效的框架在开放词汇变化检测上取得了最先进的性能（在WHU-CD数据集上提升9.52 IoU，在SECOND数据集上提升5.50 mIoU）。我们的代码发布于https://github.com/yogurts-sy/Seg2Change。

摘要 (Abstract)

Change detection is a fundamental task in remote sensing, aiming to quantify the impacts of human activities and ecological dynamics on land-cover changes. Existing change detection methods are limited to predefined classes in training datasets, which constrains their scalability in real-world scenarios. In recent years, numerous advanced open-vocabulary semantic segmentation models have emerged for remote sensing imagery. However, there is still a lack of an effective framework for directly applying these models to open-vocabulary change detection (OVCD), a novel task that integrates vision and language to detect changes across arbitrary categories. To address these challenges, we first construct a category-agnostic change detection dataset, termed CA-CDD. Further, we design a category-agnostic change head to detect the transitions of arbitrary categories and index them to specific classes. Based on them, we propose Seg2Change, an adapter designed to adapt open-vocabulary semantic segmentation models to change detection task. Without bells and whistles, this simple yet effective framework achieves state-of-the-art OVCD performance (+9.52 IoU on WHU-CD and +5.50 mIoU on SECOND). Our code is released at https://github.com/yogurts-sy/Seg2Change.

关键词: change detection, remote sensing, open-vocabulary semantic segmentation, adapter, Seg2Change, category-agnostic, vision-language integration, land-cover changes

236. ❌ NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: AI Flash Portrait (Track 3)

作者: Ya-nan Guan, Shaonan Zhang, Hang Guo, Yawen Wang, Xinying Fan, Tianqu Zhuang, Jie Liang, Hui Zeng, Guanyi Qin, Lishen Qu, Tao Dai, Shu-Tao Xia, Lei Zhang, Radu Timofte, Bin Chen, Yuanbo Zhou, Hongwei Wang, Qinquan Gao, Tong Tong, Yanxin Qian, Lizhao You, Jingru Cong, Lei Xiong, Shuyuan Zhu, Zhi-Qiang Zhong, Kan Lv, Yang Yang, Kailing Tang, Minjian Zhang, Zhipei Lei, Zhe Xu, Liwen Zhang, Dingyong Gou, Yanlin Wu, Cong Li, Xiaohui Cui, Jiajia Liu, Guoyi Xu, Yaoxin Jiang, Yaokun Shi, Jiachen Tu, Liqing Wang, Shihang Li, Bo Zhang, Biao Wang, Haiming Xu, Xiang Long, Xurui Liao, Yanqiao Zhai, Haozhe Li, Shijun Shi, Jiangning Zhang, Yong Liu, Kai Hu, Jing Xu, Xianfang Zeng, Yuyang Liu, Minchen Wei 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11230v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文聚焦于计算机视觉领域的图像修复任务，特别是低光人像恢复，属于深度学习在图像处理中的应用。所有评分关键词均与大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理系统等）或特定科学AI应用（如生物信息学）相关。论文摘要和标题中未提及任何LLM、MoE、SLMs、Scaling Laws、预训练、后训练、对齐、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理方法、代理系统、工具使用、多代理、量化、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文介绍了NTIRE 2026第三届Restore Any Image Model挑战赛的Track 3：AI Flash Portrait，旨在解决真实世界低光人像恢复中噪声抑制、细节保留和光照颜色还原的平衡问题，通过建立新基准、提供数据集和评估方案吸引了广泛参与。

摘要翻译

本文全面综述了NTIRE 2026第三届“任意图像修复模型”（RAIM）挑战赛，并特别聚焦于赛道三：AI闪拍人像。尽管深度学习在图像修复领域已取得显著进展，但现有模型在真实世界低光照人像场景中仍面临重大挑战。具体而言，这些模型难以在噪声抑制、细节保留以及真实的光照与色彩还原之间实现最佳平衡。为弥补这一差距，本挑战赛旨在为真实世界低光照人像修复建立一个新颖的基准。我们采用一种融合客观定量指标与严格主观评估协议的混合评价体系，对所提出的算法进行了全面评估。本次竞赛提供了一个包含800组真实拍摄的低光照人像数据集。每组数据包含一张1K分辨率的低光照输入图像、一张1K真实值（GT）图像和一张1K人物掩码。本挑战赛引起了学术界与工业界的广泛关注，吸引了超过100支参赛团队，并收到了超过3000份有效提交结果。本报告详细阐述了挑战赛的设立动机、数据集构建过程、评估指标以及竞赛的各个阶段。该赛道发布的数据集与基线代码均公开于同一\href{https://github.com/zsn1434/AI_Flash-BaseLine/tree/main}{GitHub仓库}，官方挑战赛网页托管于\href{https://www.codabench.org/competitions/12885/}{CodaBench}平台。

摘要 (Abstract)

In this paper, we present a comprehensive overview of the NTIRE 2026 3rd Restore Any Image Model (RAIM) challenge, with a specific focus on Track 3: AI Flash Portrait. Despite significant advancements in deep learning for image restoration, existing models still encounter substantial challenges in real-world low-light portrait scenarios. Specifically, they struggle to achieve an optimal balance among noise suppression, detail preservation, and faithful illumination and color reproduction. To bridge this gap, this challenge aims to establish a novel benchmark for real-world low-light portrait restoration. We comprehensively evaluate the proposed algorithms utilizing a hybrid evaluation system that integrates objective quantitative metrics with rigorous subjective assessment protocols. For this competition, we provide a dataset containing 800 groups of real-captured low-light portrait data. Each group consists of a 1K-resolution low-light input image, a 1K ground truth (GT), and a 1K person mask. This challenge has garnered widespread attention from both academia and industry, attracting over 100 participating teams and receiving more than 3,000 valid submissions. This report details the motivation behind the challenge, the dataset construction process, the evaluation metrics, and the various phases of the competition. The released dataset and baseline code for this track are publicly available from the same \href{https://github.com/zsn1434/AI_Flash-BaseLine/tree/main}{GitHub repository}, and the official challenge webpage is hosted on \href{https://www.codabench.org/competitions/12885/}{CodaBench}.

关键词: image restoration, low-light portrait, deep learning, benchmark, real-world dataset, noise suppression, detail preservation, illumination reproduction

237. ❌ H-SPAM: Hierarchical Superpixel Anything Model

作者: Julien Walther, Rémi Giraud, Michaël Clément 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11218v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文H-SPAM专注于计算机视觉中的超像素分割，提出了一种生成准确、规则且完全嵌套的层次化超像素的统一框架。虽然论文使用了深度学习特征（deep features）和预训练模型（pretrained models），但其核心内容与所有评分关键词（均围绕大语言模型、深度学习技术原理及其在科学领域的应用）完全无关。论文未涉及任何大语言模型、MoE、小模型、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或AI for Science等主题。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为H-SPAM的层次化超像素分割框架，通过两阶段区域合并过程生成准确、规则且完全嵌套的超像素层次结构，在标准基准测试中显著优于现有层次化方法，并与最先进的非层次化方法性能相当。

摘要翻译

超像素通过将像素分组为连贯区域，提供了一种紧凑的图像表示方法。现有方法在生成不规则超像素形状时，其分割精度已进入平台期。此外，大多数现有方法仅生成单一固定尺度的分割结果，这限制了其在需要多尺度表示的视觉处理流程中的应用。本文提出H-SPAM（分层超像素通用模型），这是一个能够生成精确、规整且完全嵌套的分层超像素的统一框架。该框架以精细分割为起点，在深度特征和外部物体先验的引导下，通过两阶段区域合并过程构建层次结构：第一阶段保持物体内部一致性，第二阶段允许受控的物体间合并。该层次结构还可通过视觉注意力图或用户输入进行调节，以在层级中更长久地保留重要区域。在标准基准测试上的实验表明，H-SPAM在准确性和规整度上显著优于现有分层方法，同时与当前最先进的非分层方法性能相当。代码与预训练模型已开源：https://github.com/waldo-j/hspam。

摘要 (Abstract)

Superpixels offer a compact image representation by grouping pixels into coherent regions. Recent methods have reached a plateau in terms of segmentation accuracy by generating noisy superpixel shapes. Moreover, most existing approaches produce a single fixed-scale partition that limits their use in vision pipelines that would benefit multi-scale representations. In this work, we introduce H-SPAM (Hierarchical Superpixel Anything Model), a unified framework for generating accurate, regular, and perfectly nested hierarchical superpixels. Starting from a fine partition, guided by deep features and external object priors, H-SPAM constructs the hierarchy through a two-phase region merging process that first preserves object consistency and then allows controlled inter-object grouping. The hierarchy can also be modulated using visual attention maps or user input to preserve important regions longer in the hierarchy. Experiments on standard benchmarks show that H-SPAM strongly outperforms existing hierarchical methods in both accuracy and regularity, while performing on par with most recent state-of-the-art non-hierarchical methods. Code and pretrained models are available: https://github.com/waldo-j/hspam.

关键词: hierarchical superpixels, image segmentation, region merging, deep features, object priors, multi-scale representation, computer vision, pretrained models

238. ❌ 3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis

作者: Stefan Schulz, Fernando Edelstein, Hannah Dröge, Matthias B. Hullin, Markus Plack 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11211v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文3DTV专注于计算机视觉中的实时视图合成，使用基于几何和学习的轻量级方法进行稀疏视图插值，应用于AR/VR和远程呈现。所有评分关键词均与大语言模型、深度学习技术原理或AI科学应用相关，而本文属于传统计算机视觉/图形学领域，未涉及任何大模型技术、训练方法、推理优化或AI科学应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为3DTV的前馈网络，用于实时稀疏视图插值，通过结合轻量级几何和学习方法，在无需场景特定优化的情况下，实现了高质量、高效率的自由视点渲染，适用于AR/VR和交互式应用。

摘要翻译

实时自由视点渲染需要在多相机冗余性与交互应用的延迟约束之间取得平衡。为解决这一挑战，我们将轻量级几何与学习方法相结合，提出了3DTV——一种用于实时稀疏视点插值的前馈网络。基于Delaunay三角剖分的三元组选择机制确保每个目标视角具备充分的角覆盖范围。在此基础上，我们引入了位姿感知深度模块，该模块通过从粗到细的深度金字塔估计，实现高效的特征重投影与遮挡感知融合。与需要针对特定场景进行优化的方法不同，3DTV以前馈方式运行且无需重新训练，使其在增强现实/虚拟现实（AR/VR）、远程呈现和交互应用中具备实用性。我们在具有挑战性的多视角视频数据集上的实验表明，3DTV始终在质量与效率间实现良好平衡，性能优于近期的实时新视角合成基线方法。关键的是，3DTV避免了显式代理几何的使用，从而能够在多样化场景中实现鲁棒渲染。这使其成为低延迟多视角流传输与交互渲染的实用解决方案。项目页面：https://stefanmschulz.github.io/3DTV_webpage/

摘要 (Abstract)

Real-time free-viewpoint rendering requires balancing multi-camera redundancy with the latency constraints of interactive applications. We address this challenge by combining lightweight geometry with learning and propose 3DTV, a feedforward network for real-time sparse-view interpolation. A Delaunay-based triplet selection ensures angular coverage for each target view. Building on this, we introduce a pose-aware depth module that estimates a coarse-to-fine depth pyramid, enabling efficient feature reprojection and occlusion-aware blending. Unlike methods that require scene-specific optimization, 3DTV runs feedforward without retraining, making it practical for AR/VR, telepresence, and interactive applications. Our experiments on challenging multi-view video datasets demonstrate that 3DTV consistently achieves a strong balance of quality and efficiency, outperforming recent real-time novel-view baselines. Crucially, 3DTV avoids explicit proxies, enabling robust rendering across diverse scenes. This makes it a practical solution for low-latency multi-view streaming and interactive rendering. Project Page: https://stefanmschulz.github.io/3DTV_webpage/

关键词: real-time view synthesis, feedforward network, sparse-view interpolation, depth pyramid, occlusion-aware blending, multi-view video, AR/VR, interactive rendering

239. ❌ LoViF 2026 Challenge on Human-oriented Semantic Image Quality Assessment: Methods and Results

作者: Xin Li, Daoli Xu, Wei Luo, Guoqiang Xiang, Haoran Li, Chengyu Zhuang, Zhibo Chen, Jian Guan, Weping Li, Weixia Zhang, Wei Sun, Zhihua Wang, Dandan Zhu, Chengguang Zhu, Ayush Gupta, Rachit Agarwal, Shouvik Das, Biplab Ch Das, Amartya Ghosh, Kanglong Fan, Wen Wen, Shuyan Zhai, Tianwu Zhi, Aoxiang Zhang, Jianzhao Liu, Yabin Zhang, Jiajun Wang, Yipeng Sun, Kaiwei Lian, Banghao Yin 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11207v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是关于人类导向的语义图像质量评估（SeIQA）的挑战赛综述，聚焦于计算机视觉领域的图像质量评估任务，包括数据集构建、基准测试和竞赛结果。所有评分关键词均与大模型、深度学习技术原理或AI for Science相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文介绍了LoViF 2026人类导向语义图像质量评估挑战赛，通过构建SeIQA数据集和举办竞赛，为语义图像质量评估建立了新的基准，并展示了参赛团队在该数据集上取得的先进性能。

摘要翻译

本文综述了面向人类语义图像质量评估的LoViF 2026挑战赛。该挑战旨在提出一个新方向，即如何从人类视角评估语义信息的损失，以推动语义编码、语义处理及面向语义的优化等新兴领域的发展。与现有质量评估数据集不同，我们构建了一个面向人类的语义质量评估数据集，称为SeIQA数据集。该数据集为本次竞赛划分为三部分：（一）训练数据：510对退化图像及其对应的真实参考图像；（二）验证数据：80对退化图像及其对应的真实参考图像；（三）测试数据：160对退化图像及其对应的真实参考图像。本挑战赛的主要目标是为面向人类的语义图像质量评估建立一个全新且强有力的基准。本次竞赛共有58支队伍注册，其中6支队伍在最终测试阶段提交了有效解决方案与技术报告。这些提交方案在SeIQA数据集上实现了最先进的性能表现。

摘要 (Abstract)

This paper reviews the LoViF 2026 Challenge on Human-oriented Semantic Image Quality Assessment. This challenge aims to raise a new direction, i.e., how to evaluate the loss of semantic information from the human perspective, intending to promote the development of some new directions, like semantic coding, processing, and semantic-oriented optimization, etc. Unlike existing datasets of quality assessment, we form a dataset of human-oriented semantic quality assessment, termed the SeIQA dataset. This dataset is divided into three parts for this competition: (i) training data: 510 pairs of degraded images and their corresponding ground truth references; (ii) validation data: 80 pairs of degraded images and their corresponding ground-truth references; (iii) testing data: 160 pairs of degraded images and their corresponding ground-truth references. The primary objective of this challenge is to establish a new and powerful benchmark for human-oriented semantic image quality assessment. There are a total of 58 teams registered in this competition, and 6 teams submitted valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the SeIQA dataset.

关键词: Human-oriented Semantic Image Quality Assessment, SeIQA dataset, Semantic coding, Semantic processing, Benchmark, Challenge, State-of-the-art performance, Image quality evaluation

240. ❌ MedP-CLIP: Medical CLIP with Region-Aware Prompt Integration

作者: Jiahui Peng, He Yao, Jingwen Li, Yanzhou Su, Sibo Ju, Yujie Lu, Jin Ye, Hongchun Lu, Xue Li, Lincheng Jiang, Min Zhu, Junlong Cheng 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11197v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文MedP-CLIP专注于医学领域的视觉语言模型（VLM），核心创新在于区域感知的提示集成机制和医学先验知识的整合。它与关键词’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分），因为论文明确提到在大规模医学数据集上进行预训练，并涉及领域适应（医学图像）。同时，它与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为它直接应用于生物信息学/医学AI领域，属于AI for Science的子领域。其他关键词主要涉及大语言模型（LLM）的特定技术（如MoE、RLHF、量化等）、推理方法（如CoT）、或代理系统，与这篇医学VLM论文的核心内容无关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出了MedP-CLIP，一种区域感知的医学视觉语言模型，通过集成医学先验知识和特征级区域提示机制，在包含640万医学图像的大规模数据集上预训练，显著提升了零样本识别、交互式分割等医学任务的性能。

摘要翻译

对比语言-图像预训练（CLIP）通过大规模文本-图像对齐，在全局图像理解和零样本迁移方面展现出卓越性能。然而，医学图像分析的核心往往在于对特定解剖结构或病灶区域的细粒度理解。因此，精确理解由医学专家或感知模型提供的感兴趣区域（RoI）信息变得至关重要。为满足这一需求，我们提出了MedP-CLIP——一个区域感知的医学视觉-语言模型（VLM）。该模型创新性地融合了医学先验知识，并设计了特征级区域提示整合机制，使其在聚焦局部区域时能灵活响应多种提示形式（如点、边界框、掩码），同时保持全局上下文感知能力。我们在精心构建的大规模数据集（包含超过640万张医学图像和9730万个区域级标注）上对模型进行预训练，使其具备跨疾病、跨模态的细粒度空间语义理解能力。实验表明，MedP-CLIP在多种医学任务（包括零样本识别、交互式分割及赋能多模态大语言模型）中均显著优于基线方法。该模型为医学人工智能提供了一个可扩展、即插即用的视觉主干网络，实现了整体图像理解与精准区域分析的结合。

摘要 (Abstract)

Contrastive Language-Image Pre-training (CLIP) has demonstrated outstanding performance in global image understanding and zero-shot transfer through large-scale text-image alignment. However, the core of medical image analysis often lies in the fine-grained understanding of specific anatomical structures or lesion regions. Therefore, precisely comprehending region-of-interest (RoI) information provided by medical professionals or perception models becomes crucial. To address this need, we propose MedP-CLIP, a region-aware medical vision-language model (VLM). MedP-CLIP innovatively integrates medical prior knowledge and designs a feature-level region prompt integration mechanism, enabling it to flexibly respond to various prompt forms (e.g., points, bounding boxes, masks) while maintaining global contextual awareness when focusing on local regions. We pre-train the model on a meticulously constructed large-scale dataset (containing over 6.4 million medical images and 97.3 million region-level annotations), equipping it with cross-disease and cross-modality fine-grained spatial semantic understanding capabilities. Experiments demonstrate that MedP-CLIP significantly outperforms baseline methods in various medical tasks, including zero-shot recognition, interactive segmentation, and empowering multimodal large language models. This model provides a scalable, plug-and-play visual backbone for medical AI, combining holistic image understanding with precise regional analysis.

关键词: MedP-CLIP, medical vision-language model, region-aware, pre-training, medical image analysis, zero-shot transfer, fine-grained understanding, large-scale dataset

241. ❌ Towards Adaptive Open-Set Object Detection via Category-Level Collaboration Knowledge Mining

作者: Yuqi Ji, Junjie Ke, Lihuo He, Lizhi Wang, Xinbo Gao 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11195v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉中的自适应开放集目标检测（AOOD），提出了一种基于类别级协作知识挖掘的方法，通过聚类记忆库和自适应特征分配来解决跨域泛化和新类别适应问题。论文与大多数大模型和深度学习技术关键词无关，仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为涉及跨域适应和特征迁移，但并非核心内容。其他关键词均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于类别级协作知识挖掘的自适应开放集目标检测方法，通过聚类记忆库和自适应特征分配策略，在多个基准测试中比现有方法提升了1.1-5.5 mAP。

摘要翻译

现有目标检测器在适应新兴未知类别时常难以实现跨域泛化。自适应开放集目标检测通过利用源域基类数据进行训练，并在无目标域标注的情况下同时适应目标域的基类与未知类别，以应对这一挑战。然而，当前AOOD方法仍受限于较弱的跨域表征能力、未知类别间的模糊性以及源域特征偏差。为解决这些问题，我们提出一种类别级协作知识挖掘策略，该策略利用跨域的类间与类内关联。具体而言，我们构建基于聚类的记忆库，用于编码类别原型、辅助特征及类内差异信息，并通过无监督聚类迭代更新以增强类别级知识表征。我们进一步设计基类到未知类的选择度量机制，用于发掘与未知类别相关的源域特征，并以此初始化未知类别分类器。此外，自适应特征分配策略将习得的类别级知识迁移至目标域，并通过异步更新记忆库以缓解源域偏差。在多个基准数据集上的大量实验表明，本方法以1.1-5.5 mAP的显著优势持续超越现有最先进的AOOD方法。

摘要 (Abstract)

Existing object detectors often struggle to generalize across domains while adapting to emerging novel categories. Adaptive open-set object detection (AOOD) addresses this challenge by training on base categories in the source domain and adapting to both base and novel categories in the target domain without target annotations. However, current AOOD methods remain limited by weak cross-domain representations, ambiguity among novel categories, and source-domain feature bias. To address these issues, we propose a category-level collaboration knowledge mining strategy that exploits both inter-class and intra-class relationships across domains. Specifically, we construct a clustering-based memory bank to encode class prototypes, auxiliary features, and intra-class disparity information, and iteratively update it via unsupervised clustering to enhance category-level knowledge representation. We further design a base-to-novel selection metric to discover source-domain features related to novel categories and use them to initialize novel-category classifiers. In addition, an adaptive feature assignment strategy transfers the learned category-level knowledge to the target domain and asynchronously updates the memory bank to alleviate source-domain bias. Extensive experiments on multiple benchmarks show that our method consistently surpasses state-of-the-art AOOD methods by 1.1-5.5 mAP.

关键词: Adaptive Open-Set Object Detection, AOOD, Category-Level Collaboration, Knowledge Mining, Cross-Domain Adaptation, Clustering Memory Bank, Novel Category Discovery, Feature Transfer

242. ❌ Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene Understanding

作者: Shivam Sharma, Sankalp Nagaonkar, Ashish Choithani, Ashutosh Trivedi 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11177v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Gemini视觉语言模型中的内部推理轨迹（thought streams）对视频场景理解的影响，核心涉及推理过程分析。高度相关关键词：‘Large Language Models’（研究Gemini模型）、‘Chain of Thought’（直接研究推理轨迹）、‘System 2 Thinking’（涉及深度推理分析）。中等相关：‘Hallucination Mitigation’（发现压缩步骤幻觉）、‘Mechanistic Interpretability’（分析模型思考内容）。其他关键词如MoE、SLMs、训练方法、代理系统等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文研究了Gemini视觉语言模型中内部推理轨迹（thought streams）对视频场景理解的影响，发现质量提升在最初几百个token后趋于平稳，并识别出压缩步骤幻觉现象。

摘要翻译

我们对内部推理轨迹（称为思维流）如何影响视觉语言模型的视频场景理解进行了基准测试。使用谷歌Gemini 2.5 Flash和Flash Lite的四种配置，对从100小时视频中提取的场景进行分析，我们提出了三个问题：更多思考是否会产生更好的输出？增益在何处停止？以及这些模型实际在思考什么？我们引入了三项评估指标。内容充实度衡量思维流中有多少是有用的场景内容而非元评论。思维-最终覆盖度衡量思维流转化为最终输出的忠实程度。主导实体分析识别模型关注的主体、动作和场景。GPT-5作为独立评判者。我们发现，额外思考带来的质量增益很快达到平台期，大部分改进发生在前几百个标记内。Flash Lite在质量与标记使用量之间提供了最佳平衡。严格的推理预算会导致模型在最终输出中添加其从未推理过的内容，这是一种压缩步骤幻觉。尽管属于不同模型层级，Flash和Flash Lite产生的思维流相似，但风格不同：Flash会讨论其推理过程，而Lite则侧重于描述场景。

摘要 (Abstract)

We benchmark how internal reasoning traces, which we call thought streams, affect video scene understanding in vision-language models. Using four configurations of Google’s Gemini 2.5 Flash and Flash Lite across scenes extracted from 100 hours of video, we ask three questions: does more thinking lead to better outputs, where do the gains stop, and what do these models actually think about? We introduce three evaluation metrics. Contentfulness measures how much of the thought stream is useful scene content versus meta-commentary. Thought-Final Coverage measures how faithfully the thought stream translates into the final output. Dominant Entity Analysis identifies which subjects, actions, and settings the model focuses on. GPT-5 serves as an independent judge. We find that quality gains from additional thinking plateau quickly, with most improvement occurring in the first few hundred tokens. Flash Lite offers the best balance between quality and token usage. Tight reasoning budgets cause the model to add content in the final output that it never reasoned about, a form of compression-step hallucination. Despite being different model tiers, Flash and Flash Lite produce similar thought streams, though they differ in style: Flash discusses its reasoning process, while Lite focuses on describing the scene.

关键词: thought streams, reasoning traces, video scene understanding, vision-language models, Gemini, hallucination, model evaluation, content analysis

243. ❌ Precision Synthesis of Multi-Tracer PET via VLM-Modulated Rectified Flow for Stratifying Mild Cognitive Impairment

作者: Tuo Liu, Shuijin Lin, Shaozhen Yan, Haifeng Wang, Jie Lu, Jianhua Ma, Chunfeng Lian 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11176v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究医学影像生成（PET合成）和阿尔茨海默病诊断，属于AI for Science（生物信息学/医学影像分析）领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文使用了领域适应的视觉语言模型（BiomedCLIP）和生成模型（rectified flow），与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为涉及领域适应（domain adaptation）。其他关键词主要涉及大语言模型（LLM）技术、推理、对齐、优化等，与论文的医学影像生成和诊断应用无直接关系，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合领域适应视觉语言模型和校正流生成模型的DIReCT++方法，用于从MRI合成多示踪剂PET图像，以精确分层轻度认知障碍，为阿尔茨海默病的早期诊断提供可扩展工具。

摘要翻译

阿尔茨海默病（AD）的生物学定义依赖于多模态神经影像学，然而正电子发射断层扫描（PET）的临床应用受到成本和辐射暴露的限制，阻碍了在临床前或前驱期的早期筛查。尽管生成模型通过从磁共振成像（MRI）合成PET图像提供了一种有前景的替代方案，但实现针对特定受试者的精确度仍是主要挑战。本文提出DIReCT$++$，一种领域信息校正流模型，用于从MRI结合基础临床信息合成多示踪剂PET图像。我们的方法整合了三维校正流架构以捕捉复杂的跨模态和跨示踪剂关系，并采用领域适应的视觉-语言模型（BiomedCLIP），利用临床评分和影像学知识提供文本引导的个性化生成。在多中心数据集上的广泛评估表明，DIReCT$++$不仅能生成具有卓越保真度和泛化性的合成PET图像（$^{18}$F-AV-45和$^{18}$F-FDG），还能准确复现疾病特异性模式。关键在于，将这些合成的PET图像与MRI结合，能够实现对轻度认知障碍（MCI）的精确个性化分层，从而推动了一种可扩展、数据高效的工具，用于AD的早期诊断和预后预测。源代码将在https://github.com/ladderlab-xjtu/DIReCT-PLUS发布。

摘要 (Abstract)

The biological definition of Alzheimer’s disease (AD) relies on multi-modal neuroimaging, yet the clinical utility of positron emission tomography (PET) is limited by cost and radiation exposure, hindering early screening at preclinical or prodromal stages. While generative models offer a promising alternative by synthesizing PET from magnetic resonance imaging (MRI), achieving subject-specific precision remains a primary challenge. Here, we introduce DIReCT$++$, a Domain-Informed ReCTified flow model for synthesizing multi-tracer PET from MRI combined with fundamental clinical information. Our approach integrates a 3D rectified flow architecture to capture complex cross-modal and cross-tracer relationships with a domain-adapted vision-language model (BiomedCLIP) that provides text-guided, personalized generation using clinical scores and imaging knowledge. Extensive evaluations on multi-center datasets demonstrate that DIReCT$++$ not only produces synthetic PET images ($^{18}$F-AV-45 and $^{18}$F-FDG) of superior fidelity and generalizability but also accurately recapitulates disease-specific patterns. Crucially, combining these synthesized PET images with MRI enables precise personalized stratification of mild cognitive impairment (MCI), advancing a scalable, data-efficient tool for the early diagnosis and prognostic prediction of AD. The source code will be released on https://github.com/ladderlab-xjtu/DIReCT-PLUS.

关键词: PET synthesis, Alzheimer’s disease, rectified flow, vision-language model, domain adaptation, mild cognitive impairment, medical imaging, generative models

244. ❌ NeuVolEx: Implicit Neural Features for Volume Exploration

作者: Haill An, Suhyeon Kim, Donghyuk Choo, Younhyun Jung 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11172v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是直接体绘制（DVR）中的神经体积探索方法NeuVolEx，专注于利用隐式神经表示（INRs）的特征进行体积数据探索，涉及图像传递函数设计和视点推荐。所有评分关键词均与大语言模型、深度学习技术原理、AI科学应用等主题相关，而本文属于计算机图形学/可视化领域，未涉及任何大模型、深度学习技术或AI科学应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为NeuVolEx的神经体积探索方法，通过利用隐式神经表示训练期间学习的特征表示来改进体积数据探索，在稀疏用户监督下实现了准确的感兴趣区域分类，并支持无监督聚类以识别互补视点。

摘要翻译

直接体绘制（Direct Volume Rendering, DVR）旨在帮助用户识别和检查体数据中的感兴趣区域（Regions of Interest, ROIs），而支持有效ROI分类与聚类的特征表示在体数据探索中起着基础性作用。现有方法通常依赖于显式的局部特征表示或从原始体数据中学习到的隐式卷积特征表示。然而，显式局部特征表示在捕捉更广泛的几何模式与空间关联性方面存在局限，而隐式卷积特征表示在实际应用中（通常用户监督有限）未必能确保稳健的性能。与此同时，隐式神经表示（Implicit Neural Representations, INRs）因其能够紧凑地参数化连续体场，近期在面向体数据压缩的DVR中展现出巨大潜力。本文提出NeuVolEx，一种神经体数据探索方法，将INRs的作用扩展至体压缩之外。与先前仅关注INR输出的压缩方法不同，NeuVolEx利用INR训练过程中学习到的特征表示，作为体数据探索的稳健基础。为了更好地使这些特征表示适应探索任务，我们为基础INR增加了结构编码器与多任务学习方案，以提升ROI表征的空间一致性。我们在两个基础体数据探索任务上验证了NeuVolEx：基于图像的传递函数（Transfer Function, TF）设计与视点推荐。NeuVolEx能够在稀疏用户监督下实现基于图像的TF设计中的准确ROI分类，并支持无监督聚类以识别能够揭示不同ROI簇的紧凑互补视点。在多种模态和ROI复杂度的不同体数据集上的实验表明，NeuVolEx在效能与可用性上均优于现有方法。

摘要 (Abstract)

Direct volume rendering (DVR) aims to help users identify and examine regions of interest (ROIs) within volumetric data, and feature representations that support effective ROI classification and clustering play a fundamental role in volume exploration. Existing approaches typically rely on either explicit local feature representations or implicit convolutional feature representations learned from raw volumes. However, explicit local feature representations are limited in capturing broader geometric patterns and spatial correlations, while implicit convolutional feature representations do not necessarily ensure robust performance in practice, where user supervision is typically limited. Meanwhile, implicit neural representations (INRs) have recently shown strong promise in DVR for volume compression, owing to their ability to compactly parameterize continuous volumetric fields. In this work, we propose NeuVolEx, a neural volume exploration approach that extends the role of INRs beyond volume compression. Unlike prior compression methods that focus on INR outputs, NeuVolEx leverages feature representations learned during INR training as a robust basis for volume exploration. To better adapt these feature representations to exploration tasks, we augment a base INR with a structural encoder and a multi-task learning scheme that improve spatial coherence for ROI characterization. We validate NeuVolEx on two fundamental volume exploration tasks: image-based transfer function (TF) design and viewpoint recommendation. NeuVolEx enables accurate ROI classification under sparse user supervision for image-based TF design and supports unsupervised clustering to identify compact complementary viewpoints that reveal different ROI clusters. Experiments on diverse volume datasets with varying modalities and ROI complexities demonstrate NeuVolEx improves both effectiveness and usability over prior methods

关键词: Direct volume rendering, Implicit neural representations, Volume exploration, ROI classification, Transfer function design, Viewpoint recommendation, Multi-task learning, Spatial coherence

245. ❌ Development and evaluation of CADe systems in low-prevalence setting: The RARE25 challenge for early detection of Barrett’s neoplasia

作者: Tim J. M. Jaspers, Francisco Caetano, Cris H. B. Claessens, Carolus H. J. Kusters, Rixta A. H. van Eijck van Heslinga, Floor Slooter, Jacques J. Bergman, Peter H. N. De With, Martijn R. Jong, Albert J. de Groof, Fons van der Sommen 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11171v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算机辅助检测（CADe）系统在巴雷特食管早期肿瘤检测中的应用，属于医学影像AI领域。论文内容涉及数据集构建、模型评估、临床适用性分析等，但未涉及大语言模型（LLM）、深度学习技术原理创新、模型训练优化方法（如MoE、SFT、RLHF、PEFT等）、推理加速、智能体系统或大模型在科学领域的应用。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在生物医学（具体是医学影像分析）领域的应用，但并非核心创新点，因此给予5分（有一定关联）。其他所有关键词均与论文内容无关，评分为0分。

!!! tip deepseek-chat TL;DR

该论文通过RARE25挑战赛，评估了计算机辅助检测系统在低患病率环境下对巴雷特食管早期肿瘤的检测性能，发现尽管多种方法在区分性表现上良好，但阳性预测值较低，凸显了低患病率检测的难度和忽略患病率可能高估临床效用的问题。

摘要翻译

巴雷特食管早期瘤变计算机辅助检测（CADe）是一个低患病率监测问题，其中临床相关发现极为罕见。尽管许多CADe系统在平衡或富集数据集上报告了强劲性能，但它们在真实患病率下的表现仍未得到充分表征。RARE25挑战赛通过引入一个大规模、考虑患病率的瘤变检测基准来填补这一空白。该基准包含一个公共训练集和一个反映真实世界发病率的隐藏测试集。评估方法采用强调高灵敏度并考虑患病率的特定操作点指标。来自七个国家的十一个团队提交了方案，这些方案采用了多样化架构、预训练、集成学习和校准策略。虽然部分方法取得了出色的判别性能，但阳性预测值仍然较低，这凸显了低患病率检测的困难，以及忽略患病率时可能高估临床实用性的风险。尽管正常样本占主导地位，所有方法仍依赖于全监督分类，这表明目前缺乏如异常检测或单类学习等与患病率无关的方法。通过发布公共数据集和可复现的评估框架，RARE25旨在支持开发对患病率变化具有鲁棒性且适用于临床监测工作流程的CADe系统。

摘要 (Abstract)

Computer-aided detection (CADe) of early neoplasia in Barrett’s esophagus is a low-prevalence surveillance problem in which clinically relevant findings are rare. Although many CADe systems report strong performance on balanced or enriched datasets, their behavior under realistic prevalence remains insufficiently characterized. The RARE25 challenge addresses this gap by introducing a large-scale, prevalence-aware benchmark for neoplasia detection. It includes a public training set and a hidden test set reflecting real-world incidence. Methods were evaluated using operating-point-specific metrics emphasizing high sensitivity and accounting for prevalence. Eleven teams from seven countries submitted approaches using diverse architectures, pretraining, ensembling, and calibration strategies. While several methods achieved strong discriminative performance, positive predictive values remained low, highlighting the difficulty of low-prevalence detection and the risk of overestimating clinical utility when prevalence is ignored. All methods relied on fully supervised classification despite the dominance of normal findings, indicating a lack of prevalence-agnostic approaches such as anomaly detection or one-class learning. By releasing a public dataset and a reproducible evaluation framework, RARE25 aims to support the development of CADe systems robust to prevalence shift and suitable for clinical surveillance workflows.

关键词: Computer-aided detection, Barrett’s esophagus, neoplasia detection, low-prevalence setting, RARE25 challenge, clinical surveillance, prevalence-aware benchmark, positive predictive value

246. ❌ Do Instance Priors Help Weakly Supervised Semantic Segmentation?

作者: Anurag Das, Anna Kukleva, Xinting Hu, Yuki M. Asano, Bernt Schiele 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11170v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是计算机视觉领域的弱监督语义分割，使用Segment Anything Model (SAM)作为基础模型，但SAM是视觉分割模型而非语言模型。论文内容涉及图像分割、弱监督学习、伪标签生成等技术，与所有评分关键词（均针对大语言模型及相关技术）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出SeSAM框架，通过将Segment Anything Model (SAM)与弱标签（如粗掩码、涂鸦、点）结合，解决了语义分割中密集像素级标注成本高的问题，显著提高了弱监督语义分割的性能并降低了标注成本。

摘要翻译

语义分割需要密集的像素级标注，其获取成本高昂且耗时。为解决此问题，我们提出SeSAM框架，该框架利用基础分割模型——即Segment Anything Model（SAM）——结合粗掩码、涂鸦和点等弱标签进行工作。SAM最初设计用于基于实例的分割，无法直接应用于语义分割任务。在本研究中，我们识别了SAM面临的具体挑战，并确定了适配组件使其能够基于弱标签实现按类别分割。具体而言，SeSAM将类别掩码分解为连通组件，沿物体骨架采样点提示，通过弱标签覆盖范围筛选SAM掩码，并利用伪标签迭代优化标注，从而使SAM生成的掩码能有效用于语义分割。通过与半监督学习框架结合，SeSAM平衡了真实标签、基于SAM的伪标签以及高置信度伪标签，显著提升了分割质量。在多种基准数据集和弱标注类型上的大量实验表明，SeSAM始终优于弱监督基线方法，同时相较于精细标注大幅降低了标注成本。

摘要 (Abstract)

Semantic segmentation requires dense pixel-level annotations, which are costly and time-consuming to acquire. To address this, we present SeSAM, a framework that uses a foundational segmentation model, i.e. Segment Anything Model (SAM), with weak labels, including coarse masks, scribbles, and points. SAM, originally designed for instance-based segmentation, cannot be directly used for semantic segmentation tasks. In this work, we identify specific challenges faced by SAM and determine appropriate components to adapt it for class-based segmentation using weak labels. Specifically, SeSAM decomposes class masks into connected components, samples point prompts along object skeletons, selects SAM masks using weak-label coverage, and iteratively refines labels using pseudo-labels, enabling SAM-generated masks to be effectively used for semantic segmentation. Integrated with a semi-supervised learning framework, SeSAM balances ground-truth labels, SAM-based pseudo-labels, and high-confidence pseudo-labels, significantly improving segmentation quality. Extensive experiments across multiple benchmarks and weak annotation types show that SeSAM consistently outperforms weakly supervised baselines while substantially reducing annotation cost relative to fine supervision.

关键词: weakly supervised semantic segmentation, Segment Anything Model (SAM), pseudo-labels, annotation cost reduction, semi-supervised learning, instance priors, class-based segmentation, SeSAM framework

247. ❌ RADA: Region-Aware Dual-encoder Auxiliary learning for Barely-supervised Medical Image Segmentation

作者: Shuang Zeng, Boxu Xie, Lei Zhu, Xinliang Zhang, Jiakui Hu, Zhengjian Yao, Yuanwei Li, Yuxing Lu, Yanye Lu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11164v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于医学图像分割的深度学习技术，特别是针对标注稀疏场景的辅助学习方法。所有关键词均与大模型技术原理、训练方法、推理优化、对齐、代理系统等直接相关，而本文研究的是传统的计算机视觉分割任务，未涉及任何大模型或相关技术。唯一相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为医学图像分割属于生物信息学/科学AI的应用领域，但论文未涉及大模型在该领域的应用，因此给予8分（有一定关联，但不是核心）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为RADA的区域感知双编码器辅助学习框架，用于解决医学图像分割中标注稀疏的问题，通过在多个数据集上实现最先进的性能，显著减少了标注负担。

摘要翻译

深度学习极大推动了医学图像分割的发展，但其成功严重依赖于全监督学习，而全监督学习需要密集标注，这对于三维容积扫描而言成本高昂且耗时。极稀疏监督学习通过每个容积仅使用少量已标注切片来减轻标注负担。现有方法通常通过几何连续性将稀疏标注传播至未标注切片以生成伪标签，但该策略缺乏语义理解，往往导致伪标签质量低下。此外，医学图像分割本质上是像素级的视觉理解任务，其准确性根本上取决于局部细粒度视觉特征的质量。受此启发，我们提出RADA——一种新颖的区域感知双编码器辅助学习流程，该方法引入基于Alpha-CLIP预训练的双编码器框架，从原始图像及有限标注中提取细粒度的区域特异性视觉特征。该框架将图像级细粒度视觉特征与文本级语义引导相结合，提供连接图像级语义与像素级分割的区域感知语义监督。RADA集成于三视图训练框架中，在LA2018、KiTS19和LiTS数据集上以极稀疏标注设置达到了最先进的性能，展现了跨多样数据集的强大泛化能力。

摘要 (Abstract)

Deep learning has greatly advanced medical image segmentation, but its success relies heavily on fully supervised learning, which requires dense annotations that are costly and time-consuming for 3D volumetric scans. Barely-supervised learning reduces annotation burden by using only a few labeled slices per volume. Existing methods typically propagate sparse annotations to unlabeled slices through geometric continuity to generate pseudo-labels, but this strategy lacks semantic understanding, often resulting in low-quality pseudo-labels. Furthermore, medical image segmentation is inherently a pixel-level visual understanding task, where accuracy fundamentally depends on the quality of local, fine-grained visual features. Inspired by this, we propose RADA, a novel Region-Aware Dual-encoder Auxiliary learning pipeline which introduces a dual-encoder framework pre-trained on Alpha-CLIP to extract fine-grained, region-specific visual features from the original images and limited annotations. The framework combines image-level fine-grained visual features with text-level semantic guidance, providing region-aware semantic supervision that bridges image-level semantics and pixel-level segmentation. Integrated into a triple-view training framework, RADA achieves SOTA performance under extremely sparse annotation settings on LA2018, KiTS19 and LiTS, demonstrating robust generalization across diverse datasets.

关键词: Medical Image Segmentation, Barely-supervised Learning, Dual-encoder Framework, Region-aware Semantic Supervision, Sparse Annotations, Alpha-CLIP Pre-training, Triple-view Training, Pseudo-label Generation

248. ❌ Boxes2Pixels: Learning Defect Segmentation from Noisy SAM Masks

作者: Camile Lendering, Erkut Akdag, Egor Bondarev 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11162v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要研究工业缺陷分割，使用SAM生成伪标签并训练学生模型。与大多数大模型技术关键词无关，但与’Self-Correction’相关（使用在线自校正机制处理伪标签噪声），与’AI for Science’有一定关联（工业检测应用）。其他关键词均不涉及。

!!! tip deepseek-chat TL;DR

该论文提出Boxes2Pixels框架，通过噪声鲁棒的框到像素蒸馏方法，利用SAM作为噪声教师训练紧凑学生模型，显著提高了工业缺陷分割的准确性和召回率。

摘要翻译

精确的缺陷分割对于工业检测至关重要，但密集的像素级标注往往难以获取。一种常见的解决方案是利用基础分割模型（如Segment Anything Model，简称SAM）将廉价的边界框转换为伪掩码。然而，这些伪标签在工业表面存在系统性噪声，常常错误地生成背景结构，同时遗漏稀疏的缺陷。
为应对这一局限，本文提出了一种抗噪声的框到像素蒸馏框架Boxes2Pixels，该框架将SAM视为一个带噪声的教师模型，而非真实监督的来源。边界框通过SAM离线转换为伪掩码，随后训练一个紧凑的学生模型，其具备以下特点：（一）在冻结的DINOv2特征上使用分层解码器以保持语义稳定性；（二）引入辅助的二值定位头，将稀疏前景发现与类别预测解耦；（三）采用单侧在线自校正机制，当学生模型置信度高时放宽对背景的监督，以针对教师模型的假阴性错误。
在人工标注的风力涡轮机检测基准数据集上，所提出的Boxes2Pixels方法在相同的弱监督条件下，相比最强基线模型，将异常平均交并比（mIoU）提升了6.97，二值交并比（IoU）提升了9.71。此外，在线自校正机制使二值召回率提高了18.56，同时该模型使用的可训练参数减少了80%。代码公开于https://github.com/CLendering/Boxes2Pixels。

摘要 (Abstract)

Accurate defect segmentation is critical for industrial inspection, yet dense pixel-level annotations are rarely available. A common workaround is to convert inexpensive bounding boxes into pseudo-masks using foundation segmentation models such as the Segment Anything Model (SAM). However, these pseudo-labels are systematically noisy on industrial surfaces, often hallucinating background structure while missing sparse defects. To address this limitation, a noise-robust box-to-pixel distillation framework, Boxes2Pixels, is proposed that treats SAM as a noisy teacher rather than a source of ground-truth supervision. Bounding boxes are converted into pseudo-masks offline by SAM, and a compact student is trained with (i) a hierarchical decoder over frozen DINOv2 features for semantic stability, (ii) an auxiliary binary localization head to decouple sparse foreground discovery from class prediction, and (iii) a one-sided online self-correction mechanism that relaxes background supervision when the student is confident, targeting teacher false negatives. On a manually annotated wind turbine inspection benchmark, the proposed Boxes2Pixels improves anomaly mIoU by +6.97 and binary IoU by +9.71 over the strongest baseline trained under identical weak supervision. Moreover, online self-correction increases the binary recall by +18.56, while the model employs 80% fewer trainable parameters. Code is available at https://github.com/CLendering/Boxes2Pixels.

关键词: defect segmentation, industrial inspection, Segment Anything Model, noisy teacher, self-correction, weak supervision, wind turbine inspection, pseudo-masks

249. ❌ rPPG-VQA: A Video Quality Assessment Framework for Unsupervised rPPG Training

作者: Tianyang Dai, Ming Chang, Yan Chen, Yang Hu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11156v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文主要研究远程光电容积描记术（rPPG）的视频质量评估框架，属于生物医学AI应用领域。论文中明确使用了“multimodal large language model (MLLM)”进行场景级分析，因此与“Large Language Models”关键词有一定关联（5分）。同时，该研究属于AI在生物医学信号处理中的应用，与“AI for Science”高度相关（8分）。其他关键词如MoE、SLMs、Scaling Laws、各种训练技术、推理优化、代理系统等均未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了rPPG-VQA框架，通过信号级和场景级分析评估视频对远程光电容积描记术训练的适用性，并利用质量分数筛选训练数据，显著提升了无监督rPPG模型的准确性。

摘要翻译

无监督远程光电容积描记术（rPPG）有望利用未标记视频数据，但其潜力受到一个关键挑战的制约：在低质量“野外”视频上进行训练会严重降低模型性能。此过程中缺失的关键步骤是在将视频用于任务前，评估其对rPPG模型学习的适用性。现有的视频质量评估（VQA）方法主要针对人类感知设计，并不直接适用于上述目的。在本工作中，我们提出了rPPG-VQA，一个用于评估视频对rPPG适用性的新颖框架。我们整合了信号级和场景级分析，并设计了一个双分支评估架构。信号级分支通过采用多方法共识机制的稳健信噪比（SNR）估计来评估视频的生理信号质量，而场景级分支则利用多模态大语言模型（MLLM）来识别如运动和光照不稳定等干扰因素。此外，我们提出了一种两阶段自适应采样（TAS）策略，该策略利用质量评分来筛选出最优的训练数据集。实验表明，通过在我们框架筛选的大规模“野外”视频上进行训练，我们可以开发出无监督rPPG模型，这些模型在标准基准测试上的准确性实现了显著提升。我们的代码可在 https://github.com/Tianyang-Dai/rPPG-VQA 获取。

摘要 (Abstract)

Unsupervised remote photoplethysmography (rPPG) promises to leverage unlabeled video data, but its potential is hindered by a critical challenge: training on low-quality “in-the-wild” videos severely degrades model performance. An essential step missing here is to assess the suitability of the videos for rPPG model learning before using them for the task. Existing video quality assessment (VQA) methods are mainly designed for human perception and not directly applicable to the above purpose. In this work, we propose rPPG-VQA, a novel framework for assessing video suitability for rPPG. We integrate signal-level and scene-level analyses and design a dual-branch assessment architecture. The signal-level branch evaluates the physiological signal quality of the videos via robust signal-to-noise ratio (SNR) estimation with a multi-method consensus mechanism, and the scene-level branch uses a multimodal large language model (MLLM) to identify interferences like motion and unstable lighting. Furthermore, we propose a two-stage adaptive sampling (TAS) strategy that utilizes the quality score to curate optimal training datasets. Experiments show that by training on large-scale, “in-the-wild” videos filtered by our framework, we can develop unsupervised rPPG models that achieve a substantial improvement in accuracy on standard benchmarks. Our code is available at https://github.com/Tianyang-Dai/rPPG-VQA.

关键词: remote photoplethysmography, video quality assessment, unsupervised training, multimodal large language model, signal-to-noise ratio, adaptive sampling, physiological signal, in-the-wild videos

250. ❌ Naka-GS: A Bionics-inspired Dual-Branch Naka Correction and Progressive Point Pruning for Low-Light 3DGS

作者: Runyu Zhu, SiXun Dong, Zhiqiang Zhang, Qingxia Ye, Zhihua Xu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11142v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于低光照条件下的3D高斯泼溅（3D Gaussian Splatting）重建技术，提出了一种生物启发式的双分支校正和渐进点剪枝方法。论文的核心是计算机视觉和3D重建领域，涉及图像增强、几何初始化和点云处理等技术。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理创新、AI for Science应用等直接相关，而本论文完全不涉及这些主题。论文没有讨论任何大模型、深度学习技术原理、AI在科学领域的应用（如生物信息学、化学信息学），也没有涉及任何评分关键词中的具体技术（如MoE、RLHF、RAG、量化等）。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为NAKA-GS的生物启发式框架，通过Naka引导的色度校正网络和轻量级点预处理模块，解决了低光照条件下3D高斯泼溅重建中的图像质量下降和几何初始化问题，显著提升了恢复质量和优化效率。

摘要翻译

低光照条件会严重阻碍三维复原与重建，其通过降低图像可见度、引入色彩失真以及污染下游优化所需的几何先验来实现这一影响。我们提出了NAKA-GS，一种受仿生学启发的低光照三维高斯泼溅框架，能够联合提升光度复原与几何初始化质量。我们的方法始于一个Naka引导的色彩校正网络，该网络结合了物理先验的低光照增强、双分支输入建模、频率解耦校正以及掩码引导优化，以抑制亮区色彩伪影和边缘结构误差。增强后的图像随后被输入到一个前馈式多视图重建模型中，以生成稠密的场景先验。为了进一步改进高斯初始化，我们引入了一个轻量级的点云预处理模块（Point Preprocessing Module, PPM），该模块执行坐标对齐、体素池化以及距离自适应的渐进式剪枝，以在保留代表性结构的同时去除噪声点和冗余点。NAKA-GS在不引入沉重推理开销的情况下，提升了低光照三维重建的复原质量、训练稳定性和优化效率。所提出的方法在NTIRE三维复原与重建（3DRR）挑战赛中进行了展示，并以显著优势超越了基线方法。代码可在https://github.com/RunyuZhu/Naka-GS获取。

摘要 (Abstract)

Low-light conditions severely hinder 3D restoration and reconstruction by degrading image visibility, introducing color distortions, and contaminating geometric priors for downstream optimization. We present NAKA-GS, a bionics-inspired framework for low-light 3D Gaussian Splatting that jointly improves photometric restoration and geometric initialization. Our method starts with a Naka-guided chroma-correction network, which combines physics-prior low-light enhancement, dual-branch input modeling, frequency-decoupled correction, and mask-guided optimization to suppress bright-region chromatic artifacts and edge-structure errors. The enhanced images are then fed into a feed-forward multi-view reconstruction model to produce dense scene priors. To further improve Gaussian initialization, we introduce a lightweight Point Preprocessing Module (PPM) that performs coordinate alignment, voxel pooling, and distance-adaptive progressive pruning to remove noisy and redundant points while preserving representative structures. Without introducing heavy inference overhead, NAKA-GS improves restoration quality, training stability, and optimization efficiency for low-light 3D reconstruction. The proposed method was presented in the NTIRE 3D Restoration and Reconstruction (3DRR) Challenge, and outperformed the baseline methods by a large margin. The code is available at https://github.com/RunyuZhu/Naka-GS

关键词: low-light 3D reconstruction, 3D Gaussian Splatting, bionics-inspired framework, Naka correction, progressive point pruning, photometric restoration, geometric initialization, Point Preprocessing Module

251. ❌ Sparse Hypergraph-Enhanced Frame-Event Object Detection with Fine-Grained MoE

作者: Wei Bao, Yuehan Wang, Tianhang Zhou, Siqi Li, Yue Gao 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11140v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉中的多模态目标检测，提出了一种结合稀疏超图和细粒度MoE的框架。仅与关键词’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分），因为论文核心贡献之一是设计了Fine-Grained Mixture of Experts (FG-MoE)模块，并利用了稀疏性。其他关键词主要涉及大语言模型、训练方法、推理技术、对齐、代理等，与论文的视觉检测主题无关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Hyper-FEOD的框架，通过稀疏超图增强的跨模态融合和细粒度混合专家模块，解决了RGB-事件流多模态目标检测中的计算效率和特征融合问题，在主流基准上实现了优越的精度-效率权衡。

摘要翻译

将基于帧的RGB相机与事件流相融合，为在复杂动态条件下实现鲁棒的目标检测提供了一种前景广阔的解决方案。然而，这两种模态固有的异构性与数据冗余性常导致难以承受的计算开销或次优的特征融合效果。本文提出Hyper-FEOD，一个高性能、高效率的检测框架，它通过两个核心组件协同优化多模态交互。首先，我们引入稀疏超图增强跨模态融合模块，该模块利用事件流固有的稀疏性构建事件引导的活动图。通过仅对选定的运动关键稀疏令牌进行高阶超图建模，S-HCF模块能够捕捉RGB与事件数据之间复杂的非局部依赖关系，同时克服传统超图计算在复杂度上的瓶颈。其次，我们设计了一个细粒度专家混合增强模块，以应对不同图像区域多样化的语义需求。该模块采用专为物体边界、内部纹理和背景定制的超图专家，利用像素级空间门控机制自适应地路由并增强特征。结合负载均衡损失与零初始化策略，FG-MoE模块确保了训练的稳定性与特征的精准优化，且不干扰预训练主干网络的特征分布。在主流RGB-Event基准测试上的实验结果表明，Hyper-FEOD实现了卓越的精度-效率权衡，其性能优于现有先进方法，同时保持了适用于实时边缘部署的轻量化特性。

摘要 (Abstract)

Integrating frame-based RGB cameras with event streams offers a promising solution for robust object detection under challenging dynamic conditions. However, the inherent heterogeneity and data redundancy of these modalities often lead to prohibitive computational overhead or suboptimal feature fusion. In this paper, we propose Hyper-FEOD, a high-performance and efficient detection framework, which synergistically optimizes multi-modal interaction through two core components. First, we introduce Sparse Hypergraph-enhanced Cross-Modal Fusion (S-HCF), which leverages the inherent sparsity of event streams to construct an event-guided activity map. By performing high-order hypergraph modeling exclusively on selected motion-critical sparse tokens, S-HCF captures complex non-local dependencies between RGB and event data while overcoming the traditional complexity bottlenecks of hypergraph computation. Second, we design a Fine-Grained Mixture of Experts (FG-MoE) Enhancement module to address the diverse semantic requirements of different image regions. This module employs specialized hypergraph experts tailored for object boundaries, internal textures, and backgrounds, utilizing a pixel-level spatial gating mechanism to adaptively route and enhance features. Combined with a load-balancing loss and zero-initialization strategy, FG-MoE ensures stable training and precise feature refinement without disrupting the pre-trained backbone’s distribution. Experimental results on mainstream RGB-Event benchmarks demonstrate that Hyper-FEOD achieves a superior accuracy-efficiency trade-off, outperforming state-of-the-art methods while maintaining a lightweight footprint suitable for real-time edge deployment.

关键词: object detection, RGB-Event fusion, sparse hypergraph, Mixture of Experts, multi-modal interaction, computational efficiency, real-time edge deployment, feature refinement

252. ❌ ViserDex: Visual Sim-to-Real for Robust Dexterous In-hand Reorientation

作者: Arjun Bhardwaj, Maximum Wilder-Smith, Mayank Mittal, Vaishakh Patil, Marco Hutter 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11138v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于机器人灵巧手操作领域，使用3D高斯溅射（3DGS）进行视觉模拟到现实的转换，并通过强化学习训练控制策略。论文内容完全围绕机器人感知与控制技术，未涉及任何大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用。所有评分关键词均与大模型、深度学习技术或AI科学应用相关，而本文研究的是机器人视觉与控制系统，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于3D高斯溅射的视觉模拟到现实框架，用于单目RGB灵巧手物体重定向，通过高斯表示空间的域随机化和课程强化学习，实现了在消费级硬件上训练并在物理多指手上验证的鲁棒重定向系统。

摘要翻译

手内物体重定向需要精确估计物体姿态以应对复杂的任务动态。虽然RGB传感为姿态跟踪提供了丰富的语义线索，但现有解决方案依赖于多相机设置或昂贵的光线追踪技术。本文提出一种用于单目RGB手内重定向的仿真到现实框架，该框架集成3D高斯溅射（3D Gaussian Splatting, 3DGS）以弥合视觉仿真到现实的差距。我们的核心洞见在于高斯表示空间中进行域随机化：通过对3D高斯施加物理一致、预渲染的数据增强，我们为物体姿态估计生成具有照片级真实感的随机化视觉数据。操作策略通过基于课程学习的强化结合师生蒸馏进行训练，从而实现对复杂行为的高效学习。值得注意的是，感知与控制模型均可独立在消费级硬件上训练，无需大型计算集群。实验表明，在具有挑战性的视觉环境中，使用3DGS数据训练的姿态估计器性能优于传统渲染数据训练的模型。我们在配备RGB相机的实体多指手上验证了该系统，即使在挑战性光照条件下也能实现对五种不同物体的鲁棒重定向。我们的研究结果凸显了高斯溅射技术作为纯RGB灵巧操作的实用路径。硬件部署视频及补充材料请访问项目网站：https://rffr.leggedrobotics.com/works/viserdex/

摘要 (Abstract)

In-hand object reorientation requires precise estimation of the object pose to handle complex task dynamics. While RGB sensing offers rich semantic cues for pose tracking, existing solutions rely on multi-camera setups or costly ray tracing. We present a sim-to-real framework for monocular RGB in-hand reorientation that integrates 3D Gaussian Splatting (3DGS) to bridge the visual sim-to-real gap. Our key insight is performing domain randomization in the Gaussian representation space: by applying physically consistent, pre-rendering augmentations to 3D Gaussians, we generate photorealistic, randomized visual data for object pose estimation. The manipulation policy is trained using curriculum-based reinforcement learning with teacher-student distillation, enabling efficient learning of complex behaviors. Importantly, both perception and control models can be trained independently on consumer-grade hardware, eliminating the need for large compute clusters. Experiments show that the pose estimator trained with 3DGS data outperforms those trained using conventional rendering data in challenging visual environments. We validate the system on a physical multi-fingered hand equipped with an RGB camera, demonstrating robust reorientation of five diverse objects even under challenging lighting conditions. Our results highlight Gaussian splatting as a practical path for RGB-only dexterous manipulation. For videos of the hardware deployments and additional supplementary materials, please refer to the project website: https://rffr.leggedrobotics.com/works/viserdex/

关键词: in-hand reorientation, 3D Gaussian Splatting, sim-to-real, monocular RGB, domain randomization, reinforcement learning, dexterous manipulation, pose estimation

253. ❌ Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding

作者: Yueying Li, Fengxiang Wang, Yan Li, Mingshuo Chen, Mengying Zhao, Long Lan 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11122v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文提出DualComp框架，针对多模态大语言模型（MLLMs）处理超高分辨率遥感图像时的视觉令牌压缩问题。核心相关性：1）涉及大模型（MLLMs）在科学（地球观测）领域的应用，与’Large Language Models’和’AI for Science’相关；2）通过令牌压缩提升推理效率，与’Quantization/Model Compression’和’Inference Acceleration’相关。其他关键词（如MoE、SFT、RAG等）未在摘要中提及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型处理超高分辨率遥感图像时视觉令牌过多导致的效率瓶颈，提出了一种任务自适应的双流令牌压缩框架DualComp，在降低计算成本的同时提升了遥感解释的准确性。

摘要翻译

多模态大语言模型（MLLMs）在地球观测领域展现出巨大潜力。然而，处理超高分辨率（Ultra-High-Resolution, UHR）图像时产生的大量视觉标记带来了极高的计算开销，严重制约了其推理效率。现有的视觉标记压缩方法主要采用静态且均匀的压缩策略，忽视了遥感解译任务中固有的“语义-几何二象性”。具体而言，对象语义任务关注目标的抽象语义，受益于对背景的激进剪枝；而场景几何任务则高度依赖于空间拓扑结构的完整性。为应对这一挑战，我们提出了DualComp——一种任务自适应的双流标记压缩框架。在轻量级预训练路由器的动态引导下，DualComp将特征处理解耦为两个专用通路。在对象语义流中，空间连续语义聚合器（Spatially-Contiguous Semantic Aggregator, SCSA）利用尺寸自适应聚类技术，在保护小目标的同时聚合冗余背景信息。在场景几何流中，指令引导结构恢复器（Instruction-Guided Structure Recoverer, IGSR）引入贪心路径追踪拓扑补全机制以重建空间骨架。在超高分辨率遥感基准XLRS-Bench上的实验表明，DualComp能够以极低计算成本实现高保真度的遥感解译，在效率与精度上同时获得提升。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) have demonstrated immense potential in Earth observation. However, the massive visual tokens generated when processing Ultra-High-Resolution (UHR) imagery introduce prohibitive computational overhead, severely bottlenecking their inference efficiency. Existing visual token compression methods predominantly adopt static and uniform compression strategies, neglecting the inherent “Semantic-Geometric Duality” in remote sensing interpretation tasks. Specifically, object semantic tasks focus on the abstract semantics of objects and benefit from aggressive background pruning, whereas scene geometric tasks critically rely on the integrity of spatial topology. To address this challenge, we propose DualComp, a task-adaptive dual-stream token compression framework. Dynamically guided by a lightweight pre-trained router, DualComp decouples feature processing into two dedicated pathways. In the object semantic stream, the Spatially-Contiguous Semantic Aggregator (SCSA) utilizes size-adaptive clustering to aggregates redundant background while protecting small object. In the scene geometric stream, the Instruction-Guided Structure Recoverer (IGSR) introduces a greedy path-tracing topology completion mechanism to reconstruct spatial skeletons. Experiments on the UHR remote sensing benchmark XLRS-Bench demonstrate that DualComp accomplishes high-fidelity remote sensing interpretation at an exceptionally low computational cost, achieving simultaneous improvements in both efficiency and accuracy.

关键词: Multimodal Large Language Models, Visual Token Compression, Ultra-High-Resolution Remote Sensing, Semantic-Geometric Duality, Inference Efficiency, Task-Adaptive Framework, Earth Observation, Computational Overhead

254. ❌ Quantum-Gated Task-interaction Knowledge Distillation for Pre-trained Model-based Class-Incremental Learning

作者: Linjie Li, Huiyu Xiao, Jiarui Cao, Zhenyu Wu, Yang Ji 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11112v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究类增量学习（CIL）中的知识蒸馏方法，提出了一种基于量子门控的任务交互知识蒸馏框架。论文的核心是预训练模型在增量学习中的应用，因此与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为涉及预训练模型在连续任务中的适应。但论文未涉及大语言模型（LLMs）、MoE、小语言模型、缩放定律、对齐、推理、代理、压缩等具体技术，也未涉及科学AI应用，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种量子门控任务交互知识蒸馏框架，解决了预训练模型在类增量学习中多任务子空间纠缠导致的灾难性遗忘问题，并在实验中实现了最先进的性能。

摘要翻译

类增量学习旨在从连续任务流中持续积累知识，并构建覆盖所有已见类别的统一分类器。尽管预训练模型在类增量学习中展现出良好性能，但其仍受多任务子空间纠缠问题的困扰——当任务路由参数校准不佳或任务级表征被僵化固定时，会导致灾难性遗忘。为解决该问题，我们提出一种新颖的量子门控任务交互知识蒸馏框架，通过量子门控机制引导任务间知识迁移。具体而言，我们引入量子门控任务调制门机制来建模任务嵌入间的关联依赖关系，动态捕捉流式任务在联合训练与推理过程中的样本-任务相关性。在量子门控输出的引导下，我们基于这些任务嵌入级关联权重执行从旧适配器到新适配器的任务交互知识蒸馏，使模型能够弥合独立任务子空间之间的表征鸿沟。大量实验表明，该量子门控知识蒸馏框架能有效缓解遗忘问题，并取得最先进的性能表现。

摘要 (Abstract)

Class-incremental learning (CIL) aims to continuously accumulate knowledge from a stream of tasks and construct a unified classifier over all seen classes. Although pretrained models (PTMs) have shown promising performance in CIL, they still struggle with the entanglement of multi-task subspaces, leading to catastrophic forgetting when task routing parameters are poorly calibrated or task-level representations are rigidly fixed. To address this issue, we propose a novel Quantum-Gated Task-interaction Knowledge Distillation (QKD) framework that leverages quantum gating to guide inter-task knowledge transfer. Specifically, we introduce a quantum-gated task modulation gating mechanism to model the relational dependencies among task embedding, dynamically capturing the sample-to-task relevance for both joint training and inference across streaming tasks. Guided by the quantum gating outputs, we perform task-interaction knowledge distillation guided by these task-embedding-level correlation weights from old to new adapters, enabling the model to bridge the representation gaps between independent task subspaces. Extensive experiments demonstrate that QKD effectively mitigates forgetting and achieves state-of-the-art performance.

关键词: Class-incremental learning, Knowledge distillation, Pretrained models, Quantum gating, Task interaction, Catastrophic forgetting, Task modulation

255. ❌ Efficient Transceiver Design for Aerial Image Transmission and Large-scale Scene Reconstruction

作者: Zeyi Ren, Jialin Dong, Wei Zuo, Yikun Wang, Bingyang Cheng, Sheng Zhou, Zhisheng Niu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11098v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于无线通信和3D场景重建的交叉领域，提出了一种基于深度学习的端到端收发器设计，将3D高斯泼溅（3DGS）集成到训练过程中。论文的核心是通信效率与重建质量的平衡，属于深度学习在特定工程问题中的应用。所有关键词均与大语言模型（LLM）相关，而本文未涉及任何LLM技术、架构、训练方法或应用。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文将深度学习应用于科学/工程问题（3D场景重建），但并非核心匹配，故给予5分（有一定关联）。其他关键词与LLM、MoE、对齐、推理、代理等完全无关，均为0分。

!!! tip deepseek-chat TL;DR

本文提出了一种基于深度学习的端到端收发器设计，通过将3D高斯泼溅集成到训练中，在低空智能网络中实现了高效的无线图像传输和高质量的大规模3D场景重建。

摘要翻译

低空智能网络（LAIN）中的大规模三维场景重建对高效的无线图像传输提出了极高要求。然而，现有方案难以在严重的导频开销与维持重建保真度所需的传输精度之间取得平衡。为兼顾效率与可靠性，本文提出一种新颖的基于深度学习的端到端收发器设计，将三维高斯泼溅（3DGS）直接集成至训练过程中。通过结合3DGS渲染损失联合优化通信模块，我们的方法显式提升了场景恢复质量。此外，这一任务驱动框架支持采用稀疏导频方案，在显著降低传输开销的同时，于低空信道条件下保持稳健的图像恢复能力。基于真实航拍数据集的大量实验表明，所提出的端到端设计显著优于现有基线方案，实现了更优的传输性能与精确的三维场景重建。

摘要 (Abstract)

Large-scale three-dimensional (3D) scene reconstruction in low-altitude intelligent networks (LAIN) demands highly efficient wireless image transmission. However, existing schemes struggle to balance severe pilot overhead with the transmission accuracy required to maintain reconstruction fidelity. To strike a balance between efficiency and reliability, this paper proposes a novel deep learning-based end-to-end (E2E) transceiver design that integrates 3D Gaussian Splatting (3DGS) directly into the training process. By jointly optimizing the communication modules via the combined 3DGS rendering loss, our approach explicitly improves scene recovery quality. Furthermore, this task-driven framework enables the use of a sparse pilot scheme, significantly reducing transmission overhead while maintaining robust image recovery under low-altitude channel conditions. Extensive experiments on real-world aerial image datasets demonstrate that the proposed E2E design significantly outperforms existing baselines, delivering superior transmission performance and accurate 3D scene reconstructions.

关键词: 3D scene reconstruction, wireless image transmission, deep learning, end-to-end transceiver, 3D Gaussian Splatting, low-altitude intelligent networks, sparse pilot scheme, task-driven framework

作者: Rongjia Yu, Tong Jia, Hao Wang, Xiaofang Li, Xiao Yang, Zinuo Zhang, Cuiwei Liu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11097v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉中的单目深度估计任务，提出了一种结合偏振信息的跨模态扩散模型。论文的核心技术是扩散模型、变分自编码器（VAE）和多模态融合，属于计算机视觉和深度学习领域。所有评分关键词均与大语言模型（LLM）相关技术、AI for Science应用或大模型技术原理创新相关，而本文研究内容完全不涉及这些领域。论文没有使用或讨论任何大语言模型、AI for Science应用或评分列表中的技术，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合偏振信息的跨模态扩散模型CDPR，用于提升单目深度估计在纹理缺失、透明和镜面反射等复杂场景下的鲁棒性和准确性。

摘要翻译

单目深度估计是计算机视觉领域一项基础且具有挑战性的任务，在诸如无纹理表面、透明物体和镜面反射等复杂条件下尤为困难。近期，基于扩散模型的方法通过将深度预测重新定义为潜在空间中的去噪过程，显著提升了性能。然而，现有方法仅依赖RGB输入，这些输入在挑战性区域往往缺乏足够的线索。在本工作中，我们提出了CDPR——用于可靠单目深度估计的偏振跨模态扩散框架——这是一种新颖的基于扩散的框架，它整合了基于物理的偏振先验以增强估计的鲁棒性。具体而言，我们通过预训练的变分自编码器（VAE）将RGB图像和偏振（AoLP/DoLP）图像编码到一个共享的潜在空间中，并通过一个可学习的置信度感知门控机制动态融合多模态信息。该融合模块能自适应地抑制偏振输入中的噪声信号，同时保留信息丰富的线索，尤其是在反射或透明表面附近，并为后续的单目深度估计提供融合后的潜在表征。除了深度估计，我们进一步验证了该框架只需最小修改即可轻松推广至表面法线预测任务，展示了其对一般性偏振引导密集预测任务的可扩展性。在合成和真实世界数据集上的实验证明，CDPR在挑战性区域显著优于仅使用RGB的基线方法，同时在标准场景中保持了有竞争力的性能。

摘要 (Abstract)

Monocular depth estimation is a fundamental yet challenging task in computer vision, especially under complex conditions such as textureless surfaces, transparency, and specular reflections. Recent diffusion-based approaches have significantly advanced performance by reformulating depth prediction as a denoising process in the latent space. However, existing methods rely solely on RGB inputs, which often lack sufficient cues in challenging regions. In this work, we present CDPR - Cross-modal Diffusion with Polarization for Reliable Monocular Depth Estimation - a novel diffusion-based framework that integrates physically grounded polarization priors to enhance estimation robustness. Specifically, we encode both RGB and polarization (AoLP/DoLP) images into a shared latent space via a pre-trained Variational Autoencoder (VAE), and dynamically fuse multi-modal information through a learnable confidence-aware gating mechanism. This fusion module adaptively suppresses noisy signals in polarization inputs while preserving informative cues, particularly around reflective or transparent surfaces, and provides the integrated latent representation for subsequent monocular depth estimation. Beyond depth estimation, we further verify that our framework can be easily generalized to surface normal prediction with minimal modification, showcasing its scalability to general polarization-guided dense prediction tasks. Experiments on both synthetic and real-world datasets validate that CDPR significantly outperforms RGB-only baselines in challenging regions while maintaining competitive performance in standard scenes.

关键词: Monocular Depth Estimation, Diffusion Models, Polarization, Cross-modal Fusion, Variational Autoencoder, Confidence-aware Gating, Surface Normal Prediction, Dense Prediction

257. ❌ Structured State-Space Regularization for Compact and Generation-Friendly Image Tokenization

作者: Jinsung Lee, Jaemin Oh, Namhun Kim, Dongwon Kim, Byung-Jun Yoon, Suha Kwak 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11089v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的图像标记化（image tokenization）技术，提出了一种基于状态空间模型（SSM）的正则化方法来改进视觉模型的潜在空间表示。论文的核心内容涉及图像编码、潜在空间优化、生成模型（扩散模型）等计算机视觉技术，但完全不涉及大语言模型（LLM）、深度学习技术原理创新、或大模型在不同领域的应用。所有评分关键词均与大语言模型、深度学习技术原理、或大模型应用相关，而该论文的研究领域是纯计算机视觉，与这些关键词无任何关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于状态空间模型正则化的新方法，用于改进图像标记化技术，使潜在空间更紧凑且更易于生成模型处理，从而在扩散模型中提高了生成质量，同时保持了重建保真度。

摘要翻译

图像分词器是现代视觉模型的核心组件，因其通常在隐空间中进行操作。理想的隐空间必须同时具备紧凑性和生成友好性：它应能紧凑地捕捉图像的核心内容，同时易于通过生成式方法进行建模。本研究引入了一种新颖的正则化方法，以使隐空间与这两个目标对齐。其核心思想是引导分词器模拟状态空间模型（SSMs）的隐藏状态动态，从而将其关键特性——频率感知——迁移至隐特征中。基于对SSMs的理论分析，我们的正则化方法强制将精细的空间结构信息和频域线索编码到紧凑的隐特征中；这使得表征容量得到更有效的利用，并提升了生成模型的可建模性。实验表明，我们的方法在扩散模型中提高了生成质量，同时仅带来极小的重建保真度损失。

摘要 (Abstract)

Image tokenizers are central to modern vision models as they often operate in latent spaces. An ideal latent space must be simultaneously compact and generation-friendly: it should capture image’s essential content compactly while remaining easy to model with generative approaches. In this work, we introduce a novel regularizer to align latent spaces with these two objectives. The key idea is to guide tokenizers to mimic the hidden state dynamics of state-space models (SSMs), thereby transferring their critical property, frequency awareness, to latent features. Grounded in a theoretical analysis of SSMs, our regularizer enforces encoding of fine spatial structures and frequency-domain cues into compact latent features; leading to more effective use of representation capacity and improved generative modelability. Experiments demonstrate that our method improves generation quality in diffusion models while incurring only minimal loss in reconstruction fidelity.

关键词: image tokenization, latent space, state-space models, regularization, generative models, diffusion models, compact representation, frequency awareness

258. ❌ LDEPrompt: Layer-importance guided Dual Expandable Prompt Pool for Pre-trained Model-based Class-Incremental Learning

作者: Linjie Li, Zhenyu Wu, Huiyu Xiao, Yang Ji 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11091v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是基于预训练模型的类增量学习（class-incremental learning），提出了一种新的提示池方法（LDEPrompt）。与关键词的相关性分析如下：1）与“Pre-training OR Continual Pre-training OR Domain Adaptation”有一定关联（5分），因为论文涉及预训练模型和增量学习（一种持续学习形式），但并非核心。2）与“PEFT OR LoRA OR Parameter-efficient Fine-tuning”高度相关（8分），因为提示学习（prompt-based learning）是参数高效微调（PEFT）的一种重要方法，论文的核心创新正是提示池的动态扩展和优化，属于PEFT范畴。3）其他关键词（如LLMs、MoE、SFT、RAG等）均未在论文中涉及，得0分。论文未提及任何指定专家作者。

!!! tip deepseek-chat TL;DR

该论文针对基于预训练模型的类增量学习中提示池固定、手动选择等问题，提出了一种层重要性引导的双重可扩展提示池方法（LDEPrompt），在多个基准测试中实现了最先进的性能。

摘要翻译

基于提示的类增量学习方法通常构建一个由多个可训练的关键提示（key-prompts）组成的提示池（prompt pool），并通过实例级匹配选择最合适的提示嵌入（prompt embeddings），已展现出良好的效果。然而，现有方法存在若干局限，包括提示池固定、提示嵌入需手动选择，以及提示选择过程严重依赖预训练主干网络。为解决这些问题，我们提出了一种基于层重要性指导的双重可扩展提示池（LDEPrompt），该方法能够实现自适应层选择，并支持提示池的动态冻结与扩展。在广泛使用的类增量学习基准测试上进行的大量实验表明，LDEPrompt 取得了最先进的性能，验证了其有效性和可扩展性。

摘要 (Abstract)

Prompt-based class-incremental learning methods typically construct a prompt pool consisting of multiple trainable key-prompts and perform instance-level matching to select the most suitable prompt embeddings, which has shown promising results. However, existing approaches face several limitations, including fixed prompt pools, manual selection of prompt embeddings, and strong reliance on the pretrained backbone for prompt selection. To address these issues, we propose a \textbf{L}ayer-importance guided \textbf{D}ual \textbf{E}xpandable \textbf{P}rompt Pool (\textbf{LDEPrompt}), which enables adaptive layer selection as well as dynamic freezing and expansion of the prompt pool. Extensive experiments on widely used class-incremental learning benchmarks demonstrate that LDEPrompt achieves state-of-the-art performance, validating its effectiveness and scalability.

关键词: class-incremental learning, prompt-based learning, pre-trained models, prompt pool, parameter-efficient fine-tuning, layer-importance guidance, dynamic expansion, state-of-the-art performance

259. ❌ KL Divergence Between Gaussians: A Step-by-Step Derivation for the Variational Autoencoder Objective

作者: Andrés Muñoz, Rodrigo Ramele 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11744v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于推导高斯分布之间KL散度的闭式表达式，并将其应用于变分自编码器（VAE）的数学基础。论文内容纯粹是数学推导和概率论应用，不涉及任何大模型、深度学习技术原理、AI应用或科学AI领域。所有评分关键词均与大模型技术、训练方法、推理优化、AI应用等主题相关，而本文是基础数学推导，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文详细推导了高斯分布之间Kullback-Leibler散度的闭式表达式，并解释了其在变分自编码器训练中的正则化作用。

摘要翻译

Kullback-Leibler（KL）散度是信息论中的一个基本概念，用于量化两个概率分布之间的差异。在变分自编码器（Variational Autoencoders, VAEs）的框架中，它作为一个核心的正则化项，对隐空间施加结构约束，从而使模型具备生成能力。本文详细推导了高斯分布之间KL散度的闭式表达式，这一情形在实际的VAE实现中尤为重要。从连续随机变量的一般定义出发，我们首先推导了单变量情况下的表达式，并在假设协方差矩阵为对角阵的条件下，将其推广到多变量情形。最后，我们讨论了所得表达式中各项的数学意义及其对模型训练动态的影响。

摘要 (Abstract)

Kullback-Leibler (KL) divergence is a fundamental concept in information theory that quantifies the discrepancy between two probability distributions. In the context of Variational Autoencoders (VAEs), it serves as a central regularization term, imposing structure on the latent space and thereby enabling the model to exhibit generative capabilities. In this work, we present a detailed derivation of the closed-form expression for the KL divergence between Gaussian distributions, a case of particular importance in practical VAE implementations. Starting from the general definition for continuous random variables, we derive the expression for the univariate case and extend it to the multivariate setting under the assumption of diagonal covariance. Finally, we discuss the interpretation of each term in the resulting expression and its impact on the training dynamics of the model.

关键词: KL divergence, Gaussian distributions, Variational Autoencoder, VAE, latent space, regularization term, closed-form expression, training dynamics

260. ❌ GPU Acceleration of Sparse Fully Homomorphic Encrypted DNNs

作者: Lara D’Agata, Carlos Agulló-Domingo, Óscar Vera-López, Kaustubh Shivdikar, Ardhi W. B. Yudha, Ferhat Yaman, David Kaeli, José L. Abellán, Ian Colbert, José Cano 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11659v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究全同态加密（FHE）在深度神经网络（DNN）中的GPU加速，特别是稀疏矩阵乘法的优化。与评分关键词的相关性有限：仅与"Mixture of Experts OR MoE OR Sparse Models"（5分，因涉及稀疏模型技术）和"Speculative Decoding OR Inference Acceleration"（5分，因涉及GPU加速和性能优化）有中等关联。其他关键词主要针对大语言模型（LLM）的技术、训练、对齐、推理、应用等，而本文聚焦于FHE和DNN的硬件加速，未涉及LLM或相关技术，因此大多评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种在AMD GPU上加速稀疏全同态加密深度神经网络矩阵乘法的新方法，使用FIDESlib库将时间复杂度从立方降低到半线性，性能提升达3倍。

摘要翻译

全同态加密（Fully Homomorphic Encryption, FHE）近来作为一种密码学原语和系统挑战受到了广泛关注。鉴于加速计算领域的最新进展，FHE 展现出广阔的发展前景，其应用范围涵盖机器学习到信息安全等多个领域。我们从硬件角度出发，针对深度神经网络中计算最密集的运算——矩阵乘法（matmul），并使其适配于 AMD GPU 上执行。我们提出了一种新的优化方法，通过使用专为 GPU 设计的近期开源 FHE 库 FIDESlib，提升了密文矩阵乘法的运行时间和复杂度。通过利用两个操作数中的稀疏性，我们的稀疏矩阵乘法实现相比其 CPU 版本性能提升高达 $3.0\times$，并将时间复杂度从三次方降低至半线性，这显示出对现有 FHE 矩阵乘法实现的改进。

摘要 (Abstract)

Fully homomorphic encryption (FHE) has recently attracted significant attention as both a cryptographic primitive and a systems challenge. Given the latest advances in accelerated computing, FHE presents a promising opportunity for progress, with applications ranging from machine learning to information security. We target the most computationally intensive operation in deep neural networks from a hardware perspective, matrix multiplication (matmul), and adapt it for execution on AMD GPUs. We propose a new optimized method that improves the runtime and complexity of ciphertext matmul by using FIDESlib, a recent open-source FHE library designed specifically for GPUs. By exploiting sparsity in both operands, our sparse matmul implementation outperforms its CPU counterpart by up to $3.0\times$ and reduces the time complexity from cubic to semi-linear, demonstrating an improvement over existing FHE matmul implementations.

关键词: Fully Homomorphic Encryption, GPU Acceleration, Sparse Matrix Multiplication, Deep Neural Networks, AMD GPUs, FIDESlib, Time Complexity Reduction, Performance Optimization

261. ❌ Universality of first-order methods on random and deterministic matrices

作者: Nicola Gorini, Chris Jones, Dmitriy Kunisky, Lucas Pesenti 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11729v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究通用一阶方法（GFOM）在随机和确定性矩阵上的动力学分析，属于数值线性代数、随机矩阵理论和优化算法的交叉领域。论文内容完全不涉及大模型、深度学习、AI应用或任何评分关键词中的技术（如LLM、MoE、训练方法、推理加速、AI对齐等）。所有关键词均与大模型技术相关，而本文是纯数学理论分析，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了通用一阶方法在随机和确定性矩阵上的渐近动力学，通过交通分布理论分析了Walsh-Hadamard等确定性矩阵的极限行为，并设计了一种新的近似消息传递算法，统一了先前变体并推广到新的输入类型。

摘要翻译

通用一阶方法（GFOM）是一类灵活的迭代算法，其通过矩阵向量乘法和逐点非线性变换来更新状态向量。长期以来，一系列研究致力于理解GFOM在大规模n下的动态特性，主要关注“高度随机”的输入矩阵以及GFOM的特例——近似消息传递（AMP）算法，其状态向量渐近服从高斯分布。然而，如何构建迭代算法以在更具结构性的输入下保持这种高斯性，以及为何现有AMP算法对某些确定性矩阵能像对随机矩阵一样有效，这两个问题长期悬而未决。
我们通过输入矩阵的极限交通分布（即矩阵元素中所有置换不变多项式的极限值集合）来分析GFOM的图展开，并得到以下结果：
我们计算了首批非平凡确定性矩阵（包括沃尔什-哈达玛变换矩阵及离散正弦与余弦变换矩阵的微小变体）的交通分布。这确定了GFOM在这些输入上的极限动态，从而解决了Marinari、Parisi和Ritort（1994）长期猜想的部分内容。
我们设计了一种新的AMP迭代算法，该算法统一了先前多种AMP变体，并能推广至新的输入类型，其极限动态在给定某些潜随机变量的条件下服从高斯分布。这一渐近动态适用于一大类自然的交通分布（涵盖随机与确定性输入矩阵），且对该算法的分析为昂萨格修正项提供了简洁的组合解释，从而回答了Wang、Zhong和Fan（2022）近期提出的问题。

摘要 (Abstract)

General first-order methods (GFOM) are a flexible class of iterative algorithms which update a state vector by matrix-vector multiplications and entrywise nonlinearities. A long line of work has sought to understand the large-n dynamics of GFOM, mostly focusing on “very random” input matrices and the approximate message passing (AMP) special case of GFOM whose state is asymptotically Gaussian. Yet, it has long remained unknown how to construct iterative algorithms that retain this Gaussianity for more structured inputs, or why existing AMP algorithms can be as effective for some deterministic matrices as they are for random matrices. We analyze diagrammatic expansions of GFOM via the limiting traffic distribution of the input matrix, the collection of all limiting values of permutation-invariant polynomials in the matrix entries, to obtain the following results:
We calculate the traffic distribution for the first non-trivial deterministic matrices, including (minor variants of) the Walsh-Hadamard and discrete sine and cosine transform matrices. This determines the limiting dynamics of GFOM on these inputs, resolving parts of longstanding conjectures of Marinari, Parisi, and Ritort (1994).
We design a new AMP iteration which unifies several previous AMP variants and generalizes to new input types, whose limiting dynamics are Gaussian conditional on some latent random variables. The asymptotic dynamics hold for a large and natural class of traffic distributions (encompassing both random and deterministic input matrices) and the algorithm’s analysis gives a simple combinatorial interpretation of the Onsager correction, answering questions posed recently by Wang, Zhong, and Fan (2022).

关键词: first-order methods, random matrices, deterministic matrices, traffic distribution, approximate message passing, asymptotic dynamics, Walsh-Hadamard matrices, Onsager correction

262. ❌ Inter-Layer Hessian Analysis of Neural Networks with DAG Architectures

作者: Maxim Bolshim, Alexander Kugaevskikh 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11639v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于神经网络（特别是具有DAG架构的网络）的Hessian矩阵理论分析，提出了一个分解框架和诊断指标。论文内容与深度学习技术原理相关，但具体聚焦于优化理论、二阶导数分析和网络架构的数学性质，而非大模型应用或特定的大模型技术（如LLM训练、对齐、推理优化等）。唯一相关的关键词是’Mechanistic Interpretability OR Explainable AI’，因为论文通过分析Hessian矩阵的结构来揭示神经网络内部的层间相互作用，这属于模型可解释性/机理可解释性的范畴，但并非核心内容，因此给予5分（有一定关联）。其他关键词均未在论文标题或摘要中涉及，与论文主题无关，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个分析框架，将任意DAG架构神经网络的Hessian矩阵分解为高斯-牛顿分量和张量分量，并引入诊断指标来揭示层间的结构曲率相互作用，从而为理解网络优化行为提供了新的理论工具。

摘要翻译

现代自动微分框架（JAX、PyTorch）将损失函数的Hessian矩阵作为一个整体张量返回，并未揭示层间相互作用的内部结构。本文提出了一种解析形式体系，能够将完整的Hessian矩阵显式分解为按任意架构的有向无环图（DAG）索引的块。规范分解 $H = H^{GN} + H^T$ 将高斯-牛顿分量（凸性部分）与张量分量（导致鞍点的残差曲率部分）分离开来。对于分段线性激活函数（ReLU），输入Hessian的张量分量消失（几乎处处有 $H^{T}{v,w}!\equiv!0$，$H^f{v,w}!=!H^{GN}_{v,w}!\succeq!0$）；而完整的参数Hessian则包含无法归约为高斯-牛顿矩阵（GGN）的残差项。基于此分解，我们引入了一系列诊断度量（层间共振 $\mathcal{R}$、几何耦合 $\mathcal{C}$、稳定秩 $\mathcal{D}$、GN-Gap），这些度量可在 $O(P)$ 时间内随机估计，并揭示层间的结构曲率相互作用。理论分析解释了在普通网络中共振的指数衰减现象，以及在跳跃连接下其得以保持的原因；实证验证涵盖了全连接多层感知器（实验1–5）和卷积架构（ResNet-18，约1100万个参数，实验6）。当架构简化为单一节点时，所有定义均退化为标准的Hessian矩阵 $\nabla^2_θ\mathcal{L}(θ)\in\mathbb{R}^{p\times p}$。

摘要 (Abstract)

Modern automatic differentiation frameworks (JAX, PyTorch) return the Hessian of the loss function as a monolithic tensor, without exposing the internal structure of inter-layer interactions. This paper presents an analytical formalism that explicitly decomposes the full Hessian into blocks indexed by the DAG of an arbitrary architecture. The canonical decomposition $H = H^{GN} + H^T$ separates the Gauss–Newton component (convex part) from the tensor component (residual curvature responsible for saddle points). For piecewise-linear activations (ReLU), the tensor component of the input Hessian vanishes ($H^{T}{v,w}!\equiv!0$ a.e., $H^f{v,w}!=!H^{GN}_{v,w}!\succeq!0$); the full parametric Hessian contains residual terms that do not reduce to the GGN. Building on this decomposition, we introduce diagnostic metrics (inter-layer resonance~$\mathcal{R}$, geometric coupling~$\mathcal{C}$, stable rank~$\mathcal{D}$, GN-Gap) that are estimated stochastically in $O(P)$ time and reveal structural curvature interactions between layers. The theoretical analysis explains exponential decay of resonance in vanilla networks and its preservation under skip connections; empirical validation spans fully connected MLPs (Exp.,1–5) and convolutional architectures (ResNet-18, ${\sim}11$M~parameters, Exp.,6). When the architecture reduces to a single node, all definitions collapse to the standard Hessian $\nabla^2_θ\mathcal{L}(θ)\in\mathbb{R}^{p\times p}$.

关键词: Hessian analysis, neural networks, DAG architectures, Gauss-Newton decomposition, inter-layer interactions, curvature diagnostics, ReLU activations, parametric Hessian

263. ❌ Computation of Least Trimmed Squares: A Branch-and-Bound framework with Hyperplane Arrangement Enhancements

作者: Xiang Meng, Andrés Gómez, Rahul Mazumder 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11584v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是稳健统计中的惩罚最小修剪二乘（LTS）回归问题的计算优化，属于传统优化算法和统计计算领域，与所有大模型、深度学习、AI应用相关的关键词均无直接关联。论文内容聚焦于混合整数优化（MIO）框架、分支定界算法、超平面排列等传统计算方法，未涉及任何神经网络、语言模型、AI代理或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种新的混合整数优化框架和分支定界算法，用于高效计算惩罚最小修剪二乘回归，显著提升了大规模低维数据下的精确稳健回归性能。

摘要翻译

本研究探讨了稳健统计学中的一个关键问题——惩罚最小截尾平方和（LTS）回归问题的计算层面。惩罚LTS是一种稳健估计器，通过限制较大残差的影响来减轻数据中异常值的影响。尽管在统计上具有吸引力，惩罚LTS问题是NP难的，且现有的混合整数优化（MIO）模型由于松弛性较弱以及在观测数量上具有指数级的最坏情况复杂度，其扩展性较差。我们提出了一种新的MIO模型，该模型将超平面排列逻辑嵌入视角重构中，明确地强制执行最优解的结构特性。我们证明，若特征数量固定，则由此产生的分支定界树在样本量上具有多项式规模。此外，我们开发了一种定制化的分支定界算法，该算法利用带对偶界的一阶方法高效求解节点松弛问题。在合成数据集和真实数据集上的计算实验表明，相较于现有MIO方法，本方法取得了显著改进：在具有5000个样本和20个特征的合成实例上，我们的定制求解器可在1分钟内达到1%的优化间隙，而现有竞争方法在1小时内无法实现。这些改进使得在低维设定下，能够在显著更大的样本量上实现精确的稳健回归。

摘要 (Abstract)

We study computational aspects of a key problem in robust statistics – the penalized least trimmed squares (LTS) regression problem, a robust estimator that mitigates the influence of outliers in data by capping residuals with large magnitudes. Although statistically attractive, penalized LTS is NP-hard, and existing mixed-integer optimization (MIO) formulations scale poorly due to weak relaxations and exponential worst-case complexity in the number of observations. We propose a new MIO formulation that embeds hyperplane arrangement logic into a perspective reformulation, explicitly enforcing structural properties of optimal solutions. We show that, if the number of features is fixed, the resulting branch-and-bound tree is of polynomial size in the sample size. Moreover, we develop a tailored branch-and-bound algorithm that uses first-order methods with dual bounds to solve node relaxations efficiently. Computational experiments on synthetic and real datasets demonstrate substantial improvements over existing MIO approaches: on synthetic instances with 5000 samples and 20 features, our tailored solver reaches a 1% gap in 1 minute while competing approaches fail to do so within one hour. These gains enable exact robust regression at significantly larger sample sizes in low-dimensional settings.

关键词: robust statistics, least trimmed squares, mixed-integer optimization, branch-and-bound, hyperplane arrangement, regression, outlier detection, computational optimization

264. ❌ Human Centered Non Intrusive Driver State Modeling Using Personalized Physiological Signals in Real World Automated Driving

作者: David Puertas-Ramirez, Raul Fernandez-Matellan, David Martin Gomez, Jesus G. Boticario 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11549v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是在真实世界自动驾驶环境中，使用可穿戴传感器采集个性化生理信号，通过深度学习模型（基于预训练的ResNet50）进行驾驶员状态建模。虽然涉及深度学习技术，但所有给定的关键词都专门针对大语言模型（LLM）及其相关技术（如MoE、RLHF、RAG、量化等），或特定AI科学应用（如生物信息学）。论文的核心是计算机视觉架构（ResNet）处理生理信号，用于驾驶员监控，与LLM技术、大模型原理创新或AI for Science（如生物信息学）无直接关联。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了在SAE Level 2自动驾驶车辆中，使用非侵入式可穿戴传感器采集个性化生理信号，并基于预训练的ResNet50深度学习模型进行驾驶员状态建模，结果表明个性化模型（平均准确率92.68%）显著优于通用模型（准确率54%），强调了未来自动驾驶系统需要适应驾驶员个体生理特征。

摘要翻译

在具备部分或有条件驾驶自动化（SAE 2-3级）的车辆中，驾驶员仍需负责监督系统并响应接管请求。因此，可靠的驾驶员状态监控对于实现安全的人机协作至关重要。然而，现有的大多数驾驶员监控系统依赖于忽略个体生理差异的通用模型。本研究探讨了在真实世界自动驾驶场景中，利用非侵入式生理传感技术实现个性化驾驶员状态建模的可行性。我们在SAE 2级车辆中开展实验，使用Empatica E4可穿戴传感器采集多模态生理信号，包括皮肤电活动、心率、温度和运动数据。为适配专为图像设计的深度学习架构，我们将生理信号转化为二维表征，并采用基于预训练ResNet50特征提取器的多模态架构进行处理。针对四位驾驶员的实验表明，与驾驶员注意力相关的生理模式存在显著的个体间差异。个性化模型平均准确率达到92.68%，而基于多用户训练的通用模型准确率降至54%，这揭示了跨用户泛化能力的严重局限性。这些结果强调了未来自动驾驶车辆需要具备自适应、个性化的驾驶员监控系统，并意味着自动驾驶系统应适配每位驾驶员独特的生理特征。

摘要 (Abstract)

In vehicles with partial or conditional driving automation (SAE Levels 2-3), the driver remains responsible for supervising the system and responding to take-over requests. Therefore, reliable driver monitoring is essential for safe human-automation collaboration. However, most existing Driver Monitoring Systems rely on generalized models that ignore individual physiological variability. In this study, we examine the feasibility of personalized driver state modeling using non-intrusive physiological sensing during real-world automated driving. We conducted experiments in an SAE Level 2 vehicle using an Empatica E4 wearable sensor to capture multimodal physiological signals, including electrodermal activity, heart rate, temperature, and motion data. To leverage deep learning architectures designed for images, we transformed the physiological signals into two-dimensional representations and processed them using a multimodal architecture based on pre-trained ResNet50 feature extractors. Experiments across four drivers demonstrate substantial interindividual variability in physiological patterns related to driver awareness. Personalized models achieved an average accuracy of 92.68%, whereas generalized models trained on multiple users dropped to an accuracy of 54%, revealing substantial limitations in cross-user generalization. These results underscore the necessity of adaptive, personalized driver monitoring systems for future automated vehicles and imply that autonomous systems should adapt to each driver’s unique physiological profile.

关键词: driver monitoring, personalized modeling, physiological signals, automated driving, deep learning, ResNet50, wearable sensor, human-automation collaboration

265. ❌ TempusBench: An Evaluation Framework for Time-Series Forecasting

作者: Denizalp Goktas, Gerardo Riaño-Briceño, Alif Abdullah, Aryan Nair, Chenkai Shen, Beatriz de Lucio, Alexandra Magnusson, Farhan Mashrur, Ahmed Abdulla, Shawrna Sen, Mahitha Thippireddy, Gregory Schwartz, Amy Greenwald 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11529v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要关注时间序列预测领域的基础模型（TSFMs），与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分），因为论文明确讨论基础模型在时间序列领域的应用。与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为论文提到预训练语料库和领域适应问题。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为时间序列预测在科学领域有广泛应用。其他关键词如MoE、SLMs、RLHF、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对时间序列预测基础模型（TSFMs）缺乏全面评估框架的问题，提出了一个名为TempusBench的开源评估框架，包括新数据集、基准任务、标准化评估流程和可视化工具。

摘要翻译

基础模型已彻底改变了自然语言处理和计算机视觉领域，而关于时间序列基础模型（TSFMs）的快速增长的文献正试图在预测领域复制这一成功。尽管近期开源模型展现了TSFMs的潜力，但该领域仍缺乏一个全面且被学界广泛接受的模型评估框架。我们发现至少有四个主要问题阻碍了此类框架的发展。首先，当前的评估框架由通常基于过时数据集（如M3）的基准预测任务构成，其中许多数据集缺乏清晰的元数据，且与用于预训练TSFMs的语料库存在重叠。其次，现有框架仅针对狭义定义的基准预测任务（如预测时间范围长度或领域）评估模型，却忽视了非平稳性和季节性等核心统计特性。第三，领域特定模型（如XGBoost）常在不公平条件下进行比较，因为现有框架缺乏一个适用于所有模型的系统且一致的超参数调优规范。第四，缺乏用于解释比较性能的可视化工具。为解决这些问题，我们推出了TempusBench，一个面向TSFMs的开源评估框架。TempusBench包含：1）未包含在现有TSFM预训练语料库中的新数据集；2）超越现有任务的一系列新颖基准任务；3）采用标准化超参数调优协议的模型评估流程；4）基于TensorBoard的可视化界面。我们在GitHub上提供了代码访问：https://github.com/Smlcrm/TempusBench。

摘要 (Abstract)

Foundation models have transformed natural language processing and computer vision, and a rapidly growing literature on time-series foundation models (TSFMs) seeks to replicate this success in forecasting. While recent open-source models demonstrate the promise of TSFMs, the field lacks a comprehensive and community-accepted model evaluation framework. We see at least four major issues impeding progress on the development of such a framework. First, current evaluation frameworks consist of benchmark forecasting tasks derived from often outdated datasets (e.g., M3), many of which lack clear metadata and overlap with the corpora used to pre-train TSFMs. Second, existing frameworks evaluate models along a narrowly defined set of benchmark forecasting tasks such as forecast horizon length or domain, but overlook core statistical properties such as non-stationarity and seasonality. Third, domain-specific models (e.g., XGBoost) are often compared unfairly, as existing frameworks neglect a systematic and consistent hyperparameter tuning convention for all models. Fourth, visualization tools for interpreting comparative performance are lacking. To address these issues, we introduce TempusBench, an open-source evaluation framework for TSFMs. TempusBench consists of 1) new datasets which are not included in existing TSFM pretraining corpora, 2) a set of novel benchmark tasks that go beyond existing ones, 3) a model evaluation pipeline with a standardized hyperparameter tuning protocol, and 4) a tensorboard-based visualization interface. We provide access to our code on GitHub: https://github.com/Smlcrm/TempusBench.

关键词: time-series foundation models, forecasting, evaluation framework, benchmark tasks, hyperparameter tuning, visualization tools, open-source

266. ❌ Generative Path-Finding Method for Wasserstein Gradient Flow

作者: Chengyu Liu, Xiang Zhou 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11519v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《Generative Path-Finding Method for Wasserstein Gradient Flow》提出了一种名为GenWGP的生成式路径寻找框架，用于计算Wasserstein梯度流（WGFs）的演化路径。该研究属于计算数学和机器学习交叉领域，核心贡献在于提出了一种基于生成流（normalizing flows）的几何方法，以高效计算高维概率分布从初始状态到平衡态的演化路径。论文与绝大多数关键词（如LLMs、MoE、RLHF、RAG、CoT等）完全无关，因为这些关键词特指大语言模型及其相关技术（如训练、对齐、推理、应用等），而本文研究的是数学物理中的梯度流计算问题，未涉及任何语言模型或自然语言处理内容。唯一可能的相关关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为该研究属于AI在科学计算（具体是数学物理模拟）中的应用，但关联较弱（5分），因为论文未明确涉及生物信息学或化学信息学，且核心是方法学而非典型的科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于生成流的几何框架GenWGP，用于高效计算高维Wasserstein梯度流的演化路径，避免了传统方法的维数灾难和时间步长约束，并在实验中以少量离散点匹配了高保真参考解。

摘要翻译

Wasserstein梯度流（WGFs）描述了概率分布在Wasserstein空间中作为自由能泛函最速下降动力学的演化过程。计算从任意初始分布到平衡态的完整路径具有挑战性，尤其是在高维空间中。欧拉方法受维度灾难困扰，而现有基于粒子或生成映射的拉格朗日方法则无法通过时间步长调整自然提升效率。我们提出GenWGP，一种用于Wasserstein梯度路径的生成式路径寻找框架。GenWGP通过学习一个生成流，通过最小化编码完整轨迹及其终端平衡条件的路径损失，将质量从初始密度输运至未知的平衡分布。该损失函数源自一个几何作用量泛函，其动机来源于相互作用扩散系统经验分布的Dawson Gartner大偏差理论。我们构建了物理时间参数化下的有限时域作用量，以及基于Wasserstein弧长的重参数化不变几何作用量。利用归一化流，GenWGP计算出一条通向平衡态的几何曲线，同时强制相邻网络层之间保持近似恒定的内蕴速度，使得离散化分布在路径上依Wasserstein度量保持近乎等距。这避免了精细的时间步长约束，并实现了基本独立于时间或几何离散化的稳定训练。在Fokker Planck方程和聚集类型问题上的实验表明，GenWGP仅需约十几个离散点即可匹配或超越高精度参考解，同时捕捉复杂的动力学行为。

摘要 (Abstract)

Wasserstein gradient flows (WGFs) describe the evolution of probability distributions in Wasserstein space as steepest descent dynamics for a free energy functional. Computing the full path from an arbitrary initial distribution to equilibrium is challenging, especially in high dimensions. Eulerian methods suffer from the curse of dimensionality, while existing Lagrangian approaches based on particles or generative maps do not naturally improve efficiency through time step tuning. We propose GenWGP, a generative path finding framework for Wasserstein gradient paths. GenWGP learns a generative flow that transports mass from an initial density to an unknown equilibrium distribution by minimizing a path loss that encodes the full trajectory and its terminal equilibrium condition. The loss is derived from a geometric action functional motivated by Dawson Gartner large deviation theory for empirical distributions of interacting diffusion systems. We formulate both a finite horizon action under physical time parametrization and a reparameterization invariant geometric action based on Wasserstein arclength. Using normalizing flows, GenWGP computes a geometric curve toward equilibrium while enforcing approximately constant intrinsic speed between adjacent network layers, so that discretized distributions remain nearly equidistant in the Wasserstein metric along the path. This avoids delicate time stepping constraints and enables stable training that is largely independent of temporal or geometric discretization. Experiments on Fokker Planck and aggregation type problems show that GenWGP matches or exceeds high fidelity reference solutions with only about a dozen discretization points while capturing complex dynamics.

关键词: Wasserstein gradient flow, generative path finding, normalizing flows, geometric action, high-dimensional probability distributions, Fokker-Planck equation, aggregation dynamics, computational efficiency

267. ❌ Machine-learning modeling of magnetization dynamics in quasi-equilibrium and driven metallic spin systems

作者: Gia-Wei Chern, Yunhao Fan, Sheng Zhang, Puhan Zhang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11513v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用机器学习（特别是Behler-Parrinello架构）对金属自旋系统中的磁化动力学进行建模和模拟，属于物理学和材料科学中的计算科学应用。论文的核心是机器学习力场方法在特定物理系统（Landau-Lifshitz-Gilbert模拟）中的应用，不涉及任何大语言模型（LLMs）、模型训练技术（如预训练、微调、对齐）、推理优化、智能体系统或通用人工智能方法。唯一相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文将机器学习应用于科学问题（凝聚态物理中的自旋动力学），属于“AI for Science”的范畴，但并非核心聚焦于生物信息学或化学信息学，因此给予5分（有一定关联）。其他所有关键词均与论文内容完全无关，评分为0分。

!!! tip deepseek-chat TL;DR

该研究开发了一种基于机器学习力场的方法，用于大规模模拟金属自旋系统中的磁化动力学，成功复现了非共线磁序并预测了电压驱动畴壁运动，实现了非平衡自旋动力学的量子精确多尺度建模。

摘要翻译

本文综述了机器学习力场方法在金属自旋体系大规模朗道-利夫希茨-吉尔伯特模拟中的最新进展。我们将最初为量子分子动力学开发的贝勒-帕里内洛机器学习架构进行推广，构建出可扩展且可迁移的机器学习模型，该模型能够捕捉巡游磁体中电子介导交换场对局域磁环境的复杂依赖关系。该框架的核心是基于群论双谱形式构建的对称性感知磁描述符。利用这些机器学习力场，朗道-利夫希茨-吉尔伯特模拟在三角晶格上精确复现了标志性的非共线磁序（如120°态和四面体态），并成功捕捉了方晶格双交换模型在热淬火下混合相态中涌现的复杂自旋织构。我们进一步讨论了一种广义势理论，该理论将贝勒-帕里内洛形式拓展至同时包含保守与非保守电子转矩，从而使机器学习模型能够从非平衡格林函数技术等计算量巨大的微观方法中学习非平衡交换场。这一拓展实现了对电压驱动畴壁运动的定量精确预测，并为非平衡自旋动力学及自旋电子学功能的量子精度多尺度建模奠定了理论基础。

摘要 (Abstract)

We review recent advances in machine-learning (ML) force-field methods for large-scale Landau-Lifshitz-Gilbert (LLG) simulations of metallic spin systems. We generalize the Behler-Parrinello (BP) ML architecture – originally developed for quantum molecular dynamics – to construct scalable and transferable ML models capable of capturing the intricate dependence of electron-mediated exchange fields on the local magnetic environment characteristic of itinerant magnets. A central ingredient of this framework is the implementation of symmetry-aware magnetic descriptors based on group-theoretical bispectrum formalisms. Leveraging these ML force fields, LLG simulations faithfully reproduce hallmark non-collinear magnetic orders – such as the $120^\circ$ and tetrahedral states – on the triangular lattice, and successfully capture the complex spin textures emerging in the mixed-phase states of a square-lattice double-exchange model under thermal quench. We further discuss a generalized potential theory that extends the BP formalism to incorporate both conservative and nonconservative electronic torques, thereby enabling ML models to learn nonequilibrium exchange fields from computationally demanding microscopic approaches such as nonequilibrium Green’s-function techniques. This extension yields quantitatively accurate predictions of voltage-driven domain-wall motion and establishes a foundation for quantum-accurate, multiscale modeling of nonequilibrium spin dynamics and spintronic functionalities.

关键词: machine learning, Landau-Lifshitz-Gilbert simulation, magnetic dynamics, spin systems, Behler-Parrinello architecture, exchange fields, domain-wall motion, nonequilibrium spin dynamics

268. ❌ The Price of Ignorance: Information-Free Quotation for Data Retention in Machine Unlearning

作者: Bin Han, Di Feng, Zexin Fang, Jie Wang, Hans D. Schotten 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11511v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究机器遗忘（machine unlearning）中的数据保留定价机制，属于隐私保护、数据管理和机制设计领域，与所有评分关键词（均聚焦于大模型/深度学习技术原理、应用或相关子领域）无直接关联。论文未涉及大模型架构、训练、推理、对齐、应用或任何相关技术。

!!! tip deepseek-chat TL;DR

该论文研究了在GDPR等法规下，移动网络运营商如何在不知道用户隐私偏好的情况下，通过信息无关的报价机制实现高效的数据保留，并证明该机制能达到接近最优的福利水平。

摘要翻译

当用户依据《通用数据保护条例》（GDPR）及类似法规行使数据删除权时，移动网络运营商面临一种权衡：过度的机器遗忘会降低模型精度并产生再训练成本，而现有的数据保留定价机制要求服务器知晓每位用户的隐私偏好与精度偏好——这在推动遗忘的法规框架下本身并不可行。我们提出：在缺乏此类私有信息的情况下运行机制会产生何种福利代价？我们设计了一种无需信息的递增报价机制，服务器逐步广播更高的价格，用户自主选择数据供给，无需知晓任何用户参数。在完全信息条件下，该协议存在唯一的子博弈完美纳什均衡，其特征表现为单周期销售。我们形式化定义了“无知代价”——即最优个性化定价（知晓一切信息）与我们的无信息报价机制（一无所知）之间的福利差距——并证明其效率排序遵循三种区间规律。通过对七种机制进行数值评估及5000次蒙特卡洛模拟，结果表明该代价接近于零：无信息机制能达到信息密集型基准方案≥99%的福利水平，同时提供抗噪声保证及相当的公平性。

摘要 (Abstract)

When users exercise data deletion rights under the General Data Protection Regulation (GDPR) and similar regulations, mobile network operators face a tradeoff: excessive machine unlearning degrades model accuracy and incurs retraining costs, yet existing pricing mechanisms for data retention require the server to know every user’s private privacy and accuracy preferences, which is infeasible under the very regulations that motivate unlearning. We ask: what is the welfare cost of operating without this private information? We design an information-free ascending quotation mechanism where the server broadcasts progressively higher prices and users self-select their data supply, requiring no knowledge of users’ parameters. Under complete information, the protocol admits a unique subgame-perfect Nash equilibrium characterized by single-period selling. We formalize the Price of Ignorance – the welfare gap between optimal personalized pricing (which knows everything) and our information-free quotation (which knows nothing) – and prove a three-regime efficiency ordering. Numerical evaluation across seven mechanisms and 5000 Monte Carlo runs shows that this price is near zero: the information-free mechanism achieves >=99% of the welfare of its information-intensive benchmarks, while providing noise-robust guarantees and comparable fairness.

关键词: machine unlearning, data retention, GDPR, pricing mechanism, information-free quotation, welfare, privacy, subgame-perfect Nash equilibrium

269. ❌ CAGenMol: Condition-Aware Diffusion Language Model for Goal-Directed Molecular Generation

作者: Yanting Li, Zhuoyang Jiang, Enyan Dai, Lei Wang, Wen-Cai Ye, Li Liu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11483v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于分子生成，属于AI for Science（特别是生物信息学/化学信息学）领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。其他关键词主要涉及大语言模型（LLM）的技术原理、训练方法、推理优化、代理系统等，而本文使用扩散语言模型进行分子序列生成，未涉及LLM核心架构、训练对齐、推理加速等主题，因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该研究解决了目标导向分子生成中难以同时满足异质约束（如蛋白质-配体兼容性和多目标药物性质）的问题，提出了一种条件感知的离散扩散框架CAGenMol，通过结合扩散和强化学习在保持化学有效性的同时优化非可微目标，实验表明其在结合亲和力、药物相似性和成功率上优于现有方法。

摘要翻译

目标导向的分子生成需要满足蛋白质-配体兼容性和多目标类药性等异质约束，然而现有方法往往孤立地优化这些约束，未能协调相互冲突的目标（例如亲和力与安全性），并且在保持结构有效性的同时难以在不可微分的化学空间中导航。为解决这些挑战，我们提出了CAGenMol，一种基于分子序列的条件感知离散扩散框架，该框架将分子设计构建为由异质结构与性质信号引导的条件去噪过程。通过将离散扩散与强化学习相结合，该模型使生成轨迹与不可微分目标对齐，同时保持化学有效性和多样性。扩散语言模型的非自回归特性进一步实现了在推理阶段对分子片段的迭代优化。在结构条件、性质条件及双条件基准测试上的实验表明，本方法在结合亲和力、类药性和成功率方面均优于现有先进方法，凸显了该框架的有效性。

摘要 (Abstract)

Goal-directed molecular generation requires satisfying heterogeneous constraints such as protein–ligand compatibility and multi-objective drug-like properties, yet existing methods often optimize these constraints in isolation, failing to reconcile conflicting objectives (e.g., affinity vs. safety), and struggle to navigate the non-differentiable chemical space without compromising structural validity. To address these challenges, we propose CAGenMol, a condition-aware discrete diffusion framework over molecular sequences that formulates molecular design as conditional denoising guided by heterogeneous structural and property signals. By coupling discrete diffusion with reinforcement learning, the model aligns the generation trajectory with non-differentiable objectives while preserving chemical validity and diversity. The non-autoregressive nature of diffusion language model further enables iterative refinement of molecular fragments at inference time. Experiments on structure-conditioned, property-conditioned, and dual-conditioned benchmarks demonstrate consistent improvements over state-of-the-art methods in binding affinity, drug-likeness, and success rate, highlighting the effectiveness of our framework.

关键词: molecular generation, diffusion language model, goal-directed design, condition-aware, reinforcement learning, non-differentiable objectives, drug discovery, bioinformatics

270. ❌ Structural Consequences of Policy-Based Interventions on the Global Supply Chain Network

作者: Lea Karbevska, Liming Xu, Zehui Dai, Sara AlMahri, Alexandra Brintrup 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11479v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究全球电动汽车供应链网络中贸易政策（Country Plus One、Friendshoring、Reshoring）的结构性影响，属于供应链管理、国际贸易和政策分析领域。所有评分关键词均涉及大模型、深度学习技术原理或AI在科学领域的应用，而本文未使用任何AI/ML方法，也未涉及大模型技术，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

本研究分析了Country Plus One、Friendshoring和Reshoring三种贸易政策对全球电动汽车供应链网络结构的影响，发现Friendshoring意外地通过增加友好国家间的供应链接促进了全球化，Country Plus One增加了网络密度，而Reshoring在电动汽车领域因产品不可替代性高而面临挑战。

摘要翻译

随着全球政治紧张局势加剧以及美国对国际贸易加征关税的预期增强，经济自主性与供应链韧性议题日益凸显。新冠疫情与持续不断的乌克兰战争所造成的冲击，进一步突显了供应链韧性的重要性。面对从地缘政治不稳定到产品供应不确定性等一系列挑战，各国政府正日益关注采取新的贸易政策。本研究探讨了其中若干政策对全球电动汽车供应链网络的影响，尤其聚焦于其对国家集群及更广泛的国际贸易结构产生的作用。具体而言，我们分析了三项关键政策：“中国+1”策略、友岸外包以及本土回流。研究结果表明，与预期相反，友岸外包通过增加友好国家间的供应联系数量，反而推动了更高程度的全球化，并可能提升交易成本。“中国+1”策略同样通过冗余联系提高了网络密度，而本土回流政策则因电动汽车领域存在大量不可替代产品而面临挑战。此外，这些政策的影响在不同行业间存在差异；例如，在“中国+1”策略中，矿产品受到的影响小于其在友岸外包政策中所受的影响。

摘要 (Abstract)

As global political tensions rise and the anticipation of additional tariffs from the United States on international trade increases, the issues of economic independence and supply chain resilience become more prominent. The importance of supply chain resilience has been further underscored by disruptions caused by the COVID-19 pandemic and the ongoing war in Ukraine.In light of these challenges, ranging from geopolitical instability to product supply uncertainties, governments are increasingly focused on adopting new trade policies. This study explores the impact of several of these policies on the global electric vehicle (EV) supply chain network, with a particular focus on their effects on country clusters and the broader structure of international trade. Specifically, we analyse three key policies: Country Plus One, Friendshoring, and Reshoring. Our findings show that Friendshoring, contrary to expectations, leads to greater globalisation by increasing the number of supply links across friendly countries, potentially raising transaction costs. The Country Plus One policy similarly enhances network density through redundant links, while the Reshoring policy creates challenges in the EV sector due to the high number of irreplaceable products. Additionally, the effects of these policies vary across industries; for instance, mining goods being less affected in Country Plus One than the Friendshoring policy.

关键词: supply chain network, trade policies, electric vehicle, globalization, network density, reshoring, friendshoring, Country Plus One

271. ❌ Learning How Much to Think: Difficulty-Aware Dynamic MoEs for Graph Node Classification

作者: Jiajun Zhou, Yadong Li, Xuanze Chen, Chen Ma, Chuang Zhao, Shanqing Yu, Qi Xuan 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11473v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文的核心贡献是提出了一种用于图神经网络（GNNs）的动态混合专家（MoE）框架D2MoE，专注于图节点分类任务。因此，它与关键词’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分），因为MoE是其核心架构。论文虽然涉及大模型在特定领域（图学习）的应用，但并未直接涉及语言模型、预训练/后训练技术、对齐、推理、代理、效率优化（如量化、推测解码）或科学AI等具体关键词。其他所有关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对图神经网络中混合专家（MoE）架构在节点分类任务上静态路由策略导致资源分配不均的问题，提出了一个基于节点预测熵的动态难度感知MoE框架D2MoE，实现了对困难节点集中专家资源并减少简单节点开销，在多个基准测试中取得了最先进的性能，并显著降低了内存消耗和训练时间。

摘要翻译

专家混合（Mixture-of-Experts, MoE）架构为图神经网络（Graph Neural Networks, GNNs）在节点分类任务中提供了一条可扩展的路径，但其通常依赖于静态且僵化的路由策略，这些策略对所有节点强制采用统一的专家预算或粗粒度的专家切换。这种局限性忽视了节点间判别难度的差异，导致困难节点欠拟合，而对简单节点则产生冗余计算。为解决此问题，我们提出了D2MoE，这是一个新颖的框架，其重点从静态专家选择转向节点级的专家资源分配。通过使用预测熵作为难度的实时代理，D2MoE采用难度驱动的top-p路由机制，自适应地将专家资源集中于困难节点，同时降低简单节点的开销，从而为节点分类实现了连续且细粒度的专家预算缩放。在13个基准数据集上的实验表明，D2MoE取得了持续领先的性能，在异质图上的准确率最高超过现有领先基线7.92%。值得注意的是，在大规模图上，与性能最佳的图MoE方法相比，其内存消耗最高降低了73.07%，训练时间减少了46.53%，从而验证了其卓越的效率。

摘要 (Abstract)

Mixture-of-Experts (MoE) architectures offer a scalable path for Graph Neural Networks (GNNs) in node classification tasks but typically rely on static and rigid routing strategies that enforce a uniform expert budget or coarse-grained expert toggles on all nodes. This limitation overlooks the varying discriminative difficulty of nodes and leads to under-fitting for hard nodes and redundant computation for easy ones. To resolve this issue, we propose D2MoE, a novel framework that shifts the focus from static expert selection to node-wise expert resource allocation. By using predictive entropy as a real-time proxy for difficulty, D2MoE employs a difficulty-driven top-p routing mechanism to adaptively concentrate expert resources on hard nodes while reducing overhead for easy ones, achieving continuous and fine-grained expert budget scaling for node classification. Experiments on 13 benchmarks demonstrate that D2MoE achieves consistent state-of-the-art performance, surpassing leading baselines by up to 7.92% in accuracy on heterophilous graphs. Notably, on large-scale graphs, it reduces memory consumption by up to 73.07% and training time by 46.53% compared to the best-performing Graph MoE, thereby validating its superior efficiency.

关键词: Mixture-of-Experts (MoE), Graph Neural Networks (GNNs), node classification, difficulty-aware routing, dynamic expert allocation, predictive entropy, heterophilous graphs, efficiency improvement

272. ❌ Active Bayesian Inference for Robust Control under Sensor False Data Injection Attacks

作者: Axel Andersson, György Dán 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11410v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是网络物理系统中传感器虚假数据注入攻击下的鲁棒控制问题，采用贝叶斯推理、主动探测和POMDP等方法，属于控制工程和网络安全领域。所有评分关键词均与大模型、深度学习技术原理或AI for Science应用相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种主动贝叶斯推理框架，用于检测和恢复网络物理系统中的传感器虚假数据注入攻击，实验表明该方法在单传感器和多传感器攻击下显著优于基线方法。

摘要翻译

本文提出了一种用于弥合信息物理系统中传感器攻击检测与恢复之间差距的框架。该框架将现代复杂的感知管道建模为二分图，结合异常检测器警报，构建了一个用于推断受损传感器的贝叶斯网络。一种主动探测策略利用系统非线性来最大化不同攻击假设之间的可区分性，同时选择性停用受损传感器以维持可靠的状态估计。我们提出了一种基于阈值的探测策略，并通过简化的部分可观测马尔可夫决策过程（POMDP）模型证明了其有效性。在单传感器与多传感器攻击场景下对倒立摆系统进行的实验表明，我们的方法显著优于抗异常值及基于预测的基线方法，尤其在持续攻击条件下表现更为突出。

摘要 (Abstract)

We present a framework for bridging the gap between sensor attack detection and recovery in cyber-physical systems. The proposed framework models modern-day, complex perception pipelines as bipartite graphs, which combined with anomaly detector alerts defines a Bayesian network for inferring compromised sensors. An active probing strategy exploits system nonlinearities to maximize distinguishability between attack hypotheses, while compromised sensors are selectively disabled to maintain reliable state estimation. We propose a threshold-based probing strategy and show its effectiveness via a simplified partially observable Markov decision process (POMDP) formulation. Experiments on an inverted pendulum under single and multi-sensor attacks show that our method significantly outperforms outlier-robust and prediction-based baselines, especially under prolonged attacks.

关键词: cyber-physical systems, sensor false data injection attacks, Bayesian inference, active probing, anomaly detection, state estimation, POMDP, inverted pendulum

273. ❌ Exact Certification of Neural Networks and Partition Aggregation Ensembles against Label Poisoning

作者: Ajinkya Mohgaonkar, Lukas Gosch, Mahalakshmi Sabanayagam, Debarghya Ghoshdastidar, Stephan Günnemann 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11416v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于神经网络和集成模型的鲁棒性认证，特别是针对标签翻转攻击的防御。论文的核心贡献是提出了EnsembleCert认证框架和ScaLabelCert方法，这些内容属于传统机器学习安全领域，而非大模型或深度学习技术原理的创新。所有关键词都涉及大模型、深度学习技术原理或特定应用领域（如AI for Science），而本文研究的是监督学习模型的认证问题，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了EnsembleCert框架和ScaLabelCert方法，首次实现了对神经网络和分区聚合集成模型在标签翻转攻击下的精确多项式时间可计算认证，显著优于现有的黑盒认证方法。

摘要翻译

标签翻转攻击通过篡改训练标签以在推理阶段诱发误分类，仍是监督学习模型面临的主要威胁。这推动了对鲁棒性认证的需求，即提供模型在对抗性标签污染下鲁棒性的形式化保证。现有认证框架依赖于平滑或分区聚合等集成技术，但将对应的基分类器视为黑盒，导致所得保证过于保守。我们提出EnsembleCert——首个利用基分类器白盒知识的分区聚合集成认证框架。具体而言，该方法通过聚合各分区的白盒认证结果，在多项式时间内计算集成层面的保证，从而提供比黑盒方法更严格的保证。为高效提取基分类器的白盒知识，我们开发了ScaLabelCert方法，该方法利用充分宽度的神经网络与基于神经正切核的核方法之间的等价性，首次实现了针对标签翻转攻击的、可精确计算且具有多项式时间复杂度的神经网络认证。EnsembleCert在性能上持平或显著超越现有基于分区的黑盒认证方法。以CIFAR-10数据集为例，相较于现有黑盒方法，我们的方法在测试集上可认证的标签翻转数量中位数提升高达+26.5%，同时所需分区数量减少100倍，这挑战了“强认证鲁棒性必须依赖密集分区”的主流观念。

摘要 (Abstract)

Label-flipping attacks, which corrupt training labels to induce misclassifications at inference, remain a major threat to supervised learning models. This drives the need for robustness certificates that provide formal guarantees about a model’s robustness under adversarially corrupted labels. Existing certification frameworks rely on ensemble techniques such as smoothing or partition-aggregation, but treat the corresponding base classifiers as black boxes, yielding overly conservative guarantees. We introduce EnsembleCert, the first certification framework for partition-aggregation ensembles that utilizes white-box knowledge of the base classifiers. Concretely, EnsembleCert yields tighter guarantees than black-box approaches by aggregating per-partition white-box certificates to compute ensemble-level guarantees in polynomial time. To extract white-box knowledge from the base classifiers efficiently, we develop ScaLabelCert, a method that leverages the equivalence between sufficiently wide neural networks and kernel methods using the neural tangent kernel. ScaLabelCert yields the first exact, polynomial-time calculable certificate for neural networks against label-flipping attacks. EnsembleCert is either on par, or significantly outperforms the existing partition-based black box certificates. Exemplary, on CIFAR-10, our method can certify upto +26.5% more label flips in median over the test set compared to the existing black-box approach while requiring 100 times fewer partitions, thus, challenging the prevailing notion that heavy partitioning is a necessity for strong certified robustness.

关键词: robustness certification, label-flipping attacks, neural networks, partition-aggregation ensembles, white-box certificates, neural tangent kernel, exact certification, adversarial robustness

274. ❌ GlobalCY I: A JAX Framework for Globally Defined and Symmetry-Aware Neural Kähler Potentials

作者: Abdul Rahman 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11404v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是在数学物理领域（具体为Calabi-Yau几何）中应用神经网络模型（神经Kähler势模型）来改进几何建模，属于AI在科学领域的应用。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文属于AI在科学（具体是数学物理）中的应用，但并非生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一个JAX框架GlobalCY，用于在射影超曲面Calabi-Yau几何上构建全局定义且对称性感知的神经Kähler势模型，解决了局部输入模型在硬四次区域几何敏感诊断中失败的问题，并发现全局不变模型在硬Cefalú案例中表现最佳。

摘要翻译

我们提出 \emph{GlobalCY}，这是一个基于 JAX 的框架，用于在射影超曲面 Calabi–Yau 几何上构建全局定义且对称性感知的神经 Kähler 势模型。核心问题在于，基于局部输入的神经 Kähler 势模型虽然能够成功训练，但在硬四次型区域中，尤其是在 Cefalú 族的奇异及近奇异成员附近，仍然无法通过几何敏感的诊断测试。为此，我们使用固定的多种子协议和一套几何感知的诊断工具，在困难的 Cefalú 案例 $λ=0.75$ 和 $λ=1.0$ 上比较了三个模型族——一个局部输入基线模型、一个全局定义的不变量模型，以及一个对称性感知的全局模型。在此基准测试中，全局定义的不变量模型是整体表现最强的模型族，在两个最清晰的几何比较指标（负特征值频率和射影不变性漂移）上均优于局部基线模型，且两种情况皆是如此。在 $λ=0.75$ 处提升最为显著，而 $λ=1.0$ 的情况仍然更具挑战性。当前的对称性感知模型相较于局部基线在射影不变性漂移方面有所改进，但尚未超越简单的全局不变量模型。这些结果表明，在硬四次型 Calabi–Yau 场景中，全局不变结构对于学习型 Kähler 势建模而言是一个有意义的架构约束。

摘要 (Abstract)

We present \emph{GlobalCY}, a JAX-based framework for globally defined and symmetry-aware neural Kähler-potential models on projective hypersurface Calabi–Yau geometries. The central problem is that local-input neural Kähler-potential models can train successfully while still failing the geometry-sensitive diagnostics that matter in hard quartic regimes, especially near singular and near-singular members of the Cefalú family. To study this, we compare three model families – a local-input baseline, a globally defined invariant model, and a symmetry-aware global model – on the hard Cefalú cases $λ=0.75$ and $λ=1.0$ using a fixed multi-seed protocol and a geometry-aware diagnostic suite. In this benchmark, the globally defined invariant model is the strongest overall family, outperforming the local baseline on the two clearest geometric comparison metrics, negative-eigenvalue frequency and projective-invariance drift, in both cases. The gains are strongest at $λ=0.75$, while $λ=1.0$ remains more difficult. The current symmetry-aware model improves projective-invariance drift relative to the local baseline, but does not yet surpass the plain global invariant model. These results show that global invariant structure is a meaningful architectural constraint for learned Kähler-potential modeling in hard quartic Calabi–Yau settings.

关键词: JAX framework, neural Kähler potentials, Calabi-Yau geometries, global invariant structure, projective hypersurface, geometry-aware diagnostics, Cefalú family, symmetry-aware model

275. ❌ Learning Discrete Diffusion of Graphs via Free-Energy Gradient Flows

作者: Dario Rancati, Jan Maas, Francesco Locatello 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11311v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究离散图上的扩散模型，属于图机器学习领域，专注于离散扩散过程的数学理论框架（梯度流、JKO方案）和计算方法。所有评分关键词均针对大语言模型（LLMs）及其相关技术（如训练、对齐、推理、应用等），而本文完全不涉及语言模型、文本处理或任何LLM相关技术。论文内容与所有关键词无直接关联，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一个基于自由能梯度流的理论框架和计算方法，用于学习离散图上的扩散动力学，并通过数值实验验证了该方法能够从合成数据中恢复底层功能。

摘要翻译

基于梯度流的数学框架，通过Jordan-Kinderlehrer-Otto（JKO）方案利用Wasserstein-2（${W}_2$）度量，连续空间上的扩散模型近期取得了显著进展。尽管基于连续时间马尔可夫链的离散空间扩散模型日益流行，但由于将${W}_2$距离直接移植到此类设置中存在固有挑战，基于梯度流的并行理论框架一直难以建立。在本研究中，我们提出了首个应对这些挑战的计算方法，通过在概率分布的单纯形上引入合适的度量$W_K$，使得我们能够将广泛使用的离散扩散路径（如离散热方程）解释为特定自由能泛函的梯度流。基于这一理论洞见，我们提出了一种在离散空间上学习扩散动力学的新方法，该方法通过利用JKO方案的一阶最优性条件直接恢复底层泛函。所得方法优化一个简单的二次损失函数，训练速度极快，无需个体样本轨迹，仅需数值预处理计算$W_K$-测地线。我们在合成数据上通过大量数值实验验证了本方法的有效性，结果表明我们能够针对多种图类型恢复其底层泛函。

摘要 (Abstract)

Diffusion-based models on continuous spaces have seen substantial recent progress through the mathematical framework of gradient flows, leveraging the Wasserstein-2 (${W}_2$) metric via the Jordan-Kinderlehrer-Otto (JKO) scheme. Despite the increasing popularity of diffusion models on discrete spaces using continuous-time Markov chains, a parallel theoretical framework based on gradient flows has remained elusive due to intrinsic challenges in translating the ${W}_2$ distance directly into these settings. In this work, we propose the first computational approach addressing these challenges, leveraging an appropriate metric $W_K$ on the simplex of probability distributions, which enables us to interpret widely used discrete diffusion paths, such as the discrete heat equation, as gradient flows of specific free-energy functionals. Through this theoretical insight, we introduce a novel methodology for learning diffusion dynamics over discrete spaces, which recovers the underlying functional directly by leveraging first-order optimality conditions for the JKO scheme. The resulting method optimizes a simple quadratic loss, trains extremely fast, does not require individual sample trajectories, and only needs a numerical preprocessing computing $W_K$-geodesics. We validate our method through extensive numerical experiments on synthetic data, showing that we can recover the underlying functional for a variety of graph classes.

关键词: discrete diffusion, graphs, gradient flows, free-energy functionals, JKO scheme, Wasserstein metric, learning diffusion dynamics, synthetic data

276. ❌ BRIDGE and TCH-Net: Heterogeneous Benchmark and Multi-Branch Baseline for Cross-Domain IoT Botnet Detection

作者: Ammar Bhilwarawala, Likhamba Rongmei, Harsh Sharma, Arya Jena, Kaushal Singh, Jayashree Piri, Raghunath Dey 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11324v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于物联网（IoT）领域的入侵检测和网络安全的特定应用，提出了一种名为TCH-Net的多分支神经网络架构，用于跨领域物联网僵尸网络检测。论文的核心贡献在于创建了一个异构多数据集基准（BRIDGE）和一个新的深度学习模型，涉及特征工程、数据集整合、模型架构设计和评估协议。然而，论文内容与所有评分关键词（主要围绕大语言模型、深度学习技术原理、模型训练优化、推理加速、对齐技术、AI代理等）均无直接关联。论文未提及任何语言模型、预训练、微调、对齐、推理优化、AI代理或科学AI应用等概念，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对物联网僵尸网络检测中跨领域泛化能力不足的问题，提出了首个异构多数据集基准BRIDGE和多分支神经网络TCH-Net，显著提升了跨环境检测性能并建立了社区泛化基线。

摘要翻译

物联网僵尸网络检测技术已取得进展，但现有系统大多仅在单一数据集上验证，难以实现跨环境泛化。异构特征空间使得多数据集训练在实践上面临困境：若保留语义可解释性则难以整合，若强行整合则会破坏数据完整性。此前尚无研究能以形式化、可复现的方法同时解决这两个问题。本文填补了这一空白。我们提出了BRIDGE（物联网领域泛化评估基准参考），这是首个形式化定义的异构多数据集物联网入侵检测基准。它基于CICFlowMeter术语体系构建了包含46个特征的语义规范词典，通过严格等价特征映射、显式零值填充及15%至93%的逐数据集覆盖率，统一整合了CICIDS-2017、CIC-IoT-2023、Bot-IoT、Edge-IIoTset和N-BaIoT五大数据集。采用留一数据集出（LODO）协议可精确量化泛化差距：所有五种评估架构的LODO平均F1值介于0.39至0.47之间，我们建立了首个社区泛化基线（平均LODO F1=0.5577），这一结果将研究重心从单基准优化转向跨环境泛化。我们提出TCH-Net多分支网络，其融合了三条路径的时间分支（残差卷积-BiGRU、步长下采样BiGRU、预层归一化Transformer）、溯源条件上下文分支以及统计分支，通过具有可学习Sigmoid门控的跨分支门控注意力融合（CB-GAF）机制实现动态特征级混合。在五次随机种子实验中，TCH-Net取得F1=0.8296±0.0028、AUC=0.9380±0.0025、MCC=0.6972±0.0056的优异性能，显著超越全部十二个基线模型（p<0.05，Wilcoxon检验），并创下当前最高的整体LODO F1记录。BRIDGE基准及完整代码已发布于https://github.com/Ammar-ss/TCH-Net。

摘要 (Abstract)

IoT botnet detection has advanced, yet most published systems are validated on a single dataset and rarely generalise across environments. Heterogeneous feature spaces make multi-dataset training practically impossible without discarding semantic interpretability or introducing data integrity violations. No prior work has addressed both problems with a formally specified, reproducible methodology. This paper does. We introduce BRIDGE (Benchmark Reference for IoT Domain Generalisation Evaluation), the first formally specified heterogeneous multi-dataset benchmark for IoT intrusion detection, unifying CICIDS-2017, CIC-IoT-2023, Bot-IoT, Edge-IIoTset, and N-BaIoT through a 46-feature semantic canonical vocabulary grounded in CICFlowMeter nomenclature, with genuine-equivalence-only feature mapping, explicit zero-filling, and per-dataset coverage from 15% to 93%. A leave-one-dataset-out (LODO) protocol makes the generalisation gap precisely measurable: all five evaluated architectures achieve mean LODO F1 between 0.39 and 0.47, and we establish the first community generalisation baseline at mean LODO F1 = 0.5577, a result that shifts the agenda from single-benchmark optimisation toward cross-environment generalisation. We propose TCH-Net, a multi-branch network fusing a three-path Temporal branch (residual convolutional-BiGRU, stride-downsampled BiGRU, pre-LayerNorm Transformer), a provenance-conditioned Contextual branch, and a Statistical branch via Cross-Branch Gated Attention Fusion (CB-GAF) with learnable sigmoid gates for dynamic feature-wise mixing. Across five random seeds, TCH-Net achieves F1 = 0.8296 +/- 0.0028, AUC = 0.9380 +/- 0.0025, and MCC = 0.6972 +/- 0.0056, outperforming all twelve baselines (p < 0.05, Wilcoxon) and recording the highest LODO F1 overall. BRIDGE and the full pipeline are at https://github.com/Ammar-ss/TCH-Net.

关键词: IoT botnet detection, cross-domain generalization, heterogeneous benchmark, multi-branch network, TCH-Net, BRIDGE, intrusion detection, domain adaptation

277. ❌ Beyond Fixed False Discovery Rates: Post-Hoc Conformal Selection with E-Variables

作者: Meiyi Zhu, Osvaldo Simeone 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11305v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是统计方法中的假设检验和错误发现率控制问题，具体提出了基于e变量的后验保形选择方法。论文内容完全属于统计学、假设检验和统计推断领域，涉及保形预测、e变量、错误发现率控制等统计概念。所有评分关键词都涉及大模型、深度学习、AI技术及其应用，而本文完全不涉及任何人工智能、机器学习或深度学习技术，没有讨论模型训练、推理、优化、对齐、应用等任何相关主题。论文中提到的基因组学和神经影像学只是作为应用领域的例子，但论文本身并不涉及AI在这些领域的应用技术。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对传统保形选择方法需要预先固定错误发现率的问题，提出了一种基于e变量的后验保形选择方法，允许用户根据观察到的数据自适应地平衡选择数量和错误发现率，并提供了有限样本的可靠性保证。

摘要翻译

保形选择（Conformal Selection，CS）利用校准数据来识别那些未观测结果可能满足预设最低质量要求的测试输入，同时控制错误发现率（False Discovery Rate，FDR）。现有方法在观测数据前固定目标FDR水平，这导致用户无法根据可用数据及下游需求与约束，灵活调整所选测试输入数量与FDR之间的平衡。例如在基因组学或神经影像学中，研究者常通过检查检验统计量的分布，依据观测到的证据强度及可用的后续资源来决定候选对象的筛选强度。为突破这一局限，本文提出后验保形选择（Post-Hoc CS，PH-CS），该方法生成一条候选选择集路径，每个选择集均配有一个数据驱动的错误发现比例（False Discovery Proportion，FDP）估计值。PH-CS允许用户通过最大化自定义效用函数，在此路径上任选操作点，从而自由权衡选择规模与FDR。基于保形电子变量（conformal e-variables）与电子本杰明-霍克伯格（e-Benjamini-Hochberg，e-BH）过程，PH-CS被证明能提供有限样本下的后验可靠性保证：估计FDP水平与真实FDP的比值在平均意义上以$1$为上界，因此一阶平均估计FDP可作为真实FDR的有效上界。PH-CS可进一步扩展至控制基于一般风险定义的质量指标。在合成与真实数据集上的实验表明，与CS不同，PH-CS能够在保持竞争力FDR控制的同时，持续满足用户设定的效用约束，并产生可靠的FDP估计。

摘要 (Abstract)

Conformal selection (CS) uses calibration data to identify test inputs whose unobserved outcomes are likely to satisfy a pre-specified minimal quality requirement, while controlling the false discovery rate (FDR). Existing methods fix the target FDR level before observing data, which prevents the user from adapting the balance between number of selected test inputs and FDR to downstream needs and constraints based on the available data. For example, in genomics or neuroimaging, researchers often inspect the distribution of test statistics, and decide how aggressively to pursue candidates based on observed evidence strength and available follow-up resources. To address this limitation, we introduce {post-hoc CS} (PH-CS), which generates a path of candidate selection sets, each paired with a data-driven false discovery proportion (FDP) estimate. PH-CS lets the user select any operating point on this path by maximizing a user-specified utility, arbitrarily balancing selection size and FDR. Building on conformal e-variables and the e-Benjamini-Hochberg (e-BH) procedure, PH-CS is proved to provide a finite-sample post-hoc reliability guarantee whereby the ratio between estimated FDP level and true FDP is, on average, upper bounded by $1$, so that the average estimated FDP is, to first order, a valid upper bound on the true FDR. PH-CS is extended to control quality defined in terms of a general risk. Experiments on synthetic and real-world datasets demonstrate that, unlike CS, PH-CS can consistently satisfy user-imposed utility constraints while producing reliable FDP estimates and maintaining competitive FDR control.

关键词: conformal selection, false discovery rate, e-variables, post-hoc selection, statistical inference, hypothesis testing, FDR control, e-BH procedure

278. ❌ THEIA: Learning Complete Kleene Three-Valued Logic in a Pure-Neural Modular Architecture

作者: Augustus Haoyang Li 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11284v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文THEIA研究的是纯神经模块化架构学习Kleene三值逻辑，主要关注神经网络的模块化设计、组合泛化和机制可解释性。与评分关键词列表对比：1）论文不涉及大语言模型（LLMs）、MoE、SLMs、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、推理加速、量化、推测解码、幻觉缓解、世界模型、模型合并、上下文学习等具体的大模型技术。2）论文不涉及AI for Science的具体应用领域（如生物信息学）。3）唯一的相关点是"Mechanistic Interpretability OR Explainable AI"，论文通过机制探测（mechanistic probing）和激活修补（activation patching）来分析模块化架构的内部表示和工作机制，这与可解释AI有一定关联，但并非核心焦点（核心是架构设计对组合泛化的影响），因此给予5分（有一定关联）。其他所有关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

论文研究了模块化神经架构THEIA如何通过结构化归纳偏置实现Kleene三值逻辑的端到端学习和组合泛化，并在500步序列任务中达到99.97%的准确率，同时通过机制分析揭示了模块化与整体式架构的不同策略。

摘要翻译

我们提出THEIA——一种模块化神经架构，它能够端到端地学习完整的克林三值逻辑（K3）而无需任何外部符号求解器，并探究何种架构先验能够在不确定性下实现组合泛化。THEIA通过专用引擎处理四个数学领域（算术、序关系、集合隶属、命题逻辑），这些引擎最终汇聚于一个逻辑模块。该模型在输入空间约3.4×10^13的200万样本数据集上训练，在匹配设置下以9.2±3.5分钟覆盖12/12条克林K3规则（比参数量相当的Transformer快5.6倍）。在模3序列组合实验中，模型从5步训练泛化至500步评估的准确率达到99.97%±0.02%——这一结果关键依赖于结构化归纳偏置：将四引擎主干替换为扁平多层感知器（MLP）会导致长度泛化在50步内坍缩至随机水平（0.80M参数和参数量匹配的2.75M变体均失败），而在相同协议下训练的预层归一化Transformer基线（3,582,147参数）在500步达到99.24%（附录D）。机制探针分析表明模块化会引发延迟判定：上游引擎编码领域特定变量时暂不决定最终真值（探针准确率≤74%的不确定性上限），判定仅出现在逻辑引擎边界——通过激活修补得到因果性验证（986组匹配对100%翻转率，在n=5次随机种子中复现；聚合准确率100.0%）。Transformer基线通过性质不同的表征轨迹（先压缩后扩展）达到同等正确率，这表明模块化与整体式架构实现了不同的组合策略。

摘要 (Abstract)

We present THEIA, a modular neural architecture that learns complete Kleene three-valued logic (K3) end-to-end without any external symbolic solver, and investigate what architectural prior enables compositional generalization under uncertainty. THEIA processes four mathematical domains (arithmetic, order, set membership, propositional logic) through dedicated engines that converge in a final logic module. Trained on a 2M-sample dataset with input space ~3.4x10^13, it achieves 12/12 Kleene K3 rule coverage across 5 seeds in 9.2 +/- 3.5 minutes (5.6x faster than a parameter-comparable Transformer under matched settings). A mod-3 sequential composition experiment generalizes from 5-step training to 500-step evaluation at 99.97% +/- 0.02% – a result that critically depends on structured inductive bias: replacing the four-engine backbone with a flat MLP collapses length generalization to chance by 50 steps regardless of capacity (both 0.80M and parameter-matched 2.75M variants fail), while a pre-LN TF8LTuned Transformer baseline (3,582,147 params) trained under the identical protocol reaches 99.24% at 500 steps (Appendix D). Mechanistic probing reveals that modularity induces a delayed verdict: upstream engines encode domain-specific variables without committing to the final truth value (probe accuracy <= 74% uncertainty-only ceiling), with the verdict emerging only at the Logic Engine boundary – causally confirmed by activation patching (100% flip rate on 986 matched pairs, replicated across n=5 seeds; 100.0% aggregate). The Transformer baseline reaches equivalent correctness through a qualitatively different representational trajectory (contraction then expansion), suggesting that modular and monolithic architectures implement distinct compositional strategies.

关键词: modular neural architecture, Kleene three-valued logic, compositional generalization, inductive bias, mechanistic probing, activation patching, end-to-end learning, uncertainty

279. ❌ Representation-Aligned Multi-Scale Personalization for Federated Learning

作者: Wenfei Liang, Wee Peng Tay 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11278v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文《Representation-Aligned Multi-Scale Personalization for Federated Learning》专注于联邦学习（FL）中的个性化与资源自适应问题，提出FRAMP框架。其核心贡献在于联邦学习架构创新（如客户端特定模型生成、表示对齐），而非大模型（LLM）或深度学习技术原理本身。所有评分关键词均明确指向大模型技术（如LLM、MoE、SFT、RLHF、RAG等）或其特定应用领域（如AI for Science），而本文未涉及任何大模型相关内容，也未应用于科学领域（如生物信息学）。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对联邦学习中客户端资源异构和数据异质的挑战，提出了一个名为FRAMP的统一框架，通过生成客户端特定模型并对其表示进行对齐，实现了更好的个性化适应和泛化性能。

摘要翻译

在联邦学习（FL）中，适应具有多样化资源约束的客户端仍是一个重大挑战。一种广泛采用的方法是使用共享的全尺寸模型，每个客户端从中提取与其计算预算匹配的子模型。然而，无论采用何种具体的评分策略，这些方法都依赖于相同的全局主干网络，限制了客户端间的结构多样性和表征适应性。本文提出FRAMP，一个用于个性化和资源自适应联邦学习的统一框架。FRAMP不依赖固定的全局模型，而是通过紧凑的客户端描述符生成客户端专属模型，从而实现对数据特征和计算预算的细粒度适应。每个客户端训练一个定制的轻量子模型，并将其学习到的表征与其他客户端对齐，以保持全局语义一致性。在视觉和图基准测试上的大量实验表明，FRAMP在广泛的客户端设置中提升了泛化能力和适应性。

摘要 (Abstract)

In federated learning (FL), accommodating clients with diverse resource constraints remains a significant challenge. A widely adopted approach is to use a shared full-size model, from which each client extracts a submodel aligned with its computational budget. However, regardless of the specific scoring strategy, these methods rely on the same global backbone, limiting both structural diversity and representational adaptation across clients. This paper presents FRAMP, a unified framework for personalized and resource-adaptive federated learning. Instead of relying on a fixed global model, FRAMP generates client-specific models from compact client descriptors, enabling fine-grained adaptation to both data characteristics and computational budgets. Each client trains a tailored lightweight submodel and aligns its learned representation with others to maintain global semantic consistency. Extensive experiments on vision and graph benchmarks demonstrate that FRAMP enhances generalization and adaptivity across a wide range of client settings.

关键词: Federated Learning, Personalization, Resource Adaptation, Representation Alignment, Client-specific Models, Generalization, FRAMP

280. ❌ Sheaf Diffusion with Adaptive Local Structure for Spatio-Temporal Forecasting

作者: Abeer Mostafa, Raneen Younis, Zahra Ahmadi 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11275v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于时空图神经网络（ST-Sheaf GNN）的架构创新，使用层理论来建模局部结构，以改进时空预测。论文的核心是图神经网络（GNN）和深度学习在时空系统中的应用，属于深度学习技术的一个特定子领域。然而，论文内容与提供的关键词列表高度不匹配：1）论文未涉及任何大语言模型（LLM）、小语言模型（SLM）或相关技术（如微调、对齐、推理、代理等）；2）未讨论模型缩放、效率优化（如量化、推测解码）、可解释性等通用大模型主题；3）唯一略有相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文在多个领域的真实世界基准上进行了评估，可能包括科学相关领域（如环境科学、交通预测），但这并非论文核心，关联性较弱，因此给予5分（有一定关联）。其他所有关键词均与论文内容完全无关，评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于层理论的时空图神经网络（ST-Sheaf GNN），通过动态学习局部限制映射来建模高阶交互，有效缓解了深度GNN中的过平滑问题，并在多个真实世界时空预测基准上实现了最先进的性能。

摘要翻译

时空系统常对局部扰动表现出高度异构且非直观的响应，这限制了传统消息传递方法在局部异构性下建模高阶交互的有效性。本文将时空预测重新定义为在局部结构化空间上学习信息流的问题，而非传播全局对齐的节点表征。我们提出了一种时空层扩散图神经网络（ST-Sheaf GNN），该网络将图拓扑嵌入到通过学习的线性限制映射连接的层理论向量空间中。与先前依赖静态或全局共享变换的研究不同，我们的模型学习随时间演化并适应局部时空模式的动态限制映射，从而实现显著更具表达力的交互。通过对潜在局部结构进行显式建模，所提框架有效缓解了深度GNN架构中的过度平滑现象。我们在多个领域的多样化现实时空预测基准数据集上评估了该框架。实验结果表明其达到了最先进的性能，凸显了层理论拓扑表征作为时空图学习强大基础的有效性。代码发布于：https://anonymous.4open.science/r/ST-SheafGNN-6523/。

摘要 (Abstract)

Spatio-temporal systems often exhibit highly heterogeneous and non-intuitive responses to localized disruptions, limiting the effectiveness of conventional message passing approaches in modeling higher-order interactions under local heterogeneity. This paper reformulates spatio-temporal forecasting as the problem of learning information flow over locally structured spaces, rather than propagating globally aligned node representations. We introduce a spatio-temporal sheaf diffusion graph neural network (ST-Sheaf GNN) that embeds graph topology into sheaf-theoretic vector spaces connected by learned linear restriction maps. Unlike prior work that relies on static or globally shared transformations, our model learns dynamic restriction maps that evolve over time and adapt to local spatio-temporal patterns to enable substantially more expressive interactions. By explicitly modeling latent local structure, the proposed framework efficiently mitigates the oversmoothing phenomenon in deep GNN architectures. We evaluate our framework on a diverse set of real-world spatio-temporal forecasting benchmarks spanning multiple domains. Experimental results demonstrate state-of-the-art performance, highlighting the effectiveness of sheaf-theoretic topological representations as a powerful foundation for spatio-temporal graph learning. The code is available at: https://anonymous.4open.science/r/ST-SheafGNN-6523/.

关键词: spatio-temporal forecasting, graph neural network, sheaf theory, local structure, restriction maps, oversmoothing mitigation, topological representations

281. ❌ AbLWR:A Context-Aware Listwise Ranking Framework for Antibody-Antigen Binding Affinity Prediction via Positive-Unlabeled Learning

作者: Fan Xu, Zhi-an Huang, Haohuai He, Yidong Song, Wei Liu, Dongxu Zhang, Yao Hu, Kay Chen Tan 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11272v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于抗体-抗原结合亲和力预测的生物信息学应用，使用深度学习技术（如多头自注意力）和PU学习机制，属于AI for Science领域，因此与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。然而，论文未涉及大模型（LLMs）、MoE、SLMs、缩放定律、预训练、后训练、对齐、RLHF、PEFT、RAG、上下文扩展、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等关键词，这些关键词主要与大模型技术原理或应用相关，而本文是传统的深度学习应用研究，因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为AbLWR的上下文感知列表排序框架，通过正未标记学习和多头自注意力机制，显著提高了抗体-抗原结合亲和力预测的准确性，在随机交叉验证实验中Precision@1提升了10%以上。

摘要翻译

抗体-抗原结合亲和力的精准预测是治疗设计的基石，但其发展仍受限于严重的标签稀疏性和抗原变异的复杂性。本文提出AbLWR（抗体-抗原结合亲和力列表排序）框架，将传统的亲和力回归任务重新定义为列表排序问题。为缓解标签稀疏性，AbLWR引入PU（正样本-未标注样本）学习机制，通过双层级对比目标和元优化标签细化来学习稳健表征。此外，我们采用同源抗原采样策略应对抗原变异问题，其中多头自注意力机制（MHSA, Multi-Head Self-Attention）显式建模训练列表内的样本间关系，以捕捉细微的亲和力差异。大量实验表明，AbLWR显著优于现有先进基线模型，在随机交叉验证实验中Precision@1（P@1）指标提升超过10%。值得注意的是，针对流感病毒和IL-33的案例研究验证了其实用价值，证明其在区分细微病毒突变方面具有稳健的排序一致性，并能有效筛选出适合湿实验验证的优质候选抗体。

摘要 (Abstract)

Accurate prediction of antibody-antigen binding affinity is fundamental to therapeutic design, yet remains constrained by severe label sparsity and the complexity of antigenic variations. In this paper, we propose AbLWR (Antibody-antigen binding affinity List-Wise Ranking), a novel framework that reformulates the conventional affinity regression task as a listwise ranking problem. To mitigate label sparsity, AbLWR incorporates a PU (Positive-Unlabeled) learning mechanism leveraging a dual-level contrastive objective and meta-optimized label refinement to learn robust representations. Furthermore, we address antigenic variation by employing a homologous antigen sampling strategy where Multi-Head Self-Attention (MHSA) explicitly models inter-sample relationships within training lists to capture subtle affinity nuances. Extensive experiments demonstrate that AbLWR significantly outperforms state-of-the-art baselines, improving the Precision@1 (P@1) by over 10$%$ in randomized cross-validation experiments. Notably, case studies on Influenza and IL-33 validate its practical utility, demonstrating robust ranking consistency in distinguishing subtle viral mutations and efficiently prioritizing top-tier candidates for wet-lab screening.

关键词: antibody-antigen binding affinity, listwise ranking, positive-unlabeled learning, multi-head self-attention, bioinformatics, therapeutic design, affinity prediction, deep learning

282. ❌ Mycelium-Index: A Streaming Approximate Nearest Neighbor Index with Myelial Edge Decay, Traffic-Driven Reinforcement, and Adaptive Living Hierarchy

作者: Anton Pakhunov 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11274v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是高维向量空间的流式近似最近邻索引系统，虽然受生物菌丝体启发，但核心内容是索引数据结构、算法优化和性能评估，与所有评分关键词（均聚焦于大模型、深度学习技术原理及其应用）完全无关。论文未涉及任何语言模型、训练方法、对齐技术、推理优化、代理系统或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种受生物菌丝体启发的流式近似最近邻索引系统，通过动态拓扑调整和优化算法，在保持高召回率的同时显著降低了内存使用并提高了查询性能。

摘要翻译

本文提出菌丝索引（mycelium-index），一种面向高维向量空间的流式近似最近邻（ANN）索引，其设计灵感来源于生物菌丝的自适应生长模式。该系统通过菌丝边缘衰减与强化机制、流量驱动的动态层级结构，以及结合冷节点的O(1)旁路删除与枢纽节点O(k)束搜索修复的混合删除策略，持续调整其拓扑结构。在SIFT-1M数据集上使用FreshDiskANN的100%全更新基准协议进行评估，实验结果表明：菌丝索引在召回率@5上达到0.927 +/- 0.028，处于FreshDiskANN约0.95召回率的测量置信区间内，同时内存使用量减少5.7倍（88 MB对比>500 MB），并实现4.7倍更高的每秒查询率（QPS）（2,795对比约600）。在静态索引测试中，当ef=192时，菌丝索引以5.2倍更低的内存消耗（163 MB对比854 MB）达到了与HNSW M=16相当的召回率（0.962对比0.965）。包括NEON SIMD距离计算、向量化节点存储（Vec-backed node storage）和位集访问跟踪（bitset visited tracking）在内的性能优化，累计带来了2.7倍的QPS提升。对十种流式修复机制的系统性研究发现：几何启发式方法在高维空间中普遍失效，而拓扑机制则能成功——我们将这一规律称为高维ANN图的拓扑修复不变性。

摘要 (Abstract)

We present mycelium-index, a streaming approximate nearest neighbor (ANN) index for high-dimensional vector spaces, inspired by the adaptive growth patterns of biological mycelium. The system continuously adapts its topology through myelial edge decay and reinforcement, a traffic-driven living hierarchy, and hybrid deletion combining O(1) bypass for cold nodes with O(k) beam-search repair for hub nodes. Experimental evaluation on SIFT-1M demonstrates that mycelium achieves 0.927 +/- 0.028 recall@5 under FreshDiskANN’s 100%-turnover benchmark protocol – within the measurement confidence interval of FreshDiskANN’s ~0.95 – while using 5.7x less RAM (88 MB vs. >500 MB) and achieving 4.7x higher QPS (2,795 vs. ~600). On the static index, at ef=192, mycelium matches HNSW M=16 recall (0.962 vs. 0.965) at 5.2x less RAM (163 MB vs. 854 MB). Performance optimizations including NEON SIMD distance computation, Vec-backed node storage, and bitset visited tracking yield a cumulative 2.7x QPS improvement. A systematic study of ten streaming repair mechanisms finds that geometric heuristics universally fail in high dimensions, while topological mechanisms succeed – a principle we term the topological repair invariance of high-dimensional ANN graphs.

关键词: approximate nearest neighbor, streaming index, mycelium-inspired, high-dimensional vectors, topological repair, memory efficiency, query performance, ANN graphs

283. ❌ Signal-Aware Conditional Diffusion Surrogates for Transonic Wing Pressure Prediction

作者: Víctor Francés-Belda, Carlos Sanmiguel Vila, Rodrigo Castellanos 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11263v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用条件去噪扩散概率模型（一种深度学习生成模型）进行跨音速机翼压力预测的代理建模，属于AI在科学计算（具体为计算流体力学/空气动力学）中的应用。论文内容与绝大多数关键词（主要关于大语言模型、训练对齐技术、推理优化、智能体等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究是AI在科学（空气动力学）领域的一个具体应用，但并非核心的生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于条件去噪扩散模型的信号感知训练方法，用于预测NASA通用研究模型机翼在不同飞行条件下的表面压力分布，相比确定性基线模型，该方法降低了平均绝对误差并更好地重建了吸力峰、激波结构和控制面不连续性。

摘要翻译

精确高效的气动表面压力场代理模型对于加速飞行器设计与分析至关重要，然而基于逐点损失训练的确定性回归器常会平滑尖锐的非线性特征。本研究提出了一种条件去噪扩散概率模型，用于预测NASA通用研究模型机翼在不同马赫数、攻角及四个控制面偏转条件下的表面压力分布。该框架通过主成分表示处理非结构化表面数据，将其作为压力场的一种非截断、可逆的线性重参数化方法，从而实现了全连接架构。通过将重构损失在扩散过程中反向传播，推导出信号感知的训练目标，生成一种依赖于时间步长的加权策略，提升了强压力梯度区域的重建保真度。通过对重复条件生成过程进行分析，研究了随机采样特性，并引入局部可靠性指数与全局可靠性指数两项诊断指标，以关联采样诱导的分布宽度与重构误差。相较于所考虑的确定性基线方法，所提出的模型降低了平均绝对误差，并改善了对吸力峰值、激波结构及控制面不连续区域的重建效果。采样诱导的分布宽度与代理误差表现出强相关性，支持将其解释为定性可靠性指标而非经过校准的不确定性量化结果。

摘要 (Abstract)

Accurate and efficient surrogate models for aerodynamic surface pressure fields are essential for accelerating aircraft design and analysis, yet deterministic regressors trained with pointwise losses often smooth sharp nonlinear features. This work presents a conditional denoising diffusion probabilistic model for predicting surface pressure distributions on the NASA Common Research Model wing under varying conditions of Mach number, angle of attack, and four control surface deflections. The framework operates on unstructured surface data through a principal component representation used as a non-truncated, reversible linear reparameterization of the pressure field, enabling a fully connected architecture. A signal-aware training objective is derived by propagating a reconstruction loss through the diffusion process, yielding a timestep-dependent weighting that improves fidelity in regions with strong pressure gradients. The stochastic sampling process is analyzed through repeated conditional generations, and two diagnostic metrics are introduced, the Local Reliability Index and Global Reliability Index, to relate sampling-induced spread to reconstruction error. Relative to the considered deterministic baselines, the proposed formulation reduces mean absolute error and improves the reconstruction of suction peaks, shock structures, and control surface discontinuities. The sampling-induced spread exhibits strong correspondence with surrogate error, supporting its interpretation as a qualitative reliability indicator rather than calibrated uncertainty quantification.

关键词: conditional denoising diffusion probabilistic model, surrogate model, aerodynamic surface pressure prediction, NASA Common Research Model, signal-aware training, unstructured surface data, reliability index, shock structure reconstruction

284. ❌ Trustworthy Feature Importance Avoids Unrestricted Permutations

作者: Emanuele Borgonovo, Francesco Cappelli, Xuefei Lu, Elmar Plischke, Cynthia Rudin 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11253v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于特征重要性方法的统计改进，特别是解决无限制排列中的外推错误问题，提出了条件模型依赖、Knockoffs和受限ALE图设计等方法。论文内容属于传统机器学习/统计学习领域，与所有提供的大模型、深度学习、AI for Science等关键词均无直接关联，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文针对特征重要性方法中无限制排列导致的外推错误问题，提出了三种新方法（条件模型依赖、Knockoffs和受限ALE图设计），理论和数值结果表明这些策略能减少或消除外推错误。

摘要翻译

采用无限制置换的特征重要性方法因外推误差而存在缺陷；此类误差存在于所有非平凡的变量重要性方法中。我们提出三种新方法：条件模型依赖度、高斯变换的Knockoffs方法以及受限ALE图设计。理论与数值结果表明，我们的策略能够减少或消除外推误差。

摘要 (Abstract)

Feature importance methods using unrestricted permutations are flawed due to extrapolation errors; such errors appear in all non-trivial variable importance approaches. We propose three new approaches: conditional model reliance and Knockoffs with Gaussian transformation, and restricted ALE plot designs. Theoretical and numerical results show our strategies reduce/eliminate extrapolation.

关键词: feature importance, unrestricted permutations, extrapolation errors, conditional model reliance, Knockoffs, ALE plots, variable importance, statistical methods

285. ❌ Unified Graph Prompt Learning via Low-Rank Graph Message Prompting

作者: Beibei Wang, Bo Jiang, Ziyan Zhang, Jin Tang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11257v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于图神经网络（GNN）的提示学习，提出了一种统一的低秩图消息提示方法（LR-GMP），用于图数据的微调。论文与大多数关键词无关，因为这些关键词主要针对大语言模型（LLM）及其相关技术（如对齐、推理、代理等）。然而，论文在以下方面与部分关键词有中等关联：1）“Pre-training OR Continual Pre-training OR Domain Adaptation”（5分）：论文涉及预训练GNN的适应和领域适应；2）“Post-training OR Supervised Fine-tuning OR SFT”（5分）：论文关注图数据的微调学习；3）“PEFT OR LoRA OR Parameter-efficient Fine-tuning”（5分）：论文的低秩提示方法可视为一种参数高效的微调技术。其他关键词（如LLM、MoE、推理、代理等）与图神经网络提示学习无直接关联，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文解决了图数据提示（GDP）方法缺乏统一框架的问题，提出了一种低秩图消息提示（LR-GMP）方法，通过统一提示所有图组件，在多个下游任务中实现了优越的泛化性和鲁棒性。

摘要翻译

图数据提示（Graph Data Prompt，GDP）通过在图数据中引入特定提示以高效适配预训练图神经网络，已成为图微调学习问题的主流方法。然而，现有GDP方法分别针对不同的图组件（如节点特征、边特征、边权重）设计，因而仅在有限的图数据提示空间中操作。据我们所知，目前仍缺乏一种能够同时面向所有图组件的统一提示器。为应对这一挑战，本文首先提出从图消息提示（Graph Message Prompt，GMP）范式的角度重新解读多种现有GDP方法。基于GMP，我们进一步引入一种新颖的图提示学习方法——低秩图消息提示（Low-Rank GMP，LR-GMP），该方法利用低秩提示表示实现高效紧凑的图提示学习。与传统GDP方法分别针对不同图组件进行提示不同，LR-GMP以统一方式同时对所有图组件执行提示操作，从而在多样化的下游任务中展现出显著更优的泛化能力和鲁棒性。在多个图基准数据集上的大量实验验证了所提LR-GMP方法的有效性和优势。

摘要 (Abstract)

Graph Data Prompt (GDP), which introduces specific prompts in graph data for efficiently adapting pre-trained GNNs, has become a mainstream approach to graph fine-tuning learning problem. However, existing GDPs have been respectively designed for distinct graph component (e.g., node features, edge features, edge weights) and thus operate within limited prompt spaces for graph data. To the best of our knowledge, it still lacks a unified prompter suitable for targeting all graph components simultaneously. To address this challenge, in this paper, we first propose to reinterpret a wide range of existing GDPs from an aspect of Graph Message Prompt (GMP) paradigm. Based on GMP, we then introduce a novel graph prompt learning approach, termed Low-Rank GMP (LR-GMP), which leverages low-rank prompt representation to achieve an effective and compact graph prompt learning. Unlike traditional GDPs that target distinct graph components separately, LR-GMP concurrently performs prompting on all graph components in a unified manner, thereby achieving significantly superior generalization and robustness on diverse downstream tasks. Extensive experiments on several graph benchmark datasets demonstrate the effectiveness and advantages of our proposed LR-GMP.

关键词: Graph Prompt Learning, Low-Rank Prompting, Graph Message Prompt, Unified Prompting, Graph Fine-tuning, GNN Adaptation, Graph Data Prompt, Downstream Tasks

286. ❌ Regional Explanations: Bridging Local and Global Variable Importance

作者: Salim I. Amoukou, Nicolas J-B. Brunel 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11223v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于机器学习模型可解释性领域，具体研究局部特征归因方法（如Local Shapley Values和LIME）的局限性，并提出新的R-LOCO方法来改进局部和全局解释之间的桥梁。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、AI应用等）完全无关，仅与’Mechanistic Interpretability OR Explainable AI’高度相关，因为这是可解释AI的核心主题。

!!! tip deepseek-chat TL;DR

该论文分析了局部特征归因方法（Local Shapley Values和LIME）的局限性，发现它们可能错误地将重要性分配给不相关的特征，并提出R-LOCO方法通过将输入空间分割为区域并在区域内应用全局归因方法，从而提供更准确和稳定的局部解释。

摘要翻译

我们分析了两种广泛使用的局部归因方法——局部沙普利值（Local Shapley Values）与LIME，它们旨在量化特征值$x_i$对特定预测$f(x_1, \dots, x_p)$的贡献。尽管应用广泛，我们发现这两种方法即使在精确计算与特征独立的理想条件下，仍存在可靠检测局部重要特征的根本性局限。我们认为，一个合理的局部归因方法不应将重要性分配给那些既不影响模型输出（例如线性模型中系数为零的特征），也未与功能相关特征表现出统计依赖性的特征。我们证明局部沙普利值与LIME均违背了这一基本原则。为解决此问题，我们提出R-LOCO（区域协变量排除法），该方法弥合了局部解释与全局解释之间的鸿沟，并提供更准确的归因结果。R-LOCO将输入空间划分为具有相似特征重要性特征的区域，随后在这些区域内应用全局归因方法，根据实例所属区域推导其特征贡献。这一方法在避免局部解释不稳定性、保留全局方法常丢失的实例特异性细节的同时，提供了更可信的局部归因。

摘要 (Abstract)

We analyze two widely used local attribution methods, Local Shapley Values and LIME, which aim to quantify the contribution of a feature value $x_i$ to a specific prediction $f(x_1, \dots, x_p)$. Despite their widespread use, we identify fundamental limitations in their ability to reliably detect locally important features, even under ideal conditions with exact computations and independent features. We argue that a sound local attribution method should not assign importance to features that neither influence the model output (e.g., features with zero coefficients in a linear model) nor exhibit statistical dependence with functionality-relevant features. We demonstrate that both Local SV and LIME violate this fundamental principle. To address this, we propose R-LOCO (Regional Leave Out COvariates), which bridges the gap between local and global explanations and provides more accurate attributions. R-LOCO segments the input space into regions with similar feature importance characteristics. It then applies global attribution methods within these regions, deriving an instance’s feature contributions from its regional membership. This approach delivers more faithful local attributions while avoiding local explanation instability and preserving instance-specific detail often lost in global methods.

关键词: local attribution methods, feature importance, explainable AI, Shapley values, LIME, model interpretability, R-LOCO, regional explanations

287. ❌ CapBench: A Multi-PDK Dataset for Machine-Learning-Based Post-Layout Capacitance Extraction

作者: Hector R. Rodriguez, Jiechen Huang, Wenjian Yu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11202v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于电子设计自动化（EDA）领域的机器学习应用，特别是电容提取任务。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐技术等）完全无关，因为这些关键词主要针对自然语言处理和大语言模型领域。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在科学计算（具体是芯片设计）中的应用，属于AI for Science的范畴，但并非核心的生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一个用于机器学习辅助电容提取的多工艺设计套件数据集CapBench，并评估了多种神经网络架构，发现CNN在精度上最优而GNN在速度上最快，揭示了精度与速度的权衡。

摘要翻译

我们推出CapBench——一个完全可复现、跨多工艺设计套件（PDK）的电容提取数据集。该数据集源自开源设计，包括单核CPU、片上系统及媒体加速器。所有设计均通过14次独立的OpenROAD流程运行完成完整布局布线，覆盖ASAP7、NanGate45和Sky130HD三种工艺节点。从这些版图中，我们提取了跨越三个尺寸层级的61,855个三维窗口，以支持迁移学习和可扩展性研究。高精度电容标签采用最先进的随机行走求解器RWCap生成，并通过行业标准工具Raphael验证，总电容平均绝对误差为0.64%。每个窗口均被预处理为密度图、图表示和点云数据。我们评估了10种机器学习架构以展示数据集用途并建立基准，包括卷积神经网络（CNN）、点云变换器和图神经网络（GNN）。实验表明CNN误差最低（1.75%），而GNN速度最快（提升达41.4倍）但误差较高（10.2%），揭示了精度与速度间的显著权衡。代码与数据集已发布于https://github.com/THU-numbda/CapBench。

摘要 (Abstract)

We present CapBench, a fully reproducible, multi-PDK dataset for capacitance extraction. The dataset is derived from open-source designs, including single-core CPUs, systems-on-chip, and media accelerators. All designs are fully placed and routed using 14 independent OpenROAD flow runs spanning three technology nodes: ASAP7, NanGate45, and Sky130HD. From these layouts, we extract 61,855 3D windows across three size tiers to enable transfer learning and scalability studies. High-fidelity capacitance labels are generated using RWCap, a state-of-the-art random-walk solver, and validated against the industry-standard Raphael, achieving a mean absolute error of 0.64% for total capacitance. Each window is pre-processed into density maps, graph representations, and point clouds. We evaluate 10 machine learning architectures that illustrate dataset usage and serve as baselines, including convolutional neural networks (CNNs), point cloud transformers, and graph neural networks (GNNs). CNNs demonstrate the lowest errors (1.75%), while GNNs are up to 41.4x faster but exhibit larger errors (10.2%), illustrating a clear accuracy-speed trade-off. Code and dataset are available at https://github.com/THU-numbda/CapBench.

关键词: Capacitance extraction, Machine learning, EDA, Dataset, CNN, GNN, Transfer learning, Post-layout

288. ❌ ShapShift: Explaining Model Prediction Shifts with Subgroup Conditional Shapley Values

作者: Tom Bewley, Salim I. Amoukou, Emanuele Albini, Saumitra Mishra, Manuela Veloso 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11200v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《ShapShift: Explaining Model Prediction Shifts with Subgroup Conditional Shapley Values》专注于机器学习模型预测偏移的解释方法，提出了一种基于Shapley值的归因技术，用于将预测偏移归因于可解释数据子组条件概率的变化。该方法适用于决策树、树集成和通过代理树扩展到神经网络等模型。论文的核心贡献在于模型可解释性（Explainable AI）领域，特别是针对预测偏移的解释。因此，仅与关键词“Mechanistic Interpretability OR Explainable AI”高度相关（评分为10分），因为论文直接涉及模型解释和可解释AI技术。其他关键词均与大模型、深度学习技术原理、科学应用等无关，评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ShapShift的Shapley值方法，用于解释机器学习模型因输入分布变化导致的预测偏移，通过将偏移归因于可解释数据子组的条件概率变化，为动态环境中的模型监控提供了简单、忠实且近乎完整的解释。

摘要翻译

输入分布的变化可能引发机器学习模型平均预测的偏移。此类预测偏移会影响下游业务结果（例如银行的贷款批准率），因此理解其成因至关重要。我们提出\ours{}方法：一种基于沙普利值（Shapley value）的归因框架，将预测偏移归因于可解释数据子组的条件概率变化，这些子组通过决策树的结构进行定义。我们首先将该方法应用于单棵决策树，基于分裂节点的条件概率变化提供精确解释。随后，我们通过选择最具解释力的树并考虑残差效应，将其扩展至树集成模型。最后，我们提出一种与模型无关的变体，使用通过新型目标函数训练的代理树，使其能够应用于神经网络等模型。虽然精确计算可能消耗大量资源，但近似技术使其具备实际应用可行性。实验表明，\ours{}能为跨模型类别的预测偏移提供简洁、忠实且近乎完整的解释，有助于在动态环境中进行模型监控。

摘要 (Abstract)

Changes in input distribution can induce shifts in the average predictions of machine learning models. Such prediction shifts may impact downstream business outcomes (e.g. a bank’s loan approval rate), so understanding their causes can be crucial. We propose \ours{}: a Shapley value method for attributing prediction shifts to changes in the conditional probabilities of interpretable subgroups of data, where these subgroups are defined by the structure of decision trees. We initially apply this method to single decision trees, providing exact explanations based on conditional probability changes at split nodes. Next, we extend it to tree ensembles by selecting the most explanatory tree and accounting for residual effects. Finally, we propose a model-agnostic variant using surrogate trees grown with a novel objective function, allowing application to models like neural networks. While exact computation can be intensive, approximation techniques enable practical application. We show that \ours{} provides simple, faithful, and near-complete explanations of prediction shifts across model classes, aiding model monitoring in dynamic environments.

关键词: prediction shifts, Shapley values, explainable AI, model interpretability, decision trees, tree ensembles, surrogate models, model monitoring

289. ❌ Probabilistic Prediction of Neural Dynamics via Autoregressive Flow Matching

作者: Nicole Rogalla, Yuzhen Qin, Mario Senden, Ahmed El-Gazzar, Marcel van Gerven 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11178v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于神经科学领域的深度学习应用，提出了一种基于自回归流匹配（AFM）的生成式预测框架来建模神经动力学。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在科学（神经科学/生物信息学）领域的应用，但并非核心匹配（论文未直接提及这些术语，且焦点是特定方法而非广泛的AI for Science主题），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于自回归流匹配（AFM）的生成式框架，用于从多模态感官输入中概率性预测短期神经活动（BOLD信号），在Algonauts项目fMRI数据集上显著优于基线模型，展示了在闭环神经技术中的潜在应用。

摘要翻译

预测自然刺激下的神经活动，仍是理解大脑动态和实现下游神经技术应用的关键挑战。本文提出了一种基于自回归流匹配的生成式预测框架，用于建模神经动态。该方法基于近期基于传输的生成建模进展，能够从多模态感觉输入中概率性地大规模预测神经响应。具体而言，我们学习给定过去神经动态与当前感觉输入条件下未来神经活动的条件分布，将神经活动明确建模为一个时间演化过程，其中未来状态依赖于近期的神经历史。我们在Algonauts项目2025挑战赛的功能磁共振成像数据集上，使用被试特异性模型评估了该框架。在预测短期、基于脑区的血氧水平依赖信号活动时，自回归流匹配方法显著优于非自回归流匹配基线模型和官方挑战赛的通用线性模型基线，显示出更好的泛化能力和广泛的皮层预测性能。消融分析表明，获取过去的BOLD动态是性能的主要驱动因素，而在短期、上下文丰富的条件下，自回归分解能带来持续且适度的性能提升。综上所述，这些发现确立了基于自回归流的生成建模作为一种有效的神经动态短期概率预测方法，在闭环神经技术中具有广阔的应用前景。

摘要 (Abstract)

Forecasting neural activity in response to naturalistic stimuli remains a key challenge for understanding brain dynamics and enabling downstream neurotechnological applications. Here, we introduce a generative forecasting framework for modeling neural dynamics based on autoregressive flow matching (AFM). Building on recent advances in transport-based generative modeling, our approach probabilistically predicts neural responses at scale from multimodal sensory input. Specifically, we learn the conditional distribution of future neural activity given past neural dynamics and concurrent sensory input, explicitly modeling neural activity as a temporally evolving process in which future states depend on recent neural history. We evaluate our framework on the Algonauts project 2025 challenge functional magnetic resonance imaging dataset using subject-specific models. AFM significantly outperforms both a non-autoregressive flow-matching baseline and the official challenge general linear model baseline in predicting short-term parcel-wise blood oxygenation level-dependent (BOLD) activity, demonstrating improved generalization and widespread cortical prediction performance. Ablation analyses show that access to past BOLD dynamics is a dominant driver of performance, while autoregressive factorization yields consistent, modest gains under short-horizon, context-rich conditions. Together, these findings position autoregressive flow-based generative modeling as an effective approach for short-term probabilistic forecasting of neural dynamics with promising applications in closed-loop neurotechnology.

关键词: autoregressive flow matching, neural dynamics, probabilistic forecasting, fMRI, BOLD activity, generative modeling, brain dynamics, neurotechnology

290. ❌ Cost-optimal Sequential Testing via Doubly Robust Q-learning

作者: Doudou Zhou, Yiran Zhang, Dian Jin, Yingye Zheng, Lu Tian, Tianxi Cai 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11165v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究临床决策中的成本最优顺序测试策略，使用强化学习（Q-learning）方法处理回顾性数据中的信息缺失问题。论文核心是统计学习和因果推断方法在医疗决策中的应用，与大多数大模型技术关键词（如LLM、MoE、SFT、RAG等）完全无关。唯一相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及生物医学数据分析和临床决策支持，属于AI在科学领域的应用，但并非核心创新点，因此给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对临床决策中成本高昂的顺序测试问题，开发了一种基于双重稳健Q学习的框架，从回顾性数据中学习最优测试策略，在模拟和前列腺癌队列研究中证明能降低测试成本而不影响预测准确性。

摘要翻译

临床决策常涉及选择昂贵、侵入性或耗时的检测，这推动了个体化、序贯化的测量策略研究，以确定应检测何种指标及何时停止确认。我们研究从回顾性数据中学习成本最优序贯决策策略的问题，其中检测的可获得性取决于先前结果，从而引发信息性缺失。在序贯随机缺失机制下，我们开发了一个双重稳健的Q学习框架来估计最优策略。该方法引入了路径特异性逆概率权重，这些权重能处理异质性的检测轨迹，并在给定观测历史条件下满足归一化特性。通过将这些权重与辅助对比模型相结合，我们构建了正交伪结局变量，使得在采集模型或对比模型任一被正确设定时，能够实现无偏的策略学习。我们建立了阶段对比估计量的oracle不等式，以及学习策略的收敛速率、遗憾界和误分类率。模拟实验表明，与加权及完整病例基线方法相比，本方法在成本调整后的性能表现更优；在前列腺癌队列研究中的应用则表明，该方法能在不降低预测准确性的前提下有效降低检测成本。

摘要 (Abstract)

Clinical decision-making often involves selecting tests that are costly, invasive, or time-consuming, motivating individualized, sequential strategies for what to measure and when to stop ascertaining. We study the problem of learning cost-optimal sequential decision policies from retrospective data, where test availability depends on prior results, inducing informative missingness. Under a sequential missing-at-random mechanism, we develop a doubly robust Q-learning framework for estimating optimal policies. The method introduces path-specific inverse probability weights that account for heterogeneous test trajectories and satisfy a normalization property conditional on the observed history. By combining these weights with auxiliary contrast models, we construct orthogonal pseudo-outcomes that enable unbiased policy learning when either the acquisition model or the contrast model is correctly specified. We establish oracle inequalities for the stage-wise contrast estimators, along with convergence rates, regret bounds, and misclassification rates for the learned policy. Simulations demonstrate improved cost-adjusted performance over weighted and complete-case baselines, and an application to a prostate cancer cohort study illustrates how the method reduces testing cost without compromising predictive accuracy.

关键词: sequential testing, clinical decision-making, Q-learning, cost-optimal policy, informative missingness, doubly robust estimation, retrospective data, prostate cancer

291. ❌ Gradient-Variation Regret Bounds for Unconstrained Online Learning

作者: Yuheng Zhao, Andrew Jacobsen, Nicolò Cesa-Bianchi, Peng Zhao 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11151v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究无约束在线学习中的参数无关算法，关注梯度变化与遗憾界分析，属于优化理论领域。所有评分关键词均涉及大模型、深度学习技术及其应用，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对无约束在线学习问题，开发了参数无关算法，实现了基于梯度变化的遗憾界保证，并扩展到了动态遗憾和随机扩展对抗模型。

摘要翻译

我们针对无约束在线学习开发了无需预设参数的算法，其遗憾界可与梯度变化量$V_T(u) = \sum_{t=2}^T |\nabla f_t(u)-\nabla f_{t-1}(u)|^2$相关联。对于$L$-光滑凸损失函数，我们提出了完全自适应的算法，能够实现量级为$\widetilde{O}(|u|\sqrt{V_T(u)} + L|u|^2+G^4)$的遗憾，且无需预先获知比较器范数$|u|$、利普希茨常数$G$或光滑度$L$。每一轮的更新可通过闭式表达式高效计算。我们的结果可扩展至动态遗憾，并直接应用于随机扩展对抗（SEA）模型，该结果显著改进了先前的最佳已知成果[Wang et al., 2025]。

摘要 (Abstract)

We develop parameter-free algorithms for unconstrained online learning with regret guarantees that scale with the gradient variation $V_T(u) = \sum_{t=2}^T |\nabla f_t(u)-\nabla f_{t-1}(u)|^2$. For $L$-smooth convex loss, we provide fully-adaptive algorithms achieving regret of order $\widetilde{O}(|u|\sqrt{V_T(u)} + L|u|^2+G^4)$ without requiring prior knowledge of comparator norm $|u|$, Lipschitz constant $G$, or smoothness $L$. The update in each round can be computed efficiently via a closed-form expression. Our results extend to dynamic regret and find immediate implications to the stochastically-extended adversarial (SEA) model, which significantly improves upon the previous best-known result [Wang et al., 2025].

关键词: online learning, parameter-free algorithms, regret bounds, gradient variation, unconstrained optimization, dynamic regret, stochastically-extended adversarial model, convex loss

292. ❌ A Full Compression Pipeline for Green Federated Learning in Communication-Constrained Environments

作者: Elouan Colybes, Shririn Salehi, Anke Schmeink 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11146v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于联邦学习（FL）中的通信效率优化，提出了一种结合剪枝、量化和霍夫曼编码的完整压缩流水线（FCP）。论文的核心是模型压缩技术（特别是量化），与关键词’Quantization OR Model Compression OR Low-bit Weights’高度相关（评10分）。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新或大模型在不同领域的应用，也未提及其他关键词（如MoE、Scaling Laws、Alignment、RAG、CoT、Agents等）。因此，除量化相关关键词外，其余所有关键词均评0分。

!!! tip deepseek-chat TL;DR

该论文针对通信受限环境中的联邦学习，提出了一种集成剪枝、量化和霍夫曼编码的完整压缩流水线（FCP），在CIFAR-10数据集上实现了11倍以上的模型大小压缩，仅损失2%的准确率，并使联邦学习训练速度提升60%以上。

摘要翻译

联邦学习（Federated Learning, FL）使得分布式客户端能够在无需共享原始数据的情况下进行协作式模型训练，从而保护数据隐私。然而，联邦学习通常面临显著的通信与计算开销，限制了其可扩展性与可持续性。本文针对通信受限环境，提出一种用于联邦学习的全压缩流水线（Full Compression Pipeline, FCP）。该流水线将三种互补的深度压缩技术（剪枝、量化和霍夫曼编码）集成到一个统一的端到端框架中。通过对本地模型与通信负载进行压缩，FCP在保持竞争力精度的同时，显著降低了传输成本与资源消耗。为量化其影响，我们开发了一个评估框架，将通信与计算开销统一建模为综合模型成本，从而实现对效率权衡的整体评估。该流水线在独立同分布（IID）与非独立同分布（non-IID）数据设置下进行了验证。在一个代表性场景中，使用十个客户端在CIFAR-10数据集上训练ResNet-12模型，带宽为2 Mbps，FCP实现了超过11倍的模型尺寸缩减，与未压缩基线相比精度仅下降2%。这使得联邦学习训练速度提升了60%以上。

摘要 (Abstract)

Federated Learning (FL) enables collaborative model training across distributed clients without sharing raw data, thereby preserving privacy. However, FL often suffers from significant communication and computational overhead, limiting its scalability and sustainability. In this work, we introduce a Full Compression Pipeline (FCP) for FL in communication-constrained environments. FCP integrates three complementary deep compression techniques (pruning, quantization, and Huffman encoding) into a unified end-to-end framework. By compressing local models and communication payloads, FCP substantially reduces transmission costs and resource consumption while maintaining competitive accuracy. To quantify its impact, we develop an evaluation framework that captures both communication and computation overheads as a unified model cost, allowing a holistic assessment of efficiency trade-offs. The pipeline is evaluated in an independent and identically distributed (IID) and non-IID data setting. In one representative scenario, training a ResNet-12 model on the CIFAR-10 dataset with ten clients and a 2 Mbps bandwidth, the FCP achieves more than 11$\times$ reduction in model size, with only a 2% drop in accuracy compared to the uncompressed baseline. This results in an FL training that is more than 60% faster.

关键词: Federated Learning, Model Compression, Quantization, Communication Efficiency, Pruning, Huffman Encoding, Green AI, Distributed Training

293. ❌ From Answers to Arguments: Toward Trustworthy Clinical Diagnostic Reasoning with Toulmin-Guided Curriculum Goal-Conditioned Learning

作者: Chen Zhan, Xiaoyu Tan, Gengchen Ma, Yu-Jie Xiong, Xiaoyan Jiang, Xihe Qiu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11137v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLMs在临床诊断中的应用，提出CGCL训练方法提升推理透明度和可靠性。高度相关关键词包括：LLMs（核心模型）、Chain of Thought/System 2 Thinking（结构化推理）、Hallucination Mitigation（解决幻觉问题）、Explainable AI（提升可解释性）、AI for Science（医疗科学应用）。中等相关：Post-training/SFT（涉及训练方法）、Instruction Tuning（目标对齐）、Self-Correction（推理改进）。其余关键词与论文技术细节无关。

!!! tip deepseek-chat TL;DR

该论文针对LLMs在临床诊断中推理不透明和不可靠的问题，提出基于Toulmin模型的Curriculum Goal-Conditioned Learning框架，显著提升了诊断准确性和推理质量。

摘要翻译

将大型语言模型（LLM）整合到临床决策支持中的关键障碍在于其推理过程不透明且往往不可靠。在医疗这一高风险领域，仅提供正确答案是不够的；临床实践要求完全的透明度，以确保患者安全并实现专业问责。当前大型语言模型存在一个普遍且危险的弱点，即倾向于通过有缺陷的推理得出“正确答案”。这一问题远不止是一个微小的学术瑕疵；此类过程性错误表明模型从根本上缺乏稳健的理解能力，使其在面对真实世界的临床复杂性时，容易产生更广泛的幻觉和不可预测的故障。本文通过将图尔敏模型（Toulmin model）应用于诊断过程，建立了一个可信赖的临床论证框架。我们提出了一种新颖的训练流程：课程目标条件学习（Curriculum Goal-Conditioned Learning, CGCL），旨在逐步训练大型语言模型生成明确遵循图尔敏结构的诊断论证。CGCL的渐进式三阶段课程系统地构建了坚实的临床论证：（1）提取事实并生成鉴别诊断；（2）为核心假设提供依据，同时反驳其他可能性；（3）将分析综合成最终的、有条件的结论。我们使用T-Eval（一个衡量诊断推理完整性的量化框架）对CGCL进行了验证。实验表明，我们的方法在诊断准确性和推理质量上达到了与资源密集的强化学习（Reinforcement Learning, RL）方法相当的水平，同时提供了一个更稳定、更高效的训练流程。

摘要 (Abstract)

The integration of Large Language Models (LLMs) into clinical decision support is critically obstructed by their opaque and often unreliable reasoning. In the high-stakes domain of healthcare, correct answers alone are insufficient; clinical practice demands full transparency to ensure patient safety and enable professional accountability. A pervasive and dangerous weakness of current LLMs is their tendency to produce “correct answers through flawed reasoning.” This issue is far more than a minor academic flaw; such process errors signal a fundamental lack of robust understanding, making the model prone to broader hallucinations and unpredictable failures when faced with real-world clinical complexity. In this paper, we establish a framework for trustworthy clinical argumentation by adapting the Toulmin model to the diagnostic process. We propose a novel training pipeline: Curriculum Goal-Conditioned Learning (CGCL), designed to progressively train LLM to generate diagnostic arguments that explicitly follow this Toulmin structure. CGCL’s progressive three-stage curriculum systematically builds a solid clinical argument: (1) extracting facts and generating differential diagnoses; (2) justifying a core hypothesis while rebutting alternatives; and (3) synthesizing the analysis into a final, qualified conclusion. We validate CGCL using T-Eval, a quantitative framework measuring the integrity of the diagnosis reasoning. Experiments show that our method achieves diagnostic accuracy and reasoning quality comparable to resource-intensive Reinforcement Learning (RL) methods, while offering a more stable and efficient training pipeline.

关键词: Large Language Models, Clinical Diagnostic Reasoning, Toulmin Model, Curriculum Goal-Conditioned Learning, Trustworthy AI, Hallucination Mitigation, Explainable AI, Healthcare AI

294. ❌ MADQRL: Distributed Quantum Reinforcement Learning Framework for Multi-Agent Environments

作者: Abhishek Sawaika, Samuel Yen-Chi Chen, Udaya Parampalli, Rajkumar Buyya 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11131v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于量子强化学习（QRL）和多智能体系统的分布式框架，与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐等）完全无关。仅与’Multi-agent Systems OR Agent Coordination’有一定关联（5分），因为论文研究多智能体环境中的分布式学习框架，但核心是量子计算与强化学习的结合，而非传统大模型或AI代理技术。

!!! tip deepseek-chat TL;DR

该论文针对高维多智能体环境中量子强化学习面临的硬件限制问题，提出了一个分布式量子强化学习框架，在合作式乒乓球环境中实现了约10%优于其他分布策略、约5%优于经典策略表示的改进。

摘要翻译

强化学习（RL）是从现实应用案例中学习的最实用方法之一。其灵感源于人类使用的认知方法，使其在人工智能领域成为一种广泛接受的策略。大多数用于强化学习的环境通常具有高维特性，传统强化学习算法在处理此类系统时计算成本高昂且难以有效学习。量子计算（QC）理论在实际应用中的最新进展，例如紧凑编码、增强表示与学习算法、随机采样，或量子系统固有的随机性，为解决这些挑战开辟了新方向。量子强化学习（QRL）在过去几年中获得了显著关注。然而，当前量子硬件的水平尚不足以应对具有复杂多智能体设置的高维环境。为解决这一问题，我们提出了一种分布式量子强化学习框架，其中多个智能体独立学习，将联合训练的负载从单台机器分散处理。我们的方法在动作空间与观测空间互不相交的环境中表现良好，但也可通过合理近似扩展到其他系统。我们在合作式乒乓球环境中对所提方法进行了分析，结果表明其性能相较于其他分布式策略提升了约10%，相较于经典策略表示模型提升了约5%。

摘要 (Abstract)

Reinforcement learning (RL) is one of the most practical ways to learn from real-life use-cases. Motivated from the cognitive methods used by humans makes it a widely acceptable strategy in the field of artificial intelligence. Most of the environments used for RL are often high-dimensional, and traditional RL algorithms becomes computationally expensive and challenging to effectively learn from such systems. Recent advancements in practical demonstration of quantum computing (QC) theories, such as compact encoding, enhanced representation and learning algorithms, random sampling, or the inherent stochastic nature of quantum systems, have opened up new directions to tackle these challenges. Quantum reinforcement learning (QRL) is seeking significant traction over the past few years. However, the current state of quantum hardware is not enough to cater for such high-dimensional environments with complex multi-agent setup. To tackle this issue, we propose a distributed framework for QRL where multiple agents learn independently, distributing the load of joint training from individual machines. Our method works well for environments with disjoint sets of action and observation spaces, but can also be extended to other systems with reasonable approximations. We analyze the proposed method on cooperative-pong environment and our results indicate ~10% improvement from other distribution strategies, and ~5% improvement from classical models of policy representation.

关键词: Quantum Reinforcement Learning, Distributed Framework, Multi-Agent Environments, Cooperative Pong, Policy Representation, Quantum Computing, Reinforcement Learning, High-dimensional Environments

295. ❌ AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

作者: Liaoyuan Fan, Zetian Xu, Chen Cao, Wenyao Zhang, Mingqi Yuan, Jiayu Chen 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11135v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文AIM专注于机器人控制领域，提出了一种基于预训练视频生成模型的意图感知统一世界动作模型。其核心创新在于通过空间价值图桥接视觉世界建模与机器人控制，并采用了混合Transformer架构和自蒸馏强化学习。与评分关键词的相关性分析如下：1）高度相关（10分）：‘World Models AND General World Models’，因为论文明确构建并改进’统一世界动作模型’，这是其核心贡献。2）中度相关（8分）：‘Pre-training OR Continual Pre-training OR Domain Adaptation’，因为模型基于预训练视频生成模型构建，并涉及领域适应（从视频到机器人控制）。3）其他关键词（0分）：论文未涉及大语言模型（LLMs）、MoE、对齐、RAG、推理、代理、量化等主题；也未明确属于’AI for Science’中的生物信息学或化学信息学子领域；其’价值图’与’价值对齐’无关，‘自蒸馏’与’自校正’无关。

!!! tip deepseek-chat TL;DR

该论文针对现有统一世界动作模型在机器人控制中解码可靠动作的局限性，提出了一种意图感知模型AIM，它通过预测对齐的空间价值图来显式建模交互意图和结构，并在RoboTwin 2.0基准测试中实现了94.0%的平均成功率，显著优于基线。

摘要翻译

预训练视频生成模型为机器人控制提供了强先验，但现有的统一世界动作模型仍需大量机器人专属训练才能解码可靠动作。我们认为这一局限源于结构错配：视频模型虽能捕捉场景演变规律，动作生成却需显式推理交互位置与底层操作意图。本文提出意图感知统一世界动作模型AIM，通过显式空间接口弥合该鸿沟。AIM不再直接从未来视觉表征解码动作，而是预测编码任务相关交互结构的对齐空间价值图，实现对未来动态的面向控制抽象。基于预训练视频生成模型构建的AIM，在共享混合Transformer架构中联合建模未来观测值与价值图，采用意图因果注意力机制将未来信息通过价值表征定向传输至动作分支。我们进一步提出自蒸馏强化学习阶段：冻结视频与价值分支，仅利用投影价值图响应生成的稠密奖励与稀疏任务级信号优化动作头。为支持训练与评估，我们构建了包含3万条操作轨迹的仿真数据集，同步提供多视角观测、动作及价值图标注。在RoboTwin 2.0基准测试中，AIM以94.0%的平均成功率显著超越现有统一世界动作基线。值得注意的是，该改进在长时程和接触敏感的操作任务中尤为显著，证明了显式空间意图建模作为视觉世界模型与机器人控制桥梁的有效性。

摘要 (Abstract)

Pretrained video generation models provide strong priors for robot control, but existing unified world action models still struggle to decode reliable actions without substantial robot-specific training. We attribute this limitation to a structural mismatch: while video models capture how scenes evolve, action generation requires explicit reasoning about where to interact and the underlying manipulation intent. We introduce AIM, an intent-aware unified world action model that bridges this gap via an explicit spatial interface. Instead of decoding actions directly from future visual representations, AIM predicts an aligned spatial value map that encodes task-relevant interaction structure, enabling a control-oriented abstraction of future dynamics. Built on a pretrained video generation model, AIM jointly models future observations and value maps within a shared mixture-of-transformers architecture. It employs intent-causal attention to route future information to the action branch exclusively through the value representation. We further propose a self-distillation reinforcement learning stage that freezes the video and value branches and optimizes only the action head using dense rewards derived from projected value-map responses together with sparse task-level signals. To support training and evaluation, we construct a simulation dataset of 30K manipulation trajectories with synchronized multi-view observations, actions, and value-map annotations. Experiments on RoboTwin 2.0 benchmark show that AIM achieves a 94.0% average success rate, significantly outperforming prior unified world action baselines. Notably, the improvement is more pronounced in long-horizon and contact-sensitive manipulation tasks, demonstrating the effectiveness of explicit spatial-intent modeling as a bridge between visual world modeling and robot control.

关键词: unified world action model, spatial value map, intent-aware, robot control, pretrained video generation, mixture-of-transformers, self-distillation reinforcement learning, manipulation tasks

296. ❌ DDO-RM for LLM Preference Optimization: A Minimal Held-Out Benchmark against DPO

作者: Tiantian Zhang, Jierui Zuo, Wenping Wang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11119v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	15.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Direct Preference Optimization (DPO)的改进方法DDO-RM，因此与’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’高度相关（15分）。论文使用LLM（Pythia-410m）进行实验，与’Large Language Models OR LLMs OR Foundation Models’相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RAG、Quantization等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为DDO-RM的新方法，用于改进大语言模型的偏好优化，在最小化保留基准测试中相比DPO方法将平均配对准确率从0.5238提升至0.5602。

摘要翻译

本文围绕DPO与DDO-RM偏好优化项目重组了当前手稿，并聚焦于两个部分：算法视角与初步的保留基准测试。该基准测试提出一个具体问题：即便在最小化的成对选择与被拒设定中，基于奖励模型的决策分布更新能否优于直接的成对优化目标？我们使用HuggingFaceH4/ultrafeedback_binarized数据集，在EleutherAI/pythia-410m模型上对比了直接偏好优化（Direct Preference Optimization, DPO）与DDO-RM方法，在保留的test_prefs分割上进行评估，并报告了种子值42、13和3407下的结果。
从算法角度看，DDO-RM将每个提示视为候选回复的有限决策问题。它并非仅优化二元的选择-被拒关系，而是构建候选回复的策略分布，在该分布下对奖励模型分数进行中心化处理，并将奖励引导的目标分布提炼回策略中。在当前公开基准测试中，相较于DPO，DDO-RM将平均配对准确率从0.5238提升至0.5602，AUC从0.5315提升至0.5382，平均边界值从0.1377提升至0.5353。这些结果令人鼓舞但仍属初步：本研究仅涵盖一个模型系列、一个数据集、一个保留评估分割及三个种子值。

摘要 (Abstract)

This paper reorganizes the current manuscript around the DPO versus DDO-RM preference-optimization project and focuses on two parts: the algorithmic view and the preliminary held-out benchmark. The benchmark asks a narrow question: even in a minimal pairwise chosen-versus-rejected setting, can a reward-guided decision-distribution update outperform a direct pairwise objective? We compare Direct Preference Optimization (DPO) against DDO-RM on EleutherAI/pythia-410m using HuggingFaceH4/ultrafeedback_binarized, evaluate on the held-out test_prefs split, and report results for seeds 42, 13, and 3407. Algorithmically, DDO-RM treats each prompt as a finite decision problem over candidate responses. Instead of optimizing only a binary chosen-rejected relation, it forms a policy distribution over candidates, centers reward-model scores under that distribution, and distills a reward-guided target distribution back into the policy. In the current public benchmark, DDO-RM improves mean pair accuracy from 0.5238 to 0.5602, AUC from 0.5315 to 0.5382, and mean margin from 0.1377 to 0.5353 relative to DPO. These are encouraging but still preliminary results: the study covers one model family, one dataset, one held-out evaluation split, and three seeds.

关键词: Direct Preference Optimization, DPO, DDO-RM, preference optimization, reward model, LLM fine-tuning, pairwise comparison, held-out benchmark

297. ❌ Distributionally Robust K-Means Clustering

作者: Vikrant Malik, Taylan Kargin, Babak Hassibi 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11118v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于传统机器学习中的K-means聚类算法，提出了一种分布鲁棒性变体以应对异常值和分布偏移。论文内容完全围绕经典聚类算法、Wasserstein距离、优化算法等传统机器学习主题，未涉及任何大语言模型、深度学习、AI for Science或相关技术（如MoE、RLHF、RAG等）。所有关键词均与大模型和深度学习技术原理或应用相关，而本文是纯粹的经典机器学习研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对K-means聚类对异常值和分布偏移敏感的问题，提出了一种基于Wasserstein距离的分布鲁棒性K-means变体，通过最小化最坏情况期望平方距离的优化框架，实现了更鲁棒的聚类性能，并在实验中验证了其在噪声和异常值处理上的优势。

摘要翻译

K-means聚类是无监督学习中的核心方法，但其对异常值、分布偏移和有限样本量的敏感性众所周知。通过将k-means视为经验分布的Lloyd–Max量化，我们提出了一种分布鲁棒的变体以抵御此类问题。我们假设未知总体分布位于经验分布的Wasserstein-2球内。在此设定下，目标是在该模糊集上寻找能最小化最坏情况期望平方距离的聚类中心，从而形成一个极小极大化问题。一个可处理的对偶形式产生了一种软聚类方案，该方案以平滑加权分配替代了硬分配。我们提出了一种高效的分块坐标下降算法，该算法具有可证明的单调递减性和局部线性收敛性。在标准基准测试和大规模合成数据上的实验表明，该方法在异常值检测和噪声鲁棒性方面取得了显著提升。

摘要 (Abstract)

K-means clustering is a workhorse of unsupervised learning, but it is notoriously brittle to outliers, distribution shifts, and limited sample sizes. Viewing k-means as Lloyd–Max quantization of the empirical distribution, we develop a distributionally robust variant that protects against such pathologies. We posit that the unknown population distribution lies within a Wasserstein-2 ball around the empirical distribution. In this setting, one seeks cluster centers that minimize the worst-case expected squared distance over this ambiguity set, leading to a minimax formulation. A tractable dual yields a soft-clustering scheme that replaces hard assignments with smoothly weighted ones. We propose an efficient block coordinate descent algorithm with provable monotonic decrease and local linear convergence. Experiments on standard benchmarks and large-scale synthetic data demonstrate substantial gains in outlier detection and robustness to noise.

关键词: K-means clustering, distributionally robust, Wasserstein distance, outlier detection, robustness, minimax formulation, soft-clustering, block coordinate descent

298. ❌ Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search

作者: Daniel Nichols, Konstantinos Parasyris, Caetano Melone, Tal Ben-Nun, Giorgis Georgakoudis, Harshitha Menon 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11109v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文主要研究GPU内核优化的自动化框架，属于高性能计算和AI系统优化领域。与大多数大模型技术关键词（如MoE、RLHF、RAG等）完全无关，因为这些关键词涉及大模型架构、训练、对齐、推理等具体技术，而本文不涉及大模型本身。唯一相关的是’AI for Science’（8分），因为论文优化的是科学计算应用（如AI和HPC工作负载），属于AI在科学领域的应用。‘Large Language Models’（5分）有微弱关联，因为摘要提到框架使用了’LLM-driven evolutionary search’，但LLM仅作为搜索组件之一，不是论文核心。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

本文提出了一个名为Record-Remix-Replay的层次化GPU内核优化框架，通过结合LLM驱动的进化搜索和贝叶斯优化，自动探索从源代码到编译器设置的优化空间，从而显著提升科学计算应用的性能并加速优化过程。

摘要翻译

随着高性能计算与人工智能工作负载日益依赖GPU，在快速迭代的硬件代际间保持高性能已成为主要挑战。开发者常需耗费数月时间调优科学应用程序以充分挖掘新架构潜力，其需要在算法设计、源代码实现、编译器标志与过程序列、内核启动参数等构成的复杂优化空间中进行探索。现有方法虽能有效单独搜索该空间的局部（如启动配置或编译器设置），但对整个空间的跨维度优化仍需大量人工专业知识与迭代式手动调优。
本文提出Record-Remix-Replay（R^3）——一种结合LLM驱动进化搜索、贝叶斯优化及记录-重放编译技术的分层优化框架，能够从源代码级实现选择到编译器过程排序及运行时配置，高效探索GPU内核优化方案。通过实现快速可扩展的候选方案评估，本方法支持对通常被割裂处理的优化维度进行实用的端到端搜索。实验表明，Record-Remix-Replay在优化完整科学应用时，不仅在内核参数与编译器标志方面优于传统方法，其搜索速度更比现代进化搜索方法快近一个数量级。

摘要 (Abstract)

As high-performance computing and AI workloads become increasingly dependent on GPUs, maintaining high performance across rapidly evolving hardware generations has become a major challenge. Developers often spend months tuning scientific applications to fully exploit new architectures, navigating a complex optimization space that spans algorithm design, source implementation, compiler flags and pass sequences, and kernel launch parameters. Existing approaches can effectively search parts of this space in isolation, such as launch configurations or compiler settings, but optimizing across the full space still requires substantial human expertise and iterative manual effort. In this paper, we present Record-Remix-Replay (R^3), a hierarchical optimization framework that combines LLM-driven evolutionary search, Bayesian optimization, and record-replay compilation techniques to efficiently explore GPU kernel optimizations from source-level implementation choices down to compiler pass ordering and runtime configuration. By making candidate evaluation fast and scalable, our approach enables practical end-to-end search over optimization dimensions that are typically treated separately. We show that Record-Remix-Replay can optimize full scientific applications better than traditional approaches over kernel parameters and compiler flags, while also being nearly an order of magnitude faster than modern evolutionary search approaches.

关键词: GPU kernel optimization, evolutionary search, record-replay compilation, high-performance computing, scientific applications, LLM-driven search, Bayesian optimization, compiler optimization

299. ❌ Generating Hadamard matrices with transformers

作者: Geordie Williamson, Oded Yacobi, Paul Zinn-Justin 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11101v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文使用transformer神经网络结合局部搜索来构造Hadamard矩阵，属于AI在数学/组合优化领域的应用。与绝大多数关键词（如LLM、MoE、RLHF、RAG等）完全无关，因为这些关键词涉及大模型技术原理、训练方法、推理优化等，而本文仅将transformer作为搜索工具。唯一相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学（数学）领域的应用，但并非核心生物/化学信息学，故给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合transformer神经网络和局部搜索的新方法，用于构造Hadamard矩阵，在100到244阶范围内成功生成了大量不等价的矩阵，并发现transformer能利用搜索空间中的隐藏对称性。

摘要翻译

我们提出了一种在PatternBoost框架下结合Transformer神经网络与局部搜索来构造Hadamard矩阵的新方法。该方法专为极度稀疏的组合搜索问题设计，尤其适用于Goethals–Seidel型Hadamard矩阵的构造——在此类问题中，傅里叶方法可实现快速评分与优化。在$100$至$250$阶的范围内，该方法能生成大量不等价的Hadamard矩阵；在更困难的案例中，它能在随机初始化局部搜索失败的情况下取得成功。通过本方法构造的最大矩阵阶数为$244$。除这些新构造外，实验还表明Transformer能够发现并利用搜索空间中隐藏的有效对称性。

摘要 (Abstract)

We present a new method for constructing Hadamard matrices that combines transformer neural networks with local search in the PatternBoost framework. Our approach is designed for extremely sparse combinatorial search problems and is particularly effective for Hadamard matrices of Goethals–Seidel type, where Fourier methods permit fast scoring and optimisation. For orders between $100$ and $250$, it produces large numbers of inequivalent Hadamard matrices, and in harder cases it succeeds where local search from random initialisation fails. The largest example found by our method has order $244$. In addition to these new constructions, our experiments reveal that the transformer can discover and exploit useful hidden symmetry in the search space.

关键词: Hadamard matrices, transformer neural networks, local search, PatternBoost framework, combinatorial search, Goethals-Seidel type, hidden symmetry, search space

300. ❌ Frugal Knowledge Graph Construction with Local LLMs: A Zero-Shot Pipeline, Self-Consistency and Wisdom of Artificial Crowds

作者: Pierre Jourlin 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11104v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究基于本地LLMs的知识图谱构建与推理，高度相关关键词包括：LLMs（核心方法）、Self-Correction（自一致性机制）、MoE（使用了MoE模型）、RAG（评估框架）、CoT Reasoning（多跳推理）、Hallucination Mitigation（研究幻觉问题）。其他关键词如SLMs（本地推理相关）、AI for Science（潜在应用）有弱关联。多数关键词（如训练方法、对齐、压缩等）未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于本地大语言模型的零样本知识图谱构建与推理管道，通过自一致性、多模型集成等机制，在消费级硬件上实现了高效的多跳推理和幻觉缓解，取得了与监督方法相当的性能。

摘要翻译

本文提出一种面向知识图谱构建与利用的多模型零样本流程的实证研究，该流程完全通过在消费级硬件上进行本地推理执行。我们设计了一个可复现的评估框架，将两个外部基准（DocRED、HotpotQA）、WebQuestionsSP风格合成数据以及RAGAS评估框架集成于自动化流程中。在500个文档级关系抽取任务上，我们的系统在零样本设置下取得了0.70 $\pm$ 0.041的F1分数，而监督学习方法DREEAM的分数为0.80。文本到查询任务在200个样本上达到0.80 $\pm$ 0.06的准确率。多跳推理任务在500个HotpotQA问题上取得0.46 $\pm$ 0.04的精确匹配（Exact Match, EM）分数，并在50个样本上获得0.96 $\pm$ 0.04的RAGAS忠实度评分。除基础流程外，我们还研究了针对困难多跳推理的多样性生成机制。在181个零温度设置下无法解决的问题上，自一致性方法（k=5，温度T=0.7）使用单个混合专家（Mixture-of-Experts, MoE）模型恢复了最高23%的EM分数，而跨模型预言机（3种架构×5个样本）可达到46.4%。我们揭示了一个一致性悖论：样本间的高度共识往往指向集体幻觉而非可靠答案，这与Moussa{ï}d等人关于群体智慧的研究相呼应。扩展至完整流程（500个问题）时，自一致性方法（k=3）将EM分数从0.46提升至0.48 $\pm$ 0.04。采用置信度路由级联机制（Phi-4 $\rightarrow$ GPT-OSS，k=5）取得了0.55 $\pm$ 0.04的EM分数（此为最优结果），其中45.4%的问题被重新路由处理。最后，我们证明将V3提示工程技术应用于其他模型时，无法复现在Gemma-4模型上观察到的性能提升，这证实了提示与模型间存在特异性交互。整个系统在单张RTX 3090显卡上运行约5小时，无需任何训练，估算碳足迹为0.09千克二氧化碳当量。

摘要 (Abstract)

This paper presents an empirical study of a multi-model zero-shot pipeline for knowledge graph construction and exploitation, executed entirely through local inference on consumer-grade hardware. We propose a reproducible evaluation framework integrating two external benchmarks (DocRED, HotpotQA), WebQuestionsSP-style synthetic data, and the RAGAS evaluation framework in an automated pipeline. On 500 document-level relations, our system achieves an F1 of 0.70 $\pm$ 0.041 in zero-shot, compared to 0.80 for supervised DREEAM. Text-to-query achieves an accuracy of 0.80 $\pm$ 0.06 on 200 samples. Multi-hop reasoning achieves an Exact Match (EM) of 0.46$\pm$0.04 on 500 HotpotQA questions, with a RAGAS faithfulness of 0.96 $\pm$ 0.04 on 50 samples. Beyond the pipeline, we study diversity mechanisms for difficult multi-hop reasoning. On 181 questions unsolvable at zero temperature, self-consistency (k=5, T =0.7) recovers up to 23% EM with a single Mixture-of-Experts (MoE) model, but the cross-model oracle (3 architectures x 5 samples) reaches 46.4%. We highlight an agreement paradox: strong consensus among samples signals collective hallucination rather than a reliable answer, echoing the work of Moussa{ï}d et al. on the wisdom of crowds. Extending to the full pipeline (500 questions), self-consistency (k=3) raises EM from 0.46 to 0.48 $\pm$ 0.04. A confidence-routing cascade mechanism (Phi-4 $\rightarrow$ GPT-OSS, k=5) achieves an EM of 0.55 $\pm$ 0.04, the best result obtained, with 45.4% of questions rerouted. Finally, we show that V3 prompt engineering applied to other models does not reproduce the gains observed with Gemma-4, confirming the specific prompt/model interaction. The entire system runs in $\sim$5 h on a single RTX 3090, without any training, for an estimated carbon footprint of 0.09 kg CO2 eq.

关键词: knowledge graph construction, local LLMs, zero-shot pipeline, self-consistency, multi-hop reasoning, hallucination mitigation, Mixture-of-Experts, RAGAS evaluation

301. ❌ Bottleneck Tokens for Unified Multimodal Retrieval

作者: Siyu Sun, Jing Ren, Zhaohe Liao, Dongxiao Mao, Xiangyuan Ren, Yiyi Zhang, Haohua Zhao, Weixiong Lin, Jiang Shaohua, Liqing Zhang, Yuchao Zheng 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11095v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在统一多模态检索任务中的应用，直接涉及’Large Language Models’和’Retrieval-Augmented Generation’关键词，分别给予10分。论文提出新的训练方法（Generative Information Condensation）属于微调范畴，与’Post-training’有一定关联，给予8分。其他关键词如MoE、SLMs、Scaling Laws、PEFT、Context Window等均未在摘要中提及或与论文主题无关，给予0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在统一多模态检索任务中存在的结构差距，提出了Bottleneck Tokens架构和Generative Information Condensation训练方法，在MMEB-V2基准测试中实现了最先进的性能。

摘要翻译

将仅解码器架构的多模态大语言模型（MLLMs）适配于统一多模态检索任务时，面临两个结构性鸿沟。首先，现有方法依赖隐式池化，即用标准词汇标记（如）的隐藏状态超载作为序列级表示，而该机制从未为信息聚合而设计。其次，对比微调仅规定了嵌入应匹配的目标，却未提供关于信息应如何压缩至其中的标记级指导。我们通过两个互补组件解决这两大问题。在架构层面，我们引入瓶颈标记（BToks），即一小组可学习的标记，作为固定容量的显式池化机制。在训练层面，我们提出生成式信息凝聚：结合下一个标记预测目标与凝聚掩码，该掩码切断目标标记到查询标记的直接注意力路径。所有预测信号因此被强制通过BToks传递，将生成式损失转化为针对语义压缩的稠密标记级监督。在推理阶段，仅需单次前向传播处理输入和BToks，其开销相比传统的末标记池化可忽略不计。在MMEB-V2基准（78个数据集、3种模态、9项元任务）上，我们的方法在可比数据条件下，于20亿参数规模模型中达到最优性能，总体得分59.0（较VLM2Vec-V2提升3.6分），并在语义要求高的任务上取得显著进步（如视频问答任务提升12.6分）。

摘要 (Abstract)

Adapting decoder-only multimodal large language models (MLLMs) for unified multimodal retrieval faces two structural gaps. First, existing methods rely on implicit pooling, which overloads the hidden state of a standard vocabulary token (e.g., ) as the sequence-level representation, a mechanism never designed for information aggregation. Second, contrastive fine-tuning specifies what the embedding should match but provides no token-level guidance on how information should be compressed into it. We address both gaps with two complementary components. Architecturally, we introduce Bottleneck Tokens (BToks), a small set of learnable tokens that serve as a fixed-capacity explicit pooling mechanism. For training, we propose Generative Information Condensation: a next-token prediction objective coupled with a Condensation Mask that severs the direct attention path from target tokens to query tokens. All predictive signals are thereby forced through the BToks, converting the generative loss into dense, token-level supervision for semantic compression. At inference time, only the input and BToks are processed in a single forward pass with negligible overhead over conventional last-token pooling. On MMEB-V2 (78 datasets, 3 modalities, 9 meta-tasks), our approach achieves state-of-the-art among 2B-scale methods under comparable data conditions, attaining an Overall score of 59.0 (+3.6 over VLM2Vec-V2) with substantial gains on semantically demanding tasks (e.g., +12.6 on Video-QA).

关键词: Multimodal Large Language Models, Unified Multimodal Retrieval, Bottleneck Tokens, Generative Information Condensation, Contrastive Fine-tuning, Semantic Compression, MMEB-V2 Benchmark, State-of-the-art Performance

302. ❌ CausalGaze: Unveiling Hallucinations via Counterfactual Graph Intervention in Large Language Models

作者: Linggang Kong, Lei Wu, Yunlong Zhang, Xiaofeng Zhong, Zhen Wang, Yongjie Wang, Yao Pan 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11087v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM幻觉检测，与’Large Language Models’高度相关（10分），直接解决’Hallucination Mitigation’问题（10分），并通过因果图方法提升可解释性，与’Mechanistic Interpretability’高度相关（10分）。其他关键词如MoE、SLMs、训练方法、推理加速、智能体等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型中的幻觉问题，提出了基于结构因果模型和反事实干预的CausalGaze检测框架，在多个数据集上显著提升了检测性能。

摘要翻译

尽管大语言模型（LLM）取得了突破性进展，幻觉问题仍是其在高风险领域部署的关键瓶颈。现有的基于分类的方法主要依赖于来自内部状态的静态被动信号，这些信号往往捕捉到噪声和虚假相关性，而忽视了底层的因果机制。为应对这一局限，我们通过引入CausalGaze——一种基于结构因果模型（SCM）的新型幻觉检测框架——将研究范式从被动观察转向主动干预。CausalGaze将大语言模型的内部状态建模为动态因果图，并利用反事实干预来分离因果推理路径与偶然噪声，从而提升模型的可解释性。在四个数据集和三种广泛使用的大语言模型上进行的大量实验验证了CausalGaze的有效性，尤其在TruthfulQA数据集上，其AUROC指标相较于最先进的基线模型提升了超过5.2%。

摘要 (Abstract)

Despite the groundbreaking advancements made by large language models (LLMs), hallucination remains a critical bottleneck for their deployment in high-stakes domains. Existing classification-based methods mainly rely on static and passive signals from internal states, which often captures the noise and spurious correlations, while overlooking the underlying causal mechanisms. To address this limitation, we shift the paradigm from passive observation to active intervention by introducing CausalGaze, a novel hallucination detection framework based on structural causal models (SCMs). CausalGaze models LLMs’ internal states as dynamic causal graphs and employs counterfactual interventions to disentangle causal reasoning paths from incidental noise, thereby enhancing model interpretability. Extensive experiments across four datasets and three widely used LLMs demonstrate the effectiveness of CausalGaze, especially achieving over 5.2% improvement in AUROC on the TruthfulQA dataset compared to state-of-the-art baselines.

关键词: Hallucination Detection, Large Language Models, Structural Causal Models, Counterfactual Intervention, Causal Reasoning, Model Interpretability, TruthfulQA

303. ❌ Lightweight Low-Light Image Enhancement via Distribution-Normalizing Preprocessing and Depthwise U-Net

作者: Shimon Murai, Teppei Kurita, Ryuta Satoh, Yusuke Moriuchi 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11071v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文专注于计算机视觉领域的低光图像增强（LLIE），提出了一种轻量级的两阶段框架，结合了基于算法的预处理和深度可分离卷积U-Net。所有评分关键词均与大语言模型（LLM）、深度学习技术原理或AI for Science应用直接相关，而本文研究的是传统的图像处理任务，未涉及任何大模型、深度学习技术原理创新或科学领域应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种轻量级的两阶段低光图像增强框架，通过分布归一化预处理和深度可分离卷积U-Net，以较少参数实现了有竞争力的感知质量，并在CVPR 2026 NTIRE挑战赛中获得了第四名。

摘要翻译

本文提出一种轻量级双阶段弱光图像增强框架，其参数量显著少于现有方法，同时实现了具有竞争力的感知质量。该框架将基于冻结算法的预处理模块与完全由深度可分离卷积构建的紧凑U-Net相结合。预处理阶段通过提供亮度校正后的互补视图来归一化输入分布，使可训练网络能够专注于残差色彩校正。本方法在CVPR 2026 NTIRE高效弱光图像增强挑战赛中荣获第四名。我们进一步提供了扩展的基准测试与消融实验，以验证所提方法的普遍有效性。

摘要 (Abstract)

We present a lightweight two-stage framework for low-light image enhancement (LLIE) that achieves competitive perceptual quality with significantly fewer parameters than existing methods. Our approach combines frozen algorithm-based preprocessing with a compact U-Net built entirely from depthwise-separable convolutions. The preprocessing normalizes the input distribution by providing complementary brightness-corrected views, enabling the trainable network to focus on residual color correction. Our method achieved 4th place in the CVPR 2026 NTIRE Efficient Low-Light Image Enhancement Challenge. We further provide extended benchmarks and ablations to demonstrate the general effectiveness of our methods.

关键词: low-light image enhancement, lightweight framework, distribution-normalizing preprocessing, depthwise-separable convolutions, U-Net, parameter-efficient, CVPR NTIRE challenge, perceptual quality

304. ❌ A Faster Path to Continual Learning

作者: Wei Li, Hangjie Yuan, Zixiang Zhao, Borui Kang, Ziwei Liu, Tao Feng 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11064v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《A Faster Path to Continual Learning》专注于持续学习（Continual Learning）的优化方法改进，提出C-Flat Turbo来加速C-Flat优化器。虽然持续学习是深度学习的一个子领域，但论文内容完全围绕传统神经网络（未特指大模型）的优化算法，不涉及大模型技术原理、训练方法（如预训练、微调、对齐）、推理优化、智能体、科学AI应用等关键词。所有关键词均与大模型或特定AI应用相关，而本文是通用深度学习优化研究，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文针对持续学习中C-Flat优化器计算开销大的问题，提出了C-Flat Turbo方法，通过跳过冗余梯度计算和自适应调度策略，在保持精度的同时将训练速度提升了1.0-1.25倍。

摘要翻译

持续学习（Continual Learning, CL）旨在动态任务流上训练神经网络，同时避免遗忘先前习得的知识。在基于优化的方法中，C-Flat因其即插即用的特性以及能够为新旧任务同时促进均匀低损失区域的能力，已成为一种前景广阔的解决方案。然而，C-Flat在每次迭代中需要三次额外的梯度计算，为优化过程带来了显著开销。本文提出C-Flat Turbo，一种更快且更强的优化器，可显著降低训练成本。我们证明，与一阶平坦性相关的梯度包含相对于代理模型梯度的方向不变分量，这使得我们能够在扰动上升步骤中跳过冗余的梯度计算。此外，我们观察到这些促进平坦性的梯度在不同任务间逐渐稳定，这启发我们采用一种带有自适应触发机制的线性调度策略，为后续任务分配更大的加速步长。实验表明，在广泛的CL方法中，C-Flat Turbo比C-Flat快1.0倍至1.25倍，同时达到相当甚至更高的准确率。

摘要 (Abstract)

Continual Learning (CL) aims to train neural networks on a dynamic stream of tasks without forgetting previously learned knowledge. Among optimization-based approaches, C-Flat has emerged as a promising solution due to its plug-and-play nature and its ability to encourage uniformly low-loss regions for both new and old tasks. However, C-Flat requires three additional gradient computations per iteration, imposing substantial overhead on the optimization process. In this work, we propose C-Flat Turbo, a faster yet stronger optimizer that significantly reduces the training cost. We show that the gradients associated with first-order flatness contain direction-invariant components relative to the proxy-model gradients, enabling us to skip redundant gradient computations in the perturbed ascent steps. Moreover, we observe that these flatness-promoting gradients progressively stabilize across tasks, which motivates a linear scheduling strategy with an adaptive trigger to allocate larger turbo steps for later tasks. Experiments show that C-Flat Turbo is 1.0$\times$ to 1.25$\times$ faster than C-Flat across a wide range of CL methods, while achieving comparable or even improved accuracy.

关键词: Continual Learning, C-Flat, C-Flat Turbo, optimizer, gradient computation, training acceleration, flatness, adaptive scheduling

305. ❌ Pando: Do Interpretability Methods Work When Models Won’t Explain Themselves?

作者: Ziqian Zhong, Aashiq Muhamed, Mona T. Diab, Virginia Smith, Aditi Raghunathan 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11061v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究机械可解释性方法在模型无法自我解释时的有效性评估，与’Mechanistic Interpretability OR Explainable AI’高度相关（10分）。研究基于720个微调模型，涉及监督微调（SFT）技术（8分）。论文讨论大模型的可解释性问题，与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分）。其他关键词如MoE、量化、推理加速、AI for Science等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

论文研究了当模型无法提供可靠自我解释时，不同机械可解释性方法（如梯度归因、相关性修补）在预测模型决策方面的有效性，发现梯度方法能提升准确性而其他方法无稳定收益。

摘要翻译

机制可解释性研究常以对齐审计为动机，因为模型的口头解释可能缺失、不完整或具有误导性。然而，许多评估并未控制仅通过黑盒提示是否能复现目标行为，因此白盒工具带来的表面增益可能反映的是提示引导效应而非内部信号；我们称此为引导混杂因子。我们引入Pando这一模型生物基准，它通过解释轴打破这种混杂：我们训练模型生成对真实规则的可信解释、不生成解释，或对无关干扰规则生成自信但不可信的解释。
在720个基于隐藏决策树规则进行微调的模型中，智能体通过10个标注的查询-响应对来预测保留的模型决策，并可选择性地加入一种可解释性工具的输出。当解释可信时，黑盒引导的表现与所有白盒方法相当或更优；当解释缺失或具有误导性时，基于梯度的归因方法将准确率提高了3-5个百分点，而相关性修补（RelP）带来的增益最大，同时对数透镜、稀疏自编码器和电路追踪方法未提供可靠收益。方差分解表明梯度追踪了决策计算过程，即哪些字段因果驱动输出，而其他解读方法则主要受任务表征、对字段身份和值的偏好所主导。
我们公开了所有模型、代码及评估基础设施。

摘要 (Abstract)

Mechanistic interpretability is often motivated for alignment auditing, where a model’s verbal explanations can be absent, incomplete, or misleading. Yet many evaluations do not control whether black-box prompting alone can recover the target behavior, so apparent gains from white-box tools may reflect elicitation rather than internal signal; we call this the elicitation confounder. We introduce Pando, a model-organism benchmark that breaks this confound via an explanation axis: models are trained to produce either faithful explanations of the true rule, no explanation, or confident but unfaithful explanations of a disjoint distractor rule. Across 720 finetuned models implementing hidden decision-tree rules, agents predict held-out model decisions from $10$ labeled query-response pairs, optionally augmented with one interpretability tool output. When explanations are faithful, black-box elicitation matches or exceeds all white-box methods; when explanations are absent or misleading, gradient-based attribution improves accuracy by 3-5 percentage points, and relevance patching, RelP, gives the largest gains, while logit lens, sparse autoencoders, and circuit tracing provide no reliable benefit. Variance decomposition suggests gradients track decision computation, which fields causally drive the output, whereas other readouts are dominated by task representation, biases toward field identity and value. We release all models, code, and evaluation infrastructure.

关键词: Mechanistic interpretability, Explainable AI, Model interpretability, Gradient attribution, Relevance patching, Fine-tuned models, Decision prediction, Elicitation confounder

306. ❌ Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis

作者: Yuhang He, Haodong Wu, Siyi Liu, Hongyu Ge, Hange Zhou, Keyi Wu, Zhuo Zheng, Qihong Lin, Zixin Zhong, Yongqi Zhang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11056v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RLVR（Reinforcement Learning with Verifiable Rewards）在大型语言模型（LLMs）中的应用，直接涉及LLMs关键词（10分）。研究聚焦于提升LLMs的推理能力，与推理相关的关键词（Chain of Thought、System 2 Thinking）有一定关联（5分），但非核心方法。其他关键词如MoE、SFT、RAG、量化等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了强化学习与可验证奖励（RLVR）中基于稀疏结果奖励的信用分配问题，通过极性-熵分析发现推理改进集中在高熵token上，并提出了熵感知策略优化（EAPO）方法，实验证明其优于现有基线。

摘要翻译

基于可验证奖励的强化学习（RLVR）显著提升了大型语言模型（LLMs）的推理能力。然而，其稀疏的基于结果的奖励带来了根本性的信用分配问题。我们通过奖励极性与标记熵的双重视角分析了这一问题。我们的诊断工具——四象限分解法——依据极性与熵对标记更新进行隔离分析，受控消融实验表明推理能力的提升集中体现在高熵象限。为从理论上解释这一现象，我们将条件互信息适配至自回归RLVR框架，并证明一个标记所能承载的信用上限受其熵值约束。这一视角产生了可检验的预测：推理增益主要源于高熵标记，且正向与负向更新在其中扮演独特角色。对GRPO的梯度分析进一步揭示了均匀奖励广播如何在高熵位置稀释信号，同时过度赋予确定性标记信用。基于这些发现，我们提出了熵感知策略优化（EAPO），该方法能相应调节标记级学习信号。大量实验表明，EAPO在两个模型家族中均优于现有强基线方法。

摘要 (Abstract)

Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved the reasoning ability of Large Language Models (LLMs). However, its sparse outcome-based rewards pose a fundamental credit assignment problem. We analyze this problem through the joint lens of reward polarity and token entropy. Our diagnostic tool, the Four Quadrant Decomposition, isolates token updates by polarity and entropy, and controlled ablations show that reasoning improvements concentrate in the high-entropy quadrants. To justify this observation theoretically, we adapt Conditional Mutual Information to the autoregressive RLVR setting and prove that the credit a token can carry is upper-bounded by its entropy. This view yields testable predictions that reasoning gains arise primarily from high-entropy tokens, with unique roles for positive and negative updates. A gradient analysis of GRPO further reveals how uniform reward broadcast dilutes signal at high-entropy positions while over-crediting deterministic tokens. Grounded in these insights, we propose Entropy-Aware Policy Optimization (EAPO) that modulates token-level learning signals accordingly. Extensive experiments demonstrate that EAPO outperforms strong baselines across two model families.

关键词: Reinforcement Learning, Verifiable Rewards, Credit Assignment, Token Entropy, Large Language Models, Reasoning, Policy Optimization, GRPO

307. ❌ RTMC: Step-Level Credit Assignment via Rollout Trees

作者: Tao Wang, Suhang Zheng, Xiaoxiao Xu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11037v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《RTMC: Step-Level Credit Assignment via Rollout Trees》专注于强化学习中的信用分配问题，提出了一种基于蒙特卡洛树展开的优势估计方法。虽然论文涉及智能体（agent）和多步决策，但其核心是强化学习算法改进，而非大语言模型（LLM）相关技术。论文中未提及LLM、预训练、微调、对齐、推理加速、幻觉缓解等大模型技术关键词，也未涉及生物信息学等科学AI应用。因此，所有关键词均与论文内容完全无关，得分为0。

!!! tip deepseek-chat TL;DR

该论文针对多步智能体强化学习中的细粒度信用分配问题，提出了一种无需学习评论家的Rollout-Tree Monte Carlo优势估计方法，在SWE-bench Verified基准上比GRPO提高了3.2个百分点的pass@1。

摘要翻译

多步智能体强化学习得益于细粒度信用分配，然而现有方法提供的选择有限：诸如GRPO等无评论家方法对轨迹中的每个动作赋予均匀优势值，而基于学习的价值网络则引入显著开销且在稀疏奖励下可能表现脆弱。我们观察到，针对同一问题的群体 rollout 常会经过重叠的中间状态，隐式形成一棵在连续决策点处分支的树状结构。基于这一洞见，我们提出 Rollout-Tree Monte Carlo（RTMC）优势估计方法，该方法通过聚合共享同一状态的多个 rollout 的回报统计量，生成逐步 Q 值与优势值——无需任何学习型评论家。一个状态-动作签名系统将原始交互历史压缩为紧凑、可比较的表示形式，使得跨 rollout 的状态匹配易于处理。在 SWE-bench Verified 基准测试中，RTMC 将 pass@1 指标较 GRPO 提升了 3.2 个百分点。

摘要 (Abstract)

Multi-step agentic reinforcement learning benefits from fine-grained credit assignment, yet existing approaches offer limited options: critic-free methods like GRPO assign a uniform advantage to every action in a trajectory, while learned value networks introduce notable overhead and can be fragile under sparse rewards. We observe that group rollouts targeting the same problem often traverse overlapping intermediate states, implicitly forming a tree whose branches diverge at successive decision points. Building on this insight, we introduce Rollout-Tree Monte Carlo (RTMC) advantage estimation, which aggregates return statistics across rollouts sharing a common state to produce per-step Q-values and advantages–without any learned critic. A state-action signature system compresses raw interaction histories into compact, comparable representations, making cross-rollout state matching tractable. On SWE-bench Verified, RTMC improves pass@1 by 3.2 percentage points over GRPO.

关键词: credit assignment, rollout trees, Monte Carlo, advantage estimation, multi-step reinforcement learning, agentic RL, step-level Q-values, state-action signature

308. ❌ Optimal Stability of KL Divergence under Gaussian Perturbations

作者: Jialu Pan, Yufeng Zhang, Nan Hu, Keqin Li 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11026v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究KL散度在非高斯分布下的稳定性理论，属于概率论和信息论的基础理论研究。虽然摘要提到在深度学习和强化学习中有应用前景，但论文本身不涉及任何大模型、深度学习技术原理、AI应用或具体AI方法。所有关键词都聚焦于大模型技术、训练方法、推理优化、AI应用等具体领域，与该论文的纯理论数学研究完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了KL散度在任意分布与高斯分布之间的稳定性问题，建立了最优的稳定性界限，为基于KL散度的离群检测等应用提供了理论基础。

摘要翻译

本文研究了库尔贝克-莱布勒（KL）散度在高斯扰动下的稳定性特征问题，并突破了传统方法局限于高斯分布族的限制。现有关于KL散度的松弛三角不等式严格依赖于所有相关分布均为高斯的假设，这限制了其在现代应用（如基于流的生成模型中的分布外检测）中的适用性。本文通过建立任意分布与高斯分布族之间在温和矩条件下的尖锐稳定性界，消除了这一限制。具体而言，设$P$为具有有限二阶矩的分布，$\mathcal{N}_1$和$\mathcal{N}_2$为多元高斯分布。我们证明：若$KL(P||\mathcal{N}_1)$较大且$KL(\mathcal{N}_1||\mathcal{N}_2)$至多为$ε$，则$KL(P||\mathcal{N}_2) \ge KL(P||\mathcal{N}_1) - O(\sqrtε)$。进一步，我们证明即使在高斯分布族内部，该$\sqrtε$速率在一般情况下也是最优的。这一结果揭示了KL散度在高斯扰动下固有的稳定性特征，将经典仅适用于高斯的松弛三角不等式推广至一般分布。由于KL散度的非对称性及一般概率空间中三角不等式的缺失，该结论具有非平凡性。作为应用，我们为基于流的模型中基于KL散度的分布外检测分析提供了严格的理论基础，消除了先前工作中使用的强高斯假设。更广泛而言，我们的研究使得在深度学习和强化学习中出现的非高斯场景下进行基于KL散度的推理成为可能。

摘要 (Abstract)

We study the problem of characterizing the stability of Kullback-Leibler (KL) divergence under Gaussian perturbations beyond Gaussian families. Existing relaxed triangle inequalities for KL divergence critically rely on the assumption that all involved distributions are Gaussian, which limits their applicability in modern applications such as out-of-distribution (OOD) detection with flow-based generative models. In this paper, we remove this restriction by establishing a sharp stability bound between an arbitrary distribution and Gaussian families under mild moment conditions. Specifically, let $P$ be a distribution with finite second moment, and let $\mathcal{N}_1$ and $\mathcal{N}_2$ be multivariate Gaussian distributions. We show that if $KL(P||\mathcal{N}_1)$ is large and $KL(\mathcal{N}_1||\mathcal{N}_2)$ is at most $ε$, then $KL(P||\mathcal{N}_2) \ge KL(P||\mathcal{N}_1) - O(\sqrtε)$. Moreover, we prove that this $\sqrtε$ rate is optimal in general, even within the Gaussian family. This result reveals an intrinsic stability property of KL divergence under Gaussian perturbations, extending classical Gaussian-only relaxed triangle inequalities to general distributions. The result is non-trivial due to the asymmetry of KL divergence and the absence of a triangle inequality in general probability spaces. As an application, we provide a rigorous foundation for KL-based OOD analysis in flow-based models, removing strong Gaussian assumptions used in prior work. More broadly, our result enables KL-based reasoning in non-Gaussian settings arising in deep learning and reinforcement learning.

关键词: KL divergence, Gaussian perturbations, stability bound, optimal rate, out-of-distribution detection, flow-based models, probability theory, information theory

309. ❌ Panoptic Pairwise Distortion Graph

作者: Muhammad Kamran Janjua, Abdul Wahab, Bahador Rashidi 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11004v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究图像质量评估，提出了一种基于区域的图像对失真图（Distortion Graph）方法，并创建了数据集和基准测试。虽然论文提到了多模态大语言模型（MLLMs）在区域级失真理解上的失败，但这只是作为背景对比，并非论文的核心技术贡献。论文的核心是计算机视觉中的图像质量评估任务，而非大模型技术原理或应用创新。因此，只有第一个关键词（Large Language Models/LLMs/Foundation Models）因提及MLLMs而获得5分（有一定关联），其他关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种新的图像质量评估方法，通过将图像对表示为基于区域的失真图（Distortion Graph），并创建了相应的数据集和基准测试，以解决现有方法在区域级失真理解上的不足。

摘要翻译

本研究为图像对比评估引入了一种新视角，将图像对表征为其区域的结构化组合。相比之下，现有方法侧重于整体图像分析，同时隐式依赖区域层面的理解。我们将场景图的概念从图像内部扩展到图像之间，并提出了一项新颖的“失真图”任务。失真图将成对图像视为基于区域的结构化拓扑，并以紧凑且可解释的图结构来表征密集的退化信息，如失真类型、严重程度、比较结果和质量分数。为实现学习失真图的任务，我们贡献了：（i）一个区域级数据集 PandaSet，（ii）一套具有不同区域级难度的基准测试 PandaBench，以及（iii）一个用于生成失真图的高效架构 Panda。我们证明，PandaBench 对当前最先进的多模态大语言模型构成了重大挑战，因为即使提供明确的区域线索，它们也无法理解区域级的退化现象。我们表明，在 PandaSet 上进行训练或使用失真图进行提示，能够激发对失真区域化的理解，从而为细粒度、结构化的成对图像评估开辟了新方向。

摘要 (Abstract)

In this work, we introduce a new perspective on comparative image assessment by representing an image pair as a structured composition of its regions. In contrast, existing methods focus on whole image analysis, while implicitly relying on region-level understanding. We extend the intra-image notion of a scene graph to inter-image, and propose a novel task of Distortion Graph (DG). DG treats paired images as a structured topology grounded in regions, and represents dense degradation information such as distortion type, severity, comparison and quality score in a compact interpretable graph structure. To realize the task of learning a distortion graph, we contribute (i) a region-level dataset, PandaSet, (ii) a benchmark suite, PandaBench, with varying region-level difficulty, and (iii) an efficient architecture, Panda, to generate distortion graphs. We demonstrate that PandaBench poses a significant challenge for state-of-the-art multimodal large language models (MLLMs) as they fail to understand region-level degradations even when fed with explicit region cues. We show that training on PandaSet or prompting with DG elicits region-wise distortion understanding, opening a new direction for fine-grained, structured pairwise image assessment.

关键词: Distortion Graph, image quality assessment, region-level analysis, multimodal large language models, PandaSet dataset, PandaBench benchmark, pairwise image comparison, structured degradation representation

310. ❌ Sanity Checks for Agentic Data Science

作者: Zachary T. Rewolinski, Austin V. Zane, Hao Huang, Chandan Singh, Chenglong Wang, Jianfeng Gao, Bin Yu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11003v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究Agentic Data Science（ADS）系统的可靠性验证，核心是评估LLM驱动的数据科学代理（如OpenAI Codex）的结论稳定性。高度相关关键词：LLM Agents（核心研究对象）、AI for Science（数据科学应用）。中等相关：Large Language Models（使用Codex）、Tool Use（代理执行分析任务）、Hallucination Mitigation（解决虚假结论）、Explainable AI（验证框架）。其他关键词涉及具体技术细节（如MoE、RLHF、量化等）或不同应用场景（如生物信息学），论文未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对Agentic Data Science系统可能产生虚假乐观结论的问题，提出了基于PCS框架的轻量级合理性检查方法，通过扰动测试验证了OpenAI Codex在11个真实数据集上的结论可靠性，发现其中6个数据集的肯定结论缺乏充分支持。

摘要翻译

代理数据科学（Agentic Data Science，简称ADS）流程在能力和应用上均快速发展，诸如OpenAI Codex等系统现已能够直接分析数据集并生成统计问题的答案。然而，这些系统可能得出虚假的乐观结论，且用户难以察觉。为解决此问题，我们基于可预测性-可计算性-稳定性（Predictability-Computability-Stability，PCS）框架提出了一对轻量级的合理性检验方法，用于验证数据科学的真实性。这些检验通过合理的扰动来筛查代理是否能可靠地区分信号与噪声，作为一种可证伪性约束，能够揭示肯定性结论缺乏依据。两项检验共同刻画了ADS输出的可信度，例如其是否发现了稳定信号、是否对噪声作出响应，或是否对输入中的偶然因素敏感。我们在具有可控信噪比的合成数据上验证了该方法，确认合理性检验能够追踪真实信号强度。随后，我们在11个真实世界数据集上使用OpenAI Codex进行检验，评估了每个结论的可信度，发现在其中6个数据集中，肯定性结论缺乏充分支持，尽管单次ADS运行可能得出该结论。我们进一步分析了ADS系统的故障模式，发现ADS自我报告的信心度与其结论的经验稳定性之间校准不佳。

摘要 (Abstract)

Agentic data science (ADS) pipelines have grown rapidly in both capability and adoption, with systems such as OpenAI Codex now able to directly analyze datasets and produce answers to statistical questions. However, these systems can reach falsely optimistic conclusions that are difficult for users to detect. To address this, we propose a pair of lightweight sanity checks grounded in the Predictability-Computability-Stability (PCS) framework for veridical data science. These checks use reasonable perturbations to screen whether an agent can reliably distinguish signal from noise, acting as a falsifiability constraint that can expose affirmative conclusions as unsupported. Together, the two checks characterize the trustworthiness of an ADS output, e.g. whether it has found stable signal, is responding to noise, or is sensitive to incidental aspects of the input. We validate the approach on synthetic data with controlled signal-to-noise ratios, confirming that the sanity checks track ground-truth signal strength. We then demonstrate the checks on 11 real-world datasets using OpenAI Codex, characterizing the trustworthiness of each conclusion and finding that in 6 of the datasets an affirmative conclusion is not well-supported, even though a single ADS run may support one. We further analyze failure modes of ADS systems and find that ADS self-reported confidence is poorly calibrated to the empirical stability of its conclusions.

关键词: Agentic Data Science, Sanity Checks, Predictability-Computability-Stability, Veridical Data Science, OpenAI Codex, Trustworthiness, Stability, Signal-to-Noise

311. ❌ Self-supervised Pretraining of Cell Segmentation Models

作者: Kaden Stillwagon, Alexandra Dunnum VandeLoo, Benjamin Magondu, Craig R. Forest 期刊/来源: arxiv 发布日期: 2026-04-12 arXiv链接: http://arxiv.org/abs/2604.10609v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于细胞实例分割的计算机视觉任务，属于AI for Science（生物信息学）领域，与大多数大语言模型技术关键词无关。论文核心贡献在于提出DINOCell框架，通过自监督预训练和领域适应技术（Pre-training/Domain Adaptation）改进细胞分割模型，然后进行监督微调（Supervised Fine-tuning），因此这两个关键词高度相关（10分）。AI for Science/Bioinformatics关键词也高度相关（10分）。其他关键词如LLMs、MoE、RLHF等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对显微镜图像中细胞实例分割任务面临标注数据稀缺和自然图像预训练模型领域适应性差的问题，提出了DINOCell自监督预训练框架，通过在未标注细胞图像上继续自监督训练进行领域适应，再结合监督微调，显著提升了在基准数据集和零样本外部分布数据集上的分割性能。

摘要翻译

实例分割通过识别属于每个细胞的像素，实现了对显微镜图像中细胞空间与时间特性的分析。然而，该领域的发展受限于高质量标注显微镜数据集的稀缺性。近期许多方法通过使用大规模自然图像模型（如Segment Anything Model，简称SAM）中经过分割预训练的权重来初始化模型，以应对这一挑战。然而，从自然图像中学习到的表征通常编码了与显微镜数据对齐度较差的物体性和纹理先验，导致在领域偏移下性能下降。我们提出了DINOCell，一种用于细胞实例分割的自监督框架，该框架利用DINOv2的表征，并通过在监督微调前对未标注的细胞图像进行持续的自监督训练，使其适应显微镜数据领域。在LIVECell基准测试中，DINOCell取得了0.784的SEG分数，比领先的基于SAM的模型提升了10.42%，并在三个分布外显微镜数据集上展现出强大的零样本性能。这些结果凸显了领域自适应自监督预训练对于实现鲁棒细胞分割的益处。

摘要 (Abstract)

Instance segmentation enables the analysis of spatial and temporal properties of cells in microscopy images by identifying the pixels belonging to each cell. However, progress is constrained by the scarcity of high-quality labeled microscopy datasets. Many recent approaches address this challenge by initializing models with segmentation-pretrained weights from large-scale natural-image models such as Segment Anything Model (SAM). However, representations learned from natural images often encode objectness and texture priors that are poorly aligned with microscopy data, leading to degraded performance under domain shift. We propose DINOCell, a self-supervised framework for cell instance segmentation that leverages representations from DINOv2 and adapts them to microscopy through continued self-supervised training on unlabeled cell images prior to supervised fine-tuning. On the LIVECell benchmark, DINOCell achieves a SEG score of 0.784, improving by 10.42% over leading SAM-based models, and demonstrates strong zero-shot performance on three out-of-distribution microscopy datasets. These results highlight the benefits of domain-adapted self-supervised pretraining for robust cell segmentation.

关键词: cell instance segmentation, self-supervised pretraining, domain adaptation, microscopy images, DINOv2, supervised fine-tuning, zero-shot performance, LIVECell benchmark

312. ❌ The Dynamic Origin of Kleiber’s Law

作者: Riccardo Marchesi 期刊/来源: arxiv 发布日期: 2026-04-12 arXiv链接: http://arxiv.org/abs/2604.10476v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究生物物理学的代谢缩放定律（克莱伯定律），属于理论生物学和生物物理学领域。所有关键词均与大模型、深度学习、AI技术原理或应用直接相关，而本文完全不涉及这些主题。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于科学领域（生物学），但并未使用AI方法，因此仅给予5分（有一定关联）。其他关键词均为0分（完全无关）。

!!! tip deepseek-chat TL;DR

本文挑战了克莱伯定律（代谢率与体重的3/4次幂关系）源于分形输运网络最小化粘性耗散的传统观点，证明该定律本质上是脉动波物理学的动态特征，并推导出普适的异速生长方程，成功预测了从小型哺乳动物到无脊椎动物的尺度转变。

摘要翻译

普遍存在的3/4代谢标度指数——即克莱伯定律——长期以来被归因于分形输运网络内粘性耗散的最小化。本文颠覆了这一传统解释，证明克莱伯定律本质上是脉动波物理学的特征，而非稳态几何结构的结果。通过将局部分支优化与全局异速生长相耦合，我们推导出精确的广义代谢指数公式 $β= dα/(2d+α)$，该公式严格地将局部输运微观物理映射到生物体的全局标度关系。我们证明，近端脉管系统中的动态波阻抗匹配在三维空间中唯一地强制实现了 $β= 3/4$。这一界限受到动态保护：任何对粘性网络的静态优化都无法复现该结果。由此，我们解析地预测了从波动主导到粘性主导转变的临界体重，并成功解释了在小型哺乳动物和无脊椎动物中观察到的向更陡峭异速标度（$β\approx 0.9$）的经验性转变，且无需引入任何自由参数。此外，我们指出经典的West-Brown-Enquist (WBE) 模型在其自身的几何假设下存在结构发散，在所需的近端主导极限下失效。我们的框架在跨越五个门类的九个生物系统中得到了验证——包括脊椎动物脉管系统、昆虫气管、植物木质部和海绵水管系统——能够基于独立的生物物理测量数据准确预测经验分支指数。最终，我们建立了一个普适的异速生长状态方程，将多样的生物网络组织成离散的普适性类别，并对从鼩鼱到扁形动物等不同进化支系提出了可证伪的预测。

摘要 (Abstract)

The ubiquitous $3/4$ metabolic scaling exponent, known as Kleiber’s law, has long been attributed to the minimization of viscous dissipation within fractal transport networks. In this paper, we invert this standard narrative, demonstrating that Kleiber’s law is fundamentally a signature of pulsatile wave physics rather than steady-state geometry. By coupling local branching optimization to global allometry, we derive the exact generalized metabolic exponent $β= dα/(2d+α)$, which strictly maps local transport microphysics to global organismal scaling. We show that dynamic wave-impedance matching in the proximal vasculature uniquely enforces $β= 3/4$ in three dimensions. This bound is dynamically protected: no static optimization of a viscous network can reproduce it. Consequently, we analytically predict the critical body mass for the wave-to-viscous transition, successfully explaining the empirical shift to steeper allometric scaling ($β\approx 0.9$) in small mammals and invertebrates with no free parameters. Furthermore, we demonstrate that the classical West–Brown–Enquist (WBE) derivation is structurally divergent under its own geometric assumptions, failing at the required proximal-dominance limit. Our framework is validated across nine biological systems spanning five phyla – including vertebrate vasculature, insect tracheae, plant xylem, and sponge canals – accurately predicting empirical branching exponents from independent biophysical measurements. Ultimately, we establish a general allometric equation of state that organizes diverse biological networks into discrete universality classes, generating falsifiable predictions across clades from shrews to flatworms.

关键词: Kleiber’s law, metabolic scaling, allometric scaling, pulsatile wave physics, vascular networks, biological transport, West-Brown-Enquist model, universality classes

313. ❌ Tackling instabilities of quantum Krylov subspace methods: an analysis of the numerical and statistical errors

作者: Maria Gabriela Jordão Oliveira, Karl Michael Ziems, Nina Glaser 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11532v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究量子计算中的Krylov子空间方法，专注于量子系统基态能量估计的数值稳定性和统计误差分析。所有评分关键词均涉及大模型、深度学习、AI技术及其应用，而本论文属于量子计算和量子算法领域，与评分关键词列表中的任何主题均无直接关联。论文未提及任何大模型、深度学习、AI技术或相关应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文分析了量子Krylov子空间方法在估计量子系统基态能量时的不稳定性问题，发现实际噪声环境中统计波动是主要挑战而非病态条件，并提出了两种无需真实谱知识即可评估解可靠性的新滤波器。

摘要翻译

Krylov子空间方法是用于估算量子系统基态能量的早期容错量子算法中研究最为广泛的方法之一。然而，快速出现的病态问题可能使得精确能量难以甚至无法获取。在本研究中，我们通过存在与不存在采样噪声的数值模拟，分析了这些方法的数值稳定性与统计问题。虽然在理想的数值模拟中，广义特征值问题确实会随着Krylov子空间维度的增加而变得不稳定，但我们发现，在实际的噪声环境中，这些方法主要并非受困于病态问题。相反，统计波动占据主导地位，并可能阻碍可靠解的提取，除非采用适当的正则化或滤波技术。为此，我们引入了两种新的度量指标——虚部滤波器和酉滤波器，它们能够在无需已知真实特征谱的情况下，成功评估所获解的可靠性。

摘要 (Abstract)

Krylov subspace methods are among the most extensively studied early fault-tolerant quantum algorithms for estimating ground-state energies of quantum systems. However, the rapid onset of ill-conditioning might make accurate energies difficult or even impossible to retrieve. In this communication, we analyse the numerical stability and statistical problems of these methods using numerical simulations both in the presence and absence of sampling noise. While in ideal numerical simulations the generalized eigenvalue problem indeed becomes unstable with increased Krylov subspace size, we find that, in realistic noisy settings, these methods do not primarily suffer from ill-conditioning. Instead, statistical fluctuations dominate and can prevent reliable solution extraction unless appropriate regularization or filtering techniques are employed. We consequently introduce two new metrics, the imaginary and unitary filters, that successfully assess the reliability of the obtained solutions without any knowledge of the true eigenspectrum.

关键词: Krylov subspace methods, quantum algorithms, ground-state energies, numerical stability, statistical errors, ill-conditioning, regularization, filters

314. ❌ Shape-dependence of electrophoretic mobility

作者: Arkava Ganguly, Ankur Gupta 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11771v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究的是粒子电泳迁移率的形状依赖性，属于经典流体力学和胶体科学领域。所有关键词（共27个）中，只有"AI for Science OR Bioinformatics OR Cheminformatics"与论文有微弱关联，因为论文在方法部分提到使用了Claude Code（AI模型）进行计算和生成结果，并讨论了AI在理论研究中的应用，但这并非论文的核心科学内容。其他26个关键词均与大模型、深度学习技术原理或具体应用直接相关，与本文的物理理论研究完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了非球形粒子在任意德拜长度下的电泳迁移率，通过微扰理论推导出通用形状修正系数，发现只有四极矩形状分量对迁移率有主要影响，并验证了理论在厚双电层和薄双电层极限下的正确性。

摘要翻译

球形粒子的电泳迁移率已有充分研究，但粒子形状如何在任意德拜长度下修正该迁移率仍是一个悬而未决的问题。本文计算了近球形粒子在任意粒子尺寸与德拜长度比 $κa$ 下的电泳迁移率，其表面由 $r_s(θ) = a[1 + \varepsilon f(θ)]$ 描述，其中 $\varepsilon \ll 1$。通过结合体积分公式与微扰域技术，我们推导出一个普适的形状修正系数 $σ_2(κa)$，使得迁移率可简洁表示为 $C_\parallel = f_H(κa),[1 + \varepsilon,c_2,σ_2(κa)]$，其中 $f_H$ 为亨利函数。研究表明，$σ_2$ 在厚双电层（Hückel）极限下趋近于 $+1/5$（此时仅由斯托克斯阻力修正主导），在薄双电层（Smoluchowski）极限下趋近于零，从而重现了经典的形状无关定理。该微扰理论与长椭球及扁椭球在两种取向下精确解定量吻合。一个关键发现是：仅粒子形状的 $P_2$（四极矩）分量会对主导阶迁移率产生影响；由于偶极外场与形状微扰耦合的角向选择规则，更高阶谐波分量对电泳无贡献。本文结果由 Claude Code（Anthropic，Opus 4.6 模型）在作者监督下生成。关于人工智能在理论研究中应用的思考，以及开发过程中的代表性指令示例，已在正文与附录中提供。

摘要 (Abstract)

The electrophoretic mobility of a spherical particle is well understood, yet how particle shape modifies this mobility at arbitrary Debye length remains an open question. Here, we compute the electrophoretic mobility of a nearly spherical particle whose surface is described by $r_s(θ) = a[1 + \varepsilon f(θ)]$, with $\varepsilon \ll 1$, at arbitrary ratio of particle size to Debye length $κa$. Using a volume-integral formulation combined with domain perturbation techniques, we derive a universal shape correction coefficient $σ_2(κa)$ such that the mobility takes the compact form $C_\parallel = f_H(κa),[1 + \varepsilon,c_2,σ_2(κa)]$, where $f_H$ is Henry’s function. We show that $σ_2$ interpolates between $+1/5$ in the thick-double-layer (Hückel) limit, governed solely by the Stokes drag correction, and zero in the thin-double-layer (Smoluchowski) limit, recovering the classical shape-independence theorem. The perturbation theory agrees quantitatively with exact spheroid solutions for both prolate and oblate orientations. A key finding is that only the $P_2$ (quadrupolar) component of the particle shape affects the mobility at leading order; higher harmonics are electrophoretically silent due to angular selection rules governing the coupling between the dipolar applied field and the shape perturbation. The results in this paper were generated using Claude Code (Anthropic, Opus 4.6 model) with supervision from the authors. Our thoughts on the usage of AI for theoretical research, along with representative prompts from the development process, are provided in the manuscript and Appendix.

关键词: electrophoretic mobility, particle shape, Debye length, perturbation theory, shape correction coefficient, spheroid solutions, quadrupolar component, angular selection rules

315. ❌ Ensemble density functional theory of excited states: Exact N-centered formalism and practical opportunities

作者: Lucien Dupuy, Toni Chiti, Jérémy Morere, Emmanuel Fromager 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11191v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于量子化学领域的密度泛函理论（DFT）扩展，特别是针对激发态的N中心系综DFT（Nc-eDFT）的精确形式化和计算策略。论文内容与绝大多数关键词（涉及大模型、深度学习、AI技术原理）完全无关，因为这些关键词属于机器学习/人工智能领域，而本文是纯粹的量子化学理论物理研究。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于计算化学领域，是AI for Science的一个潜在应用方向，但论文本身并未使用或提及任何AI/机器学习方法，因此仅给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了精确的N中心系综密度泛函理论（Nc-eDFT）形式，用于统一描述中性和带电电子激发态，并开发了三种实用的计算策略，包括系综密度泛函近似设计、准简并微扰理论和量子嵌入理论，以推进激发态电子结构计算。

摘要翻译

采用科恩-沈吕九密度泛函理论（KS-DFT）的基态电子结构计算，在效率与精度之间实现了前所未有的平衡，现已成为量子化学和凝聚态物理领域的范式方法。通过将密度映射到非相互作用系综态，KS-DFT可扩展用于模拟电子激发过程。与热力学理论不同，该系综中激发态的权重可独立变化。系综密度泛函理论（eDFT）具有诸多优势，例如能妥善处理广泛使用的含时密度泛函理论难以应对的多重激发问题，因此近年来已成为活跃的研究领域。最近，一种称为N中心（Nc）系综的扩展系综被提出，可在同一统一框架内描述中性与带电电子激发。本展望论文对精确Nc-eDFT进行了详细阐述，并系统综述了其形式理论的发展。为从精确理论中构建实用计算工具，本文提出了三种原创策略以补充现有方法：第一种策略涉及系综密度泛函近似的设计，通过利用eDFT精确性质推导的权重依赖标度函数对常规基态泛函进行修饰重构；其次，我们探索了系综密度泛函微扰理论的准简并形式，分别提出了系综哈特里能、交换能和相关能的替代定义，为构建稳健的轨道依赖型eDFA开辟道路；最后，我们重新审视并推广了非相互作用态系综的量子浴概念，为建立原则上精确（在晶格eDFT意义上）的激发态量子嵌入理论奠定基础。

摘要 (Abstract)

Ground-state electronic structure calculations using Kohn-Sham density functional theory (KS-DFT) offer an unprecedented balance between efficiency and accuracy, now paradigmatic to the fields of quantum chemistry and condensed matter physics. KS-DFT can be extended to model electronic excitations through a density mapping onto a non-interacting ensemble state in which, unlike in thermal theories, the weights assigned to the excited states vary independently. Thanks to its numerous appeals, like the adequate treatment of multiple excitations for which the widely-used time-dependent extension of DFT struggles, ensemble DFT (eDFT) has lately become a vibrant area of research. Recently, an enlarged type of ensemble, referred to as N-centered (Nc) ensemble, has been introduced to describe within the same unified formalism both neutral and charged electronic excitations. This perspective paper provides a detailed exposition of exact Nc-eDFT, with a comprehensive review of its formal developments. To cut practical computational tools out of the exact theory, three original strategies are presented, complementing existing approaches. The first one, related to the design of ensemble density-functional approximations, consists in recycling regular ground-state functionals by dressing them with a weight-dependent scaling function deduced from exact properties of eDFT. We then explore quasi-degenerate formulations of ensemble density-functional perturbation theory, suggesting alternative definitions for the ensemble Hartree, exchange, and correlation energies, individually, and paving the way toward robust orbital-dependent eDFAs. Finally, we revisit and generalize the concept of quantum bath for an ensemble of non-interacting states, laying the foundations of an in-principle exact (in the sense of lattice eDFT) quantum embedding theory of excited states.

关键词: ensemble density functional theory, excited states, N-centered ensemble, density-functional approximations, quantum embedding theory, electronic excitations, Kohn-Sham DFT, density-functional perturbation theory

316. ❌ Computational Generation of Substrate-Specific Molecular Cages

作者: Noé Demange, Yann Strozecki, Sandrine Vial 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11060v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《Computational Generation of Substrate-Specific Molecular Cages》专注于计算化学和分子设计领域，提出了一种构建特定底物分子笼的算法方法。论文内容涉及分子建模、图论算法和计算化学，但完全不涉及大语言模型（LLMs）、深度学习、神经网络或任何AI模型技术。所有关键词（除了最后一个）都明确指向大模型、深度学习及相关技术（如训练方法、推理优化、对齐、代理等），而本文是纯粹的算法和计算化学研究，与这些技术无关。唯一可能相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文属于计算化学（可视为化学信息学或科学AI的广义应用），但论文本身未使用AI或机器学习方法，仅使用传统算法，因此给予5分（有一定关联）。其他关键词均得0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文提出了一种计算生成特定底物分子笼的算法，通过建模分子为空间原子图并连接结合模式，构建了能高效处理上百个原子的最小分子路径。

摘要翻译

本文提出了一种构建用于捕获特定底物的分子笼的方法。我们将分子笼建模为具有空间坐标的原子图，并对其边施加若干约束（度数、长度和角度）。我们采用一种简单的方法来放置能够与底物特定部分相互作用的结合模式。随后，我们提出一种算法，该算法考虑连接这些结合模式的所有可能方式，并尝试构建实现这些连接的最小可能分子路径。为获得最高效的算法——使其能够构建超过一百个原子的分子笼，我们对本方法的多种变体进行了研究。

摘要 (Abstract)

In this paper, we propose a method to build molecular cages designed to capture a specific substrate. We model a cage as a graph of atoms with coordinates in space, and several constraints on their edges (degree, length and angle). We use a simple method to place binding patterns which are able to interact with certain parts of the substrate. We then propose an algorithm which considers all possible ways of connecting these binding patterns and try to construct the smallest possible molecular paths realizing these connections. We investigate many variants of our method in order to obtain the most efficient algorithm, able to build cages of more than a hundred atoms.

关键词: molecular cages, substrate-specific, computational generation, graph of atoms, binding patterns, algorithm, smallest molecular paths, computational chemistry

317. ❌ opt-DDAP: Optimisable density-derived atomic point charges via automatic differentiation

作者: Mohith H., Sudarshan Vijay 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.10984v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于计算化学领域，提出了一种通过自动微分优化密度衍生原子点电荷（opt-DDAP）的方法，用于改进分子动力学模拟中的静电势计算。所有关键词均与大语言模型、深度学习技术原理或其在科学领域的直接应用无关。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文属于计算化学（与化学信息学相关），但论文本身未使用AI/机器学习方法（仅提到优化后的电荷可作为机器学习势函数的输入，但非核心创新），因此给予5分（有一定关联）。其他关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了密度衍生原子点电荷（DDAP）方法在复杂或共价系统中因依赖固定启发式参数和数值不稳定求解器而受限的问题，通过将算法重构为可微分计算图并利用自动微分优化高斯基参数和倒空间截断，提出了opt-DDAP方法，在NaCl空位超晶格和MoS2上验证了其能准确重建绝对和差分电荷密度。

摘要翻译

能够精确描述长程静电作用的原子间势函数需要以原子为中心的电荷分布。从密度泛函理论（DFT）计算中确定此类原子中心电荷的一种方法是密度派生原子点电荷（DDAP）方法。DDAP方法通过将原子中心高斯函数拟合到基态DFT电荷密度上，并保留主导长程静电作用的多极矩。尽管这些电荷能准确预测长程行为，但在实际应用中，它们受限于对固定启发式参数的依赖以及所使用的约束求解器——该求解器在处理复杂或共价体系时会出现数值不稳定性。本研究提出了优化型DDAP（opt-DDAP），通过将算法重构为可微分计算图来解决这一局限。这种重构使得能够利用自动微分技术对高斯基组参数和倒空间截断值进行优化。为确保自动微分过程中的数值鲁棒性，我们采用伪逆解结合电荷重整化方案替代传统的拉格朗日乘子法，即使在病态矩阵存在时也能保持稳定性。我们在NaCl空位超胞和MoS$_2$体系上验证了该框架，证明其能够准确重构绝对电荷密度与差分电荷密度。优化后的电荷旨在作为机器学习中有效静电模型以及包含长程相互作用的经验原子间势函数的输入参数。

摘要 (Abstract)

Interatomic potentials which accurately describe long-range electrostatics require atom-centred charges. One such method to determine these atom-centred charges from density functional theory (DFT) calculations is the density-derived atomic point (DDAP) charge method. DDAP fits atom-centred Gaussians to the ground-state DFT charge density and preserves the multipole moments that govern long-range electrostatics. While these charges accurately predict long-range behaviour, in practice, they are limited by their reliance on fixed, heuristic parameters and a constrained solver that becomes numerically unstable for complex or covalent systems. In this work, we present opt-DDAP, which solves this limitation by reformulating the algorithm as a differentiable computational graph. This reformulation allows for the optimisation of Gaussian basis parameters and the reciprocal-space cutoff using automatic differentiation. To ensure numerical robustness through this automatic differentiation process, we replace the conventional Lagrange-multiplier approach with a pseudo-inverse solution followed by charge renormalisation, maintaining stability even in the presence of ill-conditioned matrices. We validate the framework on NaCl vacancy supercells and on MoS$_2$, demonstrating faithful reconstruction of both absolute and difference charge densities. The optimised charges are intended to serve as inputs to effective electrostatic models in machine-learning and empirical interatomic potentials that incorporate long-range interactions.

关键词: opt-DDAP, density-derived atomic point charges, automatic differentiation, interatomic potentials, electrostatics, charge density, Gaussian basis optimization, computational chemistry

318. ❌ Comparing and Contrasting Vibrational Wavepacket Dynamics and Impulsive Stimulating Raman Scattering Descriptions of Pump-Probe Spectroscopy: A Theoretical Study

作者: Subho Mitra, Arijit K. De 期刊/来源: arxiv 发布日期: 2026-04-12 arXiv链接: http://arxiv.org/abs/2604.10725v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是纯粹的理论物理/化学研究，专注于泵浦-探测光谱学中的振动波包动力学和受激拉曼散射的理论模拟与比较，完全不涉及任何大模型、深度学习、人工智能或相关技术。所有关键词均与大模型技术、AI应用或相关方法论相关，与该论文的研究领域（理论光谱学）完全无关，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该研究通过理论模拟比较了泵浦-探测光谱学中振动波包动力学方法和受激拉曼散射框架，发现非相邻振动能级相干性对信号有重要贡献，且特定条件下相干反斯托克斯路径主导观测信号。

摘要翻译

我们模拟了泵浦-探测光谱中的三阶非线性信号，该信号源于一阶与二阶波包（WPs）之间的干涉，以及在脉冲受激拉曼散射（ISRS）激发背景下斯托克斯与相干反斯托克斯路径的态-态跃迁。我们对两种方法均给出了逐步的详细阐述。通过对激发态吸收信号的模拟，我们对这两种方法所得结果进行了比较与对照。虽然在ISRS框架内，振动动力学常主要归因于激发电子态中相邻振动能级间的相干性，但我们的结果表明，需要计算涉及非相邻振动能级的相干性，才能与波包方法更好地吻合。我们还表明，在泵浦/探测光谱带宽的特定选择下，相干反斯托克斯路径对观测信号起主要贡献。

摘要 (Abstract)

We simulate a third-order nonlinear signal in a pump-probe spectroscopy from the interference between first- and second-order wavepackets (WPs), as well as from a state-to-state transition for Stokes and coherent anti-Stokes pathways in the context of impulsive stimulated Raman scattering (ISRS) excitation. We present a detailed step-by-step description of both methods. The results obtained from these two methods are compared and contrasted through simulations of the excited-state absorption signal. While within the ISRS framework, vibrational dynamics is often attributed primarily to coherences between adjacent vibrational levels in the excited electronic states, our results show that coherences involving non-adjacent vibrational levels needs to be calculated for a better agreement with the WP approach. We also show that for the specific choice of pump/probe spectral bandwidths, the coherent anti-Stokes pathway majorly contributes to the observed signal.

关键词: pump-probe spectroscopy, vibrational wavepacket dynamics, impulsive stimulated Raman scattering, nonlinear signal simulation, coherence between vibrational levels, excited-state absorption, coherent anti-Stokes pathway, theoretical study

319. ❌ Symplectic Constraints in Quantum Reaction Dynamics: Squeezed-State Suppression and Candidate Width Scales

作者: Stephen Wiggins 期刊/来源: arxiv 发布日期: 2026-04-12 arXiv链接: http://arxiv.org/abs/2604.10625v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究量子反应动力学中的辛几何约束，聚焦于压缩态波包在量子鞍点瓶颈处的传输抑制现象，使用Weyl符号、量子正规形式和经典辛宽度等理论框架。所有评分关键词均涉及大模型、深度学习技术及其应用（如LLM、MoE、SFT、RAG、量化、推理加速、AI for Science等），而本文属于理论物理/量子化学领域，未涉及任何人工智能、机器学习或大模型相关内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了高度压缩的高斯波包在量子鞍点瓶颈传输中的几何抑制效应，发现压缩态会显著降低有效反应能量，导致传输被强烈抑制，这与经典辛宽度图像一致。

摘要翻译

经典反应动力学表明，穿越指数-1鞍点的输运不仅由通量组织，还受到瓶颈附近有界代理邻域的局部辛宽度尺度的影响。我们探究在量子区域中，对于高度压缩的高斯波包是否会出现相关的几何效应。基于德戈森的辛方法，我们分析了横向浴模压缩如何改变穿越量子范式（QNF）瓶颈的透射。
为避免传播态在相空间极端偏心率下的不稳定性，我们采用量子范式的韦尔符号表述。对于二次鞍点-中心模型，我们通过将浴的压缩态粒子数分布与一维坎布尔透射因子进行卷积，推导出精确的基线透射公式。对于非谐截断量子范式模型，我们强制执行严格的代数能量守恒，并通过维克-伊瑟利斯矩公式计算韦尔符号的精确高斯期望值诊断。
结果表明，压缩效应会显著抑制透射。当压缩态的浴平面几何尺度相对于经典候选宽度增长时，预期的浴作用量迅速增加。因此，有效反应能量被严重耗尽，导致透射进入强烈抑制区域。我们将此解释为一种量子几何抑制机制的证据，该机制与经典候选辛宽度图像一致。虽然这尚未构成严格的量子非压缩定理，但本研究提供了一个具体框架，将压缩态的协方差几何、范式作用量尺度以及指数-1鞍点附近的模特异性量子反应性联系起来。

摘要 (Abstract)

Classical reaction dynamics suggests transport through an index-1 saddle is organized not just by flux, but by local symplectic width scales of bounded proxy neighborhoods near the bottleneck. We investigate if a related geometric effect appears in the quantum regime for highly squeezed Gaussian wavepackets. Building on de Gosson’s symplectic approach, we analyze how transverse bath-mode squeezing modifies transmission across a quantum normal-form (QNF) bottleneck. To avoid the instability of propagating states with extreme phase-space eccentricity, we use the Weyl-symbol formulation of the QNF. For the quadratic saddle-center model, we derive an exact baseline transmission formula by convolving the bath’s squeezed-state number distribution with the 1D Kemble transmission factor. For anharmonic truncated QNF models, we enforce strict algebraic energy conservation and evaluate exact Gaussian expectation-value diagnostics of the Weyl symbol via Wick-Isserlis moment formulas. Results reveal a pronounced squeeze-induced suppression of transmission. As the squeezed state’s bath-plane geometric scale grows relative to the classical candidate width, the expected bath action grows rapidly. Consequently, effective reactive energy is strongly depleted, driving transmission into a severely suppressed regime. We interpret this as evidence of a quantum geometric suppression mechanism consistent with the classical candidate symplectic-width picture. While not yet a rigorous quantum non-squeezing theorem, this work provides a concrete framework linking squeezed-state covariance geometry, normal-form action scales, and mode-specific quantum reactivity near an index-1 saddle.

关键词: quantum reaction dynamics, squeezed states, symplectic constraints, transmission suppression, quantum normal form, Weyl symbol, index-1 saddle, geometric suppression

320. ❌ Location of the liquid-vapor critical point in aluminum

作者: Xuyang Long, Kai Luo 期刊/来源: arxiv 发布日期: 2026-04-12 arXiv链接: http://arxiv.org/abs/2604.10561v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文使用深度势能分子动力学和大规模模拟来预测铝的液-汽临界点，属于计算材料科学领域。虽然论文使用了深度学习技术（deep potential），但所有关键词都专注于大语言模型（LLM）及其相关技术（如微调、推理优化、智能体等），而本文研究的是分子动力学模拟和材料性质预测，与LLM技术无直接关联。唯一略有相关的是’AI for Science’，因为论文将AI应用于科学问题（材料科学），但论文并未使用大模型，因此给予5分（有一定关联）。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该研究通过结合深度势能分子动力学和大规模模拟，首次精确确定了铝的液-汽临界点温度为6531-6576 K，密度为0.637 g/cm³，压力为1.6 kbar，解决了长期存在的不确定性。

摘要翻译

铝的液-气临界点精确位置数十年来始终难以确定，已报道的临界温度范围跨度近4000~K。本研究通过结合基于高精度电子结构数据训练的深度势能分子动力学与大尺度模拟，解决了这一长期存在的不确定性。我们以实验测得的液态密度为基准，对多种交换关联泛函进行了评估，确定PBEsol泛函能提供最一致的描述。采用互补方法——状态方程的自旋分解分析以及结合高斯混合相识别的直接共存模拟——我们最终确定临界温度为$6531$-$6576$~K，临界密度为$0.637$~g/cm$^{3}$，临界压力为$1.6$~~kbar。这些数值的精确度（温度不确定度约50~~K）相较以往估计实现了阶跃式提升。我们的研究框架建立了一种可迁移的预测金属临界现象的策略，对激光烧蚀、冲击压缩及极端条件下的行星建模等领域具有重要启示。

摘要 (Abstract)

The precise location of the liquid-vapor critical point (CP) in aluminum has remained elusive for decades, with reported critical temperatures spanning nearly 4000~K. Here we resolve this long-standing uncertainty by combining deep potential molecular dynamics with large-scale simulations trained on high-fidelity electronic-structure data. We benchmark multiple exchange-correlation functionals against experimental liquid densities and identify PBEsol as providing the most consistent description. Using complementary approaches – spinodal analysis of the equation of state and direct coexistence simulations with Gaussian mixture phase identification – we converge on a critical temperature of $6531$-$6576$~K, a critical density of $0.637$~g/cm$^{3}$, and a critical pressure of $1.6$~~kbar. The precision of these values, with uncertainties of $\sim$50~~K in temperature, represents a step change over previous estimates. Our framework establishes a transferable strategy for predicting critical phenomena in metals, with implications for laser ablation, shock compression, and planetary modeling under extreme conditions.

关键词: liquid-vapor critical point, aluminum, deep potential molecular dynamics, large-scale simulations, spinodal analysis, equation of state, critical temperature, critical density

321. ❌ CovAngelo: A hybrid quantum-classical computing platform for accurate and scalable drug discovery

作者: Linn Evenseth, Kamil Galewski, Witold Jarnicki, Piero Lafiosca, Vyom N. Patel, Grzegorz Rajchel-Mieldzioć, Martin Šimka, Michał Szczepanik, Emil Żak 期刊/来源: arxiv 发布日期: 2026-04-12 arXiv链接: http://arxiv.org/abs/2604.10487v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于量子-经典混合计算平台在药物发现中的应用，特别是通过QM/QM/MM多尺度嵌入模型模拟化学反应。论文内容与大多数关键词（涉及大模型、训练方法、推理优化、代理系统等）完全无关，因为这些关键词主要针对深度学习和大语言模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学（具体是药物发现和化学信息学）领域的应用，因此给予10分（高度相关）。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为CovAngelo的量子-经典混合计算平台，用于高精度模拟药物发现中的配体-蛋白质结合反应，并通过模拟zanubrutinib与Bruton酪氨酸激酶的共价对接验证了其降低计算成本和提升准确性的能力。

摘要翻译

我们提出一个用于复杂分子环境中化学反应建模的计算平台，其重点在于药物发现中的配体-蛋白质结合。该平台实现了我们新开发的量子嵌入量子嵌入经典（QM/QM/MM）多尺度嵌入模型，该模型将分子动力学与量子信息增强的密度矩阵嵌入理论及量子化学求解器（包括显式溶剂）相结合。平台利用量子信息度量生成纠缠一致性轨道，从而实现对强关联区域的高精度描述。该框架支持多种计算后端，包括多CPU、英伟达多GPU架构，以及通过CUDA-Q集成的量子硬件（IQM、IonQ、IBM），并设计为与未来容错量子系统兼容。新平台的能力通过模拟赞布替尼（zanubrutinib）经由迈克尔加成机制与布鲁顿酪氨酸激酶（Bruton’s tyrosine kinase）的共价对接得以验证，计算了完整的反应能量分布和能垒，且相对于现有方法显著降低了计算成本。作为第二代抗癌药物，赞布替尼为共价抑制剂发现提供了概念验证。本方法提供的精确第一性原理反应能垒估算，有助于降低药物发现流程中的假阳性和假阴性率。通过在GPU集群和基于云的CPU基础设施上的基准测试验证了平台的可扩展性。我们展示了与量子设备（最多20量子比特）的集成，并提供了容错量子计算的资源估算，表明潜在加速比可达20倍。除单一反应外，该平台还支持在化学度量空间中构建反应网络，从而促进配体筛选和反应路径的系统性探索。

摘要 (Abstract)

We present a computational platform for modeling chemical reactions in complex molecular environments, focused on ligand-protein binding in drug discovery. The platform implements our new quantum-in-quantum-in-classical (QM/QM/MM) multiscale embedding model that integrates molecular dynamics with a quantum-information-enhanced density matrix embedding theory and quantum chemistry solvers, including explicit solvent. Quantum-information metrics are utilized to generate entanglement-consistent orbitals, enabling a high-accuracy description of strongly correlated regions. The framework supports multiple computational backends, including multi-CPU, NVIDIA multi-GPU architectures, and quantum hardware (IQM, IonQ, IBM) integrated under CUDA-Q, and is designed for compatibility with future fault-tolerant quantum systems. The new platform’s capabilities are demonstrated by modeling covalent docking of zanubrutinib to Bruton’s tyrosine kinase via a Michael addition mechanism, computing the full reaction energy profiles and energy barriers at a reduced computational cost relative to existing methods. As a 2nd-generation anticancer agent, zanubrutinib serves as a proof of concept for covalent inhibitor discovery. Accurate first-principles reaction barrier estimations provided by our method can contribute to reducing false positive and negative rates in drug discovery pipelines. Scalability is validated through benchmarks on GPU clusters, cloud-based CPU infrastructures. We demonstrate integration with quantum devices (up to 20 qubits), alongside resource estimates for fault-tolerant quantum computing, indicating potential speedups of up to 20x. Beyond single reactions, the platform supports the construction of reaction networks in chemical metric space, facilitating ligand screening and systematic exploration of reactive pathways.

关键词: quantum-classical computing, drug discovery, QM/QM/MM multiscale embedding, ligand-protein binding, covalent docking, reaction energy profiles, quantum hardware integration, scalability benchmarks

322. ❌ Symplectic Constraints in Classical Reaction Dynamics: From Gromov’s Camel to Reaction Rates

作者: Stephen Wiggins 期刊/来源: arxiv 发布日期: 2026-04-12 arXiv链接: http://arxiv.org/abs/2604.10408v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究经典反应动力学中的辛几何约束，使用辛拓扑、Gromov非挤压定理、Poincaré-Birkhoff范式理论等数学工具分析相空间结构和反应瓶颈，属于理论物理和数学物理领域。所有评分关键词均涉及大模型、深度学习及相关技术（如训练方法、推理优化、对齐、应用等），而该论文完全不涉及任何人工智能、机器学习或计算模型的内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究如何利用辛拓扑中的Gromov非挤压定理和辛容量等几何概念，为经典反应动力学中指数-1鞍点附近的反应瓶颈和模式选择性提供新的几何视角，并通过数值计算验证了高作用边界对反应动力学的有限时间延迟影响。

摘要翻译

本研究探讨辛几何拓扑的思想——特别是格罗莫夫的非挤压定理与辛容量——能否为经典反应动力学在指数-1鞍点附近的动力学行为提供有价值的几何洞见。利用庞加莱-伯克霍夫正规形理论，我们描述了组织穿越过渡态区域输运过程的相空间结构，包括分界面、法向双曲不变流形（NHIMs）以及相关的浴模作用量几何。对于二次型鞍-中心和鞍-中心-中心模型，正规形几何确定了与反应瓶颈相关的自然浴模作用量面积标度。对于非谐振系统（埃卡特-莫尔斯与埃卡特-莫尔斯-莫尔斯模型），我们基于高阶正规形理论，针对鞍点附近反应瓶颈相关联的有界局部邻域，利用横向浴模作用量构建了相应的候选辛宽度标度。随后，我们给出两个数值示例：通过局部耦合相空间球的后向传播来检验线性非挤压行为，以及在非谐振正规形模型中进行浴模局域化的系综计算。这些计算与以下观点一致：将系综的初始相空间分布在浴模的高作用量边界处进行强烈偏置，可能导致严重的有限时间动力学延迟，从而以总相空间体积或通量无法单独捕捉的方式影响反应性。研究结果为模式选择性与反应瓶颈提供了新的几何视角，同时揭示了关于这些候选宽度标度与精确定义的反应邻域的真实辛容量之间确切关系的开放性数学问题。

摘要 (Abstract)

We investigate whether ideas from symplectic topology, in particular Gromov’s non-squeezing theorem and symplectic capacity, can provide useful geometric insight into classical reaction dynamics near an index-1 saddle. Using Poincaré-Birkhoff normal form theory, we describe the phase-space structures that organize transport through the transition-state region, including dividing surfaces, normally hyperbolic invariant manifolds (NHIMs), and the associated bath-action geometry. For quadratic saddle-center and saddle-center-center models, the normal-form geometry identifies natural bath-action area scales associated with the reactive bottleneck. For anharmonic systems (Eckart-Morse and Eckart-Morse-Morse), we formulate corresponding candidate symplectic width scales – based on transverse bath actions – using high-order normal forms for bounded local neighborhoods associated with the reaction bottleneck near the saddle. We then present two numerical illustrations: the backward propagation of a locally coupled phase-space ball to examine linear non-squeezing behavior, and a bath-localized ensemble calculation in an anharmonic normal-form model. These computations are consistent with the idea that heavily biasing the initial phase-space distribution of an ensemble toward the high-action boundaries of the bath modes can induce a severe finite-time dynamical delay, influencing reactivity in ways not captured by total phase-space volume or flux alone. The results suggest a new geometric perspective on mode selectivity and reaction bottlenecks, while highlighting open mathematical questions concerning the precise relation between these candidate width scales and genuine symplectic capacities of suitably defined reactive neighborhoods.

关键词: symplectic topology, Gromov’s non-squeezing theorem, classical reaction dynamics, phase-space structures, reaction bottlenecks, normal-form theory, mode selectivity, bath-action geometry

Token 消耗统计

总计: 1,044,681 tokens（输入 720,342 / 输出 324,339）

模型	输入	输出	合计
deepseek-chat	579,621	324,339	903,960
glm-4.7	140,721	0	140,721

📊 ArXiv 研究报告 (2026-04-15)

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

评分设置

📈 论文统计

⭐ 及格论文详细分析

1. OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

2. UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

3. Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retriev

4. Geoparsing: Diagram Parsing for Plane and Solid Geometry with a Unified Formal Language

5. Do Agent Rules Shape or Distort? Guardrails Beat Guidance in Coding Agents

6. METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models

7. CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning

8. CLSGen: A Dual-Head Fine-Tuning Framework for Joint Probabilistic Classification and Verbalized Expl

9. Self-Evolving LLM Memory Extraction Across Heterogeneous Tasks

10. ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation

11. DreamKG: A KG-Augmented Conversational System for People Experiencing Homelessness

12. Efficient Training for Cross-lingual Speech Language Models

📋 所有论文列表

1. ✅ OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

2. ✅ UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

3. ✅ Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval

4. ✅ Geoparsing: Diagram Parsing for Plane and Solid Geometry with a Unified Formal Language

5. ✅ Do Agent Rules Shape or Distort? Guardrails Beat Guidance in Coding Agents

6. ✅ METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models

7. ✅ CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning

8. ✅ CLSGen: A Dual-Head Fine-Tuning Framework for Joint Probabilistic Classification and Verbalized Explanation

9. ✅ Self-Evolving LLM Memory Extraction Across Heterogeneous Tasks

10. ✅ ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation

11. ✅ DreamKG: A KG-Augmented Conversational System for People Experiencing Homelessness

12. ✅ Efficient Training for Cross-lingual Speech Language Models

13. ❌ Dynamic Summary Generation for Interpretable Multimodal Depression Detection

14. ❌ Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees

15. ❌ Agentic Driving Coach: Robustness and Determinism of Agentic AI-Powered Human-in-the-Loop Cyber-Physical Systems

16. ❌ Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

17. ❌ When Meaning Isn’t Literal: Exploring Idiomatic Meaning Across Languages and Modalities

18. ❌ Generating Multiple-Choice Knowledge Questions with Interpretable Difficulty Estimation using Knowledge Graphs and Large Language Models

19. ❌ BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning

20. ❌ Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)

21. ❌ S$^3$: Structured Sparsity Specification

22. ❌ CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space

23. ❌ Min-$k$ Sampling: Decoupling Truncation from Temperature Scaling via Relative Logit Dynamics

24. ❌ Hierarchical Textual Knowledge for Enhanced Image Clustering

25. ❌ Towards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation

26. ❌ Towards Situation-aware State Modeling for Air Traffic Flow Prediction

27. ❌ Fairness is Not Flat: Geometric Phase Transitions Against Shortcut Learning

28. ❌ A Deep Equilibrium Network for Hyperspectral Unmixing

29. ❌ Physics-Informed State Space Models for Reliable Solar Irradiance Forecasting in Off-Grid Systems

30. ❌ Solving Physics Olympiad via Reinforcement Learning on Physics Simulators

31. ❌ Detecting Safety Violations Across Many Agent Traces

32. ❌ A Mechanistic Analysis of Looped Reasoning Language Models

33. ❌ C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World Prompts

34. ❌ Budget-Aware Uncertainty for Radiotherapy Segmentation QA Using nnU-Net

35. ❌ ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

36. ❌ GenTac: Generative Modeling and Forecasting of Soccer Tactics

37. ❌ ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

38. ❌ General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

39. ❌ Efficient KernelSHAP Explanations for Patch-based 3D Medical Image Segmentation

40. ❌ Retrieval Is Not Enough: Why Organizational AI Needs Epistemic Infrastructure

41. ❌ StarVLA-$α$: Reducing Complexity in Vision-Language-Action Systems

42. ❌ Grounded World Model for Semantically Generalizable Planning

43. ❌ Discourse Diversity in Multi-Turn Empathic Dialogue

44. ❌ Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving

45. ❌ Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games

46. ❌ Endogenous Information in Routing Games: Memory-Constrained Equilibria, Recall Braess Paradoxes, and Memory Design

47. ❌ Evaluating Cooperation in LLM Social Groups through Elected Leadership

48. ❌ On the Robustness of Watermarking for Autoregressive Image Generation

49. ❌ SWE-AGILE: A Software Agent Framework for Efficiently Managing Dynamic Reasoning Context

50. ❌ A Mamba-Based Multimodal Network for Multiscale Blast-Induced Rapid Structural Damage Assessment

51. ❌ Legal2LogicICL: Improving Generalization in Transforming Legal Cases to Logical Formulas via Diverse Few-Shot Learning

52. ❌ AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation

53. ❌ NetworkNet: A Deep Neural Network Approach for Random Networks with Sparse Nodal Attributes and Complex Nodal Heterogeneity

54. ❌ Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

55. ❌ Beyond LLMs, Sparse Distributed Memory, and Neuromorphics <A Hyper-Dimensional SRAM-CAM “VaCoAl” for Ultra-High Speed, Ultra-Low Power, and Low Cost>

56. ❌ Why Do Large Language Models Generate Harmful Content?

57. ❌ Towards Autonomous Mechanistic Reasoning in Virtual Cells

58. ❌ RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents

59. ❌ CodeTracer: Towards Traceable Agent States

60. ❌ RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

61. ❌ SCNO: Spiking Compositional Neural Operator – Towards a Neuromorphic Foundation Model for Nuclear PDE Solving