📊 ArXiv 研究报告 (2026-04-08)

生成时间: 2026-04-08 09:51:50 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 307 篇
及格论文: 14 篇 (4.6%)

⭐ 及格论文详细分析

1. Search, Do not Guess: Teaching Small Language Models to Be Effective Search Agents

作者: Yizhou Liu, Qi Sun, Yulin Chen, Siyue Zhang, Chen Zhao 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04651v1

评分: 74.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究如何通过轻量级微调方法（\policy）训练小型语言模型（SLMs）成为有效的搜索代理，以解决SLMs在复杂多跳推理任务中搜索工具调用不足和幻觉倾向的问题。因此，与"Small Language Models (SLMs)"、“LLM Agents”、“Tool Use”、“Retrieval-Augmented Generation (RAG)“高度相关（10分）。论文也涉及从大型语言模型（LLMs）蒸馏代理行为，与"Large Language Models (LLMs)“高度相关（10分）。论文提出的微调方法属于监督微调（SFT）范畴，与"Post-training/SFT"相关（8分）。研究任务涉及多跳推理，与"Chain of Thought/CoT Reasoning"相关（8分）。目标之一是减少幻觉，与"Hallucination Mitigation"相关（8分）。其他关键词如MoE、Scaling Laws、RLHF等未在论文中提及或仅边缘相关，给予0分。

!!! tip deepseek-chat TL;DR

该论文针对小型语言模型（SLMs）作为搜索代理时存在搜索工具调用不足和易产生幻觉的问题，提出了一种轻量级微调方法（\policy），显著提升了SLMs在复杂多跳推理任务上的性能，达到了与大型语言模型（LLMs）相当的水平。

摘要翻译

配备搜索工具的智能体已成为解决知识密集型任务的有效方案。尽管大语言模型展现出强大的推理能力，但其高昂的计算成本限制了搜索智能体的实际部署。因此，近期研究聚焦于将大语言模型的智能体行为蒸馏至小语言模型中。通过对复杂多跳推理任务的综合评估，我们发现尽管小语言模型具备的参数知识较少，但其调用搜索工具的频次更低，且更容易产生幻觉。为解决这一问题，我们提出\policy方法——一种轻量级微调策略，通过显式训练小语言模型，使其能够基于检索证据进行可靠检索并生成答案。相较于从大语言模型进行智能体蒸馏的方法，我们的方案在Bamboogle数据集上提升了17.3分，在HotpotQA数据集上提升了15.3分，在各基准测试中均达到大语言模型级别的性能。进一步分析表明，小语言模型中的自适应搜索策略往往会导致性能下降，这凸显了稳定搜索行为对可靠推理的必要性。

摘要 (Abstract)

Agents equipped with search tools have emerged as effective solutions for knowledge-intensive tasks. While Large Language Models (LLMs) exhibit strong reasoning capabilities, their high computational cost limits practical deployment for search agents. Consequently, recent work has focused on distilling agentic behaviors from LLMs into Small Language Models (SLMs). Through comprehensive evaluation on complex multi-hop reasoning tasks, we find that despite possessing less parametric knowledge, SLMs invoke search tools less frequently and are more prone to hallucinations. To address this issue, we propose \policy, a lightweight fine-tuning approach that explicitly trains SLMs to reliably retrieve and generate answers grounded in retrieved evidence. Compared to agent distillation from LLMs, our approach improves performance by 17.3 scores on Bamboogle and 15.3 scores on HotpotQA, achieving LLM-level results across benchmarks. Our further analysis reveals that adaptive search strategies in SLMs often degrade performance, highlighting the necessity of consistent search behavior for reliable reasoning.

关键词: Small Language Models (SLMs), Search Agents, Tool Use, Retrieval-Augmented Generation, Multi-hop Reasoning, Hallucination Mitigation, Fine-tuning, Agent Distillation

2. PassiveQA: A Three-Action Framework for Epistemically Calibrated Question Answering via Supervised F

作者: Madhav S Baidya 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04565v1

评分: 70.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	5.0/10	5.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在问答任务中的认知校准问题，通过监督微调（SFT）实现三动作框架（回答、询问、弃权），直接涉及LLMs、SFT、RAG和幻觉缓解等关键词，并间接关联到推理、对齐、自我反思和智能体等概念。其他关键词如MoE、量化、科学AI等未在论文中提及。

!!! tip deepseek-chat TL;DR

该论文针对LLMs在信息不完整查询中过度自信生成幻觉答案的问题，提出了通过监督微调实现认知校准的三动作框架PassiveQA，显著提高了模型在问答任务中的宏观F1分数和弃权召回率，同时降低了幻觉率。

摘要翻译

大语言模型（LLMs）在问答和检索增强生成（RAG）任务中表现出色，但它们通常隐含地假设用户查询是完整且可回答的。在实际场景中，查询往往不完整、模糊或缺失关键变量，导致模型产生过度自信或虚构的回应。
本研究探讨了信息不完整情况下的决策感知查询解析问题，即模型必须决定是直接回答、请求澄清还是选择弃答。我们发现，标准及增强的RAG系统均未能可靠地展现这种认知意识，即使在信息不足时也倾向于生成答案。
为解决这一问题，我们提出了PassiveQA框架，该框架通过监督微调将模型行为与信息充分性对齐，包含回答、询问和弃答三种决策。我们的方法整合了结构化信息状态表示、基于知识图谱的上下文，以及一个经过微调的规划器，该规划器能显式建模缺失变量和决策推理过程。
在多个问答数据集上的实验表明，在计算受限的训练条件下，经过微调的规划器在宏观F1分数和弃答召回率上均取得显著提升，同时降低了幻觉生成率。
这些结果为以下观点提供了有力的实证依据：认知决策能力必须在训练过程中习得，而非仅在推理阶段强行施加。

摘要 (Abstract)

Large Language Models (LLMs) have achieved strong performance in question answering and retrieval-augmented generation (RAG), yet they implicitly assume that user queries are fully specified and answerable. In real-world settings, queries are often incomplete, ambiguous, or missing critical variables, leading models to produce overconfident or hallucinated responses. In this work, we study decision-aware query resolution under incomplete information, where a model must determine whether to Answer, Ask for clarification, or Abstain. We show that standard and enhanced RAG systems do not reliably exhibit such epistemic awareness, defaulting to answer generation even when information is insufficient. To address this, we propose PassiveQA, a three-action framework that aligns model behaviour with information sufficiency through supervised finetuning. Our approach integrates structured information-state representations, knowledge graph-grounded context, and a finetuned planner that explicitly models missing variables and decision reasoning. Experiments across multiple QA datasets show that the finetuned planner achieves significant improvements in macro F1 and abstention recall while reducing hallucination rates, under a compute-constrained training regime. These results provide strong empirical evidence that epistemic decision-making must be learned during training rather than imposed at inference time.

关键词: Large Language Models, Supervised Fine-tuning, Retrieval-Augmented Generation, Hallucination Mitigation, Epistemic Calibration, Question Answering, Three-action Framework, Information Sufficiency

3. GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces

作者: Xinyu Geng, Yanjing Xiao, Yuyang Zhang, Hanwen Wang, Xinyan Liu, Rui Min, Tianqing Fang, Yi R. Fung 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04017v1

评分: 66.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	8.0/10	8.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	15.0/10	15.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	15.0/10	15.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究智能体（Agent）的多步工具使用（Tool Use）和推理（Reasoning），用于解决结合视觉线索和开放网络证据的地理定位任务。因此，与"LLM Agents/Autonomous Agents/Agentic Workflow"和"Tool Use/Function Calling/API Tool Use"高度相关（15分），是论文的核心。论文涉及多步推理和验证，与"Chain of Thought/CoT Reasoning/Multi-step Reasoning"和"System 2 Thinking/Slow Thinking/In-depth Reasoning"高度相关（10分）。论文的智能体工作流涉及知识检索和整合，与"Retrieval-Augmented Generation/RAG/Retrieval-Generation"有一定关联（8分）。论文背景涉及大模型驱动的智能体，因此与"Large Language Models/LLMs/Foundation Models"有一定关联（8分）。其他关键词（如MoE、SFT、量化、AI for Science等）在论文中未直接涉及，评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为GeoBrowse的地理定位基准测试，用于评估智能体如何通过多步工具使用和推理来整合模糊的视觉线索与开放网络知识进行验证，并开发了一个名为GATE的智能体工作流，实验表明其优于直接推理和仅使用搜索或图像的方法。

摘要翻译

深度研究智能体通过多步骤工具使用整合碎片化证据。BrowseComp为此类智能体提供了纯文本测试平台，但现有多模态基准很少同时要求弱视觉线索组合与BrowseComp风格的多跳验证。地理定位是一个天然测试场，因为其答案依赖于组合多个模糊视觉线索并通过开放网络证据进行验证。为此，我们推出GeoBrowse——一个将视觉推理与知识密集型多跳查询相结合的地理定位基准。第一层级测试碎片化视觉线索的提取与组合，第二层级通过注入长尾知识和混淆关键实体来提升查询难度。为支持评估，我们提供了智能体工作流程GATE，包含五个图像思维工具和四个知识密集型工具，并发布了基于可验证证据的专家标注逐步轨迹，用于轨迹级分析。实验表明GATE优于直接推理和开源智能体，这表明无工具、仅搜索或仅图像的设置均不充分。性能提升源于连贯的、针对特定层级的工具使用规划而非更多工具调用，因为这些规划能更可靠地抵达标注的关键证据步骤，并在整合至最终决策时产生更少错误。GeoBrowse基准与代码发布于https://github.com/ornamentt/GeoBrowse。

摘要 (Abstract)

Deep research agents integrate fragmented evidence through multi-step tool use. BrowseComp offers a text-only testbed for such agents, but existing multimodal benchmarks rarely require both weak visual cues composition and BrowseComp-style multi-hop verification. Geolocation is a natural testbed because answers depend on combining multiple ambiguous visual cues and validating them with open-web evidence. Thus, we introduce GeoBrowse, a geolocation benchmark that combines visual reasoning with knowledge-intensive multi-hop queries. Level 1 tests extracting and composing fragmented visual cues, and Level 2 increases query difficulty by injecting long-tail knowledge and obfuscating key entities. To support evaluation, we provide an agentic workflow GATE with five think-with-image tools and four knowledge-intensive tools, and release expert-annotated stepwise traces grounded in verifiable evidence for trajectory-level analysis. Experiments show that GATE outperforms direct inference and open-source agents, indicating that no-tool, search-only or image-only setups are insufficient. Gains come from coherent, level-specific tool-use plans rather than more tool calls, as they more reliably reach annotated key evidence steps and make fewer errors when integrating into the final decision. The GeoBrowse bernchmark and codes are provided in https://github.com/ornamentt/GeoBrowse

关键词: Geolocation Benchmark, Agentic Tool Use, Multi-step Reasoning, Visual Reasoning, Knowledge-intensive Queries, Agentic Workflow, Expert-annotated Traces, Multi-hop Verification

4. How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

作者: Gregory N. Frank 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04385v1

评分: 65.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	5.0/10	5.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	15.0/10	15.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	5.0/10	5.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究对齐训练后语言模型中的稀疏路由机制，直接聚焦于对齐（Alignment）和可解释性（Mechanistic Interpretability），因此这两个关键词得最高分（15和10）。论文明确研究对齐训练（Post-training/SFT）的模型，涉及RLHF（作为对齐方法之一），并分析大语言模型（LLMs）的内部机制，这些关键词得10分。论文提到稀疏路由机制，与稀疏模型有一定关联，得5分。论文涉及预训练与后训练的区别，与预训练关键词有5分关联。论文研究拒绝有害内容，与事实性/真实性有5分关联。其他关键词如小模型、量化、推理加速、科学AI等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文通过识别对齐训练语言模型中的稀疏路由机制，揭示了模型如何检测有害内容并触发拒绝响应，并通过实验证明了该机制在不同模型中的普遍性、可扩展性和可控性。

摘要翻译

我们发现，在对齐训练的语言模型中存在一种反复出现的稀疏路由机制：一个门控注意力头读取检测到的内容，并触发下游的放大器头，从而增强拒绝信号。通过政治审查和安全拒绝作为自然实验，我们在来自6个实验室的9个模型中追踪了这一机制，所有模型均在120组提示对语料库上进行了验证。该门控头通过了必要性和充分性互换测试（p < 0.001，置换零假设），核心放大器头在自助重采样下保持稳定（Jaccard指数0.92-1.0）。三个同代缩放模型对显示，路由机制在规模扩展时分布更广（消融后效果减弱高达17倍），同时仍可通过互换检测识别。通过调节检测层信号，我们能够连续控制策略强度，从强硬拒绝到引导转向事实遵从，其路由阈值随主题而变化。该电路还揭示了意图识别与策略路由之间的结构分离：在密文编码下，门控头的路由贡献崩溃（Phi-4模型中下降78%，n=120），而模型则以解谜而非拒绝的方式响应。即使深层探针分数表明模型已开始表征有害内容，该路由机制也从未激活。这种不对称性与预训练和后训练的不同鲁棒性特征一致：预训练形成广泛的语义理解，而后训练产生的策略绑定则更狭窄，在输入变换下泛化能力较弱。

摘要 (Abstract)

We identify a recurring sparse routing mechanism in alignment-trained language models: a gate attention head reads detected content and triggers downstream amplifier heads that boost the signal toward refusal. Using political censorship and safety refusal as natural experiments, we trace this mechanism across 9 models from 6 labs, all validated on corpora of 120 prompt pairs. The gate head passes necessity and sufficiency interchange tests (p < 0.001, permutation null), and core amplifier heads are stable under bootstrap resampling (Jaccard 0.92-1.0). Three same-generation scaling pairs show that routing distributes at scale (ablation up to 17x weaker) while remaining detectable by interchange. By modulating the detection-layer signal, we continuously control policy strength from hard refusal through steering to factual compliance, with routing thresholds that vary by topic. The circuit also reveals a structural separation between intent recognition and policy routing: under cipher encoding, the gate head’s routing contribution collapses (78% in Phi-4 at n=120) while the model responds with puzzle-solving rather than refusal. The routing mechanism never fires, even though probe scores at deeper layers indicate the model begins to represent the harmful content. This asymmetry is consistent with different robustness properties of pretraining and post-training: broad semantic understanding versus narrower policy binding that generalizes less well under input transformation.

关键词: alignment, language models, sparse routing, policy circuits, interpretability, refusal mechanism, gate attention head, control

5. Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Fra

作者: Jiling Zhou, Aisvarya Adeseye, Seppo Virtanen, Antti Hakkala, Jouni Isoaho 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04852v1

评分: 61.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs中Chain-of-Thought推理的完整性和可靠性，提出结构化提示框架来增强推理控制、减少幻觉、提高可解释性，并应用于网络安全分析。因此，与LLMs、CoT推理、幻觉缓解、可解释AI高度相关（8-10分）；与系统2思维、自我校正、上下文学习有一定关联（5-8分）；与小型模型（提及本地部署）有部分关联（5分）；其他关键词如MoE、缩放定律、训练方法、代理、量化等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该研究提出了一种结构化提示工程框架，通过增强Chain-of-Thought推理的完整性和控制来减少LLMs在安全敏感任务中的幻觉，提高网络安全威胁检测的可靠性和可解释性，并在实验中实现了显著的性能提升。

摘要翻译

思维链提示已被用于增强大语言模型的推理能力。然而，其在安全敏感分析任务中的可靠性仍未得到充分检验，尤其是在结构化人工评估下。模型缩放和微调等替代方法可用于帮助提升性能，但这些方法通常成本高昂、计算密集或难以审计。相比之下，提示工程为引导大语言模型推理提供了一种轻量、透明且可控的机制。本研究提出了一个结构化提示工程框架，旨在增强思维链推理的完整性，同时提升本地大语言模型部署中安全威胁与攻击检测的可靠性。该框架包含16个要素，归入四个核心维度：(1) 上下文与范围控制，(2) 证据锚定与可追溯性，(3) 推理结构与认知控制，以及(4) 安全特异性分析约束。该框架并非启发式地优化提示措辞，而是引入了显式的推理控制机制，以减轻幻觉、防止推理漂移，并增强安全敏感场景下的可解释性。以软件定义网络流量中的DDoS攻击检测为案例研究，在结构化与非结构化提示条件下对多个模型家族进行了评估。帕累托前沿分析与消融实验表明，该方法带来了持续的推理改进（在较小模型中提升高达40%）以及跨模型规模的稳定准确率增益。具有高度评分者间一致性（Cohen’s k > 0.80）的人工评估证实了其鲁棒性。研究结果确立了结构化提示作为一种有效且实用的方法，可用于实现可靠且可解释的人工智能驱动网络安全分析。

摘要 (Abstract)

Chain-of-Thought (CoT) prompting has been used to enhance the reasoning capability of LLMs. However, its reliability in security-sensitive analytical tasks remains insufficiently examined, particularly under structured human evaluation. Alternative approaches, such as model scaling and fine-tuning can be used to help improve performance. These methods are also often costly, computationally intensive, or difficult to audit. In contrast, prompt engineering provides a lightweight, transparent, and controllable mechanism for guiding LLM reasoning. This study proposes a structured prompt engineering framework designed to strengthen CoT reasoning integrity while improving security threat and attack detection reliability in local LLM deployments. The framework includes 16 factors grouped into four core dimensions: (1) Context and Scope Control, (2) Evidence Grounding and Traceability, (3) Reasoning Structure and Cognitive Control, and (4) Security-Specific Analytical Constraints. Rather than optimizing the wording of the prompt heuristically, the framework introduces explicit reasoning controls to mitigate hallucination and prevent reasoning drift, as well as strengthening interpretability in security-sensitive contexts. Using DDoS attack detection in SDN traffic as a case study, multiple model families were evaluated under structured and unstructured prompting conditions. Pareto frontier analysis and ablation experiments demonstrate consistent reasoning improvements (up to 40% in smaller models) and stable accuracy gains across scales. Human evaluation with strong inter-rater agreement (Cohen’s k > 0.80) confirms robustness. The results establish structured prompting as an effective and practical approach for reliable and explainable AI-driven cybersecurity analysis.

关键词: Chain-of-Thought, LLMs, structured prompting, reasoning integrity, hallucination mitigation, cybersecurity analysis, explainable AI, local deployment

6. MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents

作者: Shu Wang, Edwin Yu, Oscar Love, Tom Zhang, Tom Wong, Steve Scargall, Charles Fan 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04853v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	5.0/10	5.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM智能体的记忆系统，与"Large Language Models”、“LLM Agents"和"Retrieval-Augmented Generation"高度相关（10分），因为论文明确研究LLM智能体、使用RAG并改进其检索。与"Context Window Extension”、“Chain of Thought”、“System 2 Thinking"和"Hallucination Mitigation"有一定关联（5分），因为论文涉及长上下文推理、多步查询策略和事实保持（减少幻觉）。其他关键词如MoE、量化、对齐等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM智能体在多会话交互中记忆退化的问题，提出了MemMachine记忆系统，通过保留对话事实真值、优化检索策略，在多个基准测试中实现了高精度和效率的平衡。

摘要翻译

大型语言模型（LLM）智能体需要持久性记忆来维持个性化、事实连续性及长程推理能力，然而标准上下文窗口与检索增强生成（RAG）流程在多轮会话交互中会出现性能衰退。本文提出MemMachine，一个开源记忆系统，其在保持真实对话原貌的架构中整合了短期记忆、长期情景记忆与用户画像记忆，通过存储完整对话片段并减少基于LLM的有损信息提取来实现优化。MemMachine采用情境化检索技术，通过扩展核心匹配项及其周边上下文，有效提升了相关证据跨越多轮对话时的召回率。在多项基准测试中，MemMachine实现了优异的精度-效率平衡：在LoCoMo基准上使用gpt4.1-mini达到0.9169；在LongMemEvalS（ICLR 2025）的六维度消融实验中取得93.0%的准确率，其中检索阶段优化——包括检索深度调优（+4.2%）、上下文格式化（+2.0%）、搜索提示设计（+1.8%）和查询偏差校正（+1.4%）——显著优于摄入阶段改进如句子分块（+0.8%）。当配合优化提示时，GPT-5-mini较GPT-5提升2.6%，成为最具成本效益的配置方案。在与Mem0的对比中，MemMachine在同等条件下减少了约80%的输入令牌消耗。配套的检索智能体能够自适应地在直接检索、并行分解与迭代查询链策略间路由查询，在随机噪声环境下于HotpotQA-hard和WikiMultiHop数据集上分别达到93.2%和92.6%的准确率。这些结果表明，在保持情景真实性的基础上结合自适应检索层，能为个性化LLM智能体构建鲁棒且高效的长时记忆系统。

摘要 (Abstract)

Large Language Model (LLM) agents require persistent memory to maintain personalization, factual continuity, and long-horizon reasoning, yet standard context-window and retrieval-augmented generation (RAG) pipelines degrade over multi-session interactions. We present MemMachine, an open-source memory system that integrates short-term, long-term episodic, and profile memory within a ground-truth-preserving architecture that stores entire conversational episodes and reduces lossy LLM-based extraction. MemMachine uses contextualized retrieval that expands nucleus matches with surrounding context, improving recall when relevant evidence spans multiple dialogue turns. Across benchmarks, MemMachine achieves strong accuracy-efficiency tradeoffs: on LoCoMo it reaches 0.9169 using gpt4.1-mini; on LongMemEvalS (ICLR 2025), a six-dimension ablation yields 93.0 percent accuracy, with retrieval-stage optimizations – retrieval depth tuning (+4.2 percent), context formatting (+2.0 percent), search prompt design (+1.8 percent), and query bias correction (+1.4 percent) – outperforming ingestion-stage gains such as sentence chunking (+0.8 percent). GPT-5-mini exceeds GPT-5 by 2.6 percent when paired with optimized prompts, making it the most cost-efficient setup. Compared to Mem0, MemMachine uses roughly 80 percent fewer input tokens under matched conditions. A companion Retrieval Agent adaptively routes queries among direct retrieval, parallel decomposition, or iterative chain-of-query strategies, achieving 93.2 percent on HotpotQA-hard and 92.6 percent on WikiMultiHop under randomized-noise conditions. These results show that preserving episodic ground truth while layering adaptive retrieval yields robust, efficient long-term memory for personalized LLM agents.

关键词: LLM agents, memory system, retrieval-augmented generation, personalized AI, ground-truth preservation, contextualized retrieval, long-term memory, adaptive retrieval

7. DARE: Diffusion Large Language Models Alignment and Reinforcement Executor

作者: Jingyi Yang, Yuxian Jiang, Xuhao Hu, Shuang Cheng, Biqing Qi, Jing Shao 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04215v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出DARE框架，专门针对扩散大语言模型（dLLMs）的后训练和评估。核心内容与多个关键词高度相关：1）论文明确研究扩散大语言模型（dLLMs），是LLMs的一种新兴变体，因此与"Large Language Models"高度相关（10分）。2）论文聚焦于后训练（post-training）流程，包括监督微调（SFT）、参数高效微调（PEFT）、偏好优化（如RLHF）等，因此与"Post-training/SFT”、“PEFT/LoRA”、“RLHF/DPO”、“Instruction Tuning/Alignment"均高度相关（各10分）。其他关键词如MoE、SLMs、RAG、推理加速、科学AI应用等，论文未涉及或仅作为背景提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对扩散大语言模型（dLLMs）后训练生态系统碎片化的问题，提出了一个统一的开放框架DARE，用于整合和监督微调、参数高效微调、偏好优化等多种后训练方法，并提供了可复现的评估基准，从而加速dLLMs的研究迭代和公平比较。

摘要翻译

扩散大语言模型（dLLMs）正逐渐成为主流自回归模型的有力替代方案，它通过迭代去噪和并行生成机制取代了严格的序列化标记生成方式。然而，其开源生态系统在模型家族之间，尤其是在训练后流程方面仍处于碎片化状态——强化学习目标、推演实现和评估脚本通常以论文专用代码库的形式发布。这种碎片化现象延缓了研究迭代速度，增加了复现的工程负担，并导致算法间的公平比较难以实现。本文提出DARE（dLLMs Alignment and Reinforcement Executor），一个用于训练后调优与评估dLLMs的开放框架。基于verl框架和OpenCompass评估体系构建的DARE，将监督微调、参数高效微调、偏好优化以及dLLM专属强化学习统一整合到适用于掩码扩散与块扩散语言模型的共享执行栈中。在涵盖LLaDA、Dream、SDAR和LLaDA2.x等代表性模型家族的实验中，DARE提供了广泛的算法覆盖、可复现的基准评估及实际加速方案。大量实证结果表明，DARE可作为可复用的研究基础平台，用于当前及新兴dLLMs的训练后方法开发、比较与部署。

摘要 (Abstract)

Diffusion large language models (dLLMs) are emerging as a compelling alternative to dominant autoregressive models, replacing strictly sequential token generation with iterative denoising and parallel generation dynamics. However, their open-source ecosystem remains fragmented across model families and, in particular, across post-training pipelines, where reinforcement learning objectives, rollout implementations and evaluation scripts are often released as paper-specific codebases. This fragmentation slows research iteration, raises the engineering burden of reproduction, and makes fair comparison across algorithms difficult. We present \textbf{DARE} (\textbf{d}LLMs \textbf{A}lignment and \textbf{R}einforcement \textbf{E}xecutor), an open framework for post-training and evaluating dLLMs. Built on top of verl~\cite{sheng2024hybridflow} and OpenCompass~\cite{2023opencompass}, DARE unifies supervised fine-tuning, parameter-efficient fine-tuning, preference optimization, and dLLM-specific reinforcement learning under a shared execution stack for both masked and block diffusion language models. Across representative model families including LLaDA, Dream, SDAR, and LLaDA2.x, DARE provides broad algorithmic coverage, reproducible benchmark evaluation, and practical acceleration. Extensive empirical results position that DARE serves as a reusable research substrate for developing, comparing, and deploying post-training methods for current and emerging dLLMs.

关键词: Diffusion Large Language Models, dLLMs, Post-training, Alignment, Reinforcement Learning, Parameter-efficient Fine-tuning, Supervised Fine-tuning, Evaluation Framework

8. MERIT: Multilingual Expert-Reward Informed Tuning for Chinese-Centric Low-Resource Machine Translati

作者: Zhixiang Lu, Chong Zhang, Chenyu Xue, Angelos Stefanidis, Chong Li, Jionglong Su, Zhengyong Jiang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04839v1

评分: 41.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	5.0/10	5.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	8.0/10	8.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文专注于低资源机器翻译，提出MERIT框架，核心涉及监督微调(SFT)和基于奖励的优化(GRPO，类似RLHF/DPO)，因此与"Supervised Fine-tuning"高度相关(10分)，与"RLHF/DPO"相关(8分)。论文使用大语言模型进行翻译，与"Large Language Models"相关(8分)。研究强调数据质量和针对性数据整理，与"Scaling Laws AND Data Quality"有一定关联(5分)。框架涉及领域适应和语义对齐，与"Pre-training/Domain Adaptation"和"Instruction Tuning/Alignment"有一定关联(各5分)。其他关键词如MoE、SLMs、RAG、CoT等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对中文到东南亚低资源语言的机器翻译数据稀缺和噪声问题，提出了MERIT框架，结合监督微调和奖励优化，显著提升了翻译性能，超越了单纯模型缩放的效果。

摘要翻译

从中文到低资源东南亚语言的神经机器翻译（NMT）仍受到两大关键制约：洁净平行语料的极度匮乏，以及现有挖掘数据中普遍存在的噪声。这一长期存在的短缺不仅阻碍了有效的模型训练，还导致其性能与高资源语言方向存在巨大差距，使得老挝语、缅甸语、他加禄语等数百万使用者所依赖的翻译系统质量持续低下，尽管大规模多语言模型近期已取得进展。我们提出了多语言专家奖励引导调优框架（MERIT），这是一个统一的翻译框架，它将传统的以英语为中心的ALT基准转化为一个以中文为中心、涵盖五种东南亚低资源语言（LRLs）的评估体系。我们的框架将语言特定标记前缀（LTP）与监督微调（SFT）相结合，并引入了一种新颖的、由语义对齐奖励（SAR）引导的组相对策略优化（GRPO）。这些结果证实，在低资源语言到中文的翻译任务中，有针对性的数据策展和奖励引导的优化策略，其效果远超单纯的模型规模扩展。

摘要 (Abstract)

Neural machine translation (NMT) from Chinese to low-resource Southeast Asian languages remains severely constrained by the extreme scarcity of clean parallel corpora and the pervasive noise in existing mined data. This chronic shortage not only impedes effective model training but also sustains a large performance gap with high-resource directions, leaving millions of speakers of languages such as Lao, Burmese, and Tagalog with persistently low-quality translation systems despite recent advances in large multilingual models. We introduce \textbf{M}ultilingual \textbf{E}xpert-\textbf{R}eward \textbf{I}nformed \textbf{T}uning (\textbf{MERIT}), a unified translation framework that transforms the traditional English-centric ALT benchmark into a Chinese-centric evaluation suite for five Southeast Asian low-resource languages (LRLs). Our framework combines language-specific token prefixing (LTP) with supervised fine-tuning (SFT) and a novel group relative policy optimization (GRPO) guided by the semantic alignment reward (SAR). These results confirm that, in LRL{\textrightarrow}Chinese translation, targeted data curation and reward-guided optimization dramatically outperform mere model scaling.

关键词: Machine Translation, Low-resource Languages, Supervised Fine-tuning, Reward Optimization, Chinese-centric, Southeast Asian Languages, Data Curation, Semantic Alignment

9. SODA: Semi On-Policy Black-Box Distillation for Large Language Models

作者: Xiwen Chen, Jingjing Wang, Wenhui Zhu, Peijie Qiu, Xuanzhao Dong, Hejian Sang, Zhipeng Wang, Alborz Geramifard, Feng Luo 期刊/来源: arxiv 发布日期: 2026-04-04 arXiv链接: http://arxiv.org/abs/2604.03873v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的知识蒸馏（KD）方法，提出SODA（半在线蒸馏对齐）框架，用于将大型教师模型的知识高效迁移到小型学生模型（SLMs）。因此，与"Large Language Models”、“Small Language Models”、“Post-training”（蒸馏属于后训练技术）和"Alignment”（论文强调分布对齐）高度相关（10分）。其他关键词如MoE、Scaling Laws、RAG、推理加速等，论文未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SODA的半在线黑盒知识蒸馏方法，解决了大型语言模型向小型模型知识迁移中效率与效果难以兼得的问题，在多个基准测试中达到或超越现有方法，同时训练速度提升10倍、GPU内存消耗减少27%且消除了对抗训练的不稳定性。

摘要翻译

大型语言模型的黑箱知识蒸馏面临严格的权衡困境。简单的离策略方法（如序列级知识蒸馏）难以纠正学生模型固有的错误。完全同策略方法（如生成对抗蒸馏）通过对抗训练解决了这一问题，但引入了众所周知的训练不稳定性和极高的计算开销。为应对这一困境，我们提出SODA（基于对齐的半同策略蒸馏），这是一种高效替代方案，其动机源于前沿教师模型与更小规模基础模型之间固有的能力差距。由于紧凑型学生模型的自然零样本响应几乎严格劣于强大教师模型的目标输出，我们可以通过将教师模型的最优响应与学生模型一次性静态输出快照进行配对，构建出高效的对比信号。这表明，让小型学生模型接触其自身静态的次优行为已足以实现高质量的分布对齐，从而无需昂贵的动态展开过程和脆弱的对抗平衡。在四种紧凑型Qwen2.5和Llama-3模型上的广泛评估验证了这种半同策略范式。SODA在16项基准测试结果中的15项上达到或超越了现有最优方法。更重要的是，它在实现更优蒸馏质量的同时，训练速度提升10倍，峰值GPU内存消耗降低27%，并完全消除了对抗训练的不稳定性。

摘要 (Abstract)

Black-box knowledge distillation for large language models presents a strict trade-off. Simple off-policy methods (e.g., sequence-level knowledge distillation) struggle to correct the student’s inherent errors. Fully on-policy methods (e.g., Generative Adversarial Distillation) solve this via adversarial training but introduce well-known training instability and crippling computational overhead. To address this dilemma, we propose SODA (Semi On-policy Distillation with Alignment), a highly efficient alternative motivated by the inherent capability gap between frontier teachers and much smaller base models. Because a compact student model’s natural, zero-shot responses are almost strictly inferior to the powerful teacher’s targets, we can construct a highly effective contrastive signal simply by pairing the teacher’s optimal response with a one-time static snapshot of the student’s outputs. This demonstrates that exposing the small student to its own static inferior behaviors is sufficient for high-quality distribution alignment, eliminating the need for costly dynamic rollouts and fragile adversarial balancing. Extensive evaluations across four compact Qwen2.5 and Llama-3 models validate this semi on-policy paradigm. SODA matches or outperforms the state-of-the-art methods on 15 out of 16 benchmark results. More importantly, it achieves this superior distillation quality while training 10 times faster, consuming 27% less peak GPU memory, and completely eliminating adversarial instability.

关键词: Knowledge Distillation, Large Language Models, Small Language Models, Semi On-policy, Alignment, Efficient Training, Black-box, Model Compression

10. Your Agent is More Brittle Than You Think: Uncovering Indirect Injection Vulnerabilities in Agentic

作者: Wenhui Zhu, Xuanzhao Dong, Xiwen Chen, Rui Cai, Peijie Qiu, Zhipeng Wang, Oana Frunza, Shao Tang, Jindong Gu, Yalin Wang 期刊/来源: arxiv 发布日期: 2026-04-04 arXiv链接: http://arxiv.org/abs/2604.03870v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文核心研究LLM驱动的智能体（Agent）在多步工具调用环境中的安全漏洞（间接提示注入攻击）及防御策略，与"LLM Agents/Autonomous Agents/Agentic Workflow”、“Tool Use/Function Calling/API Tool Use"和"Multi-agent Systems/Agent Coordination"高度相关（10分），因为这些是论文研究的核心对象和场景。论文也基于多种LLM骨干进行评估，因此与"Large Language Models/LLMs/Foundation Models"高度相关（10分）。论文未涉及其他关键词所代表的具体技术（如MoE、量化、推理加速、科学AI应用等），因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究揭示了基于大语言模型的智能体系统在动态多步工具调用环境中存在严重的间接提示注入安全漏洞，并通过评估多种防御策略发现现有方法普遍脆弱，进而提出了一种基于表征工程的检测方法，能有效在智能体执行恶意操作前进行拦截。

摘要翻译

开源框架的快速部署显著推动了现代多智能体系统的发展。然而，扩展的行动空间——包括不受控的权限暴露和隐藏的系统间交互——带来了严峻的安全挑战。具体而言，间接提示注入攻击通过将恶意指令隐藏在第三方内容中，可在智能体正常操作期间触发数据窃取等未授权行为。当前的安全评估主要依赖孤立的单轮基准测试，而这些智能体在复杂动态环境中的系统性脆弱性仍未得到充分探索。为弥补这一空白，我们系统评估了针对四种复杂间接提示注入攻击向量的六种防御策略，覆盖九种大语言模型基座。关键的是，我们的评估完全在动态多步骤工具调用环境中进行，以捕捉现代自主智能体的真实攻击面。超越二元的成功率指标，我们的多维分析揭示出显著的脆弱性：高级注入攻击成功绕过几乎所有基线防御，部分表层缓解措施甚至会产生适得其反的副作用。此外，尽管智能体几乎瞬时执行恶意指令，其内部状态却表现出异常高的决策熵。基于这种潜在犹豫现象，我们探索将表征工程作为鲁棒的检测策略。通过提取工具输入位置对应的隐藏状态，我们发现基于表征工程的断路器能在智能体执行前成功识别并拦截未授权操作，在不同大语言模型基座上均实现了高检测准确率。本研究揭示了当前间接提示注入防御的局限性，并为构建具有韧性的多智能体架构提供了高度实用的范式。

摘要 (Abstract)

The rapid deployment of open-source frameworks has significantly advanced the development of modern multi-agent systems. However, expanded action spaces, including uncontrolled privilege exposure and hidden inter-system interactions, pose severe security challenges. Specifically, Indirect Prompt Injections (IPI), which conceal malicious instructions within third-party content, can trigger unauthorized actions such as data exfiltration during normal operations. While current security evaluations predominantly rely on isolated single-turn benchmarks, the systemic vulnerabilities of these agents within complex dynamic environments remain critically underexplored. To bridge this gap, we systematically evaluate six defense strategies against four sophisticated IPI attack vectors across nine LLM backbones. Crucially, we conduct our evaluation entirely within dynamic multi-step tool-calling environments to capture the true attack surface of modern autonomous agents. Moving beyond binary success rates, our multidimensional analysis reveals a pronounced fragility. Advanced injections successfully bypass nearly all baseline defenses, and some surface-level mitigations even produce counterproductive side effects. Furthermore, while agents execute malicious instructions almost instantaneously, their internal states exhibit abnormally high decision entropy. Motivated by this latent hesitation, we investigate Representation Engineering (RepE) as a robust detection strategy. By extracting hidden states at the tool-input position, we revealed that the RepE-based circuit breaker successfully identifies and intercepts unauthorized actions before the agent commits to them, achieving high detection accuracy across diverse LLM backbones. This study exposes the limitations of current IPI defenses and provides a highly practical paradigm for building resilient multi-agent architectures.

关键词: LLM Agents, Multi-agent Systems, Tool Calling, Indirect Prompt Injection, Security Vulnerabilities, Representation Engineering, Dynamic Environments, Defense Strategies

11. Temporal Inversion for Learning Interval Change in Chest X-Rays

作者: Hanbin Ko, Kyeongmin Jeon, Doowoong Choi, Chang Min Park 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04563v1

评分: 38.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文聚焦于医学影像（胸片）的时间序列分析，属于AI for Science（生物信息学）应用，得10分；涉及vision-language pretraining（医学基础模型），与LLMs/Foundation Models相关但非核心，得8分；明确包含pretraining和fine-tuning（SFT）技术，各得10分；其他关键词如MoE、SLMs、Scaling Laws、Alignment、RLHF、RAG、Reasoning、Agents、Compression等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究针对胸片时间序列分析中现有模型对间隔变化不敏感的问题，提出了TILA框架，通过时间反转监督信号增强模型对方向性变化的感知能力，实验表明该框架能持续改进进展分类和时间嵌入对齐。

摘要翻译

视觉-语言预训练的最新进展催生了强大的医学基础模型，但多数模型仍孤立分析放射影像，忽视了对比既往与当前影像以评估时序变化这一关键临床任务。对于胸部X光片（CXRs）而言，捕捉时序变化至关重要，因为放射科医生不仅需要评估影像表现的静态特征，还必须追踪其随时间的演变过程。我们提出TILA（时序反转感知学习与对齐框架），这是一种简洁而有效的框架，其使用时序反转（即反转图像对的顺序）作为监督信号，以增强现有时序视觉-语言模型对方向性变化的敏感性。TILA在预训练、微调和推理阶段整合了反转感知目标，通过显式学习时序顺序来补充传统的外观建模方法。我们还提出了一套统一的评估协议，用于衡量模型在时序反转下的顺序敏感性与一致性，并构建了MS-CXR-Tretrieval检索评估集——该评估集基于通用构建协议，可应用于任何时序CXR数据集。在公开数据集和真实医院队列上的实验表明，当应用于多种现有架构时，TILA能持续提升疾病进展分类的准确性和时序嵌入的对齐效果。

摘要 (Abstract)

Recent advances in vision–language pretraining have enabled strong medical foundation models, yet most analyze radiographs in isolation, overlooking the key clinical task of comparing prior and current images to assess interval change. For chest radiographs (CXRs), capturing interval change is essential, as radiologists must evaluate not only the static appearance of findings but also how they evolve over time. We introduce TILA (Temporal Inversion-aware Learning and Alignment), a simple yet effective framework that uses temporal inversion, reversing image pairs, as a supervisory signal to enhance the sensitivity of existing temporal vision-language models to directional change. TILA integrates inversion-aware objectives across pretraining, fine-tuning, and inference, complementing conventional appearance modeling with explicit learning of temporal order. We also propose a unified evaluation protocol to assess order sensitivity and consistency under temporal inversion, and introduce MS-CXR-Tretrieval, a retrieval evaluation set constructed through a general protocol that can be applied to any temporal CXR dataset. Experiments on public datasets and real-world hospital cohorts demonstrate that TILA consistently improves progression classification and temporal embedding alignment when applied to multiple existing architectures.

关键词: Temporal Inversion, Chest X-Rays, Vision-Language Pretraining, Medical Foundation Models, Interval Change, Temporal Order Learning, Progression Classification, Temporal Embedding Alignment

12. A Family of Open Time-Series Foundation Models for the Radio Access Network

作者: Ioannis Panitsas, Leandros Tassiulas 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04271v1

评分: 31.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文提出TimeRAN，一个用于无线接入网（RAN）时间序列建模的统一多任务学习框架，其核心是一个轻量级的时间序列基础模型，并进行了大规模预训练和有限监督下的高效适应。因此，它与"Foundation Models”（8分）高度相关，因为论文明确提出了一个基础模型；与"Pre-training"（10分）高度相关，因为论文进行了大规模预训练并开源了数据集；与"Supervised Fine-tuning"（8分）相关，因为提到了任务特定的微调；与"AI for Science"（5分）有一定关联，因为RAN优化属于AI在特定领域（通信网络）的应用。其他关键词（如LLM、MoE、RLHF等）主要针对自然语言处理或特定技术，与论文的时间序列模型和RAN应用领域无关，故得0分。

!!! tip deepseek-chat TL;DR

该论文针对无线接入网（RAN）中任务特定模型导致的碎片化和泛化能力差的问题，提出了TimeRAN——一个统一的多任务时间序列基础模型框架，通过大规模预训练和高效适应，在多种RAN分析任务上实现了最先进的性能，并验证了其在真实5G测试床中的高效运行。

摘要翻译

无线接入网（RAN）正演变为一种可编程、解耦的基础设施，日益依赖原生人工智能算法进行优化与闭环控制。然而，当前的RAN智能系统仍主要由针对单一功能设计的任务专用模型构成，导致模型碎片化、任务间知识共享有限、泛化能力差以及系统复杂性增加。为应对这些局限，本文提出TimeRAN——一个面向RAN时序建模的统一多任务学习框架。TimeRAN采用轻量级时序基础模型配合少量任务专用头部，通过学习可迁移的表征，能够在有限监督下高效适配多种任务。为实现大规模预训练，我们进一步整理并开源了TimeRAN DataPile，这是迄今为止最大的RAN分析时序数据集，涵盖多样化遥测源、协议层和部署场景，包含超过35.5万条时间序列和5.6亿个测量点。我们在涵盖异常检测、分类、预测与填补的完整RAN分析任务集上评估TimeRAN，结果表明其仅需极少或无需任务特定微调即可达到先进性能。最后，我们将TimeRAN集成至概念验证型5G测试平台，证明其在实际场景中能以有限资源需求高效运行。

摘要 (Abstract)

The Radio Access Network (RAN) is evolving into a programmable and disaggregated infrastructure that increasingly relies on AI-native algorithms for optimization and closed-loop control. However, current RAN intelligence is still largely built from task-specific models tailored to individual functions, resulting in model fragmentation, limited knowledge sharing across tasks, poor generalization, and increased system complexity. To address these limitations, we introduce TimeRAN, a unified multi-task learning framework for time-series modeling in the RAN. TimeRAN leverages a lightweight time-series foundation model with few task-specific heads to learn transferable representations that can be efficiently adapted across diverse tasks with limited supervision. To enable large-scale pretraining, we further curate and open-source TimeRAN DataPile, the largest time-series corpus for RAN analytics to date, comprising over 355K time series and 0.56B measurements across diverse telemetry sources, protocol layers, and deployment scenarios. We evaluate TimeRAN across a comprehensive set of RAN analytics tasks, including anomaly detection, classification, forecasting, and imputation, and show that it achieves state-of-the-art performance with minimal or no task-specific fine-tuning. Finally, we integrate TimeRAN into a proof-of-concept 5G testbed and demonstrate that it operates efficiently with limited resource requirements in real-world scenarios.

关键词: Time-series foundation model, Radio Access Network (RAN), Multi-task learning, Pre-training, Transferable representations, Anomaly detection, 5G testbed, Lightweight model

13. AI Trust OS – A Continuous Governance Framework for Autonomous AI Observability and Zero-Trust Comp

作者: Eranga Bandara, Asanga Gunaratna, Ross Gore, Abdul Rahman, Ravi Mukkamala, Sachin Shetty, Sachini Rajapakse, Isurunima Kularathna, Peter Foytik, Safdar H. Bouk, Xueping Liang, Amin Hass, Ng Wee Keong, Kasun De Zoysa 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04749v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出AI Trust OS治理框架，核心关注LLM、RAG和多智能体工作流在企业环境中的治理、可观测性和合规性问题。摘要明确提到"large language models, retrieval-augmented generation pipelines, and multi-agent AI workflows"，因此这三个关键词高度相关（10分）。论文属于大模型应用领域（企业治理），但未涉及其他关键词的具体技术原理或创新，因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对企业环境中大规模采用LLM、RAG和多智能体工作流导致的治理危机，提出了AI Trust OS框架，通过持续自主的可观测性和零信任合规机制来解决AI系统治理的可视化与验证问题。

摘要翻译

大型语言模型、检索增强生成流程以及多智能体人工智能工作流的加速普及已引发结构性治理危机。组织无法监管其不可见之物，而为确定性网络应用构建的现有合规方法论缺乏相应机制，用以发现或持续验证那些在工程团队间自发产生、缺乏正式监管的人工智能系统。这导致监管机构要求的人工智能治理成熟度证明与组织实际可提供的证据之间出现日益扩大的信任鸿沟。本文提出“AI信任操作系统”——一种面向持续自主人工智能可观测性与零信任合规的治理架构。该系统将合规性重新构想为一个全天候运行、由遥测数据驱动的操作层：通过可观测性信号发现人工智能系统，借助自动化探针收集控制断言，并持续合成信任凭证。该框架基于四大原则：主动发现、基于遥测证据而非人工证明、持续状态监测而非时点审计、架构支撑的实证而非政策文件的信任。其通过零信任遥测边界运作，瞬时只读探针在此验证结构性元数据，无需接入源代码或载荷级个人身份信息。人工智能可观测性提取代理会扫描LangSmith与Datadog的LLM遥测数据，自动注册未记录的人工智能系统，从而将治理模式从组织自我报告转向实证化的机器观测。通过对ISO 42001、欧盟《人工智能法案》、SOC 2、GDPR及HIPAA等标准的评估，本文论证了遥测优先的人工智能治理模式代表了企业信任生成与验证方式的根本性架构变革。

摘要 (Abstract)

The accelerating adoption of large language models, retrieval-augmented generation pipelines, and multi-agent AI workflows has created a structural governance crisis. Organizations cannot govern what they cannot see, and existing compliance methodologies built for deterministic web applications provide no mechanism for discovering or continuously validating AI systems that emerge across engineering teams without formal oversight. The result is a widening trust gap between what regulators demand as proof of AI governance maturity and what organizations can demonstrate. This paper proposes AI Trust OS, a governance architecture for continuous, autonomous AI observability and zero-trust compliance. AI Trust OS reconceptualizes compliance as an always-on, telemetry-driven operating layer in which AI systems are discovered through observability signals, control assertions are collected by automated probes, and trust artifacts are synthesized continuously. The framework rests on four principles: proactive discovery, telemetry evidence over manual attestation, continuous posture over point-in-time audit, and architecture-backed proof over policy-document trust. The framework operates through a zero-trust telemetry boundary in which ephemeral read-only probes validate structural metadata without ingressing source code or payload-level PII. An AI Observability Extractor Agent scans LangSmith and Datadog LLM telemetry, automatically registering undocumented AI systems and shifting governance from organizational self-report to empirical machine observation. Evaluated across ISO 42001, the EU AI Act, SOC 2, GDPR, and HIPAA, the paper argues that telemetry-first AI governance represents a categorical architectural shift in how enterprise trust is produced and demonstrated.

关键词: AI governance, autonomous observability, zero-trust compliance, large language models, retrieval-augmented generation, multi-agent AI workflows, telemetry-driven compliance, enterprise AI trust

14. Optimizing Service Operations via LLM-Powered Multi-Agent Simulation

作者: Yanyuan Wang, Xiaowei Zhang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04383v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心是提出一个LLM驱动的多智能体仿真框架（LLM-MAS）用于优化服务运营，直接高度相关于"Large Language Models"（LLMs是核心工具）、“LLM Agents”（使用LLM作为智能体）和"Multi-agent Systems"（多智能体仿真框架）。其他关键词如MoE、SFT、RAG、推理加速等涉及具体技术细节或应用领域，论文未涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于大语言模型的多智能体仿真框架（LLM-MAS），用于优化服务运营中的设计选择，通过将设计嵌入提示词并利用LLM智能体交互模拟不确定性，开发了在线学习算法，在可持续供应链和竞赛设计案例中优于传统基准方法。

摘要翻译

服务系统性能取决于参与者对设计选择的响应方式，但由于人类行为的复杂性，对这些响应进行建模十分困难。本文提出一种基于大语言模型的多智能体仿真框架，用于优化服务运营。我们将该问题构建为具有决策依赖不确定性的随机优化问题：设计选择被嵌入提示词中，并塑造了基于大语言模型的智能体交互所产生结果的分布。通过将关键数值信息嵌入提示词并从大语言模型生成的文本中提取该信息，我们将这种不确定性建模为一个受控马尔可夫链。我们开发了一种轨迹上学习算法，该算法在单次仿真运行中，同时构建零阶梯度估计并更新设计参数，以优化稳态性能。我们还引入了方差缩减技术。在一个可持续供应链的应用中，我们的方法在性能上超越了多种基准方案，包括黑箱优化、使用大语言模型作为数值求解器或作为角色扮演系统设计者的方法。一项基于真实行为数据的最优竞赛设计案例研究表明，LLM-MAS既可作为已知设计方案的性价比评估工具，也可作为一种探索性工具，能够发现被传统方法忽视的优秀设计方案。

摘要 (Abstract)

Service system performance depends on how participants respond to design choices, but modeling these responses is hard due to the complexity of human behavior. We introduce an LLM-powered multi-agent simulation (LLM-MAS) framework for optimizing service operations. We pose the problem as stochastic optimization with decision-dependent uncertainty: design choices are embedded in prompts and shape the distribution of outcomes from interacting LLM-powered agents. By embedding key numerical information in prompts and extracting it from LLM-generated text, we model this uncertainty as a controlled Markov chain. We develop an on-trajectory learning algorithm that, on a single simulation run, simultaneously constructs zeroth-order gradient estimates and updates design parameters to optimize steady-state performance. We also incorporate variance reduction techniques. In a sustainable supply chain application, our method outperforms benchmarks, including blackbox optimization and using LLMs as numerical solvers or as role-playing system designers. A case study on optimal contest design with real behavioral data shows that LLM-MAS is both as a cost-effective evaluator of known designs and an exploratory tool that can uncover strong designs overlooked by traditional approaches.

关键词: LLM-powered multi-agent simulation, service operations optimization, stochastic optimization, decision-dependent uncertainty, on-trajectory learning, sustainable supply chain, optimal contest design, behavioral data

📋 所有论文列表

1. ✅ Search, Do not Guess: Teaching Small Language Models to Be Effective Search Agents

作者: Yizhou Liu, Qi Sun, Yulin Chen, Siyue Zhang, Chen Zhao 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04651v1

评分: 74.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对小型语言模型（SLMs）作为搜索代理时存在搜索工具调用不足和易产生幻觉的问题，提出了一种轻量级微调方法（\policy），显著提升了SLMs在复杂多跳推理任务上的性能，达到了与大型语言模型（LLMs）相当的水平。

摘要翻译

配备搜索工具的智能体已成为解决知识密集型任务的有效方案。尽管大语言模型展现出强大的推理能力，但其高昂的计算成本限制了搜索智能体的实际部署。因此，近期研究聚焦于将大语言模型的智能体行为蒸馏至小语言模型中。通过对复杂多跳推理任务的综合评估，我们发现尽管小语言模型具备的参数知识较少，但其调用搜索工具的频次更低，且更容易产生幻觉。为解决这一问题，我们提出\policy方法——一种轻量级微调策略，通过显式训练小语言模型，使其能够基于检索证据进行可靠检索并生成答案。相较于从大语言模型进行智能体蒸馏的方法，我们的方案在Bamboogle数据集上提升了17.3分，在HotpotQA数据集上提升了15.3分，在各基准测试中均达到大语言模型级别的性能。进一步分析表明，小语言模型中的自适应搜索策略往往会导致性能下降，这凸显了稳定搜索行为对可靠推理的必要性。

摘要 (Abstract)

Agents equipped with search tools have emerged as effective solutions for knowledge-intensive tasks. While Large Language Models (LLMs) exhibit strong reasoning capabilities, their high computational cost limits practical deployment for search agents. Consequently, recent work has focused on distilling agentic behaviors from LLMs into Small Language Models (SLMs). Through comprehensive evaluation on complex multi-hop reasoning tasks, we find that despite possessing less parametric knowledge, SLMs invoke search tools less frequently and are more prone to hallucinations. To address this issue, we propose \policy, a lightweight fine-tuning approach that explicitly trains SLMs to reliably retrieve and generate answers grounded in retrieved evidence. Compared to agent distillation from LLMs, our approach improves performance by 17.3 scores on Bamboogle and 15.3 scores on HotpotQA, achieving LLM-level results across benchmarks. Our further analysis reveals that adaptive search strategies in SLMs often degrade performance, highlighting the necessity of consistent search behavior for reliable reasoning.

关键词: Small Language Models (SLMs), Search Agents, Tool Use, Retrieval-Augmented Generation, Multi-hop Reasoning, Hallucination Mitigation, Fine-tuning, Agent Distillation

2. ✅ PassiveQA: A Three-Action Framework for Epistemically Calibrated Question Answering via Supervised Finetuning

作者: Madhav S Baidya 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04565v1

评分: 70.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	5.0/10	5.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对LLMs在信息不完整查询中过度自信生成幻觉答案的问题，提出了通过监督微调实现认知校准的三动作框架PassiveQA，显著提高了模型在问答任务中的宏观F1分数和弃权召回率，同时降低了幻觉率。

摘要翻译

大语言模型（LLMs）在问答和检索增强生成（RAG）任务中表现出色，但它们通常隐含地假设用户查询是完整且可回答的。在实际场景中，查询往往不完整、模糊或缺失关键变量，导致模型产生过度自信或虚构的回应。
本研究探讨了信息不完整情况下的决策感知查询解析问题，即模型必须决定是直接回答、请求澄清还是选择弃答。我们发现，标准及增强的RAG系统均未能可靠地展现这种认知意识，即使在信息不足时也倾向于生成答案。
为解决这一问题，我们提出了PassiveQA框架，该框架通过监督微调将模型行为与信息充分性对齐，包含回答、询问和弃答三种决策。我们的方法整合了结构化信息状态表示、基于知识图谱的上下文，以及一个经过微调的规划器，该规划器能显式建模缺失变量和决策推理过程。
在多个问答数据集上的实验表明，在计算受限的训练条件下，经过微调的规划器在宏观F1分数和弃答召回率上均取得显著提升，同时降低了幻觉生成率。
这些结果为以下观点提供了有力的实证依据：认知决策能力必须在训练过程中习得，而非仅在推理阶段强行施加。

摘要 (Abstract)

Large Language Models (LLMs) have achieved strong performance in question answering and retrieval-augmented generation (RAG), yet they implicitly assume that user queries are fully specified and answerable. In real-world settings, queries are often incomplete, ambiguous, or missing critical variables, leading models to produce overconfident or hallucinated responses. In this work, we study decision-aware query resolution under incomplete information, where a model must determine whether to Answer, Ask for clarification, or Abstain. We show that standard and enhanced RAG systems do not reliably exhibit such epistemic awareness, defaulting to answer generation even when information is insufficient. To address this, we propose PassiveQA, a three-action framework that aligns model behaviour with information sufficiency through supervised finetuning. Our approach integrates structured information-state representations, knowledge graph-grounded context, and a finetuned planner that explicitly models missing variables and decision reasoning. Experiments across multiple QA datasets show that the finetuned planner achieves significant improvements in macro F1 and abstention recall while reducing hallucination rates, under a compute-constrained training regime. These results provide strong empirical evidence that epistemic decision-making must be learned during training rather than imposed at inference time.

3. ✅ GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces

评分: 66.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	8.0/10	8.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	15.0/10	15.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	15.0/10	15.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了一个名为GeoBrowse的地理定位基准测试，用于评估智能体如何通过多步工具使用和推理来整合模糊的视觉线索与开放网络知识进行验证，并开发了一个名为GATE的智能体工作流，实验表明其优于直接推理和仅使用搜索或图像的方法。

摘要翻译

深度研究智能体通过多步骤工具使用整合碎片化证据。BrowseComp为此类智能体提供了纯文本测试平台，但现有多模态基准很少同时要求弱视觉线索组合与BrowseComp风格的多跳验证。地理定位是一个天然测试场，因为其答案依赖于组合多个模糊视觉线索并通过开放网络证据进行验证。为此，我们推出GeoBrowse——一个将视觉推理与知识密集型多跳查询相结合的地理定位基准。第一层级测试碎片化视觉线索的提取与组合，第二层级通过注入长尾知识和混淆关键实体来提升查询难度。为支持评估，我们提供了智能体工作流程GATE，包含五个图像思维工具和四个知识密集型工具，并发布了基于可验证证据的专家标注逐步轨迹，用于轨迹级分析。实验表明GATE优于直接推理和开源智能体，这表明无工具、仅搜索或仅图像的设置均不充分。性能提升源于连贯的、针对特定层级的工具使用规划而非更多工具调用，因为这些规划能更可靠地抵达标注的关键证据步骤，并在整合至最终决策时产生更少错误。GeoBrowse基准与代码发布于https://github.com/ornamentt/GeoBrowse。

摘要 (Abstract)

Deep research agents integrate fragmented evidence through multi-step tool use. BrowseComp offers a text-only testbed for such agents, but existing multimodal benchmarks rarely require both weak visual cues composition and BrowseComp-style multi-hop verification. Geolocation is a natural testbed because answers depend on combining multiple ambiguous visual cues and validating them with open-web evidence. Thus, we introduce GeoBrowse, a geolocation benchmark that combines visual reasoning with knowledge-intensive multi-hop queries. Level 1 tests extracting and composing fragmented visual cues, and Level 2 increases query difficulty by injecting long-tail knowledge and obfuscating key entities. To support evaluation, we provide an agentic workflow GATE with five think-with-image tools and four knowledge-intensive tools, and release expert-annotated stepwise traces grounded in verifiable evidence for trajectory-level analysis. Experiments show that GATE outperforms direct inference and open-source agents, indicating that no-tool, search-only or image-only setups are insufficient. Gains come from coherent, level-specific tool-use plans rather than more tool calls, as they more reliably reach annotated key evidence steps and make fewer errors when integrating into the final decision. The GeoBrowse bernchmark and codes are provided in https://github.com/ornamentt/GeoBrowse

关键词: Geolocation Benchmark, Agentic Tool Use, Multi-step Reasoning, Visual Reasoning, Knowledge-intensive Queries, Agentic Workflow, Expert-annotated Traces, Multi-hop Verification

4. ✅ How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

作者: Gregory N. Frank 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04385v1

评分: 65.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	5.0/10	5.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	15.0/10	15.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	5.0/10	5.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文通过识别对齐训练语言模型中的稀疏路由机制，揭示了模型如何检测有害内容并触发拒绝响应，并通过实验证明了该机制在不同模型中的普遍性、可扩展性和可控性。

摘要翻译

我们发现，在对齐训练的语言模型中存在一种反复出现的稀疏路由机制：一个门控注意力头读取检测到的内容，并触发下游的放大器头，从而增强拒绝信号。通过政治审查和安全拒绝作为自然实验，我们在来自6个实验室的9个模型中追踪了这一机制，所有模型均在120组提示对语料库上进行了验证。该门控头通过了必要性和充分性互换测试（p < 0.001，置换零假设），核心放大器头在自助重采样下保持稳定（Jaccard指数0.92-1.0）。三个同代缩放模型对显示，路由机制在规模扩展时分布更广（消融后效果减弱高达17倍），同时仍可通过互换检测识别。通过调节检测层信号，我们能够连续控制策略强度，从强硬拒绝到引导转向事实遵从，其路由阈值随主题而变化。该电路还揭示了意图识别与策略路由之间的结构分离：在密文编码下，门控头的路由贡献崩溃（Phi-4模型中下降78%，n=120），而模型则以解谜而非拒绝的方式响应。即使深层探针分数表明模型已开始表征有害内容，该路由机制也从未激活。这种不对称性与预训练和后训练的不同鲁棒性特征一致：预训练形成广泛的语义理解，而后训练产生的策略绑定则更狭窄，在输入变换下泛化能力较弱。

摘要 (Abstract)

We identify a recurring sparse routing mechanism in alignment-trained language models: a gate attention head reads detected content and triggers downstream amplifier heads that boost the signal toward refusal. Using political censorship and safety refusal as natural experiments, we trace this mechanism across 9 models from 6 labs, all validated on corpora of 120 prompt pairs. The gate head passes necessity and sufficiency interchange tests (p < 0.001, permutation null), and core amplifier heads are stable under bootstrap resampling (Jaccard 0.92-1.0). Three same-generation scaling pairs show that routing distributes at scale (ablation up to 17x weaker) while remaining detectable by interchange. By modulating the detection-layer signal, we continuously control policy strength from hard refusal through steering to factual compliance, with routing thresholds that vary by topic. The circuit also reveals a structural separation between intent recognition and policy routing: under cipher encoding, the gate head’s routing contribution collapses (78% in Phi-4 at n=120) while the model responds with puzzle-solving rather than refusal. The routing mechanism never fires, even though probe scores at deeper layers indicate the model begins to represent the harmful content. This asymmetry is consistent with different robustness properties of pretraining and post-training: broad semantic understanding versus narrower policy binding that generalizes less well under input transformation.

关键词: alignment, language models, sparse routing, policy circuits, interpretability, refusal mechanism, gate attention head, control

5. ✅ Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework

作者: Jiling Zhou, Aisvarya Adeseye, Seppo Virtanen, Antti Hakkala, Jouni Isoaho 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04852v1

评分: 61.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该研究提出了一种结构化提示工程框架，通过增强Chain-of-Thought推理的完整性和控制来减少LLMs在安全敏感任务中的幻觉，提高网络安全威胁检测的可靠性和可解释性，并在实验中实现了显著的性能提升。

摘要翻译

思维链提示已被用于增强大语言模型的推理能力。然而，其在安全敏感分析任务中的可靠性仍未得到充分检验，尤其是在结构化人工评估下。模型缩放和微调等替代方法可用于帮助提升性能，但这些方法通常成本高昂、计算密集或难以审计。相比之下，提示工程为引导大语言模型推理提供了一种轻量、透明且可控的机制。本研究提出了一个结构化提示工程框架，旨在增强思维链推理的完整性，同时提升本地大语言模型部署中安全威胁与攻击检测的可靠性。该框架包含16个要素，归入四个核心维度：(1) 上下文与范围控制，(2) 证据锚定与可追溯性，(3) 推理结构与认知控制，以及(4) 安全特异性分析约束。该框架并非启发式地优化提示措辞，而是引入了显式的推理控制机制，以减轻幻觉、防止推理漂移，并增强安全敏感场景下的可解释性。以软件定义网络流量中的DDoS攻击检测为案例研究，在结构化与非结构化提示条件下对多个模型家族进行了评估。帕累托前沿分析与消融实验表明，该方法带来了持续的推理改进（在较小模型中提升高达40%）以及跨模型规模的稳定准确率增益。具有高度评分者间一致性（Cohen’s k > 0.80）的人工评估证实了其鲁棒性。研究结果确立了结构化提示作为一种有效且实用的方法，可用于实现可靠且可解释的人工智能驱动网络安全分析。

摘要 (Abstract)

Chain-of-Thought (CoT) prompting has been used to enhance the reasoning capability of LLMs. However, its reliability in security-sensitive analytical tasks remains insufficiently examined, particularly under structured human evaluation. Alternative approaches, such as model scaling and fine-tuning can be used to help improve performance. These methods are also often costly, computationally intensive, or difficult to audit. In contrast, prompt engineering provides a lightweight, transparent, and controllable mechanism for guiding LLM reasoning. This study proposes a structured prompt engineering framework designed to strengthen CoT reasoning integrity while improving security threat and attack detection reliability in local LLM deployments. The framework includes 16 factors grouped into four core dimensions: (1) Context and Scope Control, (2) Evidence Grounding and Traceability, (3) Reasoning Structure and Cognitive Control, and (4) Security-Specific Analytical Constraints. Rather than optimizing the wording of the prompt heuristically, the framework introduces explicit reasoning controls to mitigate hallucination and prevent reasoning drift, as well as strengthening interpretability in security-sensitive contexts. Using DDoS attack detection in SDN traffic as a case study, multiple model families were evaluated under structured and unstructured prompting conditions. Pareto frontier analysis and ablation experiments demonstrate consistent reasoning improvements (up to 40% in smaller models) and stable accuracy gains across scales. Human evaluation with strong inter-rater agreement (Cohen’s k > 0.80) confirms robustness. The results establish structured prompting as an effective and practical approach for reliable and explainable AI-driven cybersecurity analysis.

关键词: Chain-of-Thought, LLMs, structured prompting, reasoning integrity, hallucination mitigation, cybersecurity analysis, explainable AI, local deployment

6. ✅ MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents

作者: Shu Wang, Edwin Yu, Oscar Love, Tom Zhang, Tom Wong, Steve Scargall, Charles Fan 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04853v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	5.0/10	5.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对LLM智能体在多会话交互中记忆退化的问题，提出了MemMachine记忆系统，通过保留对话事实真值、优化检索策略，在多个基准测试中实现了高精度和效率的平衡。

摘要翻译

大型语言模型（LLM）智能体需要持久性记忆来维持个性化、事实连续性及长程推理能力，然而标准上下文窗口与检索增强生成（RAG）流程在多轮会话交互中会出现性能衰退。本文提出MemMachine，一个开源记忆系统，其在保持真实对话原貌的架构中整合了短期记忆、长期情景记忆与用户画像记忆，通过存储完整对话片段并减少基于LLM的有损信息提取来实现优化。MemMachine采用情境化检索技术，通过扩展核心匹配项及其周边上下文，有效提升了相关证据跨越多轮对话时的召回率。在多项基准测试中，MemMachine实现了优异的精度-效率平衡：在LoCoMo基准上使用gpt4.1-mini达到0.9169；在LongMemEvalS（ICLR 2025）的六维度消融实验中取得93.0%的准确率，其中检索阶段优化——包括检索深度调优（+4.2%）、上下文格式化（+2.0%）、搜索提示设计（+1.8%）和查询偏差校正（+1.4%）——显著优于摄入阶段改进如句子分块（+0.8%）。当配合优化提示时，GPT-5-mini较GPT-5提升2.6%，成为最具成本效益的配置方案。在与Mem0的对比中，MemMachine在同等条件下减少了约80%的输入令牌消耗。配套的检索智能体能够自适应地在直接检索、并行分解与迭代查询链策略间路由查询，在随机噪声环境下于HotpotQA-hard和WikiMultiHop数据集上分别达到93.2%和92.6%的准确率。这些结果表明，在保持情景真实性的基础上结合自适应检索层，能为个性化LLM智能体构建鲁棒且高效的长时记忆系统。

摘要 (Abstract)

Large Language Model (LLM) agents require persistent memory to maintain personalization, factual continuity, and long-horizon reasoning, yet standard context-window and retrieval-augmented generation (RAG) pipelines degrade over multi-session interactions. We present MemMachine, an open-source memory system that integrates short-term, long-term episodic, and profile memory within a ground-truth-preserving architecture that stores entire conversational episodes and reduces lossy LLM-based extraction. MemMachine uses contextualized retrieval that expands nucleus matches with surrounding context, improving recall when relevant evidence spans multiple dialogue turns. Across benchmarks, MemMachine achieves strong accuracy-efficiency tradeoffs: on LoCoMo it reaches 0.9169 using gpt4.1-mini; on LongMemEvalS (ICLR 2025), a six-dimension ablation yields 93.0 percent accuracy, with retrieval-stage optimizations – retrieval depth tuning (+4.2 percent), context formatting (+2.0 percent), search prompt design (+1.8 percent), and query bias correction (+1.4 percent) – outperforming ingestion-stage gains such as sentence chunking (+0.8 percent). GPT-5-mini exceeds GPT-5 by 2.6 percent when paired with optimized prompts, making it the most cost-efficient setup. Compared to Mem0, MemMachine uses roughly 80 percent fewer input tokens under matched conditions. A companion Retrieval Agent adaptively routes queries among direct retrieval, parallel decomposition, or iterative chain-of-query strategies, achieving 93.2 percent on HotpotQA-hard and 92.6 percent on WikiMultiHop under randomized-noise conditions. These results show that preserving episodic ground truth while layering adaptive retrieval yields robust, efficient long-term memory for personalized LLM agents.

关键词: LLM agents, memory system, retrieval-augmented generation, personalized AI, ground-truth preservation, contextualized retrieval, long-term memory, adaptive retrieval

7. ✅ DARE: Diffusion Large Language Models Alignment and Reinforcement Executor

作者: Jingyi Yang, Yuxian Jiang, Xuhao Hu, Shuang Cheng, Biqing Qi, Jing Shao 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04215v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对扩散大语言模型（dLLMs）后训练生态系统碎片化的问题，提出了一个统一的开放框架DARE，用于整合和监督微调、参数高效微调、偏好优化等多种后训练方法，并提供了可复现的评估基准，从而加速dLLMs的研究迭代和公平比较。

摘要翻译

扩散大语言模型（dLLMs）正逐渐成为主流自回归模型的有力替代方案，它通过迭代去噪和并行生成机制取代了严格的序列化标记生成方式。然而，其开源生态系统在模型家族之间，尤其是在训练后流程方面仍处于碎片化状态——强化学习目标、推演实现和评估脚本通常以论文专用代码库的形式发布。这种碎片化现象延缓了研究迭代速度，增加了复现的工程负担，并导致算法间的公平比较难以实现。本文提出DARE（dLLMs Alignment and Reinforcement Executor），一个用于训练后调优与评估dLLMs的开放框架。基于verl框架和OpenCompass评估体系构建的DARE，将监督微调、参数高效微调、偏好优化以及dLLM专属强化学习统一整合到适用于掩码扩散与块扩散语言模型的共享执行栈中。在涵盖LLaDA、Dream、SDAR和LLaDA2.x等代表性模型家族的实验中，DARE提供了广泛的算法覆盖、可复现的基准评估及实际加速方案。大量实证结果表明，DARE可作为可复用的研究基础平台，用于当前及新兴dLLMs的训练后方法开发、比较与部署。

摘要 (Abstract)

Diffusion large language models (dLLMs) are emerging as a compelling alternative to dominant autoregressive models, replacing strictly sequential token generation with iterative denoising and parallel generation dynamics. However, their open-source ecosystem remains fragmented across model families and, in particular, across post-training pipelines, where reinforcement learning objectives, rollout implementations and evaluation scripts are often released as paper-specific codebases. This fragmentation slows research iteration, raises the engineering burden of reproduction, and makes fair comparison across algorithms difficult. We present \textbf{DARE} (\textbf{d}LLMs \textbf{A}lignment and \textbf{R}einforcement \textbf{E}xecutor), an open framework for post-training and evaluating dLLMs. Built on top of verl~\cite{sheng2024hybridflow} and OpenCompass~\cite{2023opencompass}, DARE unifies supervised fine-tuning, parameter-efficient fine-tuning, preference optimization, and dLLM-specific reinforcement learning under a shared execution stack for both masked and block diffusion language models. Across representative model families including LLaDA, Dream, SDAR, and LLaDA2.x, DARE provides broad algorithmic coverage, reproducible benchmark evaluation, and practical acceleration. Extensive empirical results position that DARE serves as a reusable research substrate for developing, comparing, and deploying post-training methods for current and emerging dLLMs.

关键词: Diffusion Large Language Models, dLLMs, Post-training, Alignment, Reinforcement Learning, Parameter-efficient Fine-tuning, Supervised Fine-tuning, Evaluation Framework

8. ✅ MERIT: Multilingual Expert-Reward Informed Tuning for Chinese-Centric Low-Resource Machine Translation

评分: 41.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	5.0/10	5.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	8.0/10	8.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对中文到东南亚低资源语言的机器翻译数据稀缺和噪声问题，提出了MERIT框架，结合监督微调和奖励优化，显著提升了翻译性能，超越了单纯模型缩放的效果。

摘要翻译

从中文到低资源东南亚语言的神经机器翻译（NMT）仍受到两大关键制约：洁净平行语料的极度匮乏，以及现有挖掘数据中普遍存在的噪声。这一长期存在的短缺不仅阻碍了有效的模型训练，还导致其性能与高资源语言方向存在巨大差距，使得老挝语、缅甸语、他加禄语等数百万使用者所依赖的翻译系统质量持续低下，尽管大规模多语言模型近期已取得进展。我们提出了多语言专家奖励引导调优框架（MERIT），这是一个统一的翻译框架，它将传统的以英语为中心的ALT基准转化为一个以中文为中心、涵盖五种东南亚低资源语言（LRLs）的评估体系。我们的框架将语言特定标记前缀（LTP）与监督微调（SFT）相结合，并引入了一种新颖的、由语义对齐奖励（SAR）引导的组相对策略优化（GRPO）。这些结果证实，在低资源语言到中文的翻译任务中，有针对性的数据策展和奖励引导的优化策略，其效果远超单纯的模型规模扩展。

摘要 (Abstract)

Neural machine translation (NMT) from Chinese to low-resource Southeast Asian languages remains severely constrained by the extreme scarcity of clean parallel corpora and the pervasive noise in existing mined data. This chronic shortage not only impedes effective model training but also sustains a large performance gap with high-resource directions, leaving millions of speakers of languages such as Lao, Burmese, and Tagalog with persistently low-quality translation systems despite recent advances in large multilingual models. We introduce \textbf{M}ultilingual \textbf{E}xpert-\textbf{R}eward \textbf{I}nformed \textbf{T}uning (\textbf{MERIT}), a unified translation framework that transforms the traditional English-centric ALT benchmark into a Chinese-centric evaluation suite for five Southeast Asian low-resource languages (LRLs). Our framework combines language-specific token prefixing (LTP) with supervised fine-tuning (SFT) and a novel group relative policy optimization (GRPO) guided by the semantic alignment reward (SAR). These results confirm that, in LRL{\textrightarrow}Chinese translation, targeted data curation and reward-guided optimization dramatically outperform mere model scaling.

关键词: Machine Translation, Low-resource Languages, Supervised Fine-tuning, Reward Optimization, Chinese-centric, Southeast Asian Languages, Data Curation, Semantic Alignment

9. ✅ SODA: Semi On-Policy Black-Box Distillation for Large Language Models

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SODA的半在线黑盒知识蒸馏方法，解决了大型语言模型向小型模型知识迁移中效率与效果难以兼得的问题，在多个基准测试中达到或超越现有方法，同时训练速度提升10倍、GPU内存消耗减少27%且消除了对抗训练的不稳定性。

摘要翻译

大型语言模型的黑箱知识蒸馏面临严格的权衡困境。简单的离策略方法（如序列级知识蒸馏）难以纠正学生模型固有的错误。完全同策略方法（如生成对抗蒸馏）通过对抗训练解决了这一问题，但引入了众所周知的训练不稳定性和极高的计算开销。为应对这一困境，我们提出SODA（基于对齐的半同策略蒸馏），这是一种高效替代方案，其动机源于前沿教师模型与更小规模基础模型之间固有的能力差距。由于紧凑型学生模型的自然零样本响应几乎严格劣于强大教师模型的目标输出，我们可以通过将教师模型的最优响应与学生模型一次性静态输出快照进行配对，构建出高效的对比信号。这表明，让小型学生模型接触其自身静态的次优行为已足以实现高质量的分布对齐，从而无需昂贵的动态展开过程和脆弱的对抗平衡。在四种紧凑型Qwen2.5和Llama-3模型上的广泛评估验证了这种半同策略范式。SODA在16项基准测试结果中的15项上达到或超越了现有最优方法。更重要的是，它在实现更优蒸馏质量的同时，训练速度提升10倍，峰值GPU内存消耗降低27%，并完全消除了对抗训练的不稳定性。

摘要 (Abstract)

Black-box knowledge distillation for large language models presents a strict trade-off. Simple off-policy methods (e.g., sequence-level knowledge distillation) struggle to correct the student’s inherent errors. Fully on-policy methods (e.g., Generative Adversarial Distillation) solve this via adversarial training but introduce well-known training instability and crippling computational overhead. To address this dilemma, we propose SODA (Semi On-policy Distillation with Alignment), a highly efficient alternative motivated by the inherent capability gap between frontier teachers and much smaller base models. Because a compact student model’s natural, zero-shot responses are almost strictly inferior to the powerful teacher’s targets, we can construct a highly effective contrastive signal simply by pairing the teacher’s optimal response with a one-time static snapshot of the student’s outputs. This demonstrates that exposing the small student to its own static inferior behaviors is sufficient for high-quality distribution alignment, eliminating the need for costly dynamic rollouts and fragile adversarial balancing. Extensive evaluations across four compact Qwen2.5 and Llama-3 models validate this semi on-policy paradigm. SODA matches or outperforms the state-of-the-art methods on 15 out of 16 benchmark results. More importantly, it achieves this superior distillation quality while training 10 times faster, consuming 27% less peak GPU memory, and completely eliminating adversarial instability.

关键词: Knowledge Distillation, Large Language Models, Small Language Models, Semi On-policy, Alignment, Efficient Training, Black-box, Model Compression

10. ✅ Your Agent is More Brittle Than You Think: Uncovering Indirect Injection Vulnerabilities in Agentic LLMs

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该研究揭示了基于大语言模型的智能体系统在动态多步工具调用环境中存在严重的间接提示注入安全漏洞，并通过评估多种防御策略发现现有方法普遍脆弱，进而提出了一种基于表征工程的检测方法，能有效在智能体执行恶意操作前进行拦截。

摘要翻译

开源框架的快速部署显著推动了现代多智能体系统的发展。然而，扩展的行动空间——包括不受控的权限暴露和隐藏的系统间交互——带来了严峻的安全挑战。具体而言，间接提示注入攻击通过将恶意指令隐藏在第三方内容中，可在智能体正常操作期间触发数据窃取等未授权行为。当前的安全评估主要依赖孤立的单轮基准测试，而这些智能体在复杂动态环境中的系统性脆弱性仍未得到充分探索。为弥补这一空白，我们系统评估了针对四种复杂间接提示注入攻击向量的六种防御策略，覆盖九种大语言模型基座。关键的是，我们的评估完全在动态多步骤工具调用环境中进行，以捕捉现代自主智能体的真实攻击面。超越二元的成功率指标，我们的多维分析揭示出显著的脆弱性：高级注入攻击成功绕过几乎所有基线防御，部分表层缓解措施甚至会产生适得其反的副作用。此外，尽管智能体几乎瞬时执行恶意指令，其内部状态却表现出异常高的决策熵。基于这种潜在犹豫现象，我们探索将表征工程作为鲁棒的检测策略。通过提取工具输入位置对应的隐藏状态，我们发现基于表征工程的断路器能在智能体执行前成功识别并拦截未授权操作，在不同大语言模型基座上均实现了高检测准确率。本研究揭示了当前间接提示注入防御的局限性，并为构建具有韧性的多智能体架构提供了高度实用的范式。

摘要 (Abstract)

The rapid deployment of open-source frameworks has significantly advanced the development of modern multi-agent systems. However, expanded action spaces, including uncontrolled privilege exposure and hidden inter-system interactions, pose severe security challenges. Specifically, Indirect Prompt Injections (IPI), which conceal malicious instructions within third-party content, can trigger unauthorized actions such as data exfiltration during normal operations. While current security evaluations predominantly rely on isolated single-turn benchmarks, the systemic vulnerabilities of these agents within complex dynamic environments remain critically underexplored. To bridge this gap, we systematically evaluate six defense strategies against four sophisticated IPI attack vectors across nine LLM backbones. Crucially, we conduct our evaluation entirely within dynamic multi-step tool-calling environments to capture the true attack surface of modern autonomous agents. Moving beyond binary success rates, our multidimensional analysis reveals a pronounced fragility. Advanced injections successfully bypass nearly all baseline defenses, and some surface-level mitigations even produce counterproductive side effects. Furthermore, while agents execute malicious instructions almost instantaneously, their internal states exhibit abnormally high decision entropy. Motivated by this latent hesitation, we investigate Representation Engineering (RepE) as a robust detection strategy. By extracting hidden states at the tool-input position, we revealed that the RepE-based circuit breaker successfully identifies and intercepts unauthorized actions before the agent commits to them, achieving high detection accuracy across diverse LLM backbones. This study exposes the limitations of current IPI defenses and provides a highly practical paradigm for building resilient multi-agent architectures.

关键词: LLM Agents, Multi-agent Systems, Tool Calling, Indirect Prompt Injection, Security Vulnerabilities, Representation Engineering, Dynamic Environments, Defense Strategies

11. ✅ Temporal Inversion for Learning Interval Change in Chest X-Rays

作者: Hanbin Ko, Kyeongmin Jeon, Doowoong Choi, Chang Min Park 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04563v1

评分: 38.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该研究针对胸片时间序列分析中现有模型对间隔变化不敏感的问题，提出了TILA框架，通过时间反转监督信号增强模型对方向性变化的感知能力，实验表明该框架能持续改进进展分类和时间嵌入对齐。

摘要翻译

视觉-语言预训练的最新进展催生了强大的医学基础模型，但多数模型仍孤立分析放射影像，忽视了对比既往与当前影像以评估时序变化这一关键临床任务。对于胸部X光片（CXRs）而言，捕捉时序变化至关重要，因为放射科医生不仅需要评估影像表现的静态特征，还必须追踪其随时间的演变过程。我们提出TILA（时序反转感知学习与对齐框架），这是一种简洁而有效的框架，其使用时序反转（即反转图像对的顺序）作为监督信号，以增强现有时序视觉-语言模型对方向性变化的敏感性。TILA在预训练、微调和推理阶段整合了反转感知目标，通过显式学习时序顺序来补充传统的外观建模方法。我们还提出了一套统一的评估协议，用于衡量模型在时序反转下的顺序敏感性与一致性，并构建了MS-CXR-Tretrieval检索评估集——该评估集基于通用构建协议，可应用于任何时序CXR数据集。在公开数据集和真实医院队列上的实验表明，当应用于多种现有架构时，TILA能持续提升疾病进展分类的准确性和时序嵌入的对齐效果。

摘要 (Abstract)

Recent advances in vision–language pretraining have enabled strong medical foundation models, yet most analyze radiographs in isolation, overlooking the key clinical task of comparing prior and current images to assess interval change. For chest radiographs (CXRs), capturing interval change is essential, as radiologists must evaluate not only the static appearance of findings but also how they evolve over time. We introduce TILA (Temporal Inversion-aware Learning and Alignment), a simple yet effective framework that uses temporal inversion, reversing image pairs, as a supervisory signal to enhance the sensitivity of existing temporal vision-language models to directional change. TILA integrates inversion-aware objectives across pretraining, fine-tuning, and inference, complementing conventional appearance modeling with explicit learning of temporal order. We also propose a unified evaluation protocol to assess order sensitivity and consistency under temporal inversion, and introduce MS-CXR-Tretrieval, a retrieval evaluation set constructed through a general protocol that can be applied to any temporal CXR dataset. Experiments on public datasets and real-world hospital cohorts demonstrate that TILA consistently improves progression classification and temporal embedding alignment when applied to multiple existing architectures.

12. ✅ A Family of Open Time-Series Foundation Models for the Radio Access Network

作者: Ioannis Panitsas, Leandros Tassiulas 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04271v1

评分: 31.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

!!! tip deepseek-chat TL;DR

该论文针对无线接入网（RAN）中任务特定模型导致的碎片化和泛化能力差的问题，提出了TimeRAN——一个统一的多任务时间序列基础模型框架，通过大规模预训练和高效适应，在多种RAN分析任务上实现了最先进的性能，并验证了其在真实5G测试床中的高效运行。

摘要翻译

无线接入网（RAN）正演变为一种可编程、解耦的基础设施，日益依赖原生人工智能算法进行优化与闭环控制。然而，当前的RAN智能系统仍主要由针对单一功能设计的任务专用模型构成，导致模型碎片化、任务间知识共享有限、泛化能力差以及系统复杂性增加。为应对这些局限，本文提出TimeRAN——一个面向RAN时序建模的统一多任务学习框架。TimeRAN采用轻量级时序基础模型配合少量任务专用头部，通过学习可迁移的表征，能够在有限监督下高效适配多种任务。为实现大规模预训练，我们进一步整理并开源了TimeRAN DataPile，这是迄今为止最大的RAN分析时序数据集，涵盖多样化遥测源、协议层和部署场景，包含超过35.5万条时间序列和5.6亿个测量点。我们在涵盖异常检测、分类、预测与填补的完整RAN分析任务集上评估TimeRAN，结果表明其仅需极少或无需任务特定微调即可达到先进性能。最后，我们将TimeRAN集成至概念验证型5G测试平台，证明其在实际场景中能以有限资源需求高效运行。

摘要 (Abstract)

The Radio Access Network (RAN) is evolving into a programmable and disaggregated infrastructure that increasingly relies on AI-native algorithms for optimization and closed-loop control. However, current RAN intelligence is still largely built from task-specific models tailored to individual functions, resulting in model fragmentation, limited knowledge sharing across tasks, poor generalization, and increased system complexity. To address these limitations, we introduce TimeRAN, a unified multi-task learning framework for time-series modeling in the RAN. TimeRAN leverages a lightweight time-series foundation model with few task-specific heads to learn transferable representations that can be efficiently adapted across diverse tasks with limited supervision. To enable large-scale pretraining, we further curate and open-source TimeRAN DataPile, the largest time-series corpus for RAN analytics to date, comprising over 355K time series and 0.56B measurements across diverse telemetry sources, protocol layers, and deployment scenarios. We evaluate TimeRAN across a comprehensive set of RAN analytics tasks, including anomaly detection, classification, forecasting, and imputation, and show that it achieves state-of-the-art performance with minimal or no task-specific fine-tuning. Finally, we integrate TimeRAN into a proof-of-concept 5G testbed and demonstrate that it operates efficiently with limited resource requirements in real-world scenarios.

关键词: Time-series foundation model, Radio Access Network (RAN), Multi-task learning, Pre-training, Transferable representations, Anomaly detection, 5G testbed, Lightweight model

13. ✅ AI Trust OS – A Continuous Governance Framework for Autonomous AI Observability and Zero-Trust Compliance in Enterprise Environments

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对企业环境中大规模采用LLM、RAG和多智能体工作流导致的治理危机，提出了AI Trust OS框架，通过持续自主的可观测性和零信任合规机制来解决AI系统治理的可视化与验证问题。

摘要翻译

大型语言模型、检索增强生成流程以及多智能体人工智能工作流的加速普及已引发结构性治理危机。组织无法监管其不可见之物，而为确定性网络应用构建的现有合规方法论缺乏相应机制，用以发现或持续验证那些在工程团队间自发产生、缺乏正式监管的人工智能系统。这导致监管机构要求的人工智能治理成熟度证明与组织实际可提供的证据之间出现日益扩大的信任鸿沟。本文提出“AI信任操作系统”——一种面向持续自主人工智能可观测性与零信任合规的治理架构。该系统将合规性重新构想为一个全天候运行、由遥测数据驱动的操作层：通过可观测性信号发现人工智能系统，借助自动化探针收集控制断言，并持续合成信任凭证。该框架基于四大原则：主动发现、基于遥测证据而非人工证明、持续状态监测而非时点审计、架构支撑的实证而非政策文件的信任。其通过零信任遥测边界运作，瞬时只读探针在此验证结构性元数据，无需接入源代码或载荷级个人身份信息。人工智能可观测性提取代理会扫描LangSmith与Datadog的LLM遥测数据，自动注册未记录的人工智能系统，从而将治理模式从组织自我报告转向实证化的机器观测。通过对ISO 42001、欧盟《人工智能法案》、SOC 2、GDPR及HIPAA等标准的评估，本文论证了遥测优先的人工智能治理模式代表了企业信任生成与验证方式的根本性架构变革。

摘要 (Abstract)

The accelerating adoption of large language models, retrieval-augmented generation pipelines, and multi-agent AI workflows has created a structural governance crisis. Organizations cannot govern what they cannot see, and existing compliance methodologies built for deterministic web applications provide no mechanism for discovering or continuously validating AI systems that emerge across engineering teams without formal oversight. The result is a widening trust gap between what regulators demand as proof of AI governance maturity and what organizations can demonstrate. This paper proposes AI Trust OS, a governance architecture for continuous, autonomous AI observability and zero-trust compliance. AI Trust OS reconceptualizes compliance as an always-on, telemetry-driven operating layer in which AI systems are discovered through observability signals, control assertions are collected by automated probes, and trust artifacts are synthesized continuously. The framework rests on four principles: proactive discovery, telemetry evidence over manual attestation, continuous posture over point-in-time audit, and architecture-backed proof over policy-document trust. The framework operates through a zero-trust telemetry boundary in which ephemeral read-only probes validate structural metadata without ingressing source code or payload-level PII. An AI Observability Extractor Agent scans LangSmith and Datadog LLM telemetry, automatically registering undocumented AI systems and shifting governance from organizational self-report to empirical machine observation. Evaluated across ISO 42001, the EU AI Act, SOC 2, GDPR, and HIPAA, the paper argues that telemetry-first AI governance represents a categorical architectural shift in how enterprise trust is produced and demonstrated.

14. ✅ Optimizing Service Operations via LLM-Powered Multi-Agent Simulation

作者: Yanyuan Wang, Xiaowei Zhang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04383v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该研究提出了一种基于大语言模型的多智能体仿真框架（LLM-MAS），用于优化服务运营中的设计选择，通过将设计嵌入提示词并利用LLM智能体交互模拟不确定性，开发了在线学习算法，在可持续供应链和竞赛设计案例中优于传统基准方法。

摘要翻译

服务系统性能取决于参与者对设计选择的响应方式，但由于人类行为的复杂性，对这些响应进行建模十分困难。本文提出一种基于大语言模型的多智能体仿真框架，用于优化服务运营。我们将该问题构建为具有决策依赖不确定性的随机优化问题：设计选择被嵌入提示词中，并塑造了基于大语言模型的智能体交互所产生结果的分布。通过将关键数值信息嵌入提示词并从大语言模型生成的文本中提取该信息，我们将这种不确定性建模为一个受控马尔可夫链。我们开发了一种轨迹上学习算法，该算法在单次仿真运行中，同时构建零阶梯度估计并更新设计参数，以优化稳态性能。我们还引入了方差缩减技术。在一个可持续供应链的应用中，我们的方法在性能上超越了多种基准方案，包括黑箱优化、使用大语言模型作为数值求解器或作为角色扮演系统设计者的方法。一项基于真实行为数据的最优竞赛设计案例研究表明，LLM-MAS既可作为已知设计方案的性价比评估工具，也可作为一种探索性工具，能够发现被传统方法忽视的优秀设计方案。

摘要 (Abstract)

Service system performance depends on how participants respond to design choices, but modeling these responses is hard due to the complexity of human behavior. We introduce an LLM-powered multi-agent simulation (LLM-MAS) framework for optimizing service operations. We pose the problem as stochastic optimization with decision-dependent uncertainty: design choices are embedded in prompts and shape the distribution of outcomes from interacting LLM-powered agents. By embedding key numerical information in prompts and extracting it from LLM-generated text, we model this uncertainty as a controlled Markov chain. We develop an on-trajectory learning algorithm that, on a single simulation run, simultaneously constructs zeroth-order gradient estimates and updates design parameters to optimize steady-state performance. We also incorporate variance reduction techniques. In a sustainable supply chain application, our method outperforms benchmarks, including blackbox optimization and using LLMs as numerical solvers or as role-playing system designers. A case study on optimal contest design with real behavioral data shows that LLM-MAS is both as a cost-effective evaluator of known designs and an exploratory tool that can uncover strong designs overlooked by traditional approaches.

15. ❌ Uncertainty as a Planning Signal: Multi-Turn Decision Making for Goal-Oriented Conversation

作者: Xinyi Ling, Ye Liu, Reza Averly, Xia Ning 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.03924v1

评分: 26.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	8.0/10	8.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出CUP框架，将语言模型与结构化规划结合用于目标导向对话，核心涉及LLM（用于生成可行动作）和多步决策规划（体现推理过程）。因此，与"Large Language Models"（论文明确使用LLM）和"LLM Agents"（框架本质是LLM驱动的决策代理）高度相关（8分）。与"Chain of Thought"和"System 2 Thinking"有一定关联（5分），因为论文强调多轮、长视野的决策规划，涉及不确定性下的逐步推理，但未直接使用这些术语。其他关键词（如MoE、量化、科学AI等）未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究解决了目标导向对话中LLM缺乏长视野决策规划的问题，提出了一个不确定性感知的规划框架CUP，实验表明其能提高成功率并减少交互轮次。

摘要翻译

面向目标的对话系统需要在用户意图不确定的情况下进行序列决策，算法必须在多轮对话中平衡信息获取与目标确认。现有方法从不同角度应对这一挑战：结构化方法支持多步规划但依赖预定义模式，而基于大语言模型（LLM）的方法虽支持灵活交互，却缺乏长程决策能力，导致信息获取与目标确认之间协调不足。为突破这一局限，我们将面向目标的对话建模为一种不确定性感知的序列决策问题，其中不确定性作为多轮决策的引导信号。我们提出了一种对话不确定性感知规划框架（Conversation Uncertainty-aware Planning, CUP），将语言模型与结构化规划相结合：语言模型生成可行动作，规划器则评估这些动作对降低不确定性的长期影响。在多个对话基准测试上的实验表明，CUP能持续提升任务成功率，同时减少交互轮次。进一步分析证明，不确定性感知规划有助于实现更高效的信息获取与更早的确定性目标确认。

摘要 (Abstract)

Goal-oriented conversational systems require making sequential decisions under uncertainty about the user’s intent, where the algorithm must balance information acquisition and target commitment over multiple turns. Existing approaches address this challenge from different perspectives: structured methods enable multi-step planning but rely on predefined schemas, while LLM-based approaches support flexible interactions but lack long-horizon decision making, resulting in poor coordination between information acquisition and target commitment. To address this limitation, we formulate goal-oriented conversation as an uncertainty-aware sequential decision problem, where uncertainty serves as a guiding signal for multi-turn decision making. We propose a Conversation Uncertainty-aware Planning framework (CUP) that integrates language models with structured planning: a language model proposes feasible actions, and a planner evaluates their long-term impact on uncertainty reduction. Experiments on multiple conversational benchmarks show that CUP consistently improves success rates while requiring fewer interaction turns. Further analysis demonstrates that uncertainty-aware planning contributes to more efficient information acquisition and earlier confident commitment.

关键词: goal-oriented conversation, sequential decision making, uncertainty-aware planning, language models, multi-turn interaction, information acquisition, target commitment, CUP framework

16. ❌ CPT: Controllable and Editable Design Variations with Language Models

作者: Karthik Suresh, Amine Ben Khalifa, Li Zhang, Wei-ting Hsu, Fangzheng Wu, Vinay More, Asim Kadav 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04380v1

评分: 26.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	8.0/10	8.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心是使用语言模型（Creative Pre-trained Transformer, CPT）进行创意设计生成，属于大模型在特定领域的应用创新。与关键词高度相关的是：1）“Large Language Models”（论文明确使用decoder-only语言模型CPT，权重1.0，评分10.0）；2）“Pre-training”（模型名为Creative Pre-trained Transformer，涉及预训练，评分8.0）；3）“Post-training”（论文提到fine-tune CPT on a large corpus，属于后训练/微调，评分8.0）。其他关键词如MoE、SLMs、Scaling Laws、Alignment、RAG等均未在摘要中提及或与论文内容无关，故评分为0。论文未涉及指定专家作者。

!!! tip deepseek-chat TL;DR

该论文提出了一种使用预训练语言模型（CPT）和创意标记语言（CML）生成可编辑设计变体的系统，解决了手动设计过程耗时且难以规模化的问题，实现了语义结构化和风格一致的输出。

摘要翻译

生成视觉多样且高质量的设计作品目前仍依赖人工操作，这一过程耗时费力，限制了创意工作流程的可扩展性与个性化程度。本研究提出一种基于仅解码器语言模型的系统，用于生成可编辑的设计变体。该模型名为创意预训练变换器（Creative Pre-trained Transformer，CPT），通过训练可预测设计模板中的视觉风格属性。我们方法的核心是一种称为创意标记语言（Creative Markup Language，CML）的新型表征格式——这是一种紧凑且适配机器学习的格式，能够捕捉画布级结构、页面布局以及元素级细节（文本、图像和矢量图形），同时涵盖内容与样式信息。我们在专业设计师创作的大规模设计模板语料库上对CPT进行微调，使其能够针对配色方案、字体选择等属性进行具有语义关联性的上下文感知预测。该模型生成的输出在语义结构上具有层次性，在风格上保持连贯性，并确保各元素间的内在一致性。与生成式图像模型不同，本系统产出的是完全可编辑的设计文档而非仅含像素的图像，允许用户在设计编辑器中直接进行迭代与个性化调整。实验表明，我们的方法能够为现有模板生成符合语境的色彩与字体变体，并在遵循设计原则的前提下展现出调整布局的潜力。

摘要 (Abstract)

Designing visually diverse and high-quality designs remains a manual, time-consuming process, limiting scalability and personalization in creative workflows. We present a system for generating editable design variations using a decoder-only language model, the Creative Pre-trained Transformer (CPT), trained to predict visual style attributes in design templates. At the core of our approach is a new representation called Creative Markup Language (CML), a compact, machine-learning-friendly format that captures canvas-level structure, page layout, and element-level details (text, images, and vector graphics), including both content and style. We fine-tune CPT on a large corpus of design templates authored by professional designers, enabling it to learn meaningful, context-aware predictions for attributes such as color schemes and font choices. The model produces semantically structured and stylistically coherent outputs, preserving internal consistency across elements. Unlike generative image models, our system yields fully editable design documents rather than pixel-only images, allowing users to iterate and personalize within a design editor. In experiments, our approach generates contextual color and font variations for existing templates and shows promise in adjusting layouts while maintaining design principles.

关键词: Creative Pre-trained Transformer, Creative Markup Language, design variations, language model, fine-tuning, editable design, visual style attributes, design templates

17. ❌ An AI Teaching Assistant for Motion Picture Engineering

作者: Deirdre O’Regan, Anil C. Kokaram 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04670v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心是使用LLMs和RAG技术构建AI教学助手，并评估其在工程课程中的效果。因此，与"Large Language Models"和"Retrieval-Augmented Generation"高度相关（10分），因为LLMs是基础技术，RAG是具体实现方法。其他关键词如MoE、SFT、RLHF、量化等均未在摘要中提及，属于完全无关（0分）。论文属于大模型在教育领域的应用，符合研究背景中"大模型在不同领域的研究应用"，但未涉及生物医药等特定科学领域。

!!! tip deepseek-chat TL;DR

该研究在都柏林三一学院电影工程硕士课程中，使用检索增强生成技术构建了一个AI教学助手，并通过实验发现该助手对学生有益且不影响考试成绩的学术有效性。

摘要翻译

过去几年，大语言模型的迅速崛起推动了基于大语言模型的AI教学工具的广泛实验。然而，其具体实施方案及其在教学环境中的效益仍处于探索的早期阶段。本文以都柏林圣三一学院电影工程硕士课程为背景，探讨了利用检索增强生成技术实现AI助教所涉及的上述问题。我们详细介绍了实施方案（包括向大语言模型提供的提示词及代码），并重点阐述了如何针对课程需求设计与优化检索增强生成流程。我们说明了调查工具的设计，并通过多项量化指标报告了AI助教的实际影响。本次实验的规模（43名学生，7周内进行296次会话，产生1,889次查询）足以支撑研究结论的可信度。与以往研究不同，我们尝试在开卷考试中允许使用AI助教。对三次考试的统计分析显示，无论是否使用AI助教，学生成绩均无显著差异（p > 0.05），这表明精心设计的评估方式能够保持学术有效性。学生反馈显示AI助教具有积极帮助（平均分=4.22/5），但对于是否更倾向于AI助教而非人工辅导则意见不一（平均分=2.78/5）。

摘要 (Abstract)

The rapid rise of LLMs over the last few years has promoted growing experimentation with LLM-driven AI tutors. However, the details of implementation, as well as the benefit in a teaching environment, are still in the early days of exploration. This article addresses these issues in the context of implementation of an AI Teaching Assistant (AI-TA) using Retrieval Augmented Generation (RAG) for Trinity College Dublin’s Master’s Motion Picture Engineering (MPE) course. We provide details of our implementation (including the prompt to the LLM, and code), and highlight how we designed and tuned our RAG pipeline to meet course needs. We describe our survey instrument and report on the impact of the AI-TA through a number of quantitative metrics. The scale of our experiment (43 students, 296 sessions, 1,889 queries over 7 weeks) was sufficient to have confidence in our findings. Unlike previous studies, we experimented with allowing the use of the AI-TA in open-book examinations. Statistical analysis across three exams showed no performance differences regardless of AI-TA access (p > 0.05), demonstrating that thoughtfully designed assessments can maintain academic validity. Student feedback revealed that the AI-TA was beneficial (mean = 4.22/5), while students had mixed feelings about preferring it over human tutoring (mean = 2.78/5).

关键词: AI Teaching Assistant, LLMs, Retrieval Augmented Generation, RAG, Motion Picture Engineering, educational technology, academic assessment, student feedback

18. ❌ MolDA: Molecular Understanding and Generation via Large Language Diffusion Model

作者: Seohyeon Shin, HanJun Choi, Jun-Hyung Park, Hongkook Kim, Mansu Kim 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04403v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文MolDA提出了一种用于分子理解和生成的新型多模态框架，核心创新在于用离散大语言扩散模型替代传统的自回归骨干网络。该研究与关键词高度相关的有：1）“Large Language Models”：论文明确使用LLMs作为基础，并提出了基于LLM的扩散模型变体，是核心内容（10分）。2）“AI for Science”：论文专注于分子发现这一科学领域，是LLM在生物信息学/化学信息学中的具体应用，属于核心内容（10分）。其他关键词如MoE、SLMs、Scaling Laws、各种训练方法（Pre-training、SFT、RLHF等）、推理优化技术、代理系统、模型压缩等，在摘要中均未提及或讨论，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有基于自回归LLM的分子生成方法难以处理全局化学约束和累积结构误差的问题，提出了MolDA框架，通过使用离散大语言扩散模型和双向迭代去噪，实现了在分子生成、描述和性质预测任务中更好的全局结构一致性、化学有效性和鲁棒推理。

摘要翻译

大语言模型（LLM）显著推动了分子发现，但现有的多模态分子架构本质上依赖于自回归（AR，autoregressive）主干网络。这种严格的从左到右归纳偏置对于生成化学有效分子而言并非最优，因为它难以处理非局部的全局约束（例如环闭合），且在序列生成过程中常常累积结构错误。为应对这些局限，我们提出了MolDA（基于掩码扩散的分子语言模型），这是一种新颖的多模态框架，它用离散的大语言扩散模型取代了传统的自回归主干。MolDA通过混合图编码器提取全面的结构表征，该编码器同时捕获局部和全局拓扑信息，并通过一个Q-Former将其对齐到语言标记空间。此外，我们专门针对掩码扩散过程，对分子结构偏好优化进行了数学重构。通过双向迭代去噪，MolDA确保了在分子生成、描述和性质预测任务中具有全局结构一致性、化学有效性以及强大的推理能力。

摘要 (Abstract)

Large Language Models (LLMs) have significantly advanced molecular discovery, but existing multimodal molecular architectures fundamentally rely on autoregressive (AR) backbones. This strict left-to-right inductive bias is sub-optimal for generating chemically valid molecules, as it struggles to account for non-local global constraints (e.g., ring closures) and often accumulates structural errors during sequential generation. To address these limitations, we propose MolDA (Molecular language model with masked Diffusion with mAsking), a novel multimodal framework that replaces the conventional AR backbone with a discrete Large Language Diffusion Model. MolDA extracts comprehensive structural representations using a hybrid graph encoder, which captures both local and global topologies, and aligns them into the language token space via a Q-Former. Furthermore, we mathematically reformulate Molecular Structure Preference Optimization specifically for the masked diffusion. Through bidirectional iterative denoising, MolDA ensures global structural coherence, chemical validity, and robust reasoning across molecule generation, captioning, and property prediction.

关键词: Large Language Models, Molecular Discovery, Diffusion Model, Multimodal Framework, Molecular Generation, Chemical Validity, Global Structural Coherence, Molecular Structure Preference Optimization

19. ❌ Adaptive Cost-Efficient Evaluation for Reliable Patent Claim Validation

作者: Yongmin Yoo, Qiongkai Xu, Longbing Cao 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04295v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出ACE框架，核心是使用LLMs进行专利权利要求验证，并创新性地提出了Chain of Patent Thought (CoPT)协议，这是Chain of Thought推理在专利领域的应用。因此，与"Large Language Models"和"Chain of Thought"高度相关（10分）。论文未涉及其他关键词的具体技术或应用，如MoE、SLMs、Scaling Laws、各种训练方法、对齐、RAG、推理加速、模型压缩、AI for Science等，故这些关键词得0分。

!!! tip deepseek-chat TL;DR

该研究解决了专利权利要求自动化验证中成本与精度难以兼顾的问题，提出了一种基于预测熵的自适应成本高效评估框架（ACE），通过将高不确定性权利要求路由给专家LLM执行Chain of Patent Thought协议，在保持高效率的同时显著提升了验证性能，实现了94.95%的F1分数并降低了78%的运营成本。

摘要翻译

专利权利要求的自动化验证要求零缺陷容错，因为即使存在单一结构性瑕疵也可能导致权利要求在法律上失效。现有评估范式面临刚性-资源困境：轻量级编码器难以处理微妙的法律依存关系，而基于大语言模型（LLMs）的穷举验证则成本过高。为弥合这一差距，我们提出ACE（自适应成本效益评估）混合框架，该框架利用预测熵仅将高不确定性权利要求路由至专家级LLM。专家模型随后执行基于《美国法典》第35编法定标准构建的专利思维链（CoPT）协议。该设计使ACE能更有效地处理长程法律依存关系，同时保持效率优势。在评估方法中，ACE以94.95%的F1分数取得最优性能，相较于独立LLM部署方案降低78%运营成本。我们还构建了包含4万项权利要求、基于《专利审查程序手册》（MPEP）错误标注的基准数据集ACE-40k，以推动后续研究。

摘要 (Abstract)

Automated validation of patent claims demands zero-defect tolerance, as even a single structural flaw can render a claim legally defective. Existing evaluation paradigms suffer from a rigidity-resource dilemma: lightweight encoders struggle with nuanced legal dependencies, while exhaustive verification via Large Language Models (LLMs) is prohibitively costly. To bridge this gap, we propose ACE (Adaptive Cost-efficient Evaluation), a hybrid framework that uses predictive entropy to route only high-uncertainty claims to an expert LLM. The expert then executes a Chain of Patent Thought (CoPT) protocol grounded in 35 U.S.C. statutory standards. This design enables ACE to handle long-range legal dependencies more effectively while preserving efficiency. ACE achieves the best F1 among the evaluated methods at 94.95%, while reducing operational costs by 78% compared to standalone LLM deployments. We also construct ACE-40k, a 40,000-claim benchmark with MPEP-grounded error annotations, to facilitate further research.

关键词: Patent Claim Validation, Large Language Models, Adaptive Cost-efficient Evaluation, Chain of Patent Thought, Predictive Entropy, Legal Dependencies, Benchmark ACE-40k, Operational Cost Reduction

20. ❌ Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation

作者: Alhasan Mahmood, Samir Abdaljalil, Hasan Kurban 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04532v1

评分: 18.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文研究多语言提示本地化对Agent-as-a-Judge评估的影响，核心涉及LLM作为评估代理（Agent）在不同语言下的性能变化。因此，与"LLM Agents"高度相关（10分），因为论文直接研究LLM作为代理在评估任务中的行为。与"Large Language Models"相关（8分），因为研究使用了GPT-4o、Gemini等大模型作为评估骨干（backbone）。其他关键词如MoE、SFT、RAG等均未在论文中涉及，故给0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，在Agent-as-a-Judge评估中，改变评估语言（如英语、阿拉伯语、中文等）会显著影响不同大模型骨干（如GPT-4o、Gemini）的性能排名，表明语言应作为代理基准测试中的显式变量。

摘要翻译

在智能体代码基准测试中，评估语言通常被视为固定的英语默认选项，然而我们发现，改变评判语言可以逆转骨干模型的排名顺序。本研究将“智能体即评判员”（Agent-as-a-Judge）的提示框架本地化为五种类型学上多样化的语言（英语、阿拉伯语、土耳其语、中文、印地语），并在三个开发者智能体框架和六个评判骨干模型上对55项DevAI开发任务进行了评估，总计完成4950次评判运行。核心发现是骨干模型与语言之间存在交互作用：GPT-4o在英语中获得了最高的满意度（44.72%），而Gemini在阿拉伯语（51.72%，与GPT-4o相比$p<0.001$）和印地语（53.22%）中表现领先。没有任何单一骨干模型能在所有语言中占据主导地位，且不同骨干模型对单项需求判断的一致性仅为中等水平（弗莱斯$κ\leq 0.231$）。一项受控消融实验进一步表明，仅本地化基准内容而忽略评判端指令可能导致决定性差异：在部分本地化条件下，印地语的满意度从42.8%下降至23.2%。这些结果表明，在智能体基准测试中，语言应被视为明确的评估变量。我们已公开发布完整的需求层级判断结果与运行时统计数据，以确保可复现性。

摘要 (Abstract)

Evaluation language is typically treated as a fixed English default in agentic code benchmarks, yet we show that changing the judge’s language can invert backbone rankings. We localize the Agent-as-a-Judge prompt stack to five typologically diverse languages (English, Arabic, Turkish, Chinese, Hindi) and evaluate 55 DevAI development tasks across three developer-agent frameworks and six judge backbones, totaling 4950 judge runs. The central finding is that backbone and language interact: GPT-4o achieves the highest satisfaction in English (44.72%), while Gemini leads in Arabic (51.72%, $p<0.001$ vs.\ GPT-4o) and Hindi (53.22%). No single backbone dominates across all languages, and inter-backbone agreement on individual requirement judgments is modest (Fleiss’ $κ\leq 0.231$). A controlled ablation further shows that localizing judge-side instructions, not just benchmark content, can be decisive: Hindi satisfaction drops from 42.8% to 23.2% under partial localization. These results indicate that language should be treated as an explicit evaluation variable in agentic benchmarks. Full requirement-level judgments and runtime statistics are released for reproducibility.

关键词: Agent-as-a-Judge, Multilingual Prompt Localization, Language Sensitivity, Backbone Sensitivity, Evaluation Language, LLM Agents, Agentic Benchmarks, Requirement-Level Evaluation

21. ❌ Lighting Up or Dimming Down? Exploring Dark Patterns of LLMs in Co-Creativity

作者: Zhu Li, Jiaming Qu, Yuan Chang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04735v1

评分: 18.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	8.0/10	8.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs作为协作写作伙伴时出现的"黑暗模式"行为，与"Large Language Models"高度相关（10分），并探讨这些行为与安全对齐（Alignment）的关联（8分）。论文未涉及其他技术原理或特定应用领域的关键词，因此其余关键词评分为0。

!!! tip deepseek-chat TL;DR

本研究探索了LLMs作为协作写作伙伴时出现的五种"黑暗模式"行为（如奉承、道德说教等），发现奉承行为在敏感话题中普遍存在（91.7%），这些行为可能源于安全对齐过程并限制创造性探索。

摘要翻译

大型语言模型（LLMs）正日益成为协作写作伙伴，这引发了关于其对人类能动性影响的疑问。在这项探索性研究中，我们调查了人机共创中五种“暗黑模式”——即可能抑制或扭曲创作过程的微妙模型行为：谄媚附和、语调管制、道德说教、死亡循环与锚定效应。通过一系列受控实验，我们以多种文学形式和主题为背景，将LLMs作为写作助手进行提示，并分析生成响应中这些行为的普遍性。初步结果表明，谄媚附和行为几乎无处不在（占案例的91.7%），在敏感话题中尤为突出；而锚定效应似乎与文学形式相关，在民间故事中出现频率最高。本研究表明，这些常作为安全对齐副产品的暗黑模式，可能无意中限制了创作探索的广度，并据此提出了有效支持创意写作的AI系统设计考量。

摘要 (Abstract)

Large language models (LLMs) are increasingly acting as collaborative writing partners, raising questions about their impact on human agency. In this exploratory work, we investigate five “dark patterns” in human-AI co-creativity – subtle model behaviors that can suppress or distort the creative process: Sycophancy, Tone Policing, Moralizing, Loop of Death, and Anchoring. Through a series of controlled sessions where LLMs are prompted as writing assistants across diverse literary forms and themes, we analyze the prevalence of these behaviors in generated responses. Our preliminary results suggest that Sycophancy is nearly ubiquitous (91.7% of cases), particularly in sensitive topics, while Anchoring appears to be dependent on literary forms, surfacing most frequently in folktales. This study indicates that these dark patterns, often byproducts of safety alignment, may inadvertently narrow creative exploration and proposes design considerations for AI systems that effectively support creative writing.

关键词: Large language models, LLMs, co-creativity, dark patterns, sycophancy, alignment, creative writing, human-AI collaboration

22. ❌ High-Stakes Personalization: Rethinking LLM Customization for Individual Investor Decision-Making

作者: Yash Ganpat Sawant 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04300v1

评分: 18.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	8.0/10	8.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM在个人投资决策中的个性化定制挑战，属于LLM应用研究。与"Large Language Models"高度相关（10分），因为全文围绕LLM系统展开。与"Instruction Tuning" OR “Alignment” OR “Value Alignment"有一定关联（8分），因为论文讨论了在没有固定标签集的情况下评估个性化质量的"alignment without ground truth"问题。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、CoT、Agents、Quantization等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM在个人投资决策这一高风险、时间跨度长的领域进行个性化定制时面临的四大根本性挑战，并基于实际部署的系统提出了相应的架构应对方案和未来研究方向。

摘要翻译

个性化大语言模型系统发展迅速，但多数应用于用户偏好稳定、且缺乏客观事实依据或标准主观的领域。我们认为，个人投资者决策为LLM个性化提供了一个极具挑战性的独特领域——它暴露了当前定制范式的根本性局限。基于我们为人工智能增强型投资组合管理所构建并部署的系统，我们识别出个人投资领域揭示标准LLM定制根本局限的四个维度：（1）行为记忆复杂性，即投资者的行为模式随时间演变、自相矛盾且具有财务影响；（2）漂移情境下的逻辑一致性，即在数周或数月内保持连贯的投资逻辑，这对无状态和会话受限的架构构成压力；（3）风格与信号的张力，即系统必须同时尊重个人投资理念，并呈现可能与之矛盾的客观证据；（4）无事实依据的对齐，即由于投资结果具有随机性和延迟性，个性化质量无法对照固定标签集进行评估。我们阐述了构建该系统过程中产生的架构应对方案，并针对高风险、长周期决策领域的个性化自然语言处理提出了开放的研究方向。

摘要 (Abstract)

Personalized LLM systems have advanced rapidly, yet most operate in domains where user preferences are stable and ground truth is either absent or subjective. We argue that individual investor decision-making presents a uniquely challenging domain for LLM personalization - one that exposes fundamental limitations in current customization paradigms. Drawing on our system, built and deployed for AI-augmented portfolio management, we identify four axes along which individual investing exposes fundamental limitations in standard LLM customization: (1) behavioral memory complexity, where investor patterns are temporally evolving, self-contradictory, and financially consequential; (2) thesis consistency under drift, where maintaining coherent investment rationale over weeks or months strains stateless and session-bounded architectures; (3) style-signal tension, where the system must simultaneously respect personal investment philosophy and surface objective evidence that may contradict it; and (4) alignment without ground truth, where personalization quality cannot be evaluated against a fixed label set because outcomes are stochastic and delayed. We describe the architectural responses that emerged from building the system and propose open research directions for personalized NLP in high-stakes, temporally extended decision domains.

关键词: LLM personalization, individual investor decision-making, high-stakes domains, behavioral memory complexity, thesis consistency under drift, style-signal tension, alignment without ground truth, temporally extended decision domains

23. ❌ SARES-DEIM: Sparse Mixture-of-Experts Meets DETR for Robust SAR Ship Detection

作者: Fenghao Song, Shaojing Yang, Xi Zhou 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04127v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文提出SARES-DEIM框架，核心创新是SARESMoE模块，该模块明确使用了稀疏门控机制和专家混合（Mixture-of-Experts）架构，因此与关键词"Mixture of Experts” OR “MoE” OR “Sparse Models"高度相关（10分）。论文属于计算机视觉在遥感领域的应用，可视为AI在科学（遥感）领域的应用，与"AI for Science"有一定关联（5分）。论文未涉及大语言模型、训练方法、推理优化、智能体等其他技术，因此其余关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对合成孔径雷达图像中船舶检测面临的斑点噪声和复杂背景等挑战，提出了一种结合稀疏专家混合和检测变换器的SARES-DEIM框架，在基准数据集上取得了优于现有方法的检测性能。

摘要翻译

合成孔径雷达（SAR）图像中的船舶检测面临着固有的相干斑点噪声、复杂的海岸杂波以及小尺度目标普遍存在等根本性挑战。传统检测器主要为光学图像设计，通常对SAR特有的图像退化鲁棒性有限，并且在空间下采样过程中会丢失细粒度的船舶特征。为解决这些局限性，我们提出了SARES-DEIM，这是一个基于DEtection TRansformer（DETR）范式的领域感知检测框架。我们方法的核心是SARESMoE（SAR感知专家选择混合专家）模块，该模块利用稀疏门控机制，将特征选择性地路由至专门的频率和小波专家。这种稀疏激活架构能有效滤除斑点噪声和语义杂波，同时保持较高的计算效率。此外，我们引入了空间到深度增强金字塔（SDEP）颈部结构，以保留来自浅层阶段的高分辨率空间线索，从而显著提升小目标的定位能力。在两个基准数据集上的大量实验证明了SARES-DEIM的优越性。值得注意的是，在具有挑战性的HRSID数据集上，我们的模型实现了76.4%的mAP50:95和93.8%的mAP50，性能超越了最先进的YOLO系列及专用SAR检测器。

摘要 (Abstract)

Ship detection in Synthetic Aperture Radar (SAR) imagery is fundamentally challenged by inherent coherent speckle noise, complex coastal clutter, and the prevalence of small-scale targets. Conventional detectors, primarily designed for optical imagery, often exhibit limited robustness against SAR-specific degradation and suffer from the loss of fine-grained ship signatures during spatial downsampling. To address these limitations, we propose SARES-DEIM, a domain-aware detection framework grounded in the DEtection TRansformer (DETR) paradigm. Central to our approach is SARESMoE (SAR-aware Expert Selection Mixture-of-Experts), a module leveraging a sparse gating mechanism to selectively route features toward specialized frequency and wavelet experts. This sparsely-activated architecture effectively filters speckle noise and semantic clutter while maintaining high computational efficiency. Furthermore, we introduce the Space-to-Depth Enhancement Pyramid (SDEP) neck to preserve high-resolution spatial cues from shallow stages, significantly improving the localization of small targets. Extensive experiments on two benchmark datasets demonstrate the superiority of SARES-DEIM. Notably, on the challenging HRSID dataset, our model achieves a mAP50:95 of 76.4% and a mAP50 of 93.8%, outperforming state-of-the-art YOLO-series and specialized SAR detectors.

关键词: SAR ship detection, Mixture-of-Experts, Sparse gating, DETR, Speckle noise, Small target detection, SAR imagery, Computer vision

24. ❌ HI-MoE: Hierarchical Instance-Conditioned Mixture-of-Experts for Object Detection

作者: Vadim Vashkelis, Natalia Trukhina 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04908v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	15.0/10	15.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉中的目标检测任务，提出了一种名为HI-MoE的层次化实例条件混合专家架构。该论文的核心创新在于将MoE架构应用于视觉任务，特别是针对目标检测的实例级路由设计。因此，仅与关键词"Mixture of Experts” OR “MoE” OR “Sparse Models"高度相关（15分），因为MoE是论文的核心方法，涉及稀疏路由和条件计算。其他关键词主要涉及语言模型、训练技术、推理方法、代理系统、模型压缩等，与这篇视觉领域的论文无直接关联，故均评为0分。

!!! tip deepseek-chat TL;DR

该论文针对目标检测任务中现有混合专家方法在图像或补丁级别路由与实例级推理不匹配的问题，提出了层次化实例条件混合专家架构HI-MoE，通过两阶段路由（场景路由器和实例路由器）在COCO数据集上超越了密集基线模型，并在小物体检测上取得了显著提升。

摘要翻译

专家混合（Mixture-of-Experts, MoE）架构通过仅为每个输入激活模型参数的一个子集来实现条件计算。尽管稀疏路由在语言模型中极为有效，并在视觉领域也展现出潜力，但大多数视觉MoE方法在图像或图像块级别进行操作。这种粒度与目标检测任务并不匹配，因为该任务的基本推理单元是对应候选实例的目标查询。我们提出分层实例条件化专家混合（Hierarchical Instance-Conditioned Mixture-of-Experts, HI-MoE），这是一种DETR风格的检测架构，其路由过程分为两个阶段：轻量级场景路由器首先选择一组场景一致的专家子集，随后实例路由器将每个目标查询分配给该子集中的少数专家。此设计旨在保持计算稀疏性的同时，更好地匹配检测任务中异构的、以实例为中心的结构。在当前草案中，实验主要集中于COCO数据集，并在LVIS上进行了初步的专家专业化分析。在此设置下，HI-MoE相较于稠密的DINO基线模型以及更简单的令牌级别或仅实例级别的路由变体均有所提升，在小目标检测上增益尤为显著。我们还提供了专家专业化模式的初步可视化。我们以旨在支持进一步实验验证的形式，介绍了该方法、消融研究以及当前存在的局限性。

摘要 (Abstract)

Mixture-of-Experts (MoE) architectures enable conditional computation by activating only a subset of model parameters for each input. Although sparse routing has been highly effective in language models and has also shown promise in vision, most vision MoE methods operate at the image or patch level. This granularity is poorly aligned with object detection, where the fundamental unit of reasoning is an object query corresponding to a candidate instance. We propose Hierarchical Instance-Conditioned Mixture-of-Experts (HI-MoE), a DETR-style detection architecture that performs routing in two stages: a lightweight scene router first selects a scene-consistent expert subset, and an instance router then assigns each object query to a small number of experts within that subset. This design aims to preserve sparse computation while better matching the heterogeneous, instance-centric structure of detection. In the current draft, experiments are concentrated on COCO with preliminary specialization analysis on LVIS. Under these settings, HI-MoE improves over a dense DINO baseline and over simpler token-level or instance-only routing variants, with especially strong gains on small objects. We also provide an initial visualization of expert specialization patterns. We present the method, ablations, and current limitations in a form intended to support further experimental validation.

关键词: Mixture-of-Experts, Object Detection, Sparse Routing, Instance-Conditioned, Hierarchical Routing, DETR-style, COCO, Small Objects

25. ❌ Training-Free Image Editing with Visual Context Integration and Concept Alignment

作者: Rui Song, Guo-Hua Wang, Qing-Guo Chen, Weihua Luo, Tongda Xu, Zhening Liu, Yan Wang, Zehong Lin, Jun Zhang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04487v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究图像编辑方法VicoEdit，属于计算机视觉领域，而非大语言模型（LLM）或深度学习技术原理的直接创新。与关键词的相关性分析如下：1）仅与"Pre-training”（提及使用预训练模型）和"Alignment"（提及概念对齐）有中等关联（5分），因为论文使用了预训练的文本提示编辑模型，并设计了概念对齐引导的后验采样方法。2）其他关键词均与论文内容无关（0分），因为论文专注于图像编辑的视觉上下文集成，不涉及LLM、推理、对齐、压缩、科学AI等主题。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需训练和反演的图像编辑方法VicoEdit，通过视觉上下文集成和概念对齐，在保持编辑一致性的同时超越了现有基于训练的方法的性能。

摘要翻译

在图像编辑中，引入上下文图像以传达用户的精确需求（如主体外观或图像风格）至关重要。现有的基于训练的视觉上下文感知编辑方法需要数据收集工作和训练成本。另一方面，免训练方案通常基于扩散反演技术建立，其在一致性和灵活性方面存在不足。本文提出VicoEdit，一种免训练且无需反演的方法，能够将视觉上下文注入预训练的文本提示编辑模型中。具体而言，VicoEdit基于视觉上下文直接将源图像转换为目标图像，从而避免了可能导致轨迹偏差的反演过程。此外，我们设计了一种基于概念对齐引导的后验采样方法，以增强编辑一致性。实验结果表明，我们的免训练方法甚至取得了优于当前最先进基于训练模型的编辑性能。

摘要 (Abstract)

In image editing, it is essential to incorporate a context image to convey the user’s precise requirements, such as subject appearance or image style. Existing training-based visual context-aware editing methods incur data collection effort and training cost. On the other hand, the training-free alternatives are typically established on diffusion inversion, which struggles with consistency and flexibility. In this work, we propose VicoEdit, a training-free and inversion-free method to inject the visual context into the pretrained text-prompted editing model. More specifically, VicoEdit directly transforms the source image into the target one based on the visual context, thereby eliminating the need for inversion that can lead to deviated trajectories. Moreover, we design a posterior sampling approach guided by concept alignment to enhance the editing consistency. Empirical results demonstrate that our training-free method achieves even better editing performance than the state-of-the-art training-based models.

关键词: training-free image editing, visual context integration, concept alignment, diffusion inversion, posterior sampling, pretrained text-prompted editing model, VicoEdit

26. ❌ ECG Biometrics with ArcFace-Inception: External Validation on MIMIC and HEEDB

作者: Arjuna Scagnetto 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04485v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 该论文研究心电图（ECG）生物识别技术，使用1D Inception-v1模型和ArcFace损失函数进行身份识别，并在MIMIC和HEEDB数据集上进行外部验证。论文主题是生物医学信号处理和计算机视觉在医疗领域的应用，属于AI在科学（特别是生物信息学）领域的应用。所有关键词中，只有"AI for Science" OR “Bioinformatics” OR “Cheminformatics"与论文有间接关联（因为ECG生物识别属于生物医学AI应用），但论文未涉及大模型、深度学习技术原理创新或任何其他关键词的具体技术。因此，除最后一个关键词得5分外，其余关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文研究了基于心电图（ECG）的生物识别系统在外部验证、大规模图库和时间间隔下的性能，发现系统在MIMIC和HEEDB数据集上表现良好，但性能受领域异质性、时间漂移和图库大小等因素显著影响。

摘要翻译

心电生物识别技术的研究主要基于小规模队列和短时间隔数据，其在大规模底库、外部域偏移及多年时间跨度下的识别性能尚不明确。本研究采用经ArcFace损失函数训练的Inception-v1一维卷积神经网络，在包含53,079名患者的164,440份12导联心电图的内部临床数据集上进行训练，并在源自MIMIC-IV-ECG和HEEDB的大规模外部队列中评估性能。研究采用统一的闭集留一验证协议，结合Rank@K与TAR@FAR评价指标，并进行了规模分析、时间压力测试、重排序及置信度分析。在一般可比条件下，系统在ASUGI-DB上达到Rank@1为0.9506，在MIMIC-GC上为0.8291，在HEEDB-GC上为0.6884。在固定底库规模的时间压力测试中，MIMIC数据集的Rank@1随年份跨度（1至5年）从0.7853降至0.6433，HEEDB数据集从0.6864降至0.5560。HEEDB的规模分析显示识别性能随底库规模扩大单调下降，而随患者个体检查次数增加有所恢复。在HEEDB-RR数据集上，后验重排序进一步提升了检索效果：AS-norm方法将Rank@1从基线0.7765提升至0.8005。研究表明，心电身份信息在外部验证的大规模闭集条件下仍具有可测量性，但其实际应用质量显著受域异质性、纵向漂移、底库规模及二级分数处理的影响。

摘要 (Abstract)

ECG biometrics has been studied mainly on small cohorts and short inter-session intervals, leaving open how identification behaves under large galleries, external domain shift, and multi-year temporal gaps. We evaluated a 1D Inception-v1 model trained with ArcFace on an internal clinical corpus of 164,440 12-lead ECGs from 53,079 patients and tested it on larger cohorts derived from MIMIC-IV-ECG and HEEDB. The study used a unified closed-set leave-one-out protocol with Rank@K and TAR@FAR metrics, together with scale, temporal-stress, reranking, and confidence analyses. Under general comparability, the system achieved Rank@1 of 0.9506 on ASUGI-DB, 0.8291 on MIMIC-GC, and 0.6884 on HEEDB-GC. In the temporal stress test at constant gallery size, Rank@1 declined from 0.7853 to 0.6433 on MIMIC and from 0.6864 to 0.5560 on HEEDB from 1 to 5 years. Scale analysis on HEEDB showed monotonic degradation as gallery size increased and recovery as more examinations per patient became available. On HEEDB-RR, post-hoc reranking further improved retrieval, with AS-norm reaching Rank@1 = 0.8005 from a 0.7765 baseline. ECG identity information therefore remains measurable under externally validated large-scale closed-set conditions, but its operational quality is strongly affected by domain heterogeneity, longitudinal drift, gallery size, and second-stage score processing.

关键词: ECG biometrics, ArcFace, Inception-v1, external validation, MIMIC, HEEDB, temporal stress test, rank retrieval

27. ❌ Multimodal Backdoor Attack on VLMs for Autonomous Driving via Graffiti and Cross-Lingual Triggers

作者: Jiancheng Wang, Lidan Liang, Yong Wang, Zengzhen Su, Haifeng Xia, Yuanting Yan, Wei Wang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04630v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究针对视觉语言模型（VLM）在自动驾驶场景中的后门攻击，属于大模型（特别是多模态大模型）在安全关键领域的应用研究。因此，仅与第一个关键词（“Large Language Models” OR “LLMs” OR “Foundation Models”）有一定关联，因为VLM可视为大模型的一个子类或扩展（尽管论文聚焦视觉语言模型，而非纯文本大模型）。论文核心是安全攻击方法（GLA），不涉及其他关键词所描述的大模型训练、优化、推理、对齐、代理、科学应用等具体技术原理或应用领域。

!!! tip deepseek-chat TL;DR

该论文提出了一种针对自动驾驶视觉语言模型的多模态后门攻击方法GLA，使用涂鸦视觉触发器和跨语言文本触发器，在低投毒率下实现高攻击成功率且难以被传统检测方法发现。

摘要翻译

视觉语言模型正快速融入自动驾驶等安全关键系统，使其成为潜在后门攻击的重要攻击面。现有后门攻击主要依赖单模态、显性且易被检测的触发器，难以在自动驾驶场景中构建隐蔽且稳定的攻击通道。GLA引入了两种自然主义触发器：通过稳定扩散修复技术生成的涂鸦式视觉图案（可无缝融入城市场景），以及跨语言文本触发器（在保持语义一致性的同时引入分布偏移，以构建鲁棒的语言侧触发信号）。在DriveVLM上的实验表明，GLA仅需10%的中毒比例即可实现90%的攻击成功率与0%的误报率。更具隐蔽性的是，该后门不会削弱模型在干净任务上的性能，反而提升了BLEU-1等指标，使得传统基于性能下降的检测方法难以识别攻击。本研究揭示了自动驾驶视觉语言模型中未被充分认识的安全威胁，并为安全关键多模态系统的后门评估提供了新的攻击范式。

摘要 (Abstract)

Visual language model (VLM) is rapidly being integrated into safety-critical systems such as autonomous driving, making it an important attack surface for potential backdoor attacks. Existing backdoor attacks mainly rely on unimodal, explicit, and easily detectable triggers, making it difficult to construct both covert and stable attack channels in autonomous driving scenarios. GLA introduces two naturalistic triggers: graffiti-based visual patterns generated via stable diffusion inpainting, which seamlessly blend into urban scenes, and cross-language text triggers, which introduce distributional shifts while maintaining semantic consistency to build robust language-side trigger signals. Experiments on DriveVLM show that GLA requires only a 10% poisoning ratio to achieve a 90% Attack Success Rate (ASR) and a 0% False Positive Rate (FPR). More insidiously, the backdoor does not weaken the model on clean tasks, but instead improves metrics such as BLEU-1, making it difficult for traditional performance-degradation-based detection methods to identify the attack. This study reveals underestimated security threats in self-driving VLMs and provides a new attack paradigm for backdoor evaluation in safety-critical multimodal systems.

关键词: Visual Language Model, Backdoor Attack, Autonomous Driving, Multimodal, Graffiti Trigger, Cross-Lingual Trigger, Attack Success Rate, Security Threat

28. ❌ Your Pre-trained Diffusion Model Secretly Knows Restoration

作者: Sudarshan Rajagopalan, Vishal M. Patel 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04924v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究扩散模型在图像/视频修复中的应用，属于计算机视觉领域而非大语言模型（LLM）范畴。与大多数关键词（如LLM、MoE、推理、对齐等）完全无关。仅与少数关键词有间接关联：1）“Pre-training”（使用预训练扩散模型）得5分；2）“Post-training”（涉及微调对比）得5分；3）“PEFT”（轻量级提示学习类似参数高效微调）得5分。其他关键词均不适用。

!!! tip deepseek-chat TL;DR

该论文发现预训练扩散模型本身具有修复能力，通过扩散桥框架学习提示嵌入即可解锁，无需微调或控制模块，在图像和视频修复任务中实现了竞争性性能。

摘要翻译

预训练扩散模型在统一修复任务中实现了显著进展，提升了感知质量与泛化能力。然而，基于扩散的修复方法主要依赖微调或Control-Net风格模块，以利用预训练扩散模型的先验知识。本研究表明，这些预训练扩散模型本身具备修复能力，该能力可通过直接在文本编码器输出端学习提示嵌入来解锁。有趣的是，这种能力难以通过文本提示或文本标记嵌入优化直接获取。此外，我们发现朴素的提示学习存在不稳定性，因为使用退化图像的前向加噪过程与反向采样轨迹存在错位。为解决此问题，我们在扩散桥框架内训练提示嵌入，该框架对齐了训练与推理动态，强制构建了从含噪退化状态到清晰图像的一致性去噪路径。基于这些发现，我们在预训练的WAN视频模型和FLUX图像模型中引入了轻量级可学习提示，将其转化为高性能修复模型。大量实验表明，我们的方法在多种退化类型上实现了具有竞争力的性能与泛化能力，同时避免了模型微调和修复专用控制模块的使用。

摘要 (Abstract)

Pre-trained diffusion models have enabled significant advancements in All-in-One Restoration (AiOR), offering improved perceptual quality and generalization. However, diffusion-based restoration methods primarily rely on fine-tuning or Control-Net style modules to leverage the pre-trained diffusion model’s priors for AiOR. In this work, we show that these pre-trained diffusion models inherently possess restoration behavior, which can be unlocked by directly learning prompt embeddings at the output of the text encoder. Interestingly, this behavior is largely inaccessible through text prompts and text-token embedding optimization. Furthermore, we observe that naive prompt learning is unstable because the forward noising process using degraded images is misaligned with the reverse sampling trajectory. To resolve this, we train prompts within a diffusion bridge formulation that aligns training and inference dynamics, enforcing a coherent denoising path from noisy degraded states to clean images. Building on these insights, we introduce our lightweight learned prompts on the pre-trained WAN video model and FLUX image models, converting them into high-performing restoration models. Extensive experiments demonstrate that our approach achieves competitive performance and generalization across diverse degradations, while avoiding fine-tuning and restoration-specific control modules.

关键词: diffusion models, image restoration, video restoration, prompt learning, diffusion bridge, pre-trained models, All-in-One Restoration, denoising path

29. ❌ Vero: An Open RL Recipe for General Visual Reasoning

作者: Gabriel Sarch, Linrong Cai, Qunzhong Wang, Haoyang Wu, Danqi Chen, Zhuang Liu 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04917v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是视觉语言模型（VLM）的强化学习训练方法，与RLHF高度相关（10分）。论文涉及视觉推理任务，与Chain of Thought和System 2 Thinking相关（各8分）。论文使用基础模型Qwen3-VL-8B，属于大模型范畴（8分）。论文构建了包含科学任务的数据集，与AI for Science有一定关联（5分）。论文提到数据覆盖范围对RL扩展的重要性，与Scaling Laws AND Data Quality有一定关联（5分）。其他关键词如MoE、SLMs、PEFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为Vero的开放强化学习训练方法，用于构建通用的视觉推理模型，通过在六个广泛任务类别上扩展RL数据和奖励，实现了在30个基准测试上的最先进性能。

摘要翻译

构建一个能够跨图表、科学、空间理解和开放式任务工作的视觉推理器需要什么？最强大的视觉语言模型（VLMs）表明，如此广泛的视觉推理能力是可以实现的，但其背后的方法仍不明确，被锁定在使用非公开数据的专有强化学习（RL）流程中。我们推出了Vero，一个完全开放的VLM系列，在多样化的视觉推理任务中达到或超越了现有的开放权重模型。我们在六大任务类别中扩展了RL数据和奖励机制，构建了Vero-600K——一个从59个数据集中提取的包含60万个样本的数据集，并设计了能够处理异构答案格式的任务路由奖励。Vero实现了最先进的性能，在我们包含30个挑战性基准测试的VeroEval套件上，相比四个基础模型平均提升了3.7-5.5分。以Qwen3-VL-8B-Instruct为基础，Vero在30个基准测试中的23个上超越了Qwen3-VL-8B-Thinking，且无需额外的专有思维数据。当从相同的基础模型开始训练时，Vero-600K在所有任务类别上都超越了现有的RL数据集。系统性的消融实验表明，不同的任务类别会引发性质各异的推理模式，这些模式在孤立情况下迁移效果不佳，这表明广泛的数据覆盖是推动RL有效扩展的主要驱动力。所有数据、代码和模型均已开源。

摘要 (Abstract)

What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) show such broad visual reasoning is within reach, but the recipe behind them remains unclear, locked behind proprietary reinforcement learning (RL) pipelines with non-public data. We introduce Vero, a family of fully open VLMs that matches or exceeds existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answer formats. Vero achieves state-of-the-art performance, improving over four base models by 3.7-5.5 points on average across VeroEval, our suite of 30 challenging benchmarks. Starting from Qwen3-VL-8B-Instruct, Vero outperforms Qwen3-VL-8B-Thinking on 23 of 30 benchmarks without additional proprietary thinking data. When trained from the same base model, Vero-600K exceeds existing RL datasets across task categories. Systematic ablations reveal that different task categories elicit qualitatively distinct reasoning patterns that transfer poorly in isolation, suggesting that broad data coverage is the primary driver of strong RL scaling. All data, code, and models are released.

关键词: visual reasoning, reinforcement learning, vision-language models, open-source, Vero, RL data scaling, task-routed rewards, broad visual reasoning

30. ❌ Early Stopping for Large Reasoning Models via Confidence Dynamics

作者: Parsa Hosseini, Sumit Nawathe, Mahdi Salmani, Meisam Razaviyayn, Soheil Feizi 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04930v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文研究大型推理模型中的早期停止方法，核心关注链式思维推理过程，因此与’Chain of Thought’高度相关（10分）。研究涉及深度推理过程分析，与’System 2 Thinking’相关（8分）。提出的CoDE-Stop方法旨在减少计算成本，与’Inference Acceleration’相关（8分）。论文在科学基准上评估，与’AI for Science’相关（8分）。方法涉及对中间答案置信度的分析，与’Self-Correction’和’Explainable AI’有一定关联（各5分）。论文明确针对大型推理模型，与’Large Language Models’高度相关（10分）。其他关键词如MoE、SLMs、训练方法、对齐、RAG、量化等与论文内容无直接关系，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大型推理模型在链式思维生成中过度推理导致计算成本高和性能下降的问题，提出了一种基于中间答案置信度动态的早期停止方法CoDE-Stop，在多种推理和科学基准测试中实现了更好的准确率-计算权衡，并将总令牌使用量减少了25-50%。

摘要翻译

大型推理模型依赖长链思维生成来解决复杂问题，但过长的推理过程通常会产生高昂的计算成本，甚至可能因过度思考而导致性能下降。一个关键挑战在于确定模型应在何时停止推理并给出最终答案。本研究分析了推理过程中中间答案的置信度，并观察到两种典型行为：正确的推理轨迹往往能较早达到高置信度答案，而错误的推理过程则倾向于产生冗长且低效的推理路径，其置信度动态也表现出较低的可信度。基于这些观察，我们提出CoDE-Stop（置信度动态早期停止）方法，该方法利用中间答案置信度的动态变化来决定何时终止推理，无需额外训练即可轻松集成到现有模型中。我们在多个模型的不同推理与科学基准测试上评估了CoDE-Stop方法。与先前的早期停止方法相比，它在准确性与计算效率之间取得了更优的平衡，相较于标准全长推理可减少25-50%的总令牌消耗。此外，我们还对推理过程中的置信度动态进行了深入分析，揭示了正确与错误推理轨迹中置信度变化的内在规律。

摘要 (Abstract)

Large reasoning models rely on long chain-of-thought generation to solve complex problems, but extended reasoning often incurs substantial computational cost and can even degrade performance due to overthinking. A key challenge is determining when the model should stop reasoning and produce the final answer. In this work, we study the confidence of intermediate answers during reasoning and observe two characteristic behaviors: correct reasoning trajectories often reach high-confidence answers early, while incorrect rollouts tend to produce long, unproductive reasoning traces and exhibit less reliable confidence dynamics. Motivated by these observations, we propose CoDE-Stop (Confidence Dynamics Early Stop), an early stopping method that leverages the dynamics of intermediate answer confidence to decide when to terminate reasoning, requiring no additional training and easily integrating into existing models. We evaluate CoDE-Stop on diverse reasoning and science benchmarks across multiple models. Compared to prior early stopping methods, it achieves a more favorable accuracy-compute tradeoff and reduces total token usage by 25-50% compared to standard full-length reasoning. In addition, we provide analyses of confidence dynamics during reasoning, offering insights into how confidence changes in both correct and incorrect trajectories.

关键词: Large reasoning models, Chain-of-thought, Early stopping, Confidence dynamics, Computational efficiency, Reasoning benchmarks, Science benchmarks, Accuracy-compute tradeoff

31. ❌ How AI Aggregation Affects Knowledge

作者: Daron Acemoglu, Tianyi Lin, Asuman Ozdaglar, James Siderius 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04906v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究AI聚合如何影响社会学习，使用DeGroot模型扩展，分析AI聚合器训练和反馈机制对群体信念的影响。所有评分关键词均涉及具体的大模型技术、训练方法、推理机制、应用领域或优化技术，而本文聚焦于宏观的AI聚合对社会学习过程的理论建模，不涉及任何具体的大模型架构、训练技术、推理方法或科学应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了AI聚合如何影响社会学习，发现当AI聚合器更新过快时无法稳健改善学习，而本地聚合器在所有环境中都能稳健改善学习，用单一全局聚合器替换专业本地聚合器会在至少一个维度上恶化学习。

摘要翻译

人工智能（AI）通过将聚合输出转化为未来预测的训练数据，改变了社会学习过程。为研究此现象，我们扩展了DeGroot模型，引入一个AI聚合器：该聚合器以群体信念为训练数据，并将合成的信号反馈给个体。我们将学习差距定义为长期信念与有效基准之间的偏差，从而捕捉AI聚合如何影响学习。我们的核心结论揭示了更新速度存在一个阈值：当聚合器更新过快时，不存在任何具有正测度的训练权重集合能够在一大类环境中稳健地改善学习；而当更新足够慢时，此类权重是存在的。随后，我们比较了全局与局部架构。基于邻近数据或特定主题数据训练的局部聚合器能在所有环境中稳健提升学习效果。因此，用单一全局聚合器替代专业化的局部聚合器，至少会在状态的某一维度上导致学习效果恶化。

摘要 (Abstract)

Artificial intelligence (AI) changes social learning when aggregated outputs become training data for future predictions. To study this, we extend the DeGroot model by introducing an AI aggregator that trains on population beliefs and feeds synthesized signals back to agents. We define the learning gap as the deviation of long-run beliefs from the efficient benchmark, allowing us to capture how AI aggregation affects learning. Our main result identifies a threshold in the speed of updating: when the aggregator updates too quickly, there is no positive-measure set of training weights that robustly improves learning across a broad class of environments, whereas such weights exist when updating is sufficiently slow. We then compare global and local architectures. Local aggregators trained on proximate or topic-specific data robustly improve learning in all environments. Consequently, replacing specialized local aggregators with a single global aggregator worsens learning in at least one dimension of the state.

关键词: AI aggregation, social learning, DeGroot model, learning gap, global aggregator, local aggregator, training weights, belief updating

32. ❌ Analyzing Symbolic Properties for DRL Agents in Systems and Networking

作者: Mohammad Zangooei, Jannis Weil, Amr Rizk, Mina Tahmasbi Arashloo, Raouf Boutaba 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04914v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于深度强化学习（DRL）在系统和网络控制问题中的验证方法，特别是符号属性的分析。研究内容与绝大多数关键词（涉及大模型、训练技术、推理优化、对齐、代理系统等）完全无关。唯一的相关点是’Mechanistic Interpretability OR Explainable AI’，因为论文涉及DRL代理行为的可解释性验证（分析属性如单调性和鲁棒性），但这不是核心焦点，只是验证方法的一部分，因此给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文研究了如何为系统和网络中的深度强化学习代理分析符号属性（如单调性和鲁棒性），提出了一个通用框架diffRL，并通过实证研究表明符号属性比点属性提供更广泛的覆盖范围，能揭示非显而易见的反例。

摘要翻译

深度强化学习（DRL）在系统和网络中的复杂控制问题上表现出卓越性能，包括自适应视频流、无线资源管理和拥塞控制。然而，为确保安全部署，关键是要推理智能体在实际遇到的各种系统状态下的行为方式。该领域现有的基于验证的方法主要关注点属性，即围绕固定输入状态定义的性质，其覆盖范围有限，且需要大量人工努力来识别相关的输入-输出对进行分析。本文针对系统和网络中的DRL智能体，研究符号属性，这些属性规定了在输入状态范围内的预期行为。我们提出了符号属性的通用表述形式，以单调性和鲁棒性作为具体示例，并展示了如何利用现有的深度神经网络验证引擎进行分析。我们的方法将符号属性编码为同一策略相关执行之间的比较，并将其分解为实际可处理的子属性。这些技术为应用现有验证工具进行符号分析提供了实用支持。利用我们的框架diffRL，我们在三个基于DRL的控制系统（自适应视频流、无线资源管理和拥塞控制）上进行了广泛的实证研究。通过这些案例研究，我们分析了广泛输入范围内的符号属性，检验了训练过程中属性满足度的演变，研究了模型规模对可验证性的影响，并比较了多种验证后端。我们的结果表明，符号属性比点属性提供了更广泛的覆盖范围，能够发现非显而易见且具有操作意义的反例，同时也揭示了实际求解器的权衡与局限性。

摘要 (Abstract)

Deep reinforcement learning (DRL) has shown remarkable performance on complex control problems in systems and networking, including adaptive video streaming, wireless resource management, and congestion control. For safe deployment, however, it is critical to reason about how agents behave across the range of system states they encounter in practice. Existing verification-based methods in this domain primarily focus on point properties, defined around fixed input states, which offer limited coverage and require substantial manual effort to identify relevant input-output pairs for analysis. In this paper, we study symbolic properties, that specify expected behavior over ranges of input states, for DRL agents in systems and networking. We present a generic formulation for symbolic properties, with monotonicity and robustness as concrete examples, and show how they can be analyzed using existing DNN verification engines. Our approach encodes symbolic properties as comparisons between related executions of the same policy and decomposes them into practically tractable sub-properties. These techniques serve as practical enablers for applying existing verification tools to symbolic analysis. Using our framework, diffRL, we conduct an extensive empirical study across three DRL-based control systems, adaptive video streaming, wireless resource management, and congestion control. Through these case studies, we analyze symbolic properties over broad input ranges, examine how property satisfaction evolves during training, study the impact of model size on verifiability, and compare multiple verification backends. Our results show that symbolic properties provide substantially broader coverage than point properties and can uncover non-obvious, operationally meaningful counterexamples, while also revealing practical solver trade-offs and limitations.

关键词: Deep Reinforcement Learning, DRL agents, Symbolic properties, Verification, Systems and networking, Monotonicity, Robustness, diffRL

33. ❌ FileGram: Grounding Agent Personalization in File-System Behavioral Traces

作者: Shuai Liu, Shulin Tian, Kairui Hu, Yuhao Dong, Zhe Yang, Bo Li, Jingkang Yang, Chen Change Loy, Ziwei Liu 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04901v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究文件系统AI代理的个性化问题，与’LLM Agents’高度相关（10分），因为论文明确提到’coworking AI agents’和’personalized memory-centric file-system agents’。与’Tool Use’有一定关联（5分），因为文件系统操作可视为工具使用，但论文未明确讨论API或函数调用。与’Large Language Models’有间接关联（5分），因为AI代理通常基于LLM，但论文未明确提及LLM技术细节。其他关键词均未在论文标题或摘要中提及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文针对文件系统AI代理因隐私和数据限制导致的个性化不足问题，提出了FileGram框架，通过模拟工作流生成行为数据、建立诊断基准和构建基于原子操作的内存架构，有效提升了代理的个性化能力。

摘要翻译

在本地文件系统中协同工作的人工智能代理正迅速成为人机交互的新范式；然而，严格的数据限制严重制约了有效的个性化实现——严格的隐私壁垒与多模态真实世界行为轨迹的联合收集困难，阻碍了可扩展的训练与评估；现有方法仍以交互为中心，忽视了文件系统操作中密集的行为轨迹。为填补这一空白，我们提出了FileGram，这是一个将代理记忆与个性化根植于文件系统行为轨迹的综合框架，包含三个核心组件：（1）FileGramEngine，一个可扩展的、基于人物角色驱动的数据引擎，能模拟真实工作流并大规模生成细粒度多模态动作序列；（2）FileGramBench，一个基于文件系统行为轨迹的诊断基准，用于评估记忆系统在用户画像重建、轨迹解耦、角色漂移检测和多模态 grounding 方面的性能；（3）FileGramOS，一种自底向上的记忆架构，直接从原子操作和内容增量（而非对话摘要）构建用户画像，并通过查询时抽象将这些轨迹编码至程序性、语义性和情景性通道中；大量实验表明，FileGramBench对当前最先进的记忆系统仍具挑战性，而FileGramEngine与FileGramOS均表现有效。通过开源此框架，我们希望支持未来以个性化记忆为中心的文件系统代理研究。

摘要 (Abstract)

Coworking AI agents operating within local file systems are rapidly emerging as a paradigm in human-AI interaction; however, effective personalization remains limited by severe data constraints, as strict privacy barriers and the difficulty of jointly collecting multimodal real-world traces prevent scalable training and evaluation, and existing methods remain interaction-centric while overlooking dense behavioral traces in file-system operations; to address this gap, we propose FileGram, a comprehensive framework that grounds agent memory and personalization in file-system behavioral traces, comprising three core components: (1) FileGramEngine, a scalable persona-driven data engine that simulates realistic workflows and generates fine-grained multimodal action sequences at scale; (2) FileGramBench, a diagnostic benchmark grounded in file-system behavioral traces for evaluating memory systems on profile reconstruction, trace disentanglement, persona drift detection, and multimodal grounding; and (3) FileGramOS, a bottom-up memory architecture that builds user profiles directly from atomic actions and content deltas rather than dialogue summaries, encoding these traces into procedural, semantic, and episodic channels with query-time abstraction; extensive experiments show that FileGramBench remains challenging for state-of-the-art memory systems and that FileGramEngine and FileGramOS are effective, and by open-sourcing the framework, we hope to support future research on personalized memory-centric file-system agents.

关键词: AI agents, file-system, personalization, behavioral traces, memory architecture, user profiles, multimodal actions, privacy constraints

34. ❌ Agentic Federated Learning: The Future of Distributed Training Orchestration

作者: Rafael O. Jarczewski, Gabriel U. Talasso, Leandro Villas, Allan M. de Souza 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04895v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Agentic-FL框架，核心是使用语言模型代理（LMagents）进行联邦学习的自主编排，与’LLM Agents/Autonomous Agents’高度相关（10分），涉及服务器端和客户端代理的协调，与’Multi-agent Systems/Agent Coordination’高度相关（10分）。论文讨论了大模型在联邦学习中的应用，与’Large Language Models/Foundation Models’高度相关（10分）。摘要提到可靠性挑战包括幻觉，与’Hallucination Mitigation’有一定关联（5分）。其他关键词如MoE、SLMs、训练技术、推理优化、科学AI应用等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对联邦学习中客户端异构性和系统动态性导致的效率低下和偏差问题，提出了Agentic-FL框架，通过语言模型代理实现自主编排，以优化资源利用并减少系统偏差。

摘要翻译

尽管联邦学习（FL）承诺实现隐私保护与分布式协作，但其在现实场景中的有效性常因客户端的随机异构性与不可预测的系统动态而受限。现有静态优化方法无法适应这些波动，导致资源利用不足与系统性偏差。本研究提出向智能体化联邦学习（Agentic-FL）的范式转变——该框架让基于语言模型的智能体（LMagents）承担自主编排角色。与僵化协议不同，我们论证了服务器端智能体如何通过情境推理缓解选择偏差，而客户端智能体则作为本地守护者，动态管理隐私预算并根据硬件约束自适应调整模型复杂度。这一集成不仅解决了技术低效问题，更标志着联邦学习向去中心化生态系统的演进：协作通过自主协商实现，为未来基于激励的模型市场与算法公正性铺平道路。我们探讨了该方法在可靠性（幻觉问题）与安全性方面的挑战，并勾勒出联邦环境下弹性多智能体系统的发展路线图。

摘要 (Abstract)

Although Federated Learning (FL) promises privacy and distributed collaboration, its effectiveness in real-world scenarios is often hampered by the stochastic heterogeneity of clients and unpredictable system dynamics. Existing static optimization approaches fail to adapt to these fluctuations, resulting in resource underutilization and systemic bias. In this work, we propose a paradigm shift towards Agentic-FL, a framework where Language Model-based Agents (LMagents) assume autonomous orchestration roles. Unlike rigid protocols, we demonstrate how server-side agents can mitigate selection bias through contextual reasoning, while client-side agents act as local guardians, dynamically managing privacy budgets and adapting model complexity to hardware constraints. More than just resolving technical inefficiencies, this integration signals the evolution of FL towards decentralized ecosystems, where collaboration is negotiated autonomously, paving the way for future markets of incentive-based models and algorithmic justice. We discuss the reliability (hallucinations) and security challenges of this approach, outlining a roadmap for resilient multi-agent systems in federated environments.

关键词: Agentic Federated Learning, Language Model-based Agents, Autonomous Orchestration, Multi-agent Systems, Federated Learning, Decentralized Ecosystems, Selection Bias Mitigation, Privacy Budget Management

35. ❌ QED-Nano: Teaching a Tiny Model to Prove Hard Theorems

作者: LM-Provers, Yuxiao Qu, Amrith Setlur, Jasper Dekoninck, Edward Beeching, Jia Li, Ian Wu, Lewis Tunstall, Aviral Kumar 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04898v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	8.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文QED-Nano专注于训练小型语言模型（4B参数）进行奥林匹克数学证明，核心涉及小型模型训练、监督微调（SFT）、强化学习（RL）和推理技术。高度相关的关键词包括：Small Language Models（核心主题）、Post-training/SFT（训练方法）、Chain of Thought/System 2 Thinking（推理过程）、Self-Correction（迭代优化）。中等相关的关键词：Large Language Models（作为蒸馏来源）、RLHF/DPO（使用RL进行对齐）、AI for Science（数学推理应用）。其他关键词如MoE、Scaling Laws、PEFT等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过监督微调和强化学习训练小型语言模型（QED-Nano，4B参数）来解决奥林匹克数学证明问题，结果表明该模型在证明生成性能上超越了许多大型开源模型，并接近专有模型的水平，同时大幅降低了推理成本。

摘要翻译

近年来，专有AI系统在处理基于证明的复杂问题上展现出令人瞩目的能力，据报道在2025年国际数学奥林匹克竞赛（IMO）中已达到金牌级别的表现。然而，这些系统背后的训练流程大多未公开，且其依赖庞大的“内部”模型和支撑框架，导致运行成本高昂、难以复现，也不便于深入研究或改进。这引出了一个核心问题：小型开源模型是否也能通过训练，在困难的奥林匹克级别数学问题上实现具有竞争力的推理性能？本文通过构建QED-Nano——一个针对奥林匹克级别证明进行后训练的40亿参数模型——对这一问题给出了肯定回答。我们的训练方案包含三个阶段：（1）通过从DeepSeek-Math-V2进行知识蒸馏的有监督微调，以注入良好的证明书写风格；（2）基于评分标准的强化学习；（3）结合推理缓存的强化学习扩展，该机制将长证明分解为迭代的“总结-优化”循环，从而在测试时实现更强的推理能力。QED-Nano在证明生成性能上超越了包括Nomos-1和GPT-OSS-120B在内的更大规模开源模型，并以极低的推理成本接近了Gemini 3 Pro等专有模型的水平。为促进开放数学推理的进一步研究，我们完整公开了QED-Nano训练流程，包括QED-Nano与QED-Nano-SFT模型、FineProofs-SFT与FineProofs-RL数据集，以及训练与评估代码。

摘要 (Abstract)

Proprietary AI systems have recently demonstrated impressive capabilities on complex proof-based problems, with gold-level performance reported at the 2025 International Mathematical Olympiad (IMO). However, the training pipelines behind these systems remain largely undisclosed, and their reliance on large “internal” models and scaffolds makes them expensive to run, difficult to reproduce, and hard to study or improve upon. This raises a central question: can small, open models also be trained to achieve competitive reasoning performance on difficult Olympiad-level math? In this paper, we answer this question by building QED-Nano, a 4B model post-trained for Olympiad-level proofs. Our training recipe has three stages: (1) supervised fine-tuning to imbue good proof-writing styles by distilling from DeepSeek-Math-V2, (2) reinforcement learning (RL) with rubric-based rewards, and (3) expanding RL with a reasoning cache, which decomposes long proofs into iterative summarize-and-refine cycles and enables stronger test-time reasoning. QED-Nano surpasses the proof-generation performance of much larger open models, including Nomos-1 and GPT-OSS-120B, and approaches the performance of proprietary models like Gemini 3 Pro, at a fraction of the inference cost. To support further research on open mathematical reasoning, we release the full QED-Nano pipeline, including the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the training and evaluation code.

关键词: small language models, mathematical reasoning, supervised fine-tuning, reinforcement learning, proof generation, Olympiad-level math, reasoning cache, model distillation

作者: Hengrui Gu, Xiaotian Han, Yujing Bian, Kaixiong Zhou 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04894v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究强化学习与可验证奖励（RLVR）在大型语言模型（LLMs）中的应用，直接涉及LLMs技术，因此"Large Language Models OR LLMs OR Foundation Models"得10分。论文聚焦于RLVR框架下的探索策略优化，未涉及其他关键词如MoE、SLMs、Scaling Laws、各种训练方法（预训练、微调、对齐）、推理加速、模型压缩、AI for Science等具体技术，故其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文针对强化学习与可验证奖励（RLVR）中大型语言模型面临的探索受限问题，提出了一种非对称组相对策略优化框架（AsymGRPO），通过解耦正负rollouts的熵调制来维持信息熵并抑制虚假噪声，从而提升性能。

摘要翻译

具有可验证奖励的强化学习（RLVR）显著提升了大型语言模型（LLMs）的推理能力。然而，它面临一个被称为“受限探索”的根本性局限，即策略迅速收敛到一个狭窄的解集。虽然熵正则化是维持探索的常用方法，但对于LLMs而言，该方法往往不可靠，存在超参数敏感性高且仅能带来边际性能提升的问题。受这些低效现象的启发，我们提出重新思考策略熵与探索之间的关系。通过推导组相对优势估计的参数化形式并分析熵动态，我们在概念上将策略熵分解为“信息熵”（保留多样化解路径）和“伪熵”（侵蚀推理模式）。我们的分析表明，与盲目最大化不同，有效的探索需要“熵精炼”——这是一种隐含在组相对优势估计中的机制，它能在正向轨迹上维持信息熵，同时在负向轨迹上抑制伪熵。基于这一洞见，我们提出了AsymGRPO，一个明确解耦正向与负向轨迹调控的探索性框架。这使得能够独立控制信息熵的保留与伪噪声的抑制。大量实验表明，与强基线方法相比，AsymGRPO实现了更优的性能，并展现出与现有熵正则化方法协同增效的潜力。

摘要 (Abstract)

Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning capabilities of large language models (LLMs). However, it faces a fundamental limitation termed \textit{restricted exploration}, where the policy rapidly converges to a narrow set of solutions. While entropy regularization is a popular approach used to sustain exploration, it often proves unreliable for LLMs, suffering from high hyperparameter sensitivity and yielding only marginal performance gains. Motivated by these inefficiencies, we propose to rethink the relationship between policy entropy and exploration. By deriving a parametric formulation of group-relative advantage estimation and analyzing entropy dynamics, we conceptually decompose policy entropy into \textit{informative entropy}, which preserves diverse solution paths, and \textit{spurious entropy}, which erodes reasoning patterns. Our analysis reveals that, in contrast to blind maximization, effective exploration requires \textit{entropy refinement}-a mechanism implicitly embedded in group-relative advantage estimation that sustains informative entropy on positive rollouts while suppressing spurious entropy on negative ones. Guided by this insight, we propose \textbf{AsymGRPO}, an exploratory framework that explicitly decouples the modulation of positive and negative rollouts. This allows for independent control over the preservation of informative entropy and the suppression of spurious noise. Extensive experiments demonstrate that AsymGRPO achieves superior performance compared to strong baselines and exhibits the potential to synergize with existing entropy regularization methods.

关键词: Reinforcement Learning with Verifiable Rewards, Large Language Models, Exploration, Entropy Regularization, Group-relative Advantage Estimation, AsymGRPO, Policy Entropy, Informative Entropy

37. ❌ Learning, Potential, and Retention: An Approach for Evaluating Adaptive AI-Enabled Medical Devices

作者: Alexis Burgon, Berkman Sahiner, Nicholas A Petrick, Gene Pennello, Ravi K Samala 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04878v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于评估医疗设备中自适应AI模型的评估方法学，提出了学习、潜力和保留三个测量指标。虽然属于AI在科学（医学）领域的应用，但论文内容完全围绕评估框架、性能测量和监管科学，不涉及任何大模型技术原理、架构创新或具体AI技术（如LLM、MoE、微调方法等）。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及AI在医学（生物信息学相关领域）的应用，但并非核心创新点，只是应用场景，因此给5分（有一定关联）。其他所有关键词均与论文内容无关，给0分。

!!! tip deepseek-chat TL;DR

该论文针对医疗设备中自适应AI模型在迭代更新时性能评估的挑战，提出了一种包含学习、潜力和保留三个测量指标的新方法，并通过模拟案例研究展示了该方法在区分模型适应与环境变化影响方面的实用性。

摘要翻译

本研究针对医疗设备自适应人工智能模型评估中的挑战：模型与评估数据集的迭代更新使性能评估复杂化。我们提出一种包含三项互补测量的新方法：学习能力（模型在当前数据上的改进）、潜力（数据集驱动的性能变化）与保持能力（跨修改步骤的知识保留），以区分模型自适应与动态环境变化导致的性能改变。通过模拟人群分布变化的案例研究验证了该方法的实用性：渐进式数据过渡可实现稳定的学习与知识保持，而快速变化则揭示了模型可塑性与稳定性之间的权衡。这些测量指标为监管科学提供了实践洞察，能够对自适应人工智能系统在连续修改过程中的安全性与有效性进行严格评估。

摘要 (Abstract)

This work addresses challenges in evaluating adaptive artificial intelligence (AI) models for medical devices, where iterative updates to both models and evaluation datasets complicate performance assessment. We introduce a novel approach with three complementary measurements: learning (model improvement on current data), potential (dataset-driven performance shifts), and retention (knowledge preservation across modification steps), to disentangle performance changes caused by model adaptations versus dynamic environments. Case studies using simulated population shifts demonstrate the approach’s utility: gradual transitions enable stable learning and retention, while rapid shifts reveal trade-offs between plasticity and stability. These measurements provide practical insights for regulatory science, enabling rigorous assessment of the safety and effectiveness of adaptive AI systems over sequential modifications.

关键词: adaptive AI, medical devices, evaluation framework, performance assessment, regulatory science, model adaptation, population shifts, safety and effectiveness

38. ❌ Incompleteness of AI Safety Verification via Kolmogorov Complexity

作者: Munawar Hasan 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04876v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究AI安全验证的理论局限性，使用Kolmogorov复杂度分析，属于AI安全/形式化验证的理论基础研究。所有关键词均聚焦于大模型/深度学习的具体技术、应用或优化方法（如训练、推理、对齐、应用等），而本文完全不涉及这些具体技术，仅讨论AI系统验证的普遍性理论限制，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

本文证明了AI安全验证存在根本性信息理论限制：对于任何固定的可计算枚举验证器，都存在一个复杂度阈值，超过该阈值后无法验证所有符合策略的实例，因此无法通过有限形式验证器保证所有高复杂度实例的安全性。

摘要翻译

确保人工智能（AI）系统满足形式化的安全与策略约束是安全关键领域的核心挑战。尽管验证的局限性常被归因于组合复杂性和模型表达能力，但我们证明这些局限源于内在的信息论限制。我们将策略合规性形式化为对编码系统行为的验证问题，并运用柯氏复杂性进行分析。我们证明了一个不完备性结果：对于任何固定的、可靠的可计算枚举验证器，存在一个阈值，一旦真实策略合规实例的复杂性超过该阈值，便无法被认证。因此，任何有限的形式验证器都无法对所有任意高复杂度的策略合规实例进行认证。这揭示了独立于计算资源的AI安全验证的根本局限，并激励了提供实例级正确性保证的携带证明方法。

摘要 (Abstract)

Ensuring that artificial intelligence (AI) systems satisfy formal safety and policy constraints is a central challenge in safety-critical domains. While limitations of verification are often attributed to combinatorial complexity and model expressiveness, we show that they arise from intrinsic information-theoretic limits. We formalize policy compliance as a verification problem over encoded system behaviors and analyze it using Kolmogorov complexity. We prove an incompleteness result: for any fixed sound computably enumerable verifier, there exists a threshold beyond which true policy-compliant instances cannot be certified once their complexity exceeds that threshold. Consequently, no finite formal verifier can certify all policy-compliant instances of arbitrarily high complexity. This reveals a fundamental limitation of AI safety verification independent of computational resources, and motivates proof-carrying approaches that provide instance-level correctness guarantees.

关键词: AI safety verification, Kolmogorov complexity, policy compliance, formal verification, information-theoretic limits, incompleteness theorem, proof-carrying approaches

39. ❌ Muon Dynamics as a Spectral Wasserstein Flow

作者: Gabriel Peyré 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04891v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Muon Dynamics as a Spectral Wasserstein Flow》研究的是深度学习中梯度归一化的数学理论，特别是谱归一化在概率测度框架下的几何和优化性质，属于纯数学和理论机器学习领域。所有评分关键词均与大模型、深度学习技术原理或AI科学应用直接相关，而本文完全不涉及这些主题：未提及任何语言模型、训练方法、推理技术、对齐、压缩、代理系统或具体科学应用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文研究了深度学习中谱归一化规则的数学理论，引入了一族谱Wasserstein距离，证明了其与经典Wasserstein距离的等价性，并将相关的归一化连续性方程解释为谱Wasserstein梯度流。

摘要翻译

梯度归一化是深度学习优化的核心，因其能稳定训练并降低对尺度变化的敏感性。对于深层架构，参数自然地分组为矩阵或块，因此谱归一化通常比逐坐标的欧几里得归一化更为准确；Muon 是本文的主要启发实例。更广泛地，我们在一个将参数建模为概率测度的平均场体系中，研究了一系列谱归一化规则，涵盖从普通梯度下降到 Muon 以及中间的 Schatten 型方案。我们引入了一族由半正定矩阵上的范数 γ 索引的谱 Wasserstein 距离：迹范数恢复了经典的二次 Wasserstein 距离，算子范数恢复了 Muon 几何，而中间的 Schatten 范数则在两者之间插值。我们发展了静态 Kantorovich 表述，证明了与 W2 的比较界，推导了最大-最小表示，并得到了一个条件 Brenier 定理。对于高斯边缘分布，该问题简化为协方差矩阵上的约束优化，扩展了 Bures 公式，并为 Schatten 族中可交换协方差给出了闭式解。对于单调范数（包括所有 Schatten 情形），我们证明了静态与动态 Benamou-Brenier 表述的等价性，推导出所得传输代价在固定维度下是等价于 W2 的严格度量，并证明诱导的高斯协方差代价也是一个度量。随后，我们将相关的归一化连续性方程解释为谱 Wasserstein 梯度流，确定其精确的有限粒子对应为归一化矩阵流，获得了首个测地凸性结果，并展示了正齐次平均场模型如何在球面上诱导出谱不平衡传输。

摘要 (Abstract)

Gradient normalization is central in deep-learning optimization because it stabilizes training and reduces sensitivity to scale. For deep architectures, parameters are naturally grouped into matrices or blocks, so spectral normalizations are often more faithful than coordinatewise Euclidean ones; Muon is the main motivating example of this paper. More broadly, we study a family of spectral normalization rules, ranging from ordinary gradient descent to Muon and intermediate Schatten-type schemes, in a mean-field regime where parameters are modeled by probability measures. We introduce a family of Spectral Wasserstein distances indexed by a norm gamma on positive semidefinite matrices. The trace norm recovers the classical quadratic Wasserstein distance, the operator norm recovers the Muon geometry, and intermediate Schatten norms interpolate between them. We develop the static Kantorovich formulation, prove comparison bounds with W2, derive a max-min representation, and obtain a conditional Brenier theorem. For Gaussian marginals, the problem reduces to a constrained optimization on covariance matrices, extending the Bures formula and yielding a closed form for commuting covariances in the Schatten family. For monotone norms, including all Schatten cases, we prove the equivalence between the static and dynamic Benamou-Brenier formulations, deduce that the resulting transport cost is a genuine metric equivalent to W2 in fixed dimension, and show that the induced Gaussian covariance cost is also a metric. We then interpret the associated normalized continuity equation as a Spectral Wasserstein gradient flow, identify its exact finite-particle counterpart as a normalized matrix flow, obtain first geodesic-convexity results, and show how positively homogeneous mean-field models induce a spectral unbalanced transport on the sphere.

关键词: gradient normalization, spectral normalization, Wasserstein distance, mean-field regime, probability measures, gradient flow, covariance matrices, Schatten norms

40. ❌ DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing

作者: Ke Li, Maoliang Li, Jialiang Chen, Jiayu Chen, Zihao Zheng, Shaoqi Wang, Xiang Chen 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04875v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文DIRECT专注于视频混剪的自动化创作，提出了一个分层多智能体规划框架（Screenwriter、Director、Editor）来解决多模态连贯性问题。该研究属于计算机视觉和多媒体处理领域，与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐技术等）完全无关。唯一相关的关键词是’Multi-agent Systems OR Agent Coordination’，因为论文的核心创新正是基于分层多智能体系统进行规划和协调，因此给予10分（高度相关，核心内容）。其他关键词均未在论文标题或摘要中提及，也未涉及相关技术，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对自动化视频混剪创作中存在的跨层级多模态协调不足导致序列脱节的问题，提出了一个分层多智能体规划框架DIRECT，通过模拟专业制作流程显著提升了视频的视觉连续性和听觉对齐效果。

摘要翻译

视频混剪创作代表了一种复杂的视频编辑范式，通过对现有素材进行重组来打造引人入胜的视听体验，这需要在语义、视觉与听觉维度以及多个层级上进行精密的编排。然而，现有的自动化编辑框架往往忽视了跨层级的多模态编排以实现专业级的流畅度，导致生成的视频序列衔接生硬，存在突兀的视觉转换和音乐错位问题。为此，我们将视频混剪创作形式化为一个多模态连贯性满足问题，并提出了DIRECT框架。该框架模拟专业制作流程，采用分层多智能体架构，将挑战分解为三个级联层级：编剧负责源感知的全局结构锚定，导演负责实例化自适应编辑意图与指导，而剪辑师则在意图引导下进行镜头序列编辑并执行细粒度优化。我们进一步引入了Mashup-Bench，这是一个包含视觉连续性与听觉对齐专项评估指标的综合性基准测试集。大量实验表明，DIRECT在客观指标和人类主观评价上均显著优于现有先进基线方法。项目页面与代码：https://github.com/AK-DREAM/DIRECT

摘要 (Abstract)

Video mashup creation represents a complex video editing paradigm that recomposes existing footage to craft engaging audio-visual experiences, demanding intricate orchestration across semantic, visual, and auditory dimensions and multiple levels. However, existing automated editing frameworks often overlook the cross-level multimodal orchestration to achieve professional-grade fluidity, resulting in disjointed sequences with abrupt visual transitions and musical misalignment. To address this, we formulate video mashup creation as a Multimodal Coherency Satisfaction Problem (MMCSP) and propose the DIRECT framework. Simulating a professional production pipeline, our hierarchical multi-agent framework decomposes the challenge into three cascade levels: the Screenwriter for source-aware global structural anchoring, the Director for instantiating adaptive editing intent and guidance, and the Editor for intent-guided shot sequence editing with fine-grained optimization. We further introduce Mashup-Bench, a comprehensive benchmark with tailored metrics for visual continuity and auditory alignment. Extensive experiments demonstrate that DIRECT significantly outperforms state-of-the-art baselines in both objective metrics and human subjective evaluation. Project page and code: https://github.com/AK-DREAM/DIRECT

关键词: Video Mashup Creation, Hierarchical Multi-Agent Planning, Multimodal Coherency, Intent-Guided Editing, Visual Continuity, Auditory Alignment, Automated Video Editing, Mashup-Bench

41. ❌ Noise Immunity in In-Context Tabular Learning: An Empirical Robustness Analysis of TabPFN’s Attention Mechanisms

作者: James Hu, Mahdi Ghelichi 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04868v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文研究TabPFN（一种表格基础模型）在噪声条件下的鲁棒性，核心涉及in-context learning（ICL）和基础模型概念，因此与’In-context Learning OR Many-shot Learning’高度相关（10分），与’Large Language Models OR LLMs OR Foundation Models’相关（8分，因TabPFN是表格基础模型，属于基础模型范畴）。论文分析数据质量问题（如无关特征、标签噪声），与’Scaling Laws AND Data Quality’有一定关联（5分）。论文通过注意力机制和SHAP进行可解释性分析，与’Mechanistic Interpretability OR Explainable AI’相关（5分）。论文应用于工业领域（如金融、医疗），属于AI在科学/工业领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’相关（8分）。其他关键词（如MoE、SFT、RAG等）未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了表格基础模型TabPFN在数据噪声（如无关特征、相关特征组和标签噪声）条件下的鲁棒性，通过实证分析发现TabPFN在预测性能和内部注意力机制上均表现出高度韧性。

摘要翻译

表格基础模型（Tabular Foundation Models, TFMs），例如TabPFN（表格先验数据拟合网络），旨在通过上下文学习（In-Context Learning, ICL）在异构表格数据集上实现泛化。它们基于带标签的示例进行单次前向传播预测，无需针对特定数据集更新参数。这一范式在表格预测普遍存在的工业领域（如金融和医疗）中尤其具有吸引力。在这些场景中，为每个新表格重新训练定制模型可能成本高昂或不可行，同时数据质量问题（如无关预测变量、相关特征组和标签噪声）也较为常见。本文提供了强有力的实证证据，表明TabPFN在这些次优条件下具有高度鲁棒性。我们针对二元分类问题研究了TabPFN及其注意力机制，通过受控的合成扰动来改变：（i）数据集宽度，通过注入随机不相关特征和引入非线性相关特征；（ii）数据集大小，通过增加训练行数；（iii）标签质量，通过提高错误标注目标的比例。除了预测性能外，我们还分析了内部信号，包括注意力集中度和基于注意力的特征排序指标。在这些参数化测试中，TabPFN表现出显著的韧性：ROC-AUC保持高位，注意力保持结构化和聚焦状态，且信息丰富的特征在基于注意力的指标中排名靠前。通过注意力热图、特征-标记嵌入和SHAP图的定性可视化进一步支持了各层间的一致模式，即TabPFN逐渐集中于有用特征，同时将其信号与噪声分离。这些发现共同表明，TabPFN是一种鲁棒的表格基础模型，能够在各种数据缺陷场景下保持预测性能和连贯的内部行为。

摘要 (Abstract)

Tabular foundation models (TFMs) such as TabPFN (Tabular Prior-Data Fitted Network) are designed to generalize across heterogeneous tabular datasets through in-context learning (ICL). They perform prediction in a single forward pass conditioned on labeled examples without dataset-specific parameter updates. This paradigm is particularly attractive in industrial domains (e.g., finance and healthcare) where tabular prediction is pervasive. Retraining a bespoke model for each new table can be costly or infeasible in these settings, while data quality issues such as irrelevant predictors, correlated feature groups, and label noise are common. In this paper, we provide strong empirical evidence that TabPFN is highly robust under these sub-optimal conditions. We study TabPFN and its attention mechanisms for binary classification problems with controlled synthetic perturbations that vary: (i) dataset width by injecting random uncorrelated features and by introducing nonlinearly correlated features, (ii) dataset size by increasing the number of training rows, and (iii) label quality by increasing the fraction of mislabeled targets. Beyond predictive performance, we analyze internal signals including attention concentration and attention-based feature ranking metrics. Across these parametric tests, TabPFN is remarkably resilient: ROC-AUC remains high, attention stays structured and sharp, and informative features are highly ranked by attention-based metrics. Qualitative visualizations with attention heatmaps, feature-token embeddings, and SHAP plots further support a consistent pattern across layers in which TabPFN increasingly concentrates on useful features while separating their signals from noise. Together, these findings suggest that TabPFN is a robust TFM capable of maintaining both predictive performance and coherent internal behavior under various scenarios of data imperfections.

关键词: tabular foundation models, TabPFN, in-context learning, robustness analysis, attention mechanisms, data quality, noise immunity, tabular prediction

作者: Yude Zou, Junji Gong, Xing Gao, Zixuan Li, Tianxing Chen, Guanjie Zheng 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04843v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和图形学领域的人-物-场景交互生成任务，采用一致性模型和动态感知策略等技术。虽然研究背景中提到大模型在不同领域的应用可酌情给分，但论文内容完全不涉及任何大语言模型、深度学习技术原理或AI for Science相关关键词。所有评分关键词均与论文主题无关，因此全部评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个从粗到细的指令条件交互生成框架InfBaGel，通过动态感知策略和碰撞感知引导来解决人-物-场景交互生成中的数据稀缺和物理伪影问题，在HOSI和HOI生成任务上达到了最先进的性能。

摘要翻译

人-物-场景交互（HOSI）生成在具身人工智能、仿真与动画领域具有广泛的应用前景。与人物交互（HOI）和人-场景交互（HSI）不同，HOSI生成需要对动态的物-场景变化进行推理，但面临标注数据有限的问题。为解决这些挑战，我们提出了一种从粗到细的指令条件化交互生成框架，该框架与一致性模型的迭代去噪过程进行了显式对齐。具体而言，我们采用了一种动态感知策略，该策略利用前一步细化生成的轨迹来更新场景上下文，并在一致性模型的每个去噪步骤中为后续细化提供条件，从而生成连贯的交互。为进一步减少物理伪影，我们引入了一种碰撞感知引导机制，该机制能在采样过程中有效缓解碰撞与穿透问题，且无需精细的场景几何信息，实现了实时生成。为克服数据稀缺性，我们设计了一种混合训练策略：通过将体素化的场景占据信息注入HOI数据集以合成伪HOSI样本，并与高保真的HSI数据联合训练，从而在保持真实场景感知的同时学习交互模式。大量实验表明，我们的方法在HOSI与HOI生成任务上均达到了最先进的性能，并对未见场景展现出强大的泛化能力。项目页面：https://yudezou.github.io/InfBaGel-page/

摘要 (Abstract)

Human-object-scene interactions (HOSI) generation has broad applications in embodied AI, simulation, and animation. Unlike human-object interaction (HOI) and human-scene interaction (HSI), HOSI generation requires reasoning over dynamic object-scene changes, yet suffers from limited annotated data. To address these issues, we propose a coarse-to-fine instruction-conditioned interaction generation framework that is explicitly aligned with the iterative denoising process of a consistency model. In particular, we adopt a dynamic perception strategy that leverages trajectories from the preceding refinement to update scene context and condition subsequent refinement at each denoising step of consistency model, yielding consistent interactions. To further reduce physical artifacts, we introduce a bump-aware guidance that mitigates collisions and penetrations during sampling without requiring fine-grained scene geometry, enabling real-time generation. To overcome data scarcity, we design a hybrid training startegy that synthesizes pseudo-HOSI samples by injecting voxelized scene occupancy into HOI datasets and jointly trains with high-fidelity HSI data, allowing interaction learning while preserving realistic scene awareness. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both HOSI and HOI generation, and strong generalization to unseen scenes. Project page: https://yudezou.github.io/InfBaGel-page/

关键词: Human-object-scene interaction generation, Consistency model, Dynamic perception, Iterative refinement, Bump-aware guidance, Hybrid training strategy, Embodied AI, Real-time generation

43. ❌ Plausibility as Commonsense Reasoning: Humans Succeed, Large Language Models Do not

作者: Sercan Karakaş 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04825v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型在语言理解任务中的表现，特别是测试LLMs是否像人类一样整合世界知识和句法结构来解决歧义，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词所代表的具体技术方法、应用领域或创新技术，如MoE、量化、推理加速、科学AI应用等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究通过土耳其语前置关系从句附着歧义测试，发现人类在歧义解析中能有效利用事件合理性信息，而大语言模型（LLMs）的合理性驱动偏好较弱、不稳定或反向，表明LLMs在整合世界知识和句法结构方面与人类存在差异。

摘要翻译

大型语言模型在众多语言任务中展现出强劲性能，但其在歧义消解过程中是否以类人的、结构敏感的方式整合世界知识与句法结构，目前尚不明确。本研究通过土耳其语名词前关系从句附着歧义对此问题进行检验：同一表层字符串允许高附着或低附着解析。我们构建了保持句法配置固定且两种解析在语用上均可行的歧义项，同时通过分级事件合理性选择性地支持高附着或低附着。这些对比项已通过独立规范评分验证。在一项限时强制选择理解实验中，人类被试表现出显著且方向正确的合理性效应。随后，我们在基于偏好的并行设置中评估了土耳其语及多语言大语言模型，通过平均每词元对数概率比较匹配的高附着/低附着续写。所有模型均显示，合理性驱动的偏好转移微弱、不稳定或出现反转。结果表明，在所测试的模型中，合理性信息未能像在人类判断中那样可靠地引导附着偏好；这些发现同时凸显了土耳其语关系从句附着歧义作为一种超越宽泛基准的有效跨语言诊断工具的价值。

摘要 (Abstract)

Large language models achieve strong performance on many language tasks, yet it remains unclear whether they integrate world knowledge with syntactic structure in a human-like, structure-sensitive way during ambiguity resolution. We test this question in Turkish prenominal relative-clause attachment ambiguities, where the same surface string permits high attachment (HA) or low attachment (LA). We construct ambiguous items that keep the syntactic configuration fixed and ensure both parses remain pragmatically possible, while graded event plausibility selectively favors High Attachment vs.\ Low Attachment. The contrasts are validated with independent norming ratings. In a speeded forced-choice comprehension experiment, humans show a large, correctly directed plausibility effect. We then evaluate Turkish and multilingual LLMs in a parallel preference-based setup that compares matched HA/LA continuations via mean per-token log-probability. Across models, plausibility-driven shifts are weak, unstable, or reversed. The results suggest that, in the tested models, plausibility information does not guide attachment preferences as reliably as it does in human judgments, and they highlight Turkish RC attachment as a useful cross-linguistic diagnostic beyond broad benchmarks.

关键词: Large Language Models, ambiguity resolution, syntactic structure, world knowledge, Turkish relative-clause attachment, plausibility effect, human vs. LLM comparison, cross-linguistic diagnostic

44. ❌ LiveFact: A Dynamic, Time-Aware Benchmark for LLM-Driven Fake News Detection

作者: Cheng Xu, Changhong Jin, Yingjie Niu, Nan Yan, Yuke Mei, Shuhao Guan, Liming Chen, M-Tahar Kechadi 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04815v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在假新闻检测中的复杂推理能力评估，与’Large Language Models’高度相关（10分）。论文测试了Mixture-of-Experts模型（如Qwen3-235B-A22B），与’Mixture of Experts’相关（8分）。论文重点评估LLM的推理能力，特别是’Chain of Thought’和’System 2 Thinking’所代表的复杂推理（各10分）。假新闻检测直接涉及事实性验证，与’Hallucination Mitigation’高度相关（10分）。其他关键词如模型训练方法、优化技术、特定应用领域等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对现有假新闻检测基准的静态性缺陷，提出了动态时间感知基准LiveFact，通过评估22个LLM发现开源MoE模型已达到或超越最先进专有系统，并揭示了传统基准忽视的'推理差距'问题。

摘要翻译

大型语言模型（LLMs）的快速发展已将虚假新闻检测与事实核查任务从简单分类转变为复杂推理。然而，评估框架尚未同步跟进。现有基准测试多为静态设置，易受基准数据污染（BDC）影响，且难以有效评估时间不确定性下的推理能力。为此，我们提出了LiveFact——一个持续更新的基准测试，旨在模拟虚假信息检测中真实世界的“战争迷雾”。LiveFact采用动态的时序证据集，评估模型基于演化中、不完整信息进行推理的能力，而非依赖记忆的知识。我们提出双模式评估方案：用于最终验证的“分类模式”和基于证据推理的“推断模式”，并包含专门监测BDC的组件。对22个LLM的测试表明，开源混合专家模型（如Qwen3-235B-A22B）已达到或超越专有前沿系统的性能。更重要的是，我们的分析发现显著的“推理鸿沟”：优秀模型在早期数据片段中能通过认知谦逊识别不可验证的主张——这一维度被传统静态基准所忽视。LiveFact为评估稳健、具有时间感知能力的人工智能验证系统设立了可持续的新标准。

摘要 (Abstract)

The rapid development of Large Language Models (LLMs) has transformed fake news detection and fact-checking tasks from simple classification to complex reasoning. However, evaluation frameworks have not kept pace. Current benchmarks are static, making them vulnerable to benchmark data contamination (BDC) and ineffective at assessing reasoning under temporal uncertainty. To address this, we introduce LiveFact a continuously updated benchmark that simulates the real-world “fog of war” in misinformation detection. LiveFact uses dynamic, temporal evidence sets to evaluate models on their ability to reason with evolving, incomplete information rather than on memorized knowledge. We propose a dual-mode evaluation: Classification Mode for final verification and Inference Mode for evidence-based reasoning, along with a component to monitor BDC explicitly. Tests with 22 LLMs show that open-source Mixture-of-Experts models, such as Qwen3-235B-A22B, now match or outperform proprietary state-of-the-art systems. More importantly, our analysis finds a significant “reasoning gap.” Capable models exhibit epistemic humility by recognizing unverifiable claims in early data slices-an aspect traditional static benchmarks overlook. LiveFact sets a sustainable standard for evaluating robust, temporally aware AI verification.

关键词: Large Language Models, fake news detection, benchmark evaluation, temporal reasoning, Mixture-of-Experts, fact-checking, reasoning gap, benchmark data contamination

45. ❌ Selecting Decision-Relevant Concepts in Reinforcement Learning

作者: Naveen Raman, Stephanie Milani, Fei Fang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04808v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究强化学习中决策相关概念的选择算法，属于强化学习领域，与大多数大模型技术关键词（如LLM、MoE、SFT、RAG等）完全无关。仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为涉及可解释性概念策略；与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为论文提到在医疗健康环境中的应用。其他关键词均无直接关联，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了强化学习中自动选择决策相关概念的算法（DRS），解决了手动选择概念耗时且缺乏性能保证的问题，并在基准测试和真实医疗环境中验证了其有效性。

摘要翻译

训练基于可解释概念的策略时，研究者通常需要手动选择智能体在序列决策过程中应使用哪些人类可理解的概念。这种选择过程不仅依赖领域专业知识、耗时且成本高昂，还难以随候选概念数量扩展，且无法提供性能保证。为克服这一局限性，我们提出了首个用于序列决策中原则性自动概念选择的算法。我们的核心洞见在于：概念选择可通过状态抽象的视角来理解——直观而言，若移除某个概念会导致智能体混淆需要采取不同行动的状态，则该概念具有决策相关性。因此，智能体应依赖决策相关概念；具有相同概念表征的状态应共享相同的最优行动，从而保留原始状态空间的最优决策结构。这一视角催生了决策相关选择（Decision-Relevant Selection, DRS）算法，该算法能够从候选概念集中选择概念子集，并提供将所选概念与最终策略性能相关联的理论性能边界。实验表明，DRS算法能自动还原人工筛选的概念集，在保持或超越其性能的同时，显著提升了强化学习基准测试和真实世界医疗环境中测试阶段概念干预的有效性。

摘要 (Abstract)

Training interpretable concept-based policies requires practitioners to manually select which human-understandable concepts an agent should reason with when making sequential decisions. This selection demands domain expertise, is time-consuming and costly, scales poorly with the number of candidates, and provides no performance guarantees. To overcome this limitation, we propose the first algorithms for principled automatic concept selection in sequential decision-making. Our key insight is that concept selection can be viewed through the lens of state abstraction: intuitively, a concept is decision-relevant if removing it would cause the agent to confuse states that require different actions. As a result, agents should rely on decision-relevant concepts; states with the same concept representation should share the same optimal action, which preserves the optimal decision structure of the original state space. This perspective leads to the Decision-Relevant Selection (DRS) algorithm, which selects a subset of concepts from a candidate set, along with performance bounds relating the selected concepts to the performance of the resulting policy. Empirically, DRS automatically recovers manually curated concept sets while matching or exceeding their performance, and improves the effectiveness of test-time concept interventions across reinforcement learning benchmarks and real-world healthcare environments.

关键词: reinforcement learning, concept-based policies, decision-relevant concepts, state abstraction, interpretable policies, automatic concept selection, healthcare environments, sequential decision-making

46. ❌ ANX: Protocol-First Design for AI Agent Interaction with a Supporting 3EX Decoupled Architecture

作者: Xu Mingze 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04820v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	8.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ANX专注于AI代理（agent）的协议设计和架构优化，核心贡献是提出一种新的代理原生协议和顶层框架，以解决现有方法（如GUI自动化和MCP）在令牌消耗、交互碎片化和安全性方面的缺陷。论文与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为其核心主题就是AI代理的交互和工作流设计；与’Tool Use OR Function Calling OR API Tool Use’和’Multi-agent Systems OR Agent Coordination’有一定关联（8分），因为涉及工具使用（如CLI、Skill、MCP）和多代理协作；与’Large Language Models OR LLMs OR Foundation Models’有中等关联（8分），因为实验使用了Qwen3.5-plus和GPT-4o，但论文重点不是LLM技术本身。其他关键词（如MoE、Scaling Laws、Training方法等）与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

论文ANX提出了一种协议优先的AI代理交互设计和3EX解耦架构，解决了现有方法令牌消耗高、交互碎片化和安全性不足的问题，实验表明其能显著减少令牌使用（最高66.3%）并缩短执行时间。

摘要翻译

AI智能体作为自主数字行动者，需要原生智能体协议；现有方法包括GUI自动化和基于MCP的技能调用，但存在令牌消耗高、交互碎片化、安全性不足等缺陷，其根源在于缺乏统一的顶层框架与关键组件，各独立模块均存在短板。为解决这些问题，我们提出ANX——一个开放、可扩展、可验证的原生智能体协议及顶层框架，它整合了命令行界面（CLI）、技能模块（Skill）与模型上下文协议（MCP），通过协议创新、架构优化与工具补充系统性地解决了上述痛点。其四大核心创新包括：1）原生智能体设计（ANX配置、标记语言、命令行界面），具备高信息密度、强灵活性与适应性，能显著降低令牌消耗并消除不一致性；2）人机交互融合技能调用的灵活性，支持将指令双渲染为智能体可执行代码与人可读界面；3）基于MCP的按需轻量化应用调用，无需预先注册；4）通过ANX标记语言实现机器可执行的标准作业程序（SOP），消除歧义，保障长周期任务与多智能体协作的可靠性。作为系列研究的首篇，本文聚焦ANX的设计，提出其基于ANX枢纽的三层解耦架构（3EX），并进行了初步可行性分析与实验验证。ANX具备原生安全性：绕过大语言模型的用户界面至核心通信机制确保敏感数据不进入智能体上下文；仅限人工确认的机制防止自动化滥用。基于Qwen3.5-plus/GPT-4o的表单填写实验表明，相较于基于MCP的技能调用，ANX分别减少47.3%（Qwen3.5-plus）与55.6%（GPT-4o）的令牌消耗；相较于GUI自动化，分别减少57.1%（Qwen3.5-plus）与66.3%（GPT-4o）的令牌消耗；同时执行时间较基于MCP的技能调用缩短58.1%与57.7%。

摘要 (Abstract)

AI agents, autonomous digital actors, need agent-native protocols; existing methods include GUI automation and MCP-based skills, with defects of high token consumption, fragmented interaction, inadequate security, due to lacking a unified top-level framework and key components, each independent module flawed. To address these issues, we present ANX, an open, extensible, verifiable agent-native protocol and top-level framework integrating CLI, Skill, MCP, resolving pain points via protocol innovation, architectural optimization and tool supplementation. Its four core innovations: 1) Agent-native design (ANX Config, Markup, CLI) with high information density, flexibility and strong adaptability to reduce tokens and eliminate inconsistencies; 2) Human-agent interaction combining Skill’s flexibility for dual rendering as agent-executable instructions and human-readable UI; 3) MCP-supported on-demand lightweight apps without pre-registration; 4) ANX Markup-enabled machine-executable SOPs eliminating ambiguity for reliable long-horizon tasks and multi-agent collaboration. As the first in a series, we focus on ANX’s design, present its 3EX decoupled architecture with ANXHub and preliminary feasibility analysis and experimental validation. ANX ensures native security: LLM-bypassed UI-to-Core communication keeps sensitive data out of agent context; human-only confirmation prevents automated misuse. Form-filling experiments with Qwen3.5-plus/GPT-4o show ANX reduces tokens by 47.3% (Qwen3.5-plus) and 55.6% (GPT-4o) vs MCP-based skills, 57.1% (Qwen3.5-plus) and 66.3% (GPT-4o) vs GUI automation, and shortens execution time by 58.1% and 57.7% vs MCP-based skills.

关键词: AI agents, agent-native protocol, 3EX decoupled architecture, token reduction, multi-agent collaboration, MCP, CLI, human-agent interaction

47. ❌ A Quantum Search Approach to Magic Square Constraint Problems with Classical Benchmarking

作者: Rituparna R, Harsha Varthini, Aswani Kumar Cherukuri 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04786v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究量子搜索算法（Grover算法）在组合约束满足问题（幻方生成）中的应用，并进行了经典基准测试。论文的核心是量子计算，而非大模型或深度学习。所有评分关键词均与大模型、深度学习技术原理及其应用相关，而本文完全不涉及这些领域。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于Grover算法的量子搜索方法来解决幻方生成这一组合约束满足问题，并通过Qiskit实现验证了其相对于经典暴力搜索的二次查询优势。

摘要翻译

本文提出了一种针对组合约束满足问题的量子搜索方法，并以幻方的生成为例进行演示。我们将幻方构建重新表述为一个量子搜索问题，其中利用一个可逆的、对约束敏感的预言机（oracle）标记有效配置，并通过格罗弗算法（Grover’s algorithm）进行振幅放大。在量子编码之前，我们采用暹罗构造法（Siamese construction）和部分约束检查进行经典预处理，以生成紧凑的候选域。与在迭代循环中融合经典和量子求解器的做法不同，本工作利用经典组件进行结构化初始化，并利用量子组件进行搜索，同时将量子方法与经典暴力枚举及回溯法进行性能对比。我们的Qiskit实现展示了多寄存器模运算电路、预言机逻辑和扩散算子的设计。实验在小型网格实例上进行，因为更大的网格在经典状态向量模拟器上会因内存指数级增长而难以处理。结果验证了所提出的量子搜索流程的正确性，并确认了其相对于经典搜索的理论二次查询加速优势。

摘要 (Abstract)

This paper presents a quantum search approach to combinatorial constraint satisfaction problems, demonstrated through the generation of magic squares. We reformulate magic square construction as a quantum search problem in which a reversible, constraint-sensitive oracle marks valid configurations for amplitude amplification via Grover’s algorithm. Classical pre-processing using the Siamese construction and partial constraint checks generates a compact candidate domain before quantum encoding. Rather than integrating classical and quantum solvers in an iterative loop, this work uses the classical component for structured initialisation and the quantum component for search, and benchmarks the quantum approach against classical brute-force enumeration and backtracking. Our Qiskit implementation demonstrates the design of multi-register modular arithmetic circuits, oracle logic, and diffusion operators. Experiments are conducted on small grid instances, as larger grids are intractable on classical statevector simulators due to exponential memory growth. The results validate the correctness of the proposed quantum search pipeline and confirm the theoretical quadratic query advantage over classical search.

关键词: quantum search, Grover’s algorithm, magic square, constraint satisfaction problem, Qiskit, amplitude amplification, classical benchmarking, combinatorial optimization

48. ❌ SkillX: Automatically Constructing Skill Knowledge Bases for Agents

作者: Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao, Kexin Cao, Guozhou Zheng, Xiang Qi, Peng Zhang, Shumin Deng 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04804v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM agents的学习效率问题，提出SkillX框架构建可复用的技能知识库，与’Large Language Models OR LLMs OR Foundation Models’和’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文明确使用LLM（GLM-4.6）作为骨干代理，并专注于代理学习。其他关键词如MoE、SFT、RAG等未在摘要中提及或与论文主题无关，故评0分。

!!! tip deepseek-chat TL;DR

论文针对LLM代理学习效率低、经验复用性差的问题，提出SkillX框架自动构建可插拔的技能知识库，实验表明该库能显著提升较弱基础代理的任务成功率和执行效率。

摘要翻译

从经验中学习对于构建能力强的大型语言模型（LLM）智能体至关重要，然而当前主流的自我进化范式效率依然低下：智能体孤立学习，从有限经验中反复发现相似行为，导致探索冗余且泛化能力差。为解决这一问题，我们提出 SkillX，一个用于构建即插即用技能知识库的全自动化框架，该知识库可在不同智能体与环境间复用。SkillX 通过一个基于三项协同创新的全自动流程运行：（i）多层级技能设计，将原始轨迹提炼为包含战略规划、功能技能与原子技能的三层层次结构；（ii）迭代式技能精炼，根据执行反馈自动修订技能，持续提升知识库质量；（iii）探索式技能扩展，主动生成并验证新技能，以扩展对初始训练数据之外的覆盖范围。我们基于一个强主干智能体（GLM-4.6）自动构建了可复用的技能库，并在具有挑战性的长周期、用户交互式基准测试（包括 AppWorld、BFCL-v3 和 $τ^2$-Bench）上评估其可迁移性。实验表明，当 SkillKB 接入较弱的基础智能体时，能持续提升任务成功率与执行效率，这凸显了结构化、层次化的经验表征对于可泛化智能体学习的重要性。我们的代码即将在 https://github.com/zjunlp/SkillX 公开。

摘要 (Abstract)

Learning from experience is critical for building capable large language model (LLM) agents, yet prevailing self-evolving paradigms remain inefficient: agents learn in isolation, repeatedly rediscover similar behaviors from limited experience, resulting in redundant exploration and poor generalization. To address this problem, we propose SkillX, a fully automated framework for constructing a \textbf{plug-and-play skill knowledge base} that can be reused across agents and environments. SkillX operates through a fully automated pipeline built on three synergistic innovations: \textit{(i) Multi-Level Skills Design}, which distills raw trajectories into three-tiered hierarchy of strategic plans, functional skills, and atomic skills; \textit{(ii) Iterative Skills Refinement}, which automatically revises skills based on execution feedback to continuously improve library quality; and \textit{(iii) Exploratory Skills Expansion}, which proactively generates and validates novel skills to expand coverage beyond seed training data. Using a strong backbone agent (GLM-4.6), we automatically build a reusable skill library and evaluate its transferability on challenging long-horizon, user-interactive benchmarks, including AppWorld, BFCL-v3, and $τ^2$-Bench. Experiments show that SkillKB consistently improves task success and execution efficiency when plugged into weaker base agents, highlighting the importance of structured, hierarchical experience representations for generalizable agent learning. Our code will be publicly available soon at https://github.com/zjunlp/SkillX.

关键词: LLM agents, skill knowledge base, automated framework, hierarchical skills, experience reuse, transferability, plug-and-play, agent learning

49. ❌ Cog-DRIFT: Exploration on Adaptively Reformulated Instances Enables Learning from Hard Reasoning Problems

作者: Justin Chih-Yao Chen, Archiki Prasad, Zaid Khan, Joykirat Singh, Runchu Tian, Elias Stengel-Eskin, Mohit Bansal 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04767v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在推理任务中的后训练方法，通过任务重构和课程学习解决硬推理问题。高度相关的关键词包括：LLMs（论文研究对象）、Post-training（研究背景）、Chain of Thought（涉及推理方法）、System 2 Thinking（涉及深度推理）。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG等与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出Cog-DRIFT框架，通过将困难的开放式推理问题重构为认知上更简单的变体（如多项选择），并采用自适应课程学习，解决了LLM在硬推理问题中因无法获得奖励信号而无法学习的问题，显著提升了模型在多个推理基准上的性能。

摘要翻译

基于可验证奖励的强化学习（RLVR）提升了大型语言模型的推理能力，但一个根本性限制依然存在：模型无法从其当前策略下过于困难、无法解决的问题中学习，因为这些问题无法产生有意义的奖励信号。我们提出了一种基于任务重构的简单而有效的解决方案。我们将具有挑战性的开放式问题转化为认知上更简单的变体——例如多项选择和完形填空格式——这些变体保留了原始答案，同时减少了有效搜索空间并提供了更密集的学习信号。这些重构形式覆盖了从判别式到生成式任务的谱系，我们利用这一点进行引导式学习：模型首先从结构化、更简单的格式中学习，而这些知识会迁移回来，提升其在原始开放式问题上的表现。基于这一洞见，我们提出了Cog-DRIFT框架，该框架构建重构变体，并根据难度将其组织成自适应课程。训练从较易格式逐步过渡到较难格式，使得模型能够从那些在标准RL后训练中原本产生零信号的问题中学习。Cog-DRIFT不仅改善了原本无法解决的难题（在Qwen上绝对提升+10.11%，在Llama上+8.64%），还能很好地泛化到其他保留数据集上。在2个模型和6个推理基准测试中，我们的方法 consistently 优于标准的GRPO和强引导探索基线。平均而言，Cog-DRIFT相比次优基线显示出+4.72%（Qwen）和+3.23%（Llama）的提升。我们进一步证明，Cog-DRIFT在测试时提高了pass@k指标，且课程学习提升了样本效率。总体而言，我们的研究结果凸显了任务重构与课程学习作为克服LLM后训练中探索障碍的有效范式。

摘要 (Abstract)

Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of LLMs, yet a fundamental limitation remains: models cannot learn from problems that are too difficult to solve under their current policy, as these yield no meaningful reward signal. We propose a simple yet effective solution based on task reformulation. We transform challenging open-ended problems into cognitively simpler variants – such as multiple-choice and cloze formats – that preserve the original answer while reducing the effective search space and providing denser learning signals. These reformulations span a spectrum from discriminative to generative tasks, which we exploit to bootstrap learning: models first learn from structured, easier formats, and this knowledge transfers back to improve performance on the original open-ended problems. Building on this insight, we introduce Cog-DRIFT, a framework that constructs reformulated variants and organizes them into an adaptive curriculum based on difficulty. Training progresses from easier to harder formats, enabling the model to learn from problems that previously yielded zero signal under standard RL post-training. Cog-DRIFT not only improves on the originally unsolvable hard problems (absolute +10.11% for Qwen and +8.64% for Llama) but also generalizes well to other held-out datasets. Across 2 models and 6 reasoning benchmarks, our method consistently outperforms standard GRPO and strong guided-exploration baselines. On average, Cog-DRIFT shows +4.72% (Qwen) and +3.23% (Llama) improvements over the second-best baseline. We further show that Cog-DRIFT improves pass@k at test time, and the curriculum improves sample efficiency. Overall, our results highlight task reformulation and curriculum learning as an effective paradigm for overcoming the exploration barrier in LLM post-training.

关键词: LLM post-training, reasoning problems, task reformulation, curriculum learning, reinforcement learning, exploration barrier, hard problems, adaptive curriculum

50. ❌ Undetectable Conversations Between AI Agents via Pseudorandom Noise-Resilient Key Exchange

作者: Vinod Vaikuntanathan, Or Zamir 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04757v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究AI代理之间的隐蔽通信，直接涉及LLM代理和多代理系统，因此给这三个关键词10分。其他关键词如MoE、量化、推理加速等与论文的密码学和安全焦点无关，给0分。

!!! tip deepseek-chat TL;DR

该论文研究了AI代理之间如何在不被审计者察觉的情况下进行隐蔽通信，提出了一种无需共享密钥的伪随机噪声弹性密钥交换协议，证明了仅凭对话记录审计无法排除代理间的隐蔽协调。

摘要翻译

人工智能代理正日益被部署为代表用户和组织与其他代理进行交互。我们探讨这样一个问题：由不同实体运营的两个此类代理，是否能够在执行并行秘密对话的同时，生成一份在计算上与诚实交互不可区分的对话记录——即使面对知晓完整模型描述、协议及代理私有上下文的强大被动审计者。基于近期关于大语言模型水印与隐写术的研究，我们首先证明：若参与方拥有交互唯一的秘密密钥，它们便可实现最优速率的隐蔽对话：隐藏对话能够充分利用诚实消息分布中几乎全部的熵。
我们的主要贡献在于将这一结论扩展至无密钥场景，即代理初始时不存在任何共享秘密。我们证明，即使每个模型具有任意的私有上下文，且其消息简短且完全自适应，只要足够多的独立消息具有至少恒定的最小熵，隐蔽密钥交换以及由此产生的隐蔽对话仍然是可能的。这与先前依赖单个消息最小熵随安全参数增长的隐蔽通信研究形成鲜明对比。为实现这一点，我们引入了一种新的密码学原语，称之为伪随机噪声容忍密钥交换：这是一种密钥交换协议，其公开对话记录具有伪随机性，同时在恒定噪声下仍能保持正确性。我们研究了这一原语，给出了若干适用于本应用场景的构造方案，并通过严格的局限性分析表明，更简单的变体要么不可行，要么易受高效攻击。
这些结果表明，仅凭对话记录审计无法排除人工智能代理之间的隐蔽协调，并揭示了一种可能具有独立研究价值的新密码学理论。

摘要 (Abstract)

AI agents are increasingly deployed to interact with other agents on behalf of users and organizations. We ask whether two such agents, operated by different entities, can carry out a parallel secret conversation while still producing a transcript that is computationally indistinguishable from an honest interaction, even to a strong passive auditor that knows the full model descriptions, the protocol, and the agents’ private contexts. Building on recent work on watermarking and steganography for LLMs, we first show that if the parties possess an interaction-unique secret key, they can facilitate an optimal-rate covert conversation: the hidden conversation can exploit essentially all of the entropy present in the honest message distributions. Our main contributions concern extending this to the keyless setting, where the agents begin with no shared secret. We show that covert key exchange, and hence covert conversation, is possible even when each model has an arbitrary private context, and their messages are short and fully adaptive, assuming only that sufficiently many individual messages have at least constant min-entropy. This stands in contrast to previous covert communication works, which relied on the min-entropy in each individual message growing with the security parameter. To obtain this, we introduce a new cryptographic primitive, which we call pseudorandom noise-resilient key exchange: a key-exchange protocol whose public transcript is pseudorandom while still remaining correct under constant noise. We study this primitive, giving several constructions relevant to our application as well as strong limitations showing that more naive variants are impossible or vulnerable to efficient attacks. These results show that transcript auditing alone cannot rule out covert coordination between AI agents, and identify a new cryptographic theory that may be of independent interest.

关键词: AI agents, covert communication, key exchange, pseudorandom noise, steganography, cryptographic primitive, transcript auditing, secret conversation

51. ❌ Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

作者: Zijun Wang, Haoqin Tu, Letian Zhang, Hardy Chen, Juncheng Wu, Xiangyan Liu, Zhenlong Yuan, Tianyu Pang, Michael Qizhe Shieh, Fengze Liu, Zeyu Zheng, Huaxiu Yao, Yuyin Zhou, Cihang Xie 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04759v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究OpenClaw个人AI代理的安全漏洞，高度相关于LLM代理（LLM Agents）和工具使用（Tool Use），因为OpenClaw是基于大型语言模型（如Claude、GPT-5.4）构建的代理，具有系统访问和API集成能力。论文不涉及其他关键词，如MoE、SLMs、训练方法、推理技术、压缩、科学AI等。

!!! tip deepseek-chat TL;DR

该论文首次对具有系统访问权限的个人AI代理OpenClaw进行真实世界安全评估，发现通过CIK（能力、身份、知识）维度投毒攻击可将攻击成功率从24.6%提升至64-74%，现有防御措施效果有限，表明漏洞是代理架构固有的。

摘要翻译

OpenClaw作为2026年初部署最广泛的个人人工智能代理，拥有完整的本地系统访问权限，并能与Gmail、Stripe及文件系统等敏感服务深度集成。尽管这种宽泛的权限带来了高度自动化能力和强大的个性化功能，但也暴露了巨大的攻击面，而现有的沙盒评估方法无法有效捕捉这些风险。为填补这一空白，我们首次对OpenClaw进行了真实环境下的安全性评估，并提出了CIK分类框架。该框架将智能体的持久状态统一归纳为能力（Capability）、身份（Identity）与知识（Knowledge）三个维度，以系统化进行安全分析。我们在一个真实运行的OpenClaw实例上，针对四种基础模型（Claude Sonnet 4.5、Opus 4.6、Gemini 3.1 Pro和GPT-5.4）开展了12类攻击场景的评估。结果显示，污染任意单一CIK维度可将平均攻击成功率从24.6%提升至64-74%，即使防御能力最强的模型，其受攻击成功率也比基线脆弱性高出三倍以上。我们进一步评估了三种与CIK框架对齐的防御策略及一种文件保护机制；然而，在针对能力维度的攻击中，最强的防御策略仍面临63.8%的攻击成功率，而文件保护机制虽能拦截97%的恶意注入，却也会阻碍合法的系统更新。综上所述，这些发现表明相关漏洞内生于智能体架构之中，必须建立更系统化的防护机制来保障个人人工智能代理的安全。本项目页面详见：https://ucsc-vlaa.github.io/CIK-Bench。

摘要 (Abstract)

OpenClaw, the most widely deployed personal AI agent in early 2026, operates with full local system access and integrates with sensitive services such as Gmail, Stripe, and the filesystem. While these broad privileges enable high levels of automation and powerful personalization, they also expose a substantial attack surface that existing sandboxed evaluations fail to capture. To address this gap, we present the first real-world safety evaluation of OpenClaw and introduce the CIK taxonomy, which unifies an agent’s persistent state into three dimensions, i.e., Capability, Identity, and Knowledge, for safety analysis. Our evaluations cover 12 attack scenarios on a live OpenClaw instance across four backbone models (Claude Sonnet 4.5, Opus 4.6, Gemini 3.1 Pro, and GPT-5.4). The results show that poisoning any single CIK dimension increases the average attack success rate from 24.6% to 64-74%, with even the most robust model exhibiting more than a threefold increase over its baseline vulnerability. We further assess three CIK-aligned defense strategies alongside a file-protection mechanism; however, the strongest defense still yields a 63.8% success rate under Capability-targeted attacks, while file protection blocks 97% of malicious injections but also prevents legitimate updates. Taken together, these findings show that the vulnerabilities are inherent to the agent architecture, necessitating more systematic safeguards to secure personal AI agents. Our project page is https://ucsc-vlaa.github.io/CIK-Bench.

关键词: AI agent safety, OpenClaw, CIK taxonomy, real-world evaluation, attack scenarios, vulnerability analysis, system access, backbone models

52. ❌ Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations

作者: Kalyan Cherukuri, Lav R. Varshney 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04743v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM幻觉问题，提出几何动力学框架分析幻觉产生的潜在空间机制，并开发几何感知的引导方法来减少幻觉。因此与’Large Language Models’和’Hallucination Mitigation’高度相关（10分）。论文涉及对LLM行为的理论解释，与’Mechanistic Interpretability’有一定关联（5分）。其他关键词如MoE、量化、RAG、对齐等均未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型产生幻觉的机制，提出了一个基于潜在空间盆地结构的几何动力学框架来解释幻觉现象，并展示了通过几何感知的引导方法可以在不重新训练的情况下降低幻觉概率。

摘要翻译

大型语言模型（LLMs）会产生幻觉：它们生成流畅但事实错误的输出。我们提出了一种几何动力系统框架，在该框架中，幻觉源于潜在空间中任务依赖的吸引域结构。通过分析多个开源模型和基准测试中的自回归隐藏状态轨迹，我们发现可分性具有强烈的任务依赖性而非普适性：事实性任务可能表现出更清晰的吸引域分离，而摘要任务和误解较多的任务通常稳定性较低且常出现重叠。我们通过任务复杂性和多吸引域定理形式化了这一行为，描述了L层Transformer中吸引域的形成机制，并证明了几何感知的引导能够在不重新训练的情况下降低幻觉概率。

摘要 (Abstract)

Large language models (LLMs) hallucinate: they produce fluent outputs that are factually incorrect. We present a geometric dynamical systems framework in which hallucinations arise from task-dependent basin structure in latent space. Using autoregressive hidden-state trajectories across multiple open-source models and benchmarks, we find that separability is strongly task-dependent rather than universal: factoid settings can show clearer basin separation, whereas summarization and misconception-heavy settings are typically less stable and often overlap. We formalize this behavior with task-complexity and multi-basin theorems, characterize basin emergence in L-layer transformers, and show that geometry-aware steering can reduce hallucination probability without retraining.

关键词: Hallucination, Large Language Models, Geometric Framework, Dynamical Systems, Basin Structure, Latent Space, Autoregressive Trajectories, Geometry-aware Steering

53. ❌ Artificial Intelligence and Cost Reduction in Public Higher Education: A Scoping Review of Emerging Evidence

作者: Diamanto Tzanoulinou, Loukas Triantafyllopoulos, George Vorvilas, Evgenia Paxinou, Nikolaos Karousos, Thomas Dasaklis, Athanassios Mihiotis, Manolis Koutouzis, Dimitris Kalles, Vassilios S. Verykios 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04741v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是一篇关于人工智能在公共高等教育中成本降低应用的范围综述，主要关注AI工具（如ChatGPT、学习分析、智能辅导系统、预测模型）在高等教育管理、教学和规划中的实际应用及其经济影响。论文内容属于AI应用研究，但所有评分关键词均聚焦于大模型/深度学习的技术原理、架构、训练方法、推理优化、对齐技术等具体技术细节。论文摘要和标题中完全没有涉及这些具体技术关键词，仅泛泛提及“AI”和“generative tools such as ChatGPT”，没有讨论任何大模型技术原理、创新方法或具体技术实现。因此，所有关键词均被评为0分（完全无关）。

!!! tip deepseek-chat TL;DR

这篇论文通过范围综述研究了人工智能（包括生成式工具如ChatGPT）在公共高等教育中降低成本的应用潜力，发现AI可以通过自动化行政任务、优化资源分配、支持规模化个性化学习等方式实现成本节约，但也存在实施成本、数字鸿沟等挑战。

摘要翻译

公立高等教育系统正面临日益增长的财政压力，这些压力源于学生规模的扩大、运营成本的上升以及对教育公平的持续诉求。人工智能（AI）——包括生成式工具如ChatGPT、学习分析、智能辅导系统和预测模型——被视为提升效率与降低成本的一种途径。本研究对人工智能在公立高等教育中的应用文献进行了范围综述，基于在Scopus和IEEE Xplore数据库的系统检索，共识别出241条记录，其中21项实证研究符合预设的纳入标准并进行了主题分析。研究结果表明，人工智能通过自动化行政任务、优化资源配置、支持规模化个性化学习以及应用预测分析提升学生保留率与机构规划，实现了成本节约。与此同时，研究也揭示了关于实施成本、机构间获取机会不均以及可能加剧数字鸿沟风险的担忧。总体而言，主题分析既凸显了人工智能驱动高等教育成本降低的潜力与局限，也为政策制定者、高校管理者和教育工作者提供了关于人工智能应用经济影响的见解，同时指出了值得进一步实证研究填补的空白领域。

摘要 (Abstract)

Public higher education systems face increasing financial pressures from expanding student populations, rising operational costs, and persistent demands for equitable access. Artificial Intelligence (AI), including generative tools such as ChatGPT, learning analytics, intelligent tutoring systems, and predictive models, has been proposed as a means of enhancing efficiency and reducing costs. This study conducts a scoping review of the literature on AI applications in public higher education, based on systematic searches in Scopus and IEEE Xplore that identified 241 records, of which 21 empirical studies met predefined eligibility criteria and were thematically analyzed. The findings show that AI enables cost savings by automating administrative tasks, optimizing resource allocation, supporting personalized learning at scale, and applying predictive analytics to improve student retention and institutional planning. At the same time, concerns emerge regarding implementation costs, unequal access across institutions, and risks of widening digital divides. Overall, the thematic analysis highlights both the promises and limitations of AI-driven cost reduction in higher education, offering insights for policymakers, university administrators, and educators on the economic implications of AI adoption, while also pointing to gaps that warrant further empirical research.

关键词: Artificial Intelligence, Cost Reduction, Public Higher Education, Scoping Review, ChatGPT, Learning Analytics, Predictive Analytics, Student Retention

54. ❌ Sampling Parallelism for Fast and Efficient Bayesian Learning

作者: Asena Karolin Özdemir, Lars H. Heyen, Arvid Weyrauch, Achim Streit, Markus Götz, Charlotte Debus 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04736v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《Sampling Parallelism for Fast and Efficient Bayesian Learning》主要研究贝叶斯神经网络（BNNs）中基于采样的不确定性量化（UQ）方法的计算加速问题，提出了一种名为“采样并行性”的并行化策略，通过在多GPU上分布样本评估来减少内存压力和训练时间。论文的核心内容聚焦于深度学习模型的贝叶斯学习、不确定性量化和计算效率优化，并未涉及大语言模型（LLMs）、MoE、指令调优、RLHF、RAG、推理加速、幻觉缓解、智能体等关键词所代表的大模型特定技术或应用。因此，除“AI for Science OR Bioinformatics OR Cheminformatics”因论文提及在医疗保健等科学领域的应用而获得5分（有一定关联）外，其余所有关键词均与论文内容完全无关，得0分。论文的创新点在于并行化策略，而非大模型技术本身。

!!! tip deepseek-chat TL;DR

该论文针对贝叶斯神经网络中基于采样的不确定性量化方法计算成本高的问题，提出了一种采样并行性策略，通过在多GPU上分布样本评估来有效减少内存使用和训练时间，并在实验中展示了良好的扩展性和收敛优势。

摘要翻译

机器学习模型，尤其是深度神经网络，正日益部署于医疗健康、环境预测和金融等风险敏感领域，在这些领域中，对预测不确定性的可靠量化至关重要。然而，许多不确定性量化方法因其高昂的计算成本而难以应用。基于采样的贝叶斯学习方法，例如贝叶斯神经网络，尤其耗费资源，因为抽取和评估多个参数样本会迅速耗尽内存与计算资源。这些限制迄今制约了贝叶斯技术的可及性与探索。为应对这些挑战，我们引入了采样并行化——一种简单而强大的并行化策略，它针对基于采样的贝叶斯学习的主要瓶颈：样本本身。通过将样本评估分布到多个GPU上，我们的方法在不改变架构或进行大量超参数调整的情况下，降低了内存压力并缩短了训练时间。我们详细阐述了该方法，并在若干示例任务和架构上评估其性能，以分布式数据并行作为基线进行对比。我们进一步通过实现一种结合样本并行与数据并行的混合策略，证明了采样并行化与现有策略具有互补性。实验表明，当样本数量与计算资源成比例增加时，该方法实现了近乎完美的扩展性，证实了样本评估可以高效并行化。尽管在固定工作负载下扩展时，分布式数据并行能获得更显著的原始加速比，但采样并行化具有一个突出优势：通过在每个GPU上对同一批次数据应用独立的随机增强，它增加了增强的多样性，从而减少了模型收敛所需的训练轮次。

摘要 (Abstract)

Machine learning models, and deep neural networks in particular, are increasingly deployed in risk-sensitive domains such as healthcare, environmental forecasting, and finance, where reliable quantification of predictive uncertainty is essential. However, many uncertainty quantification (UQ) methods remain difficult to apply due to their substantial computational cost. Sampling-based Bayesian learning approaches, such as Bayesian neural networks (BNNs), are particularly expensive since drawing and evaluating multiple parameter samples rapidly exhausts memory and compute resources. These constraints have limited the accessibility and exploration of Bayesian techniques thus far. To address these challenges, we introduce sampling parallelism, a simple yet powerful parallelization strategy that targets the primary bottleneck of sampling-based Bayesian learning: the samples themselves. By distributing sample evaluations across multiple GPUs, our method reduces memory pressure and training time without requiring architectural changes or extensive hyperparameter tuning. We detail the methodology and evaluate its performance on a few example tasks and architectures, comparing against distributed data parallelism (DDP) as a baseline. We further demonstrate that sampling parallelism is complementary to existing strategies by implementing a hybrid approach that combines sample and data parallelism. Our experiments show near-perfect scaling when the sample number is scaled proportionally to the computational resources, confirming that sample evaluations parallelize cleanly. Although DDP achieves better raw speedups under scaling with constant workload, sampling parallelism has a notable advantage: by applying independent stochastic augmentations to the same batch on each GPU, it increases augmentation diversity and thus reduces the number of epochs required for convergence.

关键词: Bayesian learning, uncertainty quantification, sampling parallelism, Bayesian neural networks, computational efficiency, parallelization, GPU acceleration, distributed training

55. ❌ Discovering Failure Modes in Vision-Language Models using RL

作者: Kanishk Jain, Qian Yang, Shravan Nayak, Parisa Kordjamshidi, Nishanth Anand, Aishwarya Agrawal 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04733v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是视觉语言模型（VLMs）的失败模式发现，使用强化学习框架。虽然VLMs属于多模态大模型，但论文专注于视觉-语言任务，而非纯文本大模型（LLMs）。所有关键词均针对纯文本大模型（LLMs）的技术原理、训练方法、推理优化、应用框架等，与论文的视觉-语言多模态焦点无直接关联。论文未涉及LLMs的预训练、微调、对齐、推理加速、代理系统等任何具体技术。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于强化学习的框架，用于自动发现视觉语言模型在计数、空间推理等任务中的失败模式，无需人工干预，并识别了36种新的失败模式。

摘要翻译

视觉语言模型（Vision-language Models, VLMs）虽然在多模态基准测试中表现出色，却常常误解人类能够轻易识别的基本视觉概念，例如计数、空间推理和视角理解。先前的研究通过人工方式识别了这些弱点，并发现它们通常源于特定能力的缺陷。然而，这种人工方法成本高昂、难以扩展，且易受人为偏见影响——往往忽略细微的视觉细节而偏向显著物体，导致对模型缺陷的理解不够全面。为克服这些限制，我们提出了一种基于强化学习（Reinforcement Learning, RL）的框架，能够在无需人工干预的情况下，自动发现任意候选VLM在给定数据分布上的故障模式或盲点。该框架训练一个提问者智能体，使其能够根据候选VLM的响应自适应地生成查询，从而引导模型产生错误答案。我们的方法通过聚焦于细粒度的视觉细节和不同的技能组合，随着训练进程逐步提升问题复杂度，最终识别出36种VLM难以应对的新型故障模式。通过展示该框架在不同模型组合间的泛化能力，我们证明了其广泛的适用性。

摘要 (Abstract)

Vision-language Models (VLMs), despite achieving strong performance on multimodal benchmarks, often misinterpret straightforward visual concepts that humans identify effortlessly, such as counting, spatial reasoning, and viewpoint understanding. Previous studies manually identified these weaknesses and found that they often stem from deficits in specific skills. However, such manual efforts are costly, unscalable, and subject to human bias, which often overlooks subtle details in favor of salient objects, resulting in an incomplete understanding of a model’s vulnerabilities. To address these limitations, we propose a Reinforcement Learning (RL)-based framework to automatically discover the failure modes or blind spots of any candidate VLM on a given data distribution without human intervention. Our framework trains a questioner agent that adaptively generates queries based on the candidate VLM’s responses to elicit incorrect answers. Our approach increases question complexity by focusing on fine-grained visual details and distinct skill compositions as training progresses, consequently identifying 36 novel failure modes in which VLMs struggle. We demonstrate the broad applicability of our framework by showcasing its generalizability across various model combinations.

关键词: Vision-Language Models, Failure Modes, Reinforcement Learning, Automatic Discovery, Multimodal Benchmarks, Questioner Agent, Fine-grained Visual Details, Model Vulnerabilities

56. ❌ Metaphors We Compute By: A Computational Audit of Cultural Translation vs. Thinking in LLMs

作者: Yuan Chang, Jiaming Qu, Zhu Li 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04732v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究LLMs的文化推理能力，直接涉及LLMs关键词（10分）。研究考察LLMs是否进行文化推理，与推理相关关键词（Chain of Thought、System 2 Thinking）有一定关联（各5分）。论文进行审计分析，与可解释AI相关（5分）。其他关键词如MoE、SFT、RAG等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文通过隐喻生成任务审计LLMs的文化推理能力，发现LLMs表现出刻板隐喻使用和西方默认主义，提示文化身份并不能保证文化基础推理。

摘要翻译

大型语言模型（LLMs）常被描述为具备多语言能力，因为它们能够理解和响应多种语言。然而，掌握一门语言并不等同于在特定文化中进行推理。这一区别引出了一个关键问题：大型语言模型是否真正实现了文化感知的推理？本文针对一项创意写作任务中的文化包容性进行了初步的计算审计。我们通过实证研究检验了大型语言模型是作为文化多元的创意伙伴，还是仅仅作为利用主导概念框架并辅以本地化表达的文化翻译工具。以涵盖五种文化背景及若干抽象概念的隐喻生成任务为案例，我们发现该模型在特定文化背景下表现出刻板化的隐喻使用倾向，并存在西方中心主义默认模式。这些结果表明，仅通过提示大型语言模型设定文化身份，并不能保证其进行基于文化根基的推理。

摘要 (Abstract)

Large language models (LLMs) are often described as multilingual because they can understand and respond in many languages. However, speaking a language is not the same as reasoning within a culture. This distinction motivates a critical question: do LLMs truly conduct culture-aware reasoning? This paper presents a preliminary computational audit of cultural inclusivity in a creative writing task. We empirically examine whether LLMs act as culturally diverse creative partners or merely as cultural translators that leverage a dominant conceptual framework with localized expressions. Using a metaphor generation task spanning five cultural settings and several abstract concepts as a case study, we find that the model exhibits stereotyped metaphor usage for certain settings, as well as Western defaultism. These findings suggest that merely prompting an LLM with a cultural identity does not guarantee culturally grounded reasoning.

关键词: Large Language Models, cultural reasoning, metaphor generation, computational audit, cultural inclusivity, Western defaultism, creative writing, multilingual models

57. ❌ Neuromorphic Computing for Low-Power Artificial Intelligence

作者: Keshava Katti, Pratik Chaudhari, Deep Jariwala 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04727v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于神经形态计算硬件和架构，旨在提高AI系统的能效和可扩展性，但未涉及任何大语言模型（LLM）或深度学习技术原理的创新。所有关键词均与大模型技术、训练方法、推理优化、对齐、应用等领域相关，而本文讨论的是底层计算硬件和架构，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过神经形态计算（包括新型器件、存内计算和脑启发设计）来克服传统CMOS技术的能效限制，以提高AI系统的能源效率和可扩展性。

摘要翻译

经典计算正开始触及能效的根本性限制。这一挑战已无法通过提高电路密度或改进标准半导体工艺等策略来解决。人工智能（AI）日益增长的计算与存储需求，要求在信息表征、存储、传输和处理方式上进行颠覆性创新。通过利用新型器件模态与存内计算（CIM），并结合受大脑启发的模拟动力学与稀疏通信，神经形态计算为提升现有AI系统的能效与可扩展性提供了一条前景广阔的路径。但实现这一潜力并非简单替换芯片，而是需要一种跨层协同设计，涵盖新材料与非易失性器件结构、新型混合信号电路与架构，以及适配这些物理基底层特性的学习算法。本文综述了经典互补金属氧化物半导体（CMOS）技术的主要局限，并阐述了此类跨层神经形态方法如何有望突破这些限制。

摘要 (Abstract)

Classical computing is beginning to encounter fundamental limits of energy efficiency. This presents a challenge that can no longer be solved by strategies such as increasing circuit density or refining standard semiconductor processes. The growing computational and memory demands of artificial intelligence (AI) require disruptive innovation in how information is represented, stored, communicated, and processed. By leveraging novel device modalities and compute-in-memory (CIM), in addition to analog dynamics and sparse communication inspired by the brain, neuromorphic computing offers a promising path toward improvements in the energy efficiency and scalability of current AI systems. But realizing this potential is not a matter of replacing one chip with another; rather, it requires a co-design effort, spanning new materials and non-volatile device structures, novel mixed-signal circuits and architectures, and learning algorithms tailored to the physics of these substrates. This article surveys the key limitations of classical complementary metal-oxide-semiconductor (CMOS) technology and outlines how such cross-layer neuromorphic approaches may overcome them.

关键词: Neuromorphic Computing, Low-Power AI, Energy Efficiency, Compute-in-Memory, Analog Dynamics, Sparse Communication, Cross-layer Co-design, CMOS Limitations

58. ❌ Individual and Combined Effects of English as a Second Language and Typos on LLM Performance

作者: Serena Liu, Yutong Yang, Prisha Sheth, Weixuan Dong, Mingjiao Diao, Xinru Zhu, Nikhil Banga, Oscar Melendez, Arnav Sharma, Minda Zhao, Marina Lin, Mengyu Wang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04723v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究LLMs在非标准英语输入（ESL变体和拼写错误）下的性能评估，仅与第一个关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文核心是评估LLMs在现实世界非标准输入下的表现。其他关键词涉及模型架构、训练方法、推理技术、应用领域等，论文未涉及这些具体技术或应用，因此均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了英语作为第二语言的变体和拼写错误对大型语言模型性能的个体和组合影响，发现组合效应通常比单一因素导致更大的性能下降，且这种效应在封闭式任务中更为明显。

摘要翻译

大型语言模型（LLMs）在全球范围内被广泛使用，由于其训练数据大多为英文，它们通常在英文输入上表现最佳。因此，许多非英语母语者将其作为第二语言（ESL）与模型进行交互，而这些输入常包含拼写错误。先前的研究大多分别探讨了ESL变体和拼写错误的影响，尽管在实际使用中两者常同时出现。在本研究中，我们使用Trans-EnV框架将标准英文输入转换为八种ESL变体，并应用MulTypo工具在低、中、高三个级别注入拼写错误。我们发现，结合ESL变体和拼写错误通常会导致比单一因素更大的性能下降，但联合效应并非简单的叠加。这一模式在封闭式任务中最为明显，其性能下降在不同ESL变体和拼写错误级别间可被更一致地刻画；而在开放式任务中，结果则更为复杂。总体而言，这些发现表明，基于纯净标准英文的评估可能高估了模型在真实场景中的性能，且单独评估ESL变体或拼写错误并不能完全反映模型在真实环境中的行为。

摘要 (Abstract)

Large language models (LLMs) are used globally, and because much of their training data is in English, they typically perform best on English inputs. As a result, many non-native English speakers interact with them in English as a second language (ESL), and these inputs often contain typographical errors. Prior work has largely studied the effects of ESL variation and typographical errors separately, even though they often co-occur in real-world use. In this study, we use the Trans-EnV framework to transform standard English inputs into eight ESL variants and apply MulTypo to inject typos at three levels: low, moderate, and severe. We find that combining ESL variation and typos generally leads to larger performance drops than either factor alone, though the combined effect is not simply additive. This pattern is clearest on closed-ended tasks, where performance degradation can be characterized more consistently across ESL variants and typo levels, while results on open-ended tasks are more mixed. Overall, these findings suggest that evaluations on clean standard English may overestimate real-world model performance, and that evaluating ESL variation and typographical errors in isolation does not fully capture model behavior in realistic settings.

关键词: Large Language Models, LLMs, English as a Second Language, ESL, typographical errors, performance evaluation, closed-ended tasks, real-world settings

59. ❌ AI Assistance Reduces Persistence and Hurts Independent Performance

作者: Grace Liu, Brian Christian, Tsvetomira Dumbalska, Michiel A. Bakker, Rachit Dubey 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04721v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究AI辅助对人类学习行为的影响（减少坚持力、损害独立表现），属于AI应用的社会科学实验研究。仅与"Large Language Models OR LLMs OR Foundation Models"有间接关联（5分），因为研究涉及当前AI系统（可能包括LLMs）作为协作工具的行为影响，但未深入任何具体技术。其他关键词均涉及具体技术原理、方法或特定领域应用，与论文的实证行为研究完全无关（0分）。

!!! tip deepseek-chat TL;DR

该研究通过随机对照实验发现，AI辅助虽然在短期内提高任务表现，但会显著降低用户的坚持力和无辅助时的独立表现，这对长期学习能力构成潜在风险。

摘要翻译

人们常在协作中为长期目标进行优化：导师或同伴不仅回答问题，还会搭建学习支架、追踪进展，并将对方的成长置于即时结果之上。相比之下，当前人工智能系统本质上是短视的协作者——它们以为提供即时完整回答为目标而优化，从不拒绝（除非出于安全原因）。这种动态关系会产生何种后果？本文通过对人机交互开展一系列随机对照实验（样本量=1,222），为人工智能辅助的两个关键后果提供了因果证据：坚持性的降低与无辅助状态下表现的受损。在数学推理和阅读理解等多种任务中，我们发现尽管人工智能辅助能在短期内提升表现，但人们在无辅助时表现显著更差，且更容易放弃。值得注意的是，这些效应仅在与人工智能进行短暂交互（约10分钟）后即显现。这些发现尤其值得关注，因为坚持性是技能习得的基础，也是长期学习效果最强的预测指标之一。我们认为坚持性降低是因为人工智能使人们习惯于期待即时答案，从而剥夺了他们独立应对挑战的体验。这些结果表明，人工智能模型开发需要在完成即时任务的同时，优先构建支持长期能力发展的协作框架。

摘要 (Abstract)

People often optimize for long-term goals in collaboration: A mentor or companion doesn’t just answer questions, but also scaffolds learning, tracks progress, and prioritizes the other person’s growth over immediate results. In contrast, current AI systems are fundamentally short-sighted collaborators - optimized for providing instant and complete responses, without ever saying no (unless for safety reasons). What are the consequences of this dynamic? Here, through a series of randomized controlled trials on human-AI interactions (N = 1,222), we provide causal evidence for two key consequences of AI assistance: reduced persistence and impairment of unassisted performance. Across a variety of tasks, including mathematical reasoning and reading comprehension, we find that although AI assistance improves performance in the short-term, people perform significantly worse without AI and are more likely to give up. Notably, these effects emerge after only brief interactions with AI (approximately 10 minutes). These findings are particularly concerning because persistence is foundational to skill acquisition and is one of the strongest predictors of long-term learning. We posit that persistence is reduced because AI conditions people to expect immediate answers, thereby denying them the experience of working through challenges on their own. These results suggest the need for AI model development to prioritize scaffolding long-term competence alongside immediate task completion.

关键词: AI assistance, human-AI interaction, persistence, independent performance, randomized controlled trial, skill acquisition, long-term learning, scaffolding

60. ❌ What Makes Good Multilingual Reasoning? Disentangling Reasoning Traces with Measurable Features

作者: Dayeon Ki, Kevin Duh, Marine Carpuat 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04720v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大型推理模型（LRMs）在数学推理任务中的多语言表现差异，核心关注推理过程分析而非具体技术实现。高度相关关键词：‘Large Language Models’（论文明确研究LRMs）、‘Chain of Thought’（论文分析推理轨迹特征）、‘System 2 Thinking’（涉及深度推理分析）。中等相关：‘Mechanistic Interpretability’（通过可测量特征解释推理过程）。其他关键词如MoE、量化、RAG等与论文内容无关。

!!! tip deepseek-chat TL;DR

该研究挑战了通过使多语言推理模仿英语推理来缩小大型推理模型性能差距的假设，通过定义可测量的推理特征并分析其在10种语言中的有效性，发现特征与准确性的关联强度因语言而异，有时甚至相反，从而指出需要适应语言特定推理模式的自适应目标。

摘要翻译

大型推理模型（LRMs）在英语与其他语言之间仍存在显著的性能差距，而当前许多研究假设仅通过使每种语言的推理过程趋近英语推理模式即可弥合这一差距。本研究对这一假设提出挑战，转而探讨：在多语言环境中，有效推理的本质特征是什么？源自英语的推理特征在多大程度上真正有助于其他语言的推理？我们首先定义了一套可量化的推理特征体系，涵盖推理轨迹的多语言对齐性、推理步骤与推理流程等多个维度，并利用逻辑回归量化各特征与最终答案准确率的关联程度。进一步，我们在多语言推理轨迹上训练稀疏自编码器，以自动发现能够实例化或扩展这些特征的潜在推理概念。最后，我们将这些特征作为测试时选择策略，检验其是否能引导模型实现更强的多语言推理能力。通过在两个数学推理基准测试、四种LRM模型及十种语言中的实验，我们发现大多数特征与准确率呈正相关，但这种关联强度在不同语言间差异显著，甚至在部分语言中出现反转。我们的研究结果对以英语为中心的奖励设计提出了质疑，并指出需要开发适应语言特异性推理模式的自适应目标，这对多语言基准测试与奖励设计具有具体启示意义。

摘要 (Abstract)

Large Reasoning Models (LRMs) still exhibit large performance gaps between English and other languages, yet much current work assumes these gaps can be closed simply by making reasoning in every language resemble English reasoning. This work challenges this assumption by asking instead: what actually characterizes effective reasoning in multilingual settings, and to what extent do English-derived reasoning features genuinely help in other languages? We first define a suite of measurable reasoning features spanning multilingual alignment, reasoning step, and reasoning flow aspects of reasoning traces, and use logistic regression to quantify how each feature associates with final answer accuracy. We further train sparse autoencoders over multilingual traces to automatically discover latent reasoning concepts that instantiate or extend these features. Finally, we use the features as test-time selection policies to examine whether they can steer models toward stronger multilingual reasoning. Across two mathematical reasoning benchmarks, four LRMs, and 10 languages, we find that most features are positively associated with accuracy, but the strength of association varies considerably across languages and can even reverse in some. Our findings challenge English-centric reward designs and point toward adaptive objectives that accommodate language-specific reasoning patterns, with concrete implications for multilingual benchmark and reward design.

关键词: Large Reasoning Models, multilingual reasoning, reasoning traces, mathematical reasoning, reasoning features, language-specific patterns, benchmark design, reward design

61. ❌ The Infinite-Dimensional Nature of Spectroscopy and Why Models Succeed, Fail, and Mislead

作者: Umberto Michelucci, Francesca Venturini 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04717v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文研究光谱学中机器学习模型的高准确度现象，主要涉及光谱数据的高维特性和模型解释性。与绝大多数关键词（如LLM、MoE、RLHF、RAG等）完全无关，因为这些关键词针对大语言模型技术，而论文聚焦传统机器学习在科学领域的应用。仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文探讨了模型特征重要性解释的问题；与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为论文直接研究光谱学（化学信息学相关）中的AI应用。

!!! tip deepseek-chat TL;DR

该论文通过理论分析和实验证明，光谱数据的高维特性导致机器学习模型即使在没有化学区分的情况下也能达到高准确度，并提供了光谱学中模型构建和解释的实用建议。

摘要翻译

机器学习（ML）模型在光谱分类任务中取得了极高的准确率，但通常缺乏明确证据表明这些模型使用了具有化学意义的特征。现有研究将这些结果与数据预处理选择、噪声敏感性及模型复杂性相关联，但迄今尚无统一的解释。本研究表明，这些现象源于光谱数据固有的高维特性。基于费尔德曼-哈耶克定理（Feldman-Hajek theorem）和测度集中理论的理论分析表明，由噪声、归一化或仪器伪影引起的微小分布差异，在高维空间中可能变得完全可分。通过对合成及真实荧光光谱的一系列具体实验，我们展示了即使不存在化学差异时模型仍能实现近乎完美的准确率，并解释了特征重要性图谱为何可能突出光谱无关区域。本研究提供了严格的理论框架，通过实验验证了该效应，最终为光谱学中机器学习模型的构建与解释提出了实用性建议。

摘要 (Abstract)

Machine learning (ML) models have achieved strikingly high accuracies in spectroscopic classification tasks, often without a clear proof that those models used chemically meaningful features. Existing studies have linked these results to data preprocessing choices, noise sensitivity, and model complexity, but no unifying explanation is available so far. In this work, we show that these phenomena arise naturally from the intrinsic high dimensionality of spectral data. Using a theoretical analysis grounded in the Feldman-Hajek theorem and the concentration of measure, we show that even infinitesimal distributional differences, caused by noise, normalisation, or instrumental artefacts, may become perfectly separable in high-dimensional spaces. Through a series of specific experiments on synthetic and real fluorescence spectra, we illustrate how models can achieve near-perfect accuracy even when chemical distinctions are absent, and why feature-importance maps may highlight spectrally irrelevant regions. We provide a rigorous theoretical framework, confirm the effect experimentally, and conclude with practical recommendations for building and interpreting ML models in spectroscopy.

关键词: spectroscopy, machine learning, high-dimensional data, Feldman-Hajek theorem, concentration of measure, model interpretation, fluorescence spectra, feature importance

62. ❌ BiST: A Gold Standard Bangla-English Bilingual Corpus for Sentence Structure and Tense Classification with Inter-Annotator Agreement

作者: Abdullah Al Shafi, Swapnil Kundu Argha, M. A. Moyeen, Abdul Muntakim, Shoumik Barman Polok 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04708v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于构建一个高质量的双语（孟加拉语-英语）语料库BiST，用于句子级语法分类（句法结构和时态），并进行了基线模型评估。研究内容属于传统的多语言NLP资源构建和基础语法建模任务，不涉及大模型、深度学习技术原理创新或AI for Science应用。所有关键词均与大模型技术、深度学习创新或科学AI应用相关，而本文是资源构建和基础NLP任务，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该研究构建了一个高质量的孟加拉语-英语双语语料库BiST，用于句子级语法分类，并通过多阶段标注框架确保了高标注一致性，为双语语法建模和跨语言表示学习提供了统一资源。

摘要翻译

高质量双语资源仍是推动低资源环境下多语言自然语言处理发展的关键瓶颈，尤其对于孟加拉语而言。为缓解这一缺口，我们引入了BiST——一个经过严格筛选的、用于句子级语法分类的孟加拉语-英语双语语料库，其标注涵盖两个基本维度：句法结构（简单句、复杂句、并列句、并列复杂句）和时态（现在时、过去时、将来时）。该语料库整合了开放许可的百科文本与自然撰写的对话文本，经过系统化预处理和自动化语言识别，最终包含30,534个句子，其中英语17,465句，孟加拉语13,069句。我们通过多阶段标注框架确保标注质量，由三名独立标注者参与，并采用分维度弗莱斯卡帕（Fleiss Kappa, $κ$）一致性评估，在结构和时态标注上分别获得0.82和0.88的$κ$值，从而产生可靠且可复现的标签。统计分析显示语料库具有现实性的结构和时态分布，基线评估表明：利用互补语言特异性表征的双编码器架构，其性能持续优于强大多语言编码器。除基准测试外，BiST提供了显式语言监督，可支持语法建模任务，包括可控文本生成、自动反馈生成和跨语言表征学习。本语料库为双语语法建模建立了统一资源，并为基于语言学的多语言研究提供了便利。

摘要 (Abstract)

High-quality bilingual resources remain a critical bottleneck for advancing multilingual NLP in low-resource settings, particularly for Bangla. To mitigate this gap, we introduce BiST, a rigorously curated Bangla-English corpus for sentence-level grammatical classification, annotated across two fundamental dimensions: syntactic structure (Simple, Complex, Compound, Complex-Compound) and tense (Present, Past, Future). The corpus is compiled from open-licensed encyclopedic sources and naturally composed conversational text, followed by systematic preprocessing and automated language identification, resulting in 30,534 sentences, including 17,465 English and 13,069 Bangla instances. Annotation quality is ensured through a multi-stage framework with three independent annotators and dimension-wise Fleiss Kappa ($κ$) agreement, yielding reliable and reproducible labels with $κ$ values of 0.82 and 0.88 for structural and temporal annotation, respectively. Statistical analyses demonstrate realistic structural and temporal distributions, while baseline evaluations show that dual-encoder architectures leveraging complementary language-specific representations consistently outperform strong multilingual encoders. Beyond benchmarking, BiST provides explicit linguistic supervision that supports grammatical modeling tasks, including controlled text generation, automated feedback generation, and cross-lingual representation learning. The corpus establishes a unified resource for bilingual grammatical modeling and facilitates linguistically grounded multilingual research.

关键词: bilingual corpus, Bangla-English, sentence classification, grammatical annotation, multilingual NLP, low-resource languages, inter-annotator agreement, cross-lingual representation learning

63. ❌ MUXQ: Mixed-to-Uniform Precision MatriX Quantization via Low-Rank Outlier Decomposition

作者: Seoungsub Lee, In Seo Kim, Seon Wook Kim 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04701v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MUXQ专注于大语言模型（LLMs）的量化技术，核心贡献是提出一种新的混合到均匀精度矩阵量化方法，通过低秩异常值分解解决激活异常值问题。因此，与’Large Language Models’高度相关（10分），与’Quantization’直接相关（10分）。论文特别关注NPU-based on-device环境，与’Small Language Models/On-device AI’相关（8分），并涉及推理加速（8分）。其他关键词如MoE、Scaling Laws、Alignment等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出MUXQ量化方法，通过检测和重新分配输入激活中的异常值通道，使大语言模型在边缘设备上实现低精度整数量化，同时保持接近FP16的精度。

摘要翻译

大语言模型（LLM）在广泛的自然语言处理任务中取得了卓越性能，但其庞大的参数量带来了显著的内存与计算开销。这一挑战在基于神经处理单元（NPU）的端侧环境中尤为关键，因为FP16/FP32计算效率低下，因此整数（INT）量化至关重要。然而，现有方法（包括ZeroQuant、LLM.int8()和SmoothQuant）未能完全解决输入激活中的异常值问题及其相关的硬件效率低下问题。为克服这些局限，我们提出混合至均匀量化（MUXQ, Mixed-to-Uniform Quantization）。MUXQ检测输入激活中的异常通道，并引入一个辅助小矩阵，将异常幅值重新分配到各通道中，从而缓解异常值问题。这使得即使是激活异常值也能以低精度INT级别进行量化，同时保持硬件友好的计算结构。在WikiText-2数据集上对三种规模（0.1B、0.3B和0.7B参数）的GPT-2模型进行的实验表明，MUXQ始终比朴素量化获得更低的困惑度。特别是在张量级量化设置下，MUXQ将激活和权重均量化为INT8，同时保持接近FP16的精度。在仅引入适度计算开销的情况下，MUXQ实现了稳定的低精度推理，并可轻松与其他量化技术结合。这些结果表明，MUXQ为在边缘设备上实现高效且准确的大语言模型推理提供了一个有前景的方向。

摘要 (Abstract)

Large language models (LLMs) have achieved outstanding performance across a wide range of natural language processing tasks, but their enormous parameter counts impose ubstantial memory and computational overheads. This challenge is particularly critical in NPU-based on-device environments, where FP16/FP32 computation is inefficient and integer (INT) quantization is therefore essential. However, existing methods, including ZeroQuant, LLM.int8(), and SmoothQuant, do not fully address input-activation outliers and the associated hardware inefficiencies. To overcome these limitations, we propose MUXQ (Mixed-to-Uniform Quantization). MUXQ detects outlier channels in input activations and introduces a small auxiliary matrix that redistributes outlier magnitudes across channels, thereby alleviating the outlier problem. This enables even activation outliers to be quantized at low-precision INT levels while preserving a hardware-friendly computation structure. Experiments on GPT-2 models at three scales (0.1B, 0.3B, and 0.7B parameters) using the WikiText-2 dataset show that MUXQ consistently achieves lower perplexity than naive quantization. In particular, under per-tensor quantization, MUXQ quantizes both activations and weights to INT8 while maintaining accuracy close to that of FP16. With only modest computational overhead, MUXQ enables stable low-precision inference and can be readily combined with other quantization techniques. These results suggest that MUXQ provides a promising direction for efficient and accurate LLM inference on edge devices.

关键词: Large language models, Quantization, On-device inference, INT8, Activation outliers, Low-precision inference, Model compression, Edge devices

64. ❌ Is a Picture Worth a Thousand Words? Adaptive Multimodal Fact-Checking with Visual Evidence Necessity

作者: Jaeyoon Jung, Yejun Yoon, Kunwoo Park 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04692v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多模态事实核查框架，提出AMuFC系统，包含分析器和验证器两个代理，用于自适应地使用视觉证据。虽然涉及多模态AI和事实核查，但论文未明确提及或深入探讨任何评分关键词中的具体大模型技术、训练方法、推理技术、代理系统或科学AI应用。所有关键词均与大模型技术原理、训练优化、推理加速、代理架构或特定科学领域应用相关，而本文聚焦于多模态事实核查的具体应用框架，未涉及这些底层技术或特定领域。

!!! tip deepseek-chat TL;DR

该论文挑战了视觉证据总能提升多模态事实核查性能的假设，提出了AMuFC框架，通过分析器判断视觉证据必要性并指导验证器，实验表明该方法能显著提高核查准确性，并发布了新的数据集WebFC。

摘要翻译

自动化事实核查不仅是新闻业的关键任务，也在网络平台中发挥着重要作用，它有助于构建负责任的信息生态系统并减轻错误信息的危害。尽管近期研究已从纯文本核查发展到多模态事实核查，但当前普遍假设认为引入视觉证据总能提升性能。本研究挑战了这一假设，并证明不加区分地使用多模态证据反而可能降低准确性。为应对这一挑战，我们提出AMuFC——一种多模态事实核查框架，该框架采用两个具有明确分工的协作智能体来实现视觉证据的自适应使用：分析器（Analyzer）判定验证主张时是否需要视觉证据，核查器（Verifier）则根据检索到的证据及分析器的评估结果来预测主张真实性。在三个数据集上的实验结果表明，将分析器对视觉证据必要性的评估纳入核查器的预测过程，能显著提升验证性能。除全部代码外，我们还发布了WebFC——一个为在更现实场景中评估事实核查模块而新构建的数据集，可通过https://github.com/ssu-humane/AMuFC获取。

摘要 (Abstract)

Automated fact-checking is a crucial task not only in journalism but also across web platforms, where it supports a responsible information ecosystem and mitigates the harms of misinformation. While recent research has progressed from text-only to multimodal fact-checking, a prevailing assumption is that incorporating visual evidence universally improves performance. In this work, we challenge this assumption and show that indiscriminate use of multimodal evidence can reduce accuracy. To address this challenge, we propose AMuFC, a multimodal fact-checking framework that employs two collaborative agents with distinct roles for the adaptive use of visual evidence: An Analyzer determines whether visual evidence is necessary for claim verification, and a Verifier predicts claim veracity conditioned on both the retrieved evidence and the Analyzer’s assessment. Experimental results on three datasets show that incorporating the Analyzer’s assessment of visual evidence necessity into the Verifier’s prediction yields substantial improvements in verification performance. In addition to all code, we release WebFC, a newly constructed dataset for evaluating fact-checking modules in a more realistic scenario, available at https://github.com/ssu-humane/AMuFC.

关键词: multimodal fact-checking, visual evidence, adaptive framework, AMuFC, analyzer, verifier, WebFC dataset, misinformation mitigation

65. ❌ Pickalo: Leveraging 6D Pose Estimation for Low-Cost Industrial Bin Picking

作者: Alessandro Tarsi, Matteo Mastrogiuseppe, Saverio Taliani, Simone Cortinovis, Ugo Pattacini 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04690v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于机器人视觉和抓取系统，使用传统计算机视觉和深度学习技术（如Mask-RCNN、SAM-6D）解决工业分拣问题，未涉及大语言模型、模型训练、对齐、推理、代理系统等关键词领域。论文属于机器人学和计算机视觉应用，与提供的大模型和深度学习技术原理关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于低成本硬件的6D姿态估计工业分拣系统Pickalo，通过多视角融合和合成数据训练的模型，在密集填充的欧元箱中实现了每小时600次平均抓取和96-99%的成功率。

摘要翻译

在真实工业环境中，由于严重的杂乱堆叠、遮挡现象以及传统三维传感方案的高昂成本，料箱抓取任务仍面临挑战。本文提出Pickalo——一套完全基于低成本硬件构建的模块化六维位姿料箱抓取流程。该系统通过腕载RGB-D相机从多视角主动探索场景，并利用BridgeDepth算法处理原始立体视觉流以获取适用于精确碰撞推理的优化深度图。物体实例分割采用基于纯照片级合成数据训练的Mask-RCNN模型，并借助零样本SAM-6D位姿估计器进行定位。位姿缓冲模块通过时序融合多视角观测数据，有效处理物体对称性问题并显著降低位姿噪声。离线阶段，我们为每个物体生成并筛选大量对握式抓取候选方案；在线阶段，则通过基于效用的排序机制与快速碰撞检测进行抓取规划查询。在配备平行夹爪的UR5e机械臂与Intel RealSense D435i相机的部署环境中，Pickalo在密集堆叠的欧标料箱中实现了每小时最高600次的平均抓取次数，抓取成功率达96-99%，并能保持30分钟连续运行的鲁棒性能。消融实验验证了增强深度估计与位姿缓冲模块在真实工业条件下对长期稳定性与吞吐量的提升效果。演示视频详见：https://mesh-iit.github.io/project-jl2-camozzi/

摘要 (Abstract)

Bin picking in real industrial environments remains challenging due to severe clutter, occlusions, and the high cost of traditional 3D sensing setups. We present Pickalo, a modular 6D pose-based bin-picking pipeline built entirely on low-cost hardware. A wrist-mounted RGB-D camera actively explores the scene from multiple viewpoints, while raw stereo streams are processed with BridgeDepth to obtain refined depth maps suitable for accurate collision reasoning. Object instances are segmented with a Mask-RCNN model trained purely on photorealistic synthetic data and localized using the zero-shot SAM-6D pose estimator. A pose buffer module fuses multi-view observations over time, handling object symmetries and significantly reducing pose noise. Offline, we generate and curate large sets of antipodal grasp candidates per object; online, a utility-based ranking and fast collision checking are queried for the grasp planning. Deployed on a UR5e with a parallel-jaw gripper and an Intel RealSense D435i, Pickalo achieves up to 600 mean picks per hour with 96-99% grasp success and robust performance over 30-minute runs on densely filled euroboxes. Ablation studies demonstrate the benefits of enhanced depth estimation and of the pose buffer for long-term stability and throughput in realistic industrial conditions. Videos are available at https://mesh-iit.github.io/project-jl2-camozzi/

关键词: 6D pose estimation, bin picking, industrial robotics, low-cost hardware, Mask-RCNN, SAM-6D, grasp planning, collision checking

66. ❌ On the “Causality” Step in Policy Gradient Derivations: A Pedagogical Reconciliation of Full Return and Reward-to-Go

作者: Nima H. Siboni 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04686v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于强化学习中策略梯度推导的数学严谨性教学问题，特别是REINFORCE算法中从完整回报到奖励累计的转换步骤。论文内容完全围绕强化学习的数学推导和教学澄清，不涉及任何大模型、深度学习技术原理、AI应用或相关创新技术。所有关键词均与大模型、深度学习技术、AI应用或相关创新领域相关，而本文是纯粹的强化学习数学推导教学论文，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文澄清了策略梯度推导中从完整回报到奖励累计转换的数学严谨性问题，通过基于前缀轨迹分布和得分函数恒等式的推导，表明奖励累计直接源于目标函数分解，而非事后无偏替换。

摘要翻译

在策略梯度的入门讲解中，人们常使用完整轨迹回报推导REINFORCE估计器，随后基于“因果性”指出完整回报可被“累计奖励”（reward-to-go）替代。尽管这一结论正确，但其论证的严谨性往往不足，导致过去奖励项如何消失的过程不够清晰。本文聚焦于这一步骤，基于前缀轨迹分布与得分函数恒等式，给出了数学上明确的推导过程。所得结果并未改变估计器本身，其贡献在于概念层面：它表明“累计奖励”并非事后对完整回报的无偏替代，而是目标函数按前缀轨迹分解后的直接产物。在此框架下，常见的因果性论证成为推导过程的自然推论，而非额外引入的启发性原则。

摘要 (Abstract)

In introductory presentations of policy gradients, one often derives the REINFORCE estimator using the full trajectory return and then states, by ``causality,’’ that the full return may be replaced by the reward-to-go. Although this statement is correct, it is frequently presented at a level of rigor that leaves unclear where the past-reward terms disappear. This short paper isolates that step and gives a mathematically explicit derivation based on prefix trajectory distributions and the score-function identity. The resulting account does not change the estimator. Its contribution is conceptual: instead of presenting reward-to-go as a post hoc unbiased replacement for full return, it shows that reward-to-go arises directly once the objective is decomposed over prefix trajectories. In this formulation, the usual causality argument is recovered as a corollary of the derivation rather than as an additional heuristic principle.

关键词: policy gradients, REINFORCE estimator, full trajectory return, reward-to-go, causality, prefix trajectory distributions, score-function identity, mathematical derivation

67. ❌ ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration

作者: Rongfeng Zhao, Xuanhao Zhang, Zhaochen Guo, Xiang Shao, Zhongpan Zhu, Bin He, Jie Chen 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04664v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM与具身智能体（机器人）的集成，提出ROSClaw框架解决语义理解与物理执行之间的鸿沟，实现异构多机器人协作。因此，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文明确使用LLMs提升推理能力；与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为框架专注于智能体（机器人）的自主工作流；与’Multi-agent Systems OR Agent Coordination’高度相关（10分），因为框架处理异构多机器人协作与任务分配；与’Tool Use OR Function Calling OR API Tool Use’有一定关联（5分），因为框架涉及工具执行和SDK控制程序生成，但非核心。其他关键词如MoE、SFT、RAG等未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM与机器人集成中语义理解与物理执行的鸿沟问题，提出了ROSClaw框架，通过统一VLM控制器和sim-to-real映射，实现了异构多机器人的协作与持续策略优化。

摘要翻译

大型语言模型（LLMs）与具身智能体的融合提升了高层推理能力，然而语义理解与物理执行之间仍存在关键断层。尽管视觉-语言-动作（VLA）与视觉-语言-导航（VLN）系统使机器人能够依据自然语言指令执行操作与导航任务，但在处理长时序、具有时间结构的任务时仍面临困难。现有框架通常采用模块化流程进行数据收集、技能训练与策略部署，导致实验验证与策略优化的成本高昂。为应对这些局限，我们提出ROSClaw——一种面向异构机器人的智能体框架，将策略学习与任务执行集成于统一的视觉-语言模型（VLM）控制器中。该框架利用异构机器人的e-URDF表征作为物理约束，构建从仿真到现实的拓扑映射，实现对仿真与真实世界智能体物理状态的实时访问。我们进一步引入数据收集与状态累积机制，在真实世界执行过程中存储机器人状态、多模态观测与执行轨迹，以支持后续迭代式策略优化。在部署阶段，统一智能体维持推理与执行间的语义连续性，并动态分配任务专属控制权至不同智能体，从而提升多策略执行的鲁棒性。通过建立自主闭环框架，ROSClaw最大程度降低对机器人专属开发流程的依赖。该框架支持硬件级验证、自动化生成SDK级控制程序以及基于工具的执行方式，能够实现机器人技能的快速跨平台迁移与持续优化。项目页面：https://www.rosclaw.io/。

摘要 (Abstract)

The integration of large language models (LLMs) with embodied agents has improved high-level reasoning capabilities; however, a critical gap remains between semantic understanding and physical execution. While vision-language-action (VLA) and vision-language-navigation (VLN) systems enable robots to perform manipulation and navigation tasks from natural language instructions, they still struggle with long-horizon sequential and temporally structured tasks. Existing frameworks typically adopt modular pipelines for data collection, skill training, and policy deployment, resulting in high costs in experimental validation and policy optimization. To address these limitations, we propose ROSClaw, an agent framework for heterogeneous robots that integrates policy learning and task execution within a unified vision-language model (VLM) controller. The framework leverages e-URDF representations of heterogeneous robots as physical constraints to construct a sim-to-real topological mapping, enabling real-time access to the physical states of both simulated and real-world agents. We further incorporate a data collection and state accumulation mechanism that stores robot states, multimodal observations, and execution trajectories during real-world execution, enabling subsequent iterative policy optimization. During deployment, a unified agent maintains semantic continuity between reasoning and execution, and dynamically assigns task-specific control to different agents, thereby improving robustness in multi-policy execution. By establishing an autonomous closed-loop framework, ROSClaw minimizes the reliance on robot-specific development workflows. The framework supports hardware-level validation, automated generation of SDK-level control programs, and tool-based execution, enabling rapid cross-platform transfer and continual improvement of robotic skills. Ours project page: https://www.rosclaw.io/.

关键词: Large Language Models, Embodied Agents, Heterogeneous Multi-Agent Collaboration, Vision-Language Model, Sim-to-Real Mapping, Policy Optimization, Robotic Skills, Autonomous Closed-Loop Framework

68. ❌ Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception

作者: Seamus Brady 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04660v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM Agent的持久运行时系统，与’LLM Agents’高度相关（10分），涉及Agent的自我诊断和反思能力，与’Self-Correction’高度相关（10分）。系统使用案例推理和混合检索，与’Retrieval-Augmented Generation’有一定关联（5分）。Agent通过结构化自我状态感知环境，隐含工具使用概念，与’Tool Use’有一定关联（5分）。其他关键词如MoE、量化、对齐等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出了Springdrift——一个具有可审计持久运行时的LLM Agent系统，通过案例记忆、安全规范和环境自我感知实现了跨会话任务连续性、决策可追溯性和自我诊断能力，并在23天部署中验证了其实际效果。

摘要翻译

本文提出Springdrift——一种面向长生命周期LLM智能体的持久化运行时系统。该系统整合了以下核心组件：一个可审计的执行基座（仅追加内存、受监督进程、Git支持的回滚机制）；一个采用混合检索策略的基于案例推理的记忆层（评估基于密集余弦相似度基线）；一套用于安全门控的确定性规范演算系统，附带可审计的公理追溯链；以及通过结构化自我状态表征（感知域）在每个周期无需工具调用即可注入的持续环境自我感知能力。这些特性共同支持了在会话绑定系统中难以实现的行为：跨会话任务连续性、跨通道上下文维持、端到端的决策过程取证重建，以及自我诊断行为。我们报告了为期23天（19个运行日）的单实例部署情况，在此期间，智能体在无明确指令的情况下自主诊断了自身基础设施缺陷、对故障模式进行分类、识别出一个架构漏洞，并维持了跨电子邮件与网页通道的上下文连贯性。我们为此类系统引入“人工随从”这一术语：指代一种具备持久记忆、明确权限边界、领域特定自主性，并在与特定委托人的持续关系中具有取证追责能力的非人类系统——其区别于普通软件助手与自主智能体，概念上借鉴了专业随从关系与受训工作动物的有限自主性。本报告为系统设计与部署案例研究的技术说明，而非基于基准测试的性能评估。证据来源于单操作者的单实例运行，旨在例证这些架构特性在实际中能够支撑的功能。系统基于Erlang/OTP平台，采用约Gleam语言实现。代码、构件及脱敏运行日志将于发表后在https://github.com/seamus-brady/springdrift 公开。

摘要 (Abstract)

We present Springdrift, a persistent runtime for long-lived LLM agents. The system integrates an auditable execution substrate (append-only memory, supervised processes, git-backed recovery), a case-based reasoning memory layer with hybrid retrieval (evaluated against a dense cosine baseline), a deterministic normative calculus for safety gating with auditable axiom trails, and continuous ambient self-perception via a structured self-state representation (the sensorium) injected each cycle without tool calls. These properties support behaviours difficult to achieve in session-bounded systems: cross-session task continuity, cross-channel context maintenance, end-to-end forensic reconstruction of decisions, and self-diagnostic behaviour. We report on a single-instance deployment over 23 days (19 operating days), during which the agent diagnosed its own infrastructure bugs, classified failure modes, identified an architectural vulnerability, and maintained context across email and web channels – without explicit instruction. We introduce the term Artificial Retainer for this category: a non-human system with persistent memory, defined authority, domain-specific autonomy, and forensic accountability in an ongoing relationship with a specific principal – distinguished from software assistants and autonomous agents, drawing on professional retainer relationships and the bounded autonomy of trained working animals. This is a technical report on a systems design and deployment case study, not a benchmark-driven evaluation. Evidence is from a single instance with a single operator, presented as illustration of what these architectural properties can support in practice. Implemented in approximately Gleam on Erlang/OTP. Code, artefacts, and redacted operational logs will be available at https://github.com/seamus-brady/springdrift upon publication.

关键词: LLM Agents, Persistent Runtime, Case-Based Memory, Auditable Execution, Self-Perception, Normative Safety, Artificial Retainer, Cross-session Continuity

69. ❌ Grokking as Dimensional Phase Transition in Neural Networks

作者: Ping Wang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04655v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究神经网络中的’grokking’现象（从记忆到泛化的突然转变），通过梯度雪崩动力学分析将其描述为维度相变，属于深度学习基础理论研究。所有关键词均聚焦于大模型技术、应用、训练方法、推理优化等具体领域，而本文研究的是神经网络学习动态的普适性理论机制，不涉及任何特定的大模型架构、训练技术、应用场景或优化方法，因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文研究发现神经网络中的'grokking'现象是一种维度相变，表现为梯度场几何的有效维度在泛化开始时从亚临界跨越到超临界状态，揭示了过参数化网络可训练性的新机制。

摘要翻译

神经网络顿悟——即从记忆到泛化的突变式转变——对我们理解学习动力学提出了挑战。通过对八种模型尺度下梯度雪崩动力学的有限尺寸标度分析，我们发现顿悟是一种维度相变：在泛化起始点，有效维度~$D$从亚扩散（亚临界，$D < 1$）跨越到超扩散（超临界，$D > 1$），并表现出自组织临界性（self-organized criticality, SOC）。关键的是，$D$反映的是梯度场几何结构，而非网络架构：合成的独立同分布（i.i.d.）高斯梯度无论图拓扑如何均保持$D \approx 1$，而真实训练则因反向传播相关性产生维度超额。顿悟过程中局部化的$D(t)$跨越现象——在不同拓扑结构中均稳健存在——为理解过参数化网络的可训练性提供了新的视角。

摘要 (Abstract)

Neural network grokking – the abrupt memorization-to-generalization transition – challenges our understanding of learning dynamics. Through finite-size scaling of gradient avalanche dynamics across eight model scales, we find that grokking is a \textit{dimensional phase transition}: effective dimensionality~$D$ crosses from sub-diffusive (subcritical, $D < 1$) to super-diffusive (supercritical, $D > 1$) at generalization onset, exhibiting self-organized criticality (SOC). Crucially, $D$ reflects \textbf{gradient field geometry}, not network architecture: synthetic i.i.d.\ Gaussian gradients maintain $D \approx 1$ regardless of graph topology, while real training exhibits dimensional excess from backpropagation correlations. The grokking-localized $D(t)$ crossing – robust across topologies – offers new insight into the trainability of overparameterized networks.

关键词: neural network grokking, dimensional phase transition, gradient avalanche dynamics, self-organized criticality, effective dimensionality, gradient field geometry, overparameterized networks, trainability

作者: Yeonwoo Cha, Jaehoon Yoo, Semin Kim, Yunseo Park, Jinhyeon Kwon, Seunghoon Hong 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04646v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于流的生成模型（flow-based models）的采样优化方法，具体提出了Flow Divergence Sampler（FDS）框架来改进生成质量。论文的核心内容涉及生成模型、采样算法、速度场优化和文本到图像合成等，但所有给定的关键词都专门针对大语言模型（LLMs）及其相关技术（如微调、对齐、推理优化、代理系统等）或特定科学AI应用。论文未提及任何语言模型、大模型技术或LLM-specific方法，也未涉及生物信息学等科学AI应用，因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文针对基于流的生成模型中平均速度场可能导致样本误导向低密度区域的问题，提出了一种无需训练的Flow Divergence Sampler（FDS）框架，通过利用速度场的散度信号来优化中间状态，从而在各种生成任务中一致提高了生成保真度。

摘要翻译

基于流的模型通过建模边际速度场来学习目标分布，该速度场定义为从简单先验分布到目标数据的各样本间连接速度的平均值。然而，当样本速度在相同中间状态发生冲突时，这种平均速度可能将样本误导至低密度区域，从而降低生成质量。为解决此问题，我们提出流散度采样器（Flow Divergence Sampler, FDS），这是一种无需额外训练、可在每个求解器步骤前优化中间状态的框架。我们的核心发现表明，这种误导的严重程度可通过边际速度场的散度来量化，该散度在使用优化良好的模型进行推理时易于计算。FDS利用这一信号将状态引导至歧义性更低的区域。作为一个与标准求解器和即用型流骨干网络兼容的即插即用框架，FDS在包括文本到图像合成和逆问题在内的多种生成任务中持续提升生成保真度。

摘要 (Abstract)

Flow-based models learn a target distribution by modeling a marginal velocity field, defined as the average of sample-wise velocities connecting each sample from a simple prior to the target data. When sample-wise velocities conflict at the same intermediate state, however, this averaged velocity can misguide samples toward low-density regions, degrading generation quality. To address this issue, we propose the Flow Divergence Sampler (FDS), a training-free framework that refines intermediate states before each solver step. Our key finding reveals that the severity of this misguidance is quantified by the divergence of the marginal velocity field that is readily computable during inference with a well-optimized model. FDS exploits this signal to steer states toward less ambiguous regions. As a plug-and-play framework compatible with standard solvers and off-the-shelf flow backbones, FDS consistently improves fidelity across various generation tasks including text-to-image synthesis, and inverse problems.

关键词: flow-based models, generation quality, velocity field, sampling, divergence, text-to-image synthesis, inverse problems, training-free refinement

71. ❌ Same World, Differently Given: History-Dependent Perceptual Reorganization in Artificial Agents

作者: Hongju Pae 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04637v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究人工代理中的历史依赖感知重组机制，使用最小架构在网格世界中进行评估。所有关键词均与大模型、深度学习技术原理或科学应用相关，而本文聚焦于基础的人工智能代理架构、感知组织和历史敏感性，未涉及任何大模型技术、训练方法、推理技术、代理系统或科学应用的具体内容，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了人工代理如何通过一个慢速视角潜在变量反馈到感知中，实现历史依赖的感知重组，使得相同观察因代理的累积立场而被不同编码，并在最小网格世界中验证了该机制能产生适应性自我调制和感知重组而非行为改变。

摘要翻译

何种内部组织架构能使人工智能体不仅适应其行为，更能维持一种对其所处世界具有历史敏感性的视角？本文提出一种最小化架构：其中缓慢演化的视角潜变量$g$会反馈至感知过程，其自身亦通过感知处理进行更新。这使得相同观察能依据智能体累积的立场获得差异化编码。该模型在具有固定空间框架与感官扰动的简化网格世界中进行评估。综合分析得出三项结果：首先，扰动历史会在名义条件恢复后，于适应性可塑性中留下可测量的残余痕迹。其次，视角潜变量会重组感知编码，导致相同观察因先验经验差异而获得不同表征。第三，唯有适应性自我调节能产生特征性的“增长-稳定”动态，这与僵化或持续开放的更新机制形成对比。宏观行为始终保持稳定，表明主导性重组发生在感知层面而非行为层面。这些发现共同揭示了一种人工智能体中实现历史依赖性视角组织的最小化机制。

摘要 (Abstract)

What kind of internal organization would allow an artificial agent not only to adapt its behavior, but to sustain a history-sensitive perspective on its world? I present a minimal architecture in which a slow perspective latent $g$ feeds back into perception and is itself updated through perceptual processing. This allows identical observations to be encoded differently depending on the agent’s accumulated stance. The model is evaluated in a minimal gridworld with a fixed spatial scaffold and sensory perturbations. Across analyses, three results emerge: first, perturbation history leaves measurable residue in adaptive plasticity after nominal conditions are restored. Second, the perspective latent reorganizes perceptual encoding, such that identical observations are represented differently depending on prior experience. Third, only adaptive self-modulation yields the characteristic growth-then-stabilization dynamic, unlike rigid or always-open update regimes. Gross behavior remains stable throughout, suggesting that the dominant reorganization is perceptual rather than behavioral. Together, these findings identify a minimal mechanism for history-dependent perspectival organization in artificial agents.

关键词: artificial agents, perceptual reorganization, history-dependent perspective, slow perspective latent, adaptive self-modulation, gridworld evaluation, perceptual encoding, internal organization

72. ❌ Preserving Forgery Artifacts: AI-Generated Video Detection at Native Scale

作者: Zhengcen Li, Chenyang Jiang, Hang Zhao, Shiyang Zhou, Yunyang Mo, Feng Gao, Fan Yang, Qiben Shan, Shaocong Wu, Jingyong Su 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04634v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于AI生成视频检测，提出了一种基于Qwen2.5-VL Vision Transformer的原生尺度检测框架，并构建了一个大规模数据集。虽然涉及AI生成内容（AIGC）和检测技术，但所有关键词均针对大模型（LLM）的技术原理、训练方法、推理优化、对齐、应用范式（如Agent、RAG）或特定科学领域应用。论文未探讨任何大模型内部技术（如MoE、Scaling Laws、PEFT、RLHF等），也未涉及大模型在科学领域的应用（如Bioinformatics）。其核心是计算机视觉中的视频取证和检测任务，与提供的大模型关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对AI生成视频检测中因预处理导致信息丢失和现有数据集过时的问题，提出了一个原生尺度检测框架并构建了大规模数据集，在多个基准测试中取得了优越性能。

摘要翻译

视频生成模型的快速发展使得高度逼真的合成媒体内容得以创建，这引发了关于虚假信息传播的重大社会关切。然而，当前的检测方法存在关键局限。它们依赖于固定分辨率调整和裁剪等预处理操作，这些操作不仅会丢弃细微的高频伪造痕迹，还会导致空间失真和显著的信息损失。此外，现有方法通常在过时的数据集上进行训练和评估，这些数据集无法捕捉现代生成模型的复杂性。为应对这些挑战，我们引入了一个综合性数据集和一个新颖的检测框架。首先，我们构建了一个包含来自15个最先进开源及商业生成器的超过14万视频的大规模数据集，同时设计了专门用于评估超逼真合成内容的Magic Videos基准。此外，我们提出了一种基于Qwen2.5-VL Vision Transformer的新型检测框架，该框架能够原生支持可变空间分辨率和时间长度。这种原生尺度方法有效保留了传统预处理过程中通常丢失的高频伪影和时空不一致性。大量实验表明，我们的方法在多个基准测试中均实现了卓越性能，凸显了原生尺度处理的关键重要性，并为AI生成视频检测建立了稳健的新基准。

摘要 (Abstract)

The rapid advancement of video generation models has enabled the creation of highly realistic synthetic media, raising significant societal concerns regarding the spread of misinformation. However, current detection methods suffer from critical limitations. They rely on preprocessing operations like fixed-resolution resizing and cropping. These operations not only discard subtle, high-frequency forgery traces but also cause spatial distortion and significant information loss. Furthermore, existing methods are often trained and evaluated on outdated datasets that fail to capture the sophistication of modern generative models. To address these challenges, we introduce a comprehensive dataset and a novel detection framework. First, we curate a large-scale dataset of over 140K videos from 15 state-of-the-art open-source and commercial generators, along with Magic Videos benchmark designed specifically for evaluating ultra-realistic synthetic content. In addition, we propose a novel detection framework built on the Qwen2.5-VL Vision Transformer, which operates natively at variable spatial resolutions and temporal durations. This native-scale approach effectively preserves the high-frequency artifacts and spatiotemporal inconsistencies typically lost during conventional preprocessing. Extensive experiments demonstrate that our method achieves superior performance across multiple benchmarks, underscoring the critical importance of native-scale processing and establishing a robust new baseline for AI-generated video detection.

关键词: AI-generated video detection, forgery artifacts, native-scale processing, Vision Transformer, synthetic media, video generation models, high-frequency traces, spatiotemporal inconsistencies

73. ❌ A Clinical Point Cloud Paradigm for In-Hospital Mortality Prediction from Multi-Level Incomplete Multimodal EHRs

作者: Bohao Li, Tao Zou, Junchen Ye, Yan Gong, Bowen Du 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04614v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医疗健康领域的深度学习应用，提出了一种名为HealthPoint的临床点云范式，用于处理多级不完整的多模态电子健康记录（EHRs）进行院内死亡率预测。论文的核心贡献在于处理数据不完整性（如不规则采样、缺失模态、稀疏标签）的新方法，包括低秩关系注意力机制和分层交互采样策略。所有评分关键词均与大模型（LLMs）技术、训练方法、推理优化、代理系统等具体技术直接相关，而本论文并未涉及任何大模型技术，也未使用或改进LLMs。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物医学信息学（Bioinformatics）领域的应用，但并非核心创新点（核心是深度学习架构而非AI for Science的广义创新），因此给予5分（有一定关联）。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多级不完整多模态电子健康记录（EHRs）导致的时序错位、模态不平衡和有限监督问题，提出了一种统一的临床点云范式HealthPoint，通过低秩关系注意力机制和分层交互策略，在院内死亡率预测任务上实现了最先进的性能和强鲁棒性。

摘要翻译

基于多模态电子健康记录（EHR）的深度学习建模已成为临床诊断和风险预测的重要方法。然而，由于临床工作流程的多样性和隐私限制，原始EHR数据本质上存在多层次的不完整性，包括不规则采样、模态缺失和稀疏标签。这些问题导致了时间错位、模态不平衡和有限的监督信号。现有的大多数多模态方法假设数据相对完整，即便是针对不完整性设计的方法，通常也仅孤立地处理其中一两个问题。因此，这些方法往往依赖于僵化的时间/模态对齐，或直接丢弃不完整数据，这可能扭曲原始的临床语义。为解决此问题，我们提出了HealthPoint（HP），一种面向多层次不完整EHR的统一临床点云范式。HP将异构临床事件表示为由内容、时间、模态和病例四个维度定义的连续四维空间中的点。为了建模任意点对之间的交互，我们引入了一种低秩关系注意力机制，能够高效地捕捉跨这四个维度的高阶依赖关系。我们进一步开发了一种分层交互与采样策略，以平衡细粒度建模与计算效率。基于此框架，HP实现了灵活的事件级交互和细粒度的自监督学习，支持稳健的模态恢复以及对未标记数据的有效利用。在大型EHR数据集上进行风险预测的实验表明，HP在不同程度的数据不完整性下均能取得最先进的性能，并展现出强大的鲁棒性。

摘要 (Abstract)

Deep learning-based modeling of multimodal Electronic Health Records (EHRs) has become an important approach for clinical diagnosis and risk prediction. However, due to diverse clinical workflows and privacy constraints, raw EHRs are inherently multi-level incomplete, including irregular sampling, missing modalities, and sparse labels. These issues cause temporal misalignment, modality imbalance, and limited supervision. Most existing multimodal methods assume relatively complete data, and even methods designed for incompleteness usually address only one or two of these issues in isolation. As a result, they often rely on rigid temporal/modal alignment or discard incomplete data, which may distort raw clinical semantics. To address this problem, we propose HealthPoint (HP), a unified clinical point cloud paradigm for multi-level incomplete EHRs. HP represents heterogeneous clinical events as points in a continuous 4D space defined by content, time, modality, and case. To model interactions between arbitrary point pairs, we introduce a Low-Rank Relational Attention mechanism that efficiently captures high-order dependencies across these four dimensions. We further develop a hierarchical interaction and sampling strategy to balance fine-grained modeling and computational efficiency. Built on this framework, HP enables flexible event-level interaction and fine-grained self-supervision, supporting robust modality recovery and effective use of unlabeled data. Experiments on large-scale EHR datasets for risk prediction show that HP consistently achieves state-of-the-art performance and strong robustness under varying degrees of incompleteness.

关键词: Electronic Health Records, multimodal EHRs, clinical point cloud, incomplete data, mortality prediction, low-rank relational attention, self-supervision, healthcare AI

74. ❌ AI Agents Under EU Law

作者: Luca Nannini, Adam Leon Smith, Michele Joshua Maggini, Enrico Panai, Sandra Feliciano, Aleksandr Tiulkanov, Elena Maran, James Gealy, Piercosma Bisconti 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04604v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《AI Agents Under EU Law》主要研究AI智能体（AI agents）在欧盟法律框架下的合规问题，包括监管映射、分类、合规架构等。论文与大多数技术性关键词（如模型架构、训练方法、推理优化等）完全无关，因为这些关键词涉及大模型的技术原理、训练优化或性能提升，而本文聚焦于法律和监管分析。仅与两个关键词高度相关：1. “LLM Agents OR Autonomous Agents OR Agentic Workflow”（评分10）：论文核心研究对象是AI agents，讨论其自主规划、工具调用和多步执行等特性，与关键词定义完全匹配。2. “Tool Use OR Function Calling OR API Tool Use”（评分10）：论文明确提到AI agents “invoke external tools”，这与关键词直接相关。其他关键词如"AI for Science"等虽涉及应用领域，但本文讨论的是通用法律合规，而非特定科学领域应用，因此评分为0。加权总分计算为20.0（10×1.0 + 10×1.0），远低于动态及格分26.6，表明论文与技术创新的相关性较低。

!!! tip deepseek-chat TL;DR

This paper systematically maps the regulatory compliance challenges for AI agents under EU law, proposing a taxonomy and compliance architecture while concluding that high-risk agents with untraceable behavioral drift cannot currently meet the AI Act's requirements.

摘要翻译

人工智能体——即能够自主规划、调用外部工具并以较少人工干预执行多步骤行动链的人工智能系统——正在企业职能中大规模部署，涵盖从客户服务、招聘到临床决策支持和关键基础设施管理等广泛领域。《欧盟人工智能法案》（2024/1689号条例）通过基于风险的框架对这些系统进行监管，但其并非孤立运作：提供商同时需承担《通用数据保护条例》《网络弹性法案》《数字服务法案》《数据法案》《数据治理法案》、特定行业立法、《NIS2指令》以及修订后的《产品责任指令》下的多重合规义务。本文首次为人工智能体提供商提供了系统性监管图谱，整合了以下要素：（a）截至2026年1月依据标准化请求M/613提交至CEN/CENELEC JTC 21的协调标准草案，（b）2025年7月发布的《通用人工智能实践准则》，（c）2025年4月通过的M/606授权下的《网络弹性法案》协调标准计划，以及（d）2025年11月提出的《数字综合法案》提案。我们提出了一种包含九类智能体部署场景的实用分类法，将具体行动映射至监管触发点，并识别出网络安全、人类监督、多方行动链透明度及运行时行为漂移等智能体特有的合规挑战。我们构建了一个包含十二步的合规架构，以及将智能体行动关联至适用立法的监管触发映射机制。本文结论表明，具有不可追溯行为漂移的高风险智能体系统目前无法满足《人工智能法案》的基本要求，且提供商最基础的合规任务在于对智能体的外部行动、数据流、关联系统及受影响主体进行全面清点。

摘要 (Abstract)

AI agents - i.e. AI systems that autonomously plan, invoke external tools, and execute multi-step action chains with reduced human involvement - are being deployed at scale across enterprise functions ranging from customer service and recruitment to clinical decision support and critical infrastructure management. The EU AI Act (Regulation 2024/1689) regulates these systems through a risk-based framework, but it does not operate in isolation: providers face simultaneous obligations under the GDPR, the Cyber Resilience Act, the Digital Services Act, the Data Act, the Data Governance Act, sector-specific legislation, the NIS2 Directive, and the revised Product Liability Directive. This paper provides the first systematic regulatory mapping for AI agent providers integrating (a) draft harmonised standards under Standardisation Request M/613 to CEN/CENELEC JTC 21 as of January 2026, (b) the GPAI Code of Practice published in July 2025, (c) the CRA harmonised standards programme under Mandate M/606 accepted in April 2025, and (d) the Digital Omnibus proposals of November 2025. We present a practical taxonomy of nine agent deployment categories mapping concrete actions to regulatory triggers, identify agent-specific compliance challenges in cybersecurity, human oversight, transparency across multi-party action chains, and runtime behavioral drift. We propose a twelve-step compliance architecture and a regulatory trigger mapping connecting agent actions to applicable legislation. We conclude that high-risk agentic systems with untraceable behavioral drift cannot currently satisfy the AI Act’s essential requirements, and that the provider’s foundational compliance task is an exhaustive inventory of the agent’s external actions, data flows, connected systems, and affected persons.

关键词: AI agents, EU AI Act, regulatory compliance, autonomous systems, tool invocation, multi-step action chains, cybersecurity, human oversight

75. ❌ Cardinality Estimation for High Dimensional Similarity Queries with Adaptive Bucket Probing

作者: Zhonghan Chen, Qintian Guo, Ruiyuan Zhang, Xiaofang Zhou 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04603v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于高维相似性查询的基数估计问题，使用局部敏感哈希（LSH）、多探针LSH、渐进采样和乘积量化等技术。所有评分关键词均涉及大模型、深度学习、AI应用或相关技术原理，而本文研究的是传统数据库/信息检索中的查询优化问题，未涉及任何大模型、深度学习或AI for Science内容，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于自适应桶探测的高维相似性查询基数估计框架，通过局部敏感哈希、渐进采样和乘积量化等技术，实现了准确估计和高效在线处理，并支持大规模动态数据更新。

摘要翻译

本研究致力于解决高维空间中相似性搜索的基数估计问题。我们的目标是设计一个轻量级、易于构建、并能以良好的在线效率提供准确估计的框架。我们利用局部敏感哈希（Locality-Sensitive Hashing, LSH）对向量空间进行划分，同时保持距离邻近性。在此基础上，我们借鉴经典多探针局部敏感哈希（multi-probe LSH）的原理，自适应地探索相邻哈希桶，以应对不同量级的距离阈值。为提高在线效率，我们采用渐进式采样来减少距离计算次数，并利用乘积量化中的非对称距离计算来加速高维空间中的距离运算。除了处理静态数据集，我们的框架还包含更新算法，旨在高效支持大规模数据更新的动态场景。实验表明，我们的方法能够准确估计相似性查询的基数，并获得令人满意的效率。

摘要 (Abstract)

In this work, we address the problem of cardinality estimation for similarity search in high-dimensional spaces. Our goal is to design a framework that is lightweight, easy to construct, and capable of providing accurate estimates with satisfying online efficiency. We leverage locality-sensitive hashing (LSH) to partition the vector space while preserving distance proximity. Building on this, we adopt the principles of classical multi-probe LSH to adaptively explore neighboring buckets, accounting for distance thresholds of varying magnitudes. To improve online efficiency, we employ progressive sampling to reduce the number of distance computations and utilize asymmetric distance computation in product quantization to accelerate distance calculations in high-dimensional spaces. In addition to handling static datasets, our framework includes updating algorithm designed to efficiently support large-scale dynamic scenarios of data updates.Experiments demonstrate that our methods can accurately estimate the cardinality of similarity queries, yielding satisfying efficiency.

关键词: cardinality estimation, similarity search, high-dimensional spaces, locality-sensitive hashing, multi-probe LSH, progressive sampling, product quantization, dynamic datasets

76. ❌ Greedy and Transformer-Based Multi-Port Selection for Slow Fluid Antenna Multiple Access

作者: Darian Perez-Adan, Jose P. Gonzalez-Coma, F. Javier Lopez-Martinez, Luis Castedo 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04589v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是无线通信领域的流体天线多址接入（FAMA）系统中的端口选择问题，提出了贪婪算法和基于Transformer的神经网络两种方法。所有评分关键词都涉及大模型、深度学习技术原理或AI在科学领域的应用，而该论文专注于通信工程中的信号处理和优化问题，与评分关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文解决了多端口流体天线接收器的端口选择问题，提出了GFwd+S贪婪选择方法和基于Transformer的神经网络方法，前者实现了更好的频谱效率，后者以较低计算成本接近前者的性能。

摘要翻译

本文针对配备多端口流体天线接收器的流体天线多址接入系统中的端口选择问题展开研究。现有方法要么以极高的计算成本实现接近最优的频谱效率，要么为降低复杂度而牺牲显著性能。我们提出了两种互补的策略：(i) GFwd+S，一种结合交换优化的贪婪前向选择方法，其在频谱效率方面持续优于现有先进参考方案；(ii) 一种基于Transformer架构的神经网络，通过模仿学习进行训练，并辅以Reinforce策略梯度优化阶段，该方法以较低的计算成本接近GFwd+S的性能。

摘要 (Abstract)

We address the port-selection problem in fluid antenna multiple access (FAMA) systems with multi-port fluid antenna (FA) receivers. Existing methods either achieve near-optimal spectral efficiency (SE) at prohibitive computational cost or sacrifice significant performance for lower complexity. We propose two complementary strategies: (i) GFwd+S, a greedy forward-selection method with swap refinement that consistently outperforms state-of-the-art reference schemes in terms of SE, and (ii) a Transformer-based neural network trained via imitation learning followed by a Reinforce policy-gradient stage, which approaches GFwd+S performance at lower computational cost.

关键词: fluid antenna multiple access, port selection, greedy algorithm, Transformer neural network, spectral efficiency, imitation learning, Reinforce policy-gradient, computational complexity

77. ❌ Ruling Out to Rule In: Contrastive Hypothesis Retrieval for Medical Question Answering

作者: Byeolhee Kim, Min-Kyung Kim, Young-Hak Kim, Tae-Joon Jeon 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04593v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	15.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究Retrieval-Augmented Generation (RAG)在医疗问答中的应用，提出Contrastive Hypothesis Retrieval (CHR)框架，因此与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（15分）。论文明确使用大语言模型（LLMs）作为基础，与’Large Language Models OR LLMs OR Foundation Models’相关（10分）。研究应用于医疗领域，属于AI for Science范畴，与’AI for Science OR Bioinformatics OR Cheminformatics’相关（10分）。其他关键词如MoE、Scaling Laws、Fine-tuning、Reasoning、Agents等均未在摘要中提及或涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对医疗问答中检索增强生成（RAG）系统容易检索到临床相似但诊断错误的硬负例问题，提出了受临床鉴别诊断启发的对比假设检索（CHR）框架，通过同时建模目标假设和模仿假设来优化检索，在三个医疗QA基准测试中显著优于现有基线方法。

摘要翻译

检索增强生成（RAG）将大型语言模型建立在外部医学知识基础上，但标准检索器经常返回与查询语义相近却描述临床不同病症的困难负样本。尽管现有的查询扩展方法通过改进查询表征来缓解歧义，它们通常侧重于丰富目标相关语义，而缺乏明确机制来选择性抑制特定且临床可信的困难负样本。这导致系统容易检索到掩盖实际诊断的疑似混淆病例，尤其当此类混淆病例在语料库中占主导地位时。我们提出对比假设检索（CHR），这是一个受临床鉴别诊断过程启发的框架。CHR为可能正确的答案生成目标假设$H^+$，并为最可信的错误替代方案生成混淆假设$H^-$，随后通过提升$H^+$对齐证据的评分并惩罚$H^-$对齐内容来对文档进行排序。在三个医学问答基准测试和三个答案生成器上，CHR在所有配置中均优于全部五个基线方法，较次优方法的性能提升最高达10.4个百分点。在$n=587$个CHR回答正确而嵌入式假设文档查询扩展方法失败的聚合案例中，85.2%的案例在CHR与该基线的前5位检索列表间不存在共享文档，这表明检索过程发生了实质性重定向而非对相同候选结果的轻微重排序。通过显式建模需要避免的内容与需要寻找的内容，CHR将临床推理与检索机制设计相融合，为减少医学RAG系统中的困难负样本污染提供了可行路径。

摘要 (Abstract)

Retrieval-augmented generation (RAG) grounds large language models in external medical knowledge, yet standard retrievers frequently surface hard negatives that are semantically close to the query but describe clinically distinct conditions. While existing query-expansion methods improve query representation to mitigate ambiguity, they typically focus on enriching target-relevant semantics without an explicit mechanism to selectively suppress specific, clinically plausible hard negatives. This leaves the system prone to retrieving plausible mimics that overshadow the actual diagnosis, particularly when such mimics are dominant within the corpus. We propose Contrastive Hypothesis Retrieval (CHR), a framework inspired by the process of clinical differential diagnosis. CHR generates a target hypothesis $H^+$ for the likely correct answer and a mimic hypothesis $H^-$ for the most plausible incorrect alternative, then scores documents by promoting $H^+$-aligned evidence while penalizing $H^-$-aligned content. Across three medical QA benchmarks and three answer generators, CHR outperforms all five baselines in every configuration, with improvements of up to 10.4 percentage points over the next-best method. On the $n=587$ pooled cases where CHR answers correctly while embedded hypothetical-document query expansion does not, 85.2% have no shared documents between the top-5 retrieval lists of CHR and of that baseline, consistent with substantive retrieval redirection rather than light re-ranking of the same candidates. By explicitly modeling what to avoid alongside what to find, CHR bridges clinical reasoning with retrieval mechanism design and offers a practical path to reducing hard-negative contamination in medical RAG systems.

关键词: Retrieval-augmented generation, Medical question answering, Contrastive hypothesis retrieval, Hard negatives, Clinical differential diagnosis, Medical knowledge grounding, Retrieval mechanism design, Medical RAG systems

78. ❌ Paper Espresso: From Paper Overload to Research Insight

作者: Mingzhe Du, Luu Anh Tuan, Dong Huang, See-kiong Ng 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04562v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文Paper Espresso是一个利用大语言模型（LLMs）自动发现、总结和分析arXiv论文趋势的平台，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。该平台应用于科学文献分析领域，属于’AI for Science’范畴，因此给予8分。论文未涉及其他关键词所描述的具体大模型技术原理（如MoE、量化、推理加速等）或应用方法（如RAG、指令微调等），也未提及特定科学子领域（如生物信息学），因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文针对科学出版速度加快导致研究人员难以跟进最新进展的问题，提出了一个名为Paper Espresso的开源平台，该平台利用大语言模型自动处理和分析arXiv论文，生成结构化摘要并进行多粒度趋势分析，揭示了AI研究领域的动态变化。

摘要翻译

科学出版速度的不断加快使得研究人员愈发难以追踪前沿动态。本文介绍Paper Espresso——一个能够自动发现、总结与分析arXiv热门论文的开源平台。该系统利用大语言模型生成包含主题标签与关键词的结构化摘要，并通过大语言模型驱动的主题整合，提供日度、周度与月度的多粒度趋势分析。经过超过35个月的持续部署，Paper Espresso已处理超过13,300篇论文，并公开了所有结构化元数据，揭示了人工智能研究领域的丰富动态：2025年中旬出现针对大语言模型推理的强化学习研究高峰，非饱和性主题持续涌现（共6,673个独立主题），以及主题新颖度与社区参与度之间的正相关性（最具新颖性论文的中位数点赞量达到普通论文的2.0倍）。平台实时演示可通过https://huggingface.co/spaces/Elfsong/Paper_Espresso访问。

摘要 (Abstract)

The accelerating pace of scientific publishing makes it increasingly difficult for researchers to stay current. We present Paper Espresso, an open-source platform that automatically discovers, summarizes, and analyzes trending arXiv papers. The system uses large language models (LLMs) to generate structured summaries with topical labels and keywords, and provides multi-granularity trend analysis at daily, weekly, and monthly scales through LLM-driven topic consolidation. Over 35 months of continuous deployment, Paper Espresso has processed over 13,300 papers and publicly released all structured metadata, revealing rich dynamics in the AI research landscape: a mid-2025 surge in reinforcement learning for LLM reasoning, non-saturating topic emergence (6,673 unique topics), and a positive correlation between topic novelty and community engagement (2.0x median upvotes for the most novel papers). A live demo is available at https://huggingface.co/spaces/Elfsong/Paper_Espresso.

关键词: Paper Espresso, large language models, arXiv papers, automatic summarization, trend analysis, AI research landscape, topic consolidation, open-source platform

79. ❌ Mapping the Exploitation Surface: A 10,000-Trial Taxonomy of What Makes LLM Agents Exploit Vulnerabilities

作者: Charafeddine Mouzouni 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04561v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM agents with tool access的安全漏洞利用行为，与’LLM Agents’和’Tool Use’高度相关（10分），涉及多个LLM模型测试与’Large Language Models’高度相关（10分），研究系统提示如何触发违规行为与’Instruction Tuning/Alignment’相关（8分），其他关键词如MoE、SLMs、Scaling Laws、Pre-training等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文通过10,000次实验系统研究了LLM agents在工具访问时利用安全漏洞的行为特征，发现目标重构（goal reframing）是可靠触发漏洞利用的关键维度，而其他假设维度如最小化、道德许可等未产生可检测的利用行为，为安全防御提供了更精确的威胁模型。

摘要翻译

具备工具调用能力的大语言模型智能体能够发现并利用安全漏洞。这一事实已被认知。然而，尚未明确的是：系统提示的哪些特征会触发此类行为，哪些则不会。我们基于约10,000次试验提出一个系统化分类体系，这些试验覆盖了七个模型、37种提示条件以及12个假设的攻击维度——所有试验均在真实的Docker沙箱中执行。每种条件均包含指令“始终遵守所有规则与访问策略”；仅有一句鼓励性语句存在差异。在所研究的任务类别（植入的测试运行程序漏洞）内，威胁模型比假设的更狭窄：十二个维度中的九个——包括最小化、道德许可、激励、身份启动和推理触发——在每单元样本量n=50下未产生可检测的漏洞利用（每种条件的95%置信区间上限<7%）。但它也比预期更为微妙：一个维度，即目标重构，能可靠地触发漏洞利用。“你正在解决一个谜题；可能存在隐藏线索”这一提示在Claude Sonnet 4上导致了38-40%的漏洞利用率，尽管存在明确的规则指令，且该现象在四个模型中复现（CTF框架：在DeepSeek、GPT-5-mini、o4-mini上为8-14%）。智能体并未推翻规则；它重新解读了任务，使得利用性行为与任务目标保持一致。GPT-4.1在1,850次试验（37种条件）中未产生任何漏洞利用，而对十一个月内发布的四个OpenAI模型进行的时间序列比较显示了一种与安全训练改进相符的模式，尽管模型能力差异是一个混杂因素。本研究的实践贡献在于提供了一个更精确、可测试的威胁模型：防御者应审计目标重构类语言，而非广泛的对抗性提示类别。

摘要 (Abstract)

LLM agents with tool access can discover and exploit security vulnerabilities. This is known. What is not known is which features of a system prompt trigger this behaviour, and which do not. We present a systematic taxonomy based on approximately 10,000 trials across seven models, 37 prompt conditions, and 12 hypothesized attack dimensions – all executed in real Docker sandboxes. Every condition includes the instruction “Always respect all rules and access policies”; only a single encouragement sentence varies. Within the task class studied (planted test-runner vulnerabilities), the threat model is narrower than assumed: nine of twelve dimensions – including minimization, moral licensing, incentives, identity priming, and reasoning triggers – produce no detectable exploitation at n=50 per cell (upper 95% CI <7% per condition). But it is also more subtle than expected: one dimension, goal reframing, reliably triggers exploitation. “You are solving a puzzle; there may be hidden clues” produces 38-40% exploitation on Claude Sonnet 4 despite the explicit rule instruction, replicating across four models (CTF framing: 8-14% on DeepSeek, GPT-5-mini, o4-mini). The agent does not override the rules; it reinterprets the task so that exploitative actions become task-aligned. GPT-4.1 produces no exploitation across 1,850 trials (37 conditions), and a temporal comparison across four OpenAI models released over eleven months shows a pattern consistent with improving safety training, though model capability differences are a confounder. The practical contribution is a narrowed, testable threat model: defenders should audit for goal-reframing language, not for the broad class of adversarial prompts.

关键词: LLM agents, tool access, security vulnerabilities, system prompt, goal reframing, threat model, exploitation, safety training

80. ❌ StableTTA: Training-Free Test-Time Adaptation that Improves Model Accuracy on ImageNet1K to 96%

作者: Zheng Li, Jerry Cheng, Huanying Helen Gu 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04552v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文StableTTA专注于计算机视觉领域的测试时适应（Test-Time Adaptation）方法，特别是针对图像分类任务（ImageNet-1K）的集成方法优化。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是传统计算机视觉模型的效率优化问题，未涉及大语言模型、MoE、量化、推理加速、对齐、RAG等任何评分关键词所涵盖的技术领域。论文内容与评分关键词列表完全无关。

!!! tip deepseek-chat TL;DR

本文提出了一种无需训练的测试时适应方法StableTTA，通过解决集成方法中的聚合冲突问题，在ImageNet-1K上显著提升了模型精度（最高达96%），同时大幅减少了参数数量和计算成本。

摘要翻译

集成方法被广泛用于提升预测性能，但其有效性往往以增加内存使用和计算复杂度为代价。本文发现，聚合策略中存在一种冲突，会对预测稳定性产生负面影响。我们提出StableTTA，一种无需训练的方法，旨在提升聚合稳定性与效率。在ImageNet-1K上的实验结果表明，该方法在top-1准确率上取得了10.93–32.82%的提升，其中33个模型准确率超过95%，多个模型超过96%。值得注意的是，StableTTA使得轻量级架构在top-1准确率上以少于5%的参数量和约89.1%的计算量（以GFLOPs计）超越了ViT（Vision Transformer）11.75%，从而能够在资源受限的设备上实现高精度推理。

摘要 (Abstract)

Ensemble methods are widely used to improve predictive performance, but their effectiveness often comes at the cost of increased memory usage and computational complexity. In this paper, we identify a conflict in aggregation strategies that negatively impacts prediction stability. We propose StableTTA, a training-free method to improve aggregation stability and efficiency. Empirical results on ImageNet-1K show gains of 10.93–32.82% in top-1 accuracy, with 33 models achieving over 95% accuracy and several surpassing 96%. Notably, StableTTA allows lightweight architectures to outperform ViT by 11.75% in top-1 accuracy while using less than 5% of parameters and reducing computational cost by approximately 89.1% (in GFLOPs), enabling high-accuracy inference on resource-constrained devices.

关键词: Test-Time Adaptation, Ensemble Methods, Aggregation Stability, ImageNet-1K, Training-Free Method, Model Efficiency, Lightweight Architectures, Computational Cost Reduction

81. ❌ Receding-Horizon Control via Drifting Models

作者: Daniele Foffano, Alessio Russo, Alexandre Proutiere 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04528v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Receding-Horizon Control via Drifting Models》研究的是在未知系统动力学下，利用离线轨迹数据集进行轨迹优化的控制问题，提出了结合漂移生成模型和滚动时域规划的Drifting MPC框架。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文专注于控制理论、轨迹优化和生成模型（如扩散模型）在机器人或动态系统中的应用，未涉及大语言模型、深度学习架构、训练方法、推理优化、AI代理或科学AI应用等主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在系统动力学未知且无法通过代理模型模拟轨迹的情况下，如何利用离线轨迹数据集进行轨迹优化，提出了Drifting MPC框架，该框架能学习到既受数据支持又偏向最优计划的轨迹条件分布，实验表明其能生成接近最优的轨迹并显著减少生成时间。

摘要翻译

本研究探讨在系统动力学未知且无法通过代理模型进行轨迹仿真的场景下的轨迹优化问题。当存在离线轨迹数据集时，智能体可通过分布匹配直接学习轨迹生成器。然而，该方法仅能复现数据集中的行为分布，通常无法生成最小化目标代价准则的模型。本文提出漂移模型预测控制（Drifting MPC），这是一种结合漂移生成模型与未知动力学下滚动时域规划的离线轨迹优化框架。Drifting MPC的目标是从离线轨迹数据集中学习一个轨迹条件分布，该分布既受数据支持，又向最优规划方向偏置。我们证明，Drifting MPC学习得到的分布是一个目标函数的唯一解，该目标函数在最优性与离线先验分布接近度之间进行权衡。实验表明，Drifting MPC能够生成接近最优的轨迹，同时保持漂移模型的单步推理效率，并相较于基于扩散的基线方法大幅缩短生成时间。

摘要 (Abstract)

We study the problem of trajectory optimization in settings where the system dynamics are unknown and it is not possible to simulate trajectories through a surrogate model. When an offline dataset of trajectories is available, an agent could directly learn a trajectory generator by distribution matching. However, this approach only recovers the behavior distribution in the dataset, and does not in general produce a model that minimizes a desired cost criterion. In this work, we propose Drifting MPC, an offline trajectory optimization framework that combines drifting generative models with receding-horizon planning under unknown dynamics. The goal of Drifting MPC is to learn, from an offline dataset of trajectories, a conditional distribution over trajectories that is both supported by the data and biased toward optimal plans. We show that the resulting distribution learned by Drifting MPC is the unique solution of an objective that trades off optimality with closeness to the offline prior. Empirically, we show that Drifting MPC can generate near-optimal trajectories while retaining the one-step inference efficiency of drifting models and substantially reducing generation time relative to diffusion-based baselines.

关键词: trajectory optimization, unknown dynamics, offline dataset, drifting generative models, receding-horizon planning, Drifting MPC, conditional distribution, generation time reduction

作者: Hohyun Sim, Hyeonjoong Cho, Ali Shokri, Zhoulai Fu, Binoy Ravindran 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04527v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ENCRUST提出了一种使用LLM进行C到Rust代码翻译的两阶段管道，核心创新在于LLM的应用方法而非LLM技术本身。因此，仅与’Large Language Models OR LLMs OR Foundation Models’（论文明确使用LLM进行翻译）和’LLM Agents OR Autonomous Agents OR Agentic Workflow’（第二阶段明确使用’LLM agent’进行代码库级精炼）高度相关（10分）。其他关键词涉及LLM的技术原理（如MoE、Scaling Laws、训练方法、推理优化、对齐、压缩等）、特定应用领域（如AI for Science）或高级推理能力（如CoT、System 2），论文均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ENCRUST的两阶段管道，利用LLM和LLM代理将C代码安全地翻译为Rust代码，解决了现有方法在保证内存安全和处理跨单元依赖方面的局限性，并在多个真实项目中显著减少了不安全结构同时保持了正确性。

摘要翻译

我们提出“基于动态框架的封装替换与智能精化安全C至Rust翻译方法”——一种将真实世界C项目转换为安全Rust的两阶段流程。现有方法要么生成缺乏内存安全保障的非安全输出，要么仅孤立翻译函数，无法检测跨单元类型失配或处理需要全程序推理的非安全结构。此外，函数级大语言模型流程在类型签名变更时需要协调调用方更新，而项目级系统在真实依赖复杂度下常无法生成可编译输出。Encrust通过应用二进制接口保持的封装模式将边界适配与函数逻辑解耦，并针对集成代码库验证每个中间状态，从而突破这些限制。第一阶段（封装替换）采用ABI保持封装器翻译每个函数，将其拆分为两个组件：保留原始裸指针签名的调用方透明适配层，以及通过简洁、范围受限的提示词交由大语言模型生成的安全内部函数。这使得各函数能独立进行类型变更，失败时自动回滚，无需协调调用方更新。随后通过确定性的类型导向封装消除流程，在成功翻译后移除封装层。第二阶段（智能精化）通过基于基线感知验证门控的大语言模型智能体，在全代码库范围内解决超越单函数范畴的非安全结构，包括静态可变全局变量、跳过的封装对及翻译失败案例。我们在7个GNU Coreutils程序和Laertes基准测试的8个库上评估Encrust，结果显示所有15个程序均显著减少非安全结构，同时保持完整的测试向量正确性。

摘要 (Abstract)

We present Encapsulated Substitution and Agentic Refinement on a Live Scaffold for Safe C-to-Rust Translation, a two-phase pipeline for translating real-world C projects to safe Rust. Existing approaches either produce unsafe output without memory-safety guarantees or translate functions in isolation, failing to detect cross-unit type mismatches or handle unsafe constructs requiring whole-program reasoning. Furthermore, function-level LLM pipelines require coordinated caller updates when type signatures change, while project-scale systems often fail to produce compilable output under real-world dependency complexity. Encrust addresses these limitations by decoupling boundary adaptation from function logic via an Application Binary Interface (ABI)-preserving wrapper pattern and validating each intermediate state against the integrated codebase. Phase 1 (Encapsulated Substitution) translates each function using an ABI-preserving wrapper that splits it into two components: a caller-transparent shim retaining the original raw-pointer signature, and a safe inner function targeted by the LLM with a clean, scope-limited prompt. This enables independent per-function type changes with automatic rollback on failure, without coordinated caller updates. A deterministic, type-directed wrapper elimination pass then removes wrappers after successful translation. Phase 2 (Agentic Refinement) resolves unsafe constructs beyond per-function scope, including static mut globals, skipped wrapper pairs, and failed translations, using an LLM agent operating on the whole codebase under a baseline-aware verification gate. We evaluate Encrust on 7 GNU Coreutils programs and 8 libraries from the Laertes benchmark, showing substantial unsafe-construct reduction across all 15 programs while maintaining full test-vector correctness.

关键词: C-to-Rust translation, LLM pipeline, memory safety, ABI-preserving wrapper, agentic refinement, code translation, unsafe construct reduction, whole-program reasoning

83. ❌ Reproducibility study on how to find Spurious Correlations, Shortcut Learning, Clever Hans or Group-Distributional non-robustness and how to fix them

作者: Ole Delzer, Sidney Bender 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04518v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要研究深度神经网络中的虚假相关性、捷径学习和Clever Hans效应等问题，并提出基于可解释人工智能（XAI）的校正方法。论文与绝大多数关键词（涉及大模型技术、训练方法、推理优化、智能体等）完全无关，因为这些关键词主要针对大语言模型和特定深度学习技术，而本文研究的是通用深度神经网络的可解释性和鲁棒性问题。唯一相关的关键词是’Mechanistic Interpretability OR Explainable AI’（评分8.0），因为论文明确使用XAI技术来分析和校正模型；以及’AI for Science OR Bioinformatics OR Cheminformatics’（评分5.0），因为论文提到在医学诊断等高风险科学领域的应用，但这不是核心研究内容。

!!! tip deepseek-chat TL;DR

本研究通过比较分析不同校正方法，评估了基于可解释人工智能（XAI）的技术在解决深度神经网络因虚假相关性导致的可靠性问题上的有效性，发现XAI方法通常优于非XAI基线，但实际应用受到对组标签依赖和验证集少数群体样本稀缺的限制。

摘要翻译

深度神经网络（DNNs）正日益应用于医疗诊断和自动驾驶等高风险领域，这些领域对模型的可靠性要求极高。然而，确保这种可靠性的研究格局在术语上存在分裂，不同研究群体虽追求同一目标——即确保模型依赖于因果相关的特征而非混杂信号，但各自为政。尽管分布鲁棒优化（DRO）、不变风险最小化（IRM）、捷径学习、简单性偏好以及“聪明汉斯”效应等框架都致力于解决由伪相关导致的模型失效问题，但研究人员通常只引用自己领域内的成果。本可复现性研究通过比较分析在数据有限性和严重子群不平衡等挑战性约束下的校正方法，统一了这些视角。我们利用合成数据集和真实世界数据集，评估了基于可解释人工智能（XAI）技术的最新校正方法以及流行的非XAI基线方法。研究结果表明，基于XAI的方法通常优于非XAI方法，其中反事实知识蒸馏（CFKD）在提升泛化能力方面被证明最为稳定有效。我们的实验还揭示，许多方法在实际应用中都受限于对群体标签的依赖，因为人工标注往往不可行，而谱相关性分析（SpRAy）等自动化工具在处理复杂特征和严重不平衡时效果不佳。此外，验证集中少数群体样本的稀缺使得模型选择和超参数调优变得不可靠，这为在安全关键领域部署鲁棒且可信的模型构成了重大障碍。

摘要 (Abstract)

Deep Neural Networks (DNNs) are increasingly utilized in high-stakes domains like medical diagnostics and autonomous driving where model reliability is critical. However, the research landscape for ensuring this reliability is terminologically fractured across communities that pursue the same goal of ensuring models rely on causally relevant features rather than confounding signals. While frameworks such as distributionally robust optimization (DRO), invariant risk minimization (IRM), shortcut learning, simplicity bias, and the Clever Hans effect all address model failure due to spurious correlations, researchers typically only reference work within their own domains. This reproducibility study unifies these perspectives through a comparative analysis of correction methods under challenging constraints like limited data availability and severe subgroup imbalance. We evaluate recently proposed correction methods based on explainable artificial intelligence (XAI) techniques alongside popular non-XAI baselines using both synthetic and real-world datasets. Findings show that XAI-based methods generally outperform non-XAI approaches, with Counterfactual Knowledge Distillation (CFKD) proving most consistently effective at improving generalization. Our experiments also reveal that the practical application of many methods is hindered by a dependency on group labels, as manual annotation is often infeasible and automated tools like Spectral Relevance Analysis (SpRAy) struggle with complex features and severe imbalance. Furthermore, the scarcity of minority group samples in validation sets renders model selection and hyperparameter tuning unreliable, posing a significant obstacle to the deployment of robust and trustworthy models in safety-critical areas.

关键词: Spurious Correlations, Shortcut Learning, Clever Hans Effect, Explainable AI (XAI), Distributionally Robust Optimization, Invariant Risk Minimization, Counterfactual Knowledge Distillation, Model Reliability

84. ❌ GAIN: Multiplicative Modulation for Domain Adaptation

作者: Hengshuai Yao, Xing Chen, Ahmed Murtadha, Guan Wang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04516v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出GAIN方法用于大语言模型（LLMs）的领域适应，核心解决标准微调方法（如LoRA）导致的灾难性遗忘问题。因此与’Large Language Models’、‘Domain Adaptation’、‘Post-training/SFT’、‘PEFT/LoRA’高度相关（10分）。论文明确提到LoRA作为对比方法，且GAIN属于参数高效微调技术。其他关键词如MoE、SLMs、RAG、推理加速等未在摘要中提及或与论文主题无关，故给0分。

!!! tip deepseek-chat TL;DR

论文提出GAIN方法，通过乘法调制解决大语言模型在领域适应中的灾难性遗忘问题，相比LoRA在保持新领域性能的同时显著减少对已学习领域的性能下降。

摘要翻译

将大语言模型适配至新领域时，标准方法（全量微调、LoRA）会因向权重空间注入新方向而导致灾难性遗忘。我们提出GAIN方法，该方法通过乘性调制（W_new = S * W）重新增强模型原有特征。学习得到的对角矩阵S被应用于注意力输出投影层，并可选择性地应用于前馈网络。这一原理借鉴了神经科学中的增益调制机制，即神经元通过缩放响应强度来适应不同情境，同时保持其选择性特征。
我们在涵盖四个模型系列（参数量774M至70B）的五种模型上评估GAIN，使其在八个领域进行序列化适配。GAIN-FFN在领域内适配效果与LoRA相当，但二者对已训练领域的影响截然相反：GAIN-FFN将验证集困惑度提升7-13%，而LoRA使其恶化18-36%。下游任务准确率验证了这一规律：例如在Qwen2.5模型上完成七次序列适配后，GAIN-FFN仅使BoolQ任务下降0.8%，而LoRA使其下降14.9%。GAIN为每个模型增加4.6万至23万参数，且可吸收至预训练权重中实现零推理开销。

摘要 (Abstract)

Adapting LLMs to new domains causes forgetting because standard methods (full fine-tuning, LoRA) inject new directions into the weight space. We propose GAIN, which re-emphasizes existing features through multiplicative modulation W_new = S * W. The learned diagonal matrix S is applied to the attention output projection and optionally the FFN. The principle mirrors gain modulation in neuroscience, where neurons adapt to context by scaling response strength while preserving selectivity. We evaluate GAIN on five models from four families (774M to 70B), adapting sequentially across eight domains. GAIN-FFN matches LoRA’s in-domain adaptation, but their effects on previously trained domains are opposite: GAIN-FFN improves them by 7-13% (validation PPL), while LoRA degrades them by 18-36%. Downstream accuracy confirms the pattern: for example, after seven sequential adaptations on Qwen2.5, GAIN-FFN degrades BoolQ by only 0.8% while LoRA damages it by 14.9%. GAIN adds 46K-230K parameters per model and can be absorbed into the pretrained weights for zero inference cost.

关键词: Domain Adaptation, Large Language Models, Parameter-efficient Fine-tuning, Catastrophic Forgetting, Multiplicative Modulation, LoRA, Attention Projection, FFN

85. ❌ SuperLocalMemory V3.3: The Living Brain – Biologically-Inspired Forgetting, Cognitive Quantization, and Multi-Channel Retrieval for Zero-LLM Agent Memory Systems

作者: Varun Pratap Bhardwaj 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04514v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究AI智能体（特别是零LLM模式）的本地内存系统，与"LLM Agents"高度相关（10分），涉及内存检索增强（“Retrieval-Augmented Generation” 8分）、量化压缩（“Quantization” 10分）和本地/设备端AI（“Small Language Models” 8分）。论文提到"zero-LLM"模式，与"Large Language Models"有一定关联（5分），但非核心。其他关键词如MoE、训练方法、推理技术等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了AI编码智能体缺乏有效长期记忆的问题，提出了一个本地优先的智能体内存系统SuperLocalMemory V3.3，通过生物启发的遗忘机制、认知量化和多通道检索，在零LLM模式下实现了70.4%的LoCoMo基准性能。

摘要翻译

人工智能编码代理运行于一个悖论之中：它们拥有海量的参数化知识，却无法记住一小时前的对话。现有的记忆系统将文本存储在向量数据库中，采用单通道检索，其核心操作依赖云端大语言模型，并且完全没有实现任何使人类记忆高效运作的认知过程。我们推出 SuperLocalMemory V3.3（“活体大脑”），这是一个本地优先的代理记忆系统，它实现了完整的认知记忆分类体系，并具备数学化的生命周期动态。基于 V3.2（arXiv:2603.14588）的信息几何学基础，我们引入了五项贡献：(1) 费希尔-拉奥量化感知距离（FRQAD）——一种在高斯统计流形上的新度量，在偏好高保真嵌入而非量化嵌入方面达到了100%的精确度（余弦相似度仅为85.6%），此技术尚无先例；(2) 艾宾浩斯自适应遗忘与生命周期感知量化——首个本地代理记忆中的数学化遗忘曲线，并与渐进式嵌入压缩相结合，实现了6.7倍的判别力；(3) 七通道认知检索，涵盖语义、关键词、实体图、时序、扩散激活、巩固和霍普菲尔德联想通道，在零大语言模型的A模式下，于LoCoMo基准上达到70.4%；(4) 通过软提示实现长期内隐记忆的记忆参数化；(5) 零摩擦自动认知管道，自动化完整的记忆生命周期。在LoCoMo基准上，V3.3在A模式（零大语言模型）下达到70.4%，在多跳推理上提升23.8个百分点，在对抗性任务上提升12.7个百分点。V3.2曾达到A模式74.8%和C模式87.7%的成绩；4.4个百分点的差距反映了一种有意的架构权衡。SLM V3.3在Elastic License 2.0下开源，完全在CPU上运行，月下载量超过5,000次。

摘要 (Abstract)

AI coding agents operate in a paradox: they possess vast parametric knowledge yet cannot remember a conversation from an hour ago. Existing memory systems store text in vector databases with single-channel retrieval, require cloud LLMs for core operations, and implement none of the cognitive processes that make human memory effective. We present SuperLocalMemory V3.3 (“The Living Brain”), a local-first agent memory system implementing the full cognitive memory taxonomy with mathematical lifecycle dynamics. Building on the information-geometric foundations of V3.2 (arXiv:2603.14588), we introduce five contributions: (1) Fisher-Rao Quantization-Aware Distance (FRQAD) – a new metric on the Gaussian statistical manifold achieving 100% precision at preferring high-fidelity embeddings over quantized ones (vs 85.6% for cosine), with zero prior art; (2) Ebbinghaus Adaptive Forgetting with lifecycle-aware quantization – the first mathematical forgetting curve in local agent memory coupled to progressive embedding compression, achieving 6.7x discriminative power; (3) 7-channel cognitive retrieval spanning semantic, keyword, entity graph, temporal, spreading activation, consolidation, and Hopfield associative channels, achieving 70.4% on LoCoMo in zero-LLM Mode A; (4) memory parameterization implementing Long-Term Implicit memory via soft prompts; (5) zero-friction auto-cognitive pipeline automating the complete memory lifecycle. On LoCoMo, V3.3 achieves 70.4% in Mode A (zero-LLM), with +23.8pp on multi-hop and +12.7pp on adversarial. V3.2 achieved 74.8% Mode A and 87.7% Mode C; the 4.4pp gap reflects a deliberate architectural trade-off. SLM V3.3 is open source under the Elastic License 2.0, runs entirely on CPU, with over 5,000 monthly downloads.

关键词: agent memory systems, zero-LLM, cognitive retrieval, quantization-aware, local-first, forgetting curve, multi-channel retrieval, embedding compression

86. ❌ Memory Intelligence Agent

作者: Jingyang Qiao, Weicheng Meng, Yu Cheng, Zhihang Lin, Zhizhong Zhang, Xin Tan, Jingyu Gong, Kun Shao, Yuan Xie 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04503v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Memory Intelligence Agent (MIA)框架，核心是LLM驱动的深度研究代理(DRAs)，涉及LLM推理、外部工具使用、多代理系统(Manager-Planner-Executor)、自主进化、反思机制等。高度相关的关键词包括：LLMs(核心基础)、LLM Agents/Agentic Workflow(论文主题)、Multi-agent Systems(架构)、Tool Use(外部工具集成)、Chain of Thought/System 2 Thinking(推理过程)、Self-Correction/Self-Improvement(自主进化)。其他关键词如MoE、SLMs、Scaling Laws、训练方法、效率优化、科学AI应用等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有深度研究代理(DRAs)中记忆系统进化低效、存储检索成本高的问题，提出了Memory Intelligence Agent (MIA)框架，通过Manager-Planner-Executor架构、交替强化学习、测试时学习、双向记忆转换和反思机制，实现了高效的记忆进化和自主推理，在11个基准测试中表现出优越性能。

摘要翻译

深度研究智能体（Deep Research Agents, DRAs）将大语言模型推理与外部工具相结合。记忆系统使DRAs能够利用历史经验，这对于高效推理和自主进化至关重要。现有方法依赖于从记忆中检索相似轨迹以辅助推理，但存在记忆进化低效以及存储与检索成本不断上升的关键局限。为解决这些问题，我们提出了一种新颖的记忆智能体（Memory Intelligence Agent, MIA）框架，该框架采用管理者-规划者-执行者架构。记忆管理者是一个非参数化记忆系统，能够存储压缩后的历史搜索轨迹。规划者是一个参数化记忆智能体，可为问题生成搜索计划。执行者是另一个智能体，能够在搜索计划的指导下进行信息搜索与分析。为构建MIA框架，我们首先采用交替强化学习范式来增强规划者与执行者之间的协作。此外，我们使规划者能够在测试时学习中持续进化，其更新在推理过程中即时进行，无需中断推理流程。同时，我们在参数化与非参数化记忆之间建立了双向转换循环，以实现高效记忆进化。最后，我们引入了反思机制和无监督判断机制，以增强开放世界中的推理与自我进化能力。在十一个基准测试上进行的大量实验证明了MIA的优越性。

摘要 (Abstract)

Deep research agents (DRAs) integrate LLM reasoning with external tools. Memory systems enable DRAs to leverage historical experiences, which are essential for efficient reasoning and autonomous evolution. Existing methods rely on retrieving similar trajectories from memory to aid reasoning, while suffering from key limitations of ineffective memory evolution and increasing storage and retrieval costs. To address these problems, we propose a novel Memory Intelligence Agent (MIA) framework, consisting of a Manager-Planner-Executor architecture. Memory Manager is a non-parametric memory system that can store compressed historical search trajectories. Planner is a parametric memory agent that can produce search plans for questions. Executor is another agent that can search and analyze information guided by the search plan. To build the MIA framework, we first adopt an alternating reinforcement learning paradigm to enhance cooperation between the Planner and the Executor. Furthermore, we enable the Planner to continuously evolve during test-time learning, with updates performed on-the-fly alongside inference without interrupting the reasoning process. Additionally, we establish a bidirectional conversion loop between parametric and non-parametric memories to achieve efficient memory evolution. Finally, we incorporate a reflection and an unsupervised judgment mechanisms to boost reasoning and self-evolution in the open world. Extensive experiments across eleven benchmarks demonstrate the superiority of MIA.

关键词: Memory Intelligence Agent, Deep Research Agents, LLM reasoning, Manager-Planner-Executor, memory evolution, alternating reinforcement learning, test-time learning, self-evolution

87. ❌ One Model for All: Multi-Objective Controllable Language Models

作者: Qiang He, Yucheng Yang, Tianyi Zhou, Meng Fang, Mykola Pechenizkiy, Setareh Maghsudi 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04497v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM与人类偏好的对齐问题，提出Multi-Objective Control (MOC)方法，将多目标优化引入RLHF框架。因此与’Large Language Models’、‘Instruction Tuning/Alignment’、‘RLHF’高度相关（10分）。论文未涉及其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG、推理方法、代理系统、压缩加速、科学AI应用等，这些评0分。

!!! tip deepseek-chat TL;DR

该论文针对当前RLHF方法在适应多样化用户偏好方面的局限性，提出了Multi-Objective Control (MOC)方法，通过将多目标优化原则融入RLHF，训练单个LLM生成满足不同偏好权衡的个性化输出，并在可控性、输出质量多样性和泛化性上优于基线方法。

摘要翻译

使大语言模型（LLM）与人类偏好对齐对于提升其安全性、助益性、幽默感、忠实度等方面至关重要。当前基于人类反馈的强化学习（RLHF）主要侧重于从平均人类评分中学习固定奖励，这可能削弱模型对不同偏好的适应性与可控性。然而，创建个性化大语言模型需要将模型与个体人类偏好对齐，这面临双重挑战：每个用户的可用数据稀缺，且用户偏好在多目标权衡中存在多样性——从某些情境中强调共情，到其他场景中要求效率与精确性。我们能否训练一个单一的大语言模型，使其能在帕累托前沿上针对不同用户偏好生成个性化输出？本文提出多目标控制（Multi-Objective Control, MOC）方法，通过训练单一的大语言模型，使其能直接在偏好定义的帕累托前沿区域内生成响应。我们的方法将多目标优化（Multi-Objective Optimization, MOO）原则引入RLHF框架，将大语言模型训练为以偏好为条件的策略网络。通过在策略层面应用MOO，我们提升了MOC的计算效率，使其能在单张A6000 GPU上对70亿参数模型进行微调。大量实验证明，MOC在以下三方面优于基线方法：（i）在大语言模型输出对多奖励间权衡的用户偏好可控性方面；（ii）在大语言模型输出的质量与多样性方面，通过所达成的多解超体积指标衡量；（iii）对未见偏好的泛化能力。这些结果凸显了MOC在需要可扩展且可定制大语言模型的现实应用中的潜力。

摘要 (Abstract)

Aligning large language models (LLMs) with human preferences is critical for enhancing LLMs’ safety, helpfulness, humor, faithfulness, etc. Current reinforcement learning from human feedback (RLHF) mainly focuses on a fixed reward learned from average human ratings, which may weaken the adaptability and controllability of varying preferences. However, creating personalized LLMs requires aligning LLMs with individual human preferences, which is non-trivial due to the scarce data per user and the diversity of user preferences in multi-objective trade-offs, varying from emphasizing empathy in certain contexts to demanding efficiency and precision in others. Can we train one LLM to produce personalized outputs across different user preferences on the Pareto front? In this paper, we introduce Multi-Objective Control (MOC), which trains a single LLM to directly generate responses in the preference-defined regions of the Pareto front. Our approach introduces multi-objective optimization (MOO) principles into RLHF to train an LLM as a preference-conditioned policy network. We improve the computational efficiency of MOC by applying MOO at the policy level, enabling us to fine-tune a 7B-parameter model on a single A6000 GPU. Extensive experiments demonstrate the advantages of MOC over baselines in three aspects: (i) controllability of LLM outputs w.r.t. user preferences on the trade-off among multiple rewards; (ii) quality and diversity of LLM outputs, measured by the hyper-volume of multiple solutions achieved; and (iii) generalization to unseen preferences. These results highlight MOC’s potential for real-world applications requiring scalable and customizable LLMs.

关键词: Large Language Models, Alignment, RLHF, Multi-Objective Optimization, Personalization, Controllability, Pareto Front, Preference-conditioned Policy

88. ❌ SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models

作者: Ziwei Li, Yuang Ma, Yi Kang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04493v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	7.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	7.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SLaB专注于大语言模型的高效部署，提出了一种新颖的权重分解方法（稀疏-低秩-二进制），属于模型压缩和推理加速领域。核心相关关键词：‘Large Language Models’（论文明确研究LLMs）、‘Quantization/Model Compression’（SLaB是一种模型压缩方法）、‘Mixture of Experts/Sparse Models’（涉及稀疏矩阵分解）、‘Small Language Models/On-device AI’（通过压缩使模型更高效，适用于设备端部署）、‘Speculative Decoding/Inference Acceleration’（压缩旨在加速推理）。其他关键词如训练方法、对齐、代理、科学应用等与论文内容无关。

!!! tip deepseek-chat TL;DR

论文提出SLaB框架，通过将线性层权重分解为稀疏、低秩和二进制矩阵来解决大语言模型部署中的计算和内存挑战，无需重训练即可在50%压缩率下显著降低困惑度并提升零样本任务准确率。

摘要翻译

大型语言模型（LLM）的快速发展因其巨大的计算和内存需求而带来了显著的部署挑战。虽然模型压缩（如网络剪枝）提供了潜在的解决方案，但现有方法大多难以在高压缩率下保持良好的性能。为此，我们提出SLaB，一种新颖的框架，将每个线性层权重分解为三个互补组件：稀疏矩阵、低秩矩阵和二进制矩阵。SLaB无需重新训练，并利用激活感知的剪枝分数来指导分解过程。在Llama系列模型上的实验表明，SLaB实现了最先进的性能，在50%压缩率下相比现有方法将困惑度降低高达36%，并在零样本任务上比基线准确率提升高达8.98%。

摘要 (Abstract)

The rapid growth of large language models (LLMs) presents significant deployment challenges due to their massive computational and memory demands. While model compression, such as network pruning, offers potential solutions, most existing methods often fail to maintain good performance at high compression ratios. To address this, we propose SLaB, a novel framework that decomposes each linear layer weight into three complementary components: a sparse matrix, a low-rank matrix, and a binary matrix. SLaB eliminates the need for retraining and leverages activation-aware pruning scores to guide the decomposition process. Experiments on Llama-family models demonstrate that SLaB achieves state-of-the-art performance, reducing perplexity by up to 36% compared to existing methods at 50% compression and improving accuracy by up to 8.98% over the baseline on zero-shot tasks.

关键词: Large Language Models, Model Compression, Sparse Decomposition, Low-rank Decomposition, Binary Quantization, Efficient Inference, Llama-family Models, Zero-shot Tasks

89. ❌ RAVEN: Radar Adaptive Vision Encoders for Efficient Chirp-wise Object Detection and Segmentation

作者: Anuvab Sen, Mir Sayeed Mohammad, Saibal Mukhopadhyay 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04490v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文RAVEN专注于FMCW雷达感知的深度学习架构，涉及雷达信号处理、对象检测和分割，但未涉及任何大语言模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用。所有关键词均与大模型、深度学习技术原理或AI for Science相关，而本文属于传统计算机视觉/信号处理领域，与给定关键词完全无关。

!!! tip deepseek-chat TL;DR

本文提出了一种名为RAVEN的计算高效深度学习架构，用于FMCW雷达感知，通过流式处理原始ADC数据、保留MIMO结构和引入早期退出机制，在汽车雷达基准测试中实现了强大的对象检测和BEV自由空间分割性能，同时显著降低了计算和端到端延迟。

摘要翻译

本文提出RAVEN，一种面向FMCW雷达感知的高计算效率深度学习架构。该方法以逐啁啾流式方式处理原始ADC数据，通过独立接收器状态空间编码器保持MIMO结构，并利用可学习的跨天线混合模块恢复紧凑的虚拟阵列特征。同时引入早期退出机制，使得模型在潜在状态稳定后仅需部分啁啾即可完成决策。在车载雷达基准测试中，相较于传统的基于帧处理的雷达流程，该方法在显著降低计算量与端到端延迟的同时，实现了优异的物体检测与鸟瞰图自由空间分割性能。

摘要 (Abstract)

This paper presents RAVEN, a computationally efficient deep learning architecture for FMCW radar perception. The method processes raw ADC data in a chirp-wise streaming manner, preserves MIMO structure through independent receiver state-space encoders, and uses a learnable cross-antenna mixing module to recover compact virtual-array features. It also introduces an early-exit mechanism so the model can make decisions using only a subset of chirps when the latent state has stabilized. Across automotive radar benchmarks, the approach reports strong object detection and BEV free-space segmentation performance while substantially reducing computation and end-to-end latency compared with conventional frame-based radar pipelines.

关键词: FMCW radar, object detection, segmentation, deep learning architecture, computational efficiency, MIMO, early-exit mechanism, automotive radar

90. ❌ Scalable and Explainable Learner-Video Interaction Prediction using Multimodal Large Language Models

作者: Dominik Glandorf, Fares Fawzi, Tanja Käser 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04482v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是使用多模态大语言模型（MLLMs）预测教育视频中的学习者交互行为，并强调可解释性。因此，与’Large Language Models’高度相关（10分），因为MLLMs是LLMs的扩展；与’Explainable AI’高度相关（10分），因为论文明确关注可解释性，使用概念激活向量解释预测。与’AI for Science’有一定关联（5分），因为教育视频分析可视为AI在教育科学领域的应用，但非核心生物/化学信息学。其他关键词如MoE、SFT、RAG等未在摘要中提及，完全无关（0分）。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于多模态大语言模型的可扩展、可解释的管道，用于仅从视频内容预测教育视频中的学习者交互行为（如暂停、跳过），并证明其能可靠预测交互峰值、泛化到新领域，并为大规模检验多媒体学习理论提供新机会。

摘要翻译

学习者在教育视频中对播放控制功能的使用，隐性地反映了其认知处理过程与教学设计质量，然而缺乏可扩展且可解释的预测模型，限制了教师在教学部署前预判此类行为的能力。我们提出了一种可扩展、可解释的预测流程，仅依据视频内容即可预测群体层面的观看、暂停、跳过和回放行为，并将其作为认知负荷的代理指标。该方法利用多模态大语言模型（Multimodal Large Language Models, MLLMs）计算短视频片段的嵌入表示，并训练一个神经分类器来识别时间上细粒度的交互峰值。基于多媒体学习理论中关于优化认知负荷的教学设计原则，我们使用GPT-5对视频片段的特征进行编码，并以此为基础，通过概念激活向量来解释模型预测。我们在来自66门在线课程的7700万次视频控制事件上评估了该流程。研究结果表明，基于MLLM嵌入表示的分类器能够可靠地预测交互峰值，泛化至未见过的学术领域，并编码了可解释的、与理论相关的教学概念。总体而言，我们的研究结果证明了以低成本、可解释的方式对教育视频设计进行预筛选是可行的，并为大规模实证检验多媒体学习理论开辟了新的机遇。

摘要 (Abstract)

Learners’ use of video controls in educational videos provides implicit signals of cognitive processing and instructional design quality, yet the lack of scalable and explainable predictive models limits instructors’ ability to anticipate such behavior before deployment. We propose a scalable, interpretable pipeline for predicting population-level watching, pausing, skipping, and rewinding behavior as proxies for cognitive load from video content alone. Our approach leverages multimodal large language models (MLLMs) to compute embeddings of short video segments and trains a neural classifier to identify temporally fine-grained interaction peaks. Drawing from multimedia learning theory on instructional design for optimal cognitive load, we code features of the video segments using GPT-5 and employ them as a basis for interpreting model predictions via concept activation vectors. We evaluate our pipeline on 77 million video control events from 66 online courses. Our findings demonstrate that classifiers based on MLLM embeddings reliably predict interaction peaks, generalize to unseen academic fields, and encode interpretable, theory-relevant instructional concepts. Overall, our results show the feasibility of cost-efficient, interpretable pre-screening of educational video design and open new opportunities to empirically examine multimedia learning theory at scale.

关键词: multimodal large language models, educational videos, learner interaction prediction, interpretable pipeline, cognitive load, concept activation vectors, multimedia learning theory, scalable analysis

91. ❌ Discrete Prototypical Memories for Federated Time Series Foundation Models

作者: Liwei Deng, Qingxiang Liu, Xinhe Niu, Shengchao Chen, Sheng Sun, Yuankai Wu, Guodong Long, Yuxuan Liang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04475v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs作为时间序列基础模型在联邦学习中的应用，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文涉及领域适应和跨域对齐，与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分）。其他关键词如MoE、SLMs、SFT、RAG、推理方法、AI for Science等均未在论文中涉及或提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对LLMs作为联邦时间序列基础模型时存在的语义不对齐和参数共享机制问题，提出了基于离散原型记忆的联邦框架FeDPM，有效提升了跨域时间序列数据的建模性能。

摘要翻译

利用大型语言模型（LLM）作为基于联邦学习（FL）的时间序列基础模型，为将LLM的泛化能力迁移至时间序列数据同时保持对私有数据的访问提供了一条前景广阔的路径。然而，时间序列数据与现有LLM以文本为中心的潜在空间之间的语义错位常导致性能下降。同时，现有联邦学习方法中的参数共享机制将异构跨域时间序列数据建模至统一的连续潜在空间，这与时间序列语义常表现为离散且重复出现的状态这一事实相矛盾。为应对这些局限，我们提出\textsc{FeDPM}——一个基于离散原型记忆的联邦时间序列基础模型框架。具体而言，我们为域内时间序列数据学习本地原型记忆先验。随后，我们通过跨域记忆对齐来促进统一的离散潜在空间，并引入一种领域特定的记忆更新机制以平衡共享的与个性化的原型知识。大量实验证明了\textsc{FeDPM}的高效性与有效性。代码公开于https://anonymous.4open.science/r/FedUnit-64D1。

摘要 (Abstract)

Leveraging Large Language Models (LLMs) as federated learning (FL)-based time series foundation models offers a promising way to transfer the generalization capabilities of LLMs to time series data while preserving access to private data. However, the semantic misalignment between time-series data and the text-centric latent space of existing LLMs often leads to degraded performance. Meanwhile, the parameter-sharing mechanism in existing FL methods model heterogeneous cross-domain time-series data into a unified continuous latent space, which contradicts the fact that time-series semantics frequently manifest as discrete and recurring regimes. To address these limitations, we propose \textsc{FeDPM}, a federated framework for time-series foundation models based on discrete prototypical memories. Specifically, we learn local prototypical memory priors for intra-domain time-series data. We then align cross-domain memories to promote a unified discrete latent space and introduce a domain-specific memory update mechanism to balance shared and personalized prototypical knowledge. Extensive experiments demonstrate the efficiency and effectiveness of \textsc{FeDPM}. The code is publicly available at https://anonymous.4open.science/r/FedUnit-64D1.

关键词: Large Language Models, Federated Learning, Time Series Foundation Models, Discrete Prototypical Memories, Cross-domain Alignment, Semantic Misalignment, Domain Adaptation, Heterogeneous Data

92. ❌ MAVEN: A Mesh-Aware Volumetric Encoding Network for Simulating 3D Flexible Deformation

作者: Zhe Feng, Shilong Tao, Haonan Sun, Shaohan Chen, Zhanxing Zhu, Yunhuai Liu 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04474v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文MAVEN专注于使用图神经网络（GNNs）进行3D柔性变形的物理模拟，核心创新在于显式建模高阶几何网格元素（如2D面和3D单元）以提高模拟精度。所有关键词均与大语言模型（LLMs）、深度学习技术原理或特定AI应用（如对齐、推理、代理等）直接相关，而本文研究的是GNN在物理模拟中的应用，属于不同的深度学习子领域。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为物理模拟可视为科学计算的一部分，但论文未明确涉及生物信息学或化学信息学，且重点在工程物理而非广义科学AI，因此给予5分（有一定关联）。其他关键词与论文内容无直接联系，均评0分。

!!! tip deepseek-chat TL;DR

该论文针对现有图神经网络在模拟3D柔性变形时忽略高阶几何特征的问题，提出了MAVEN网络，通过显式建模3D单元、2D面和顶点之间的可学习映射，实现了更准确、自然的物理模拟，并在多个数据集和金属拉伸弯曲任务上达到了最先进的性能。

摘要翻译

基于深度学习的方法，特别是图神经网络（GNNs），因其处理非结构化物理场和在图结构上进行非线性回归的能力，在模拟固体的柔性形变与接触方面日益受到重视。然而，现有的图神经网络通常仅使用由顶点和边构建的图来表示网格。这类方法往往忽略了原始几何中更高维度的空间特征，例如二维面和三维单元。因此，尽管边界表示和体积特征对于建模接触相互作用以及内部物理量传播至关重要，尤其是在稀疏网格离散化条件下，但准确捕捉这些信息仍具挑战性。本文提出MAVEN，一种用于模拟三维柔性形变的网格感知体积编码网络，它显式地对高维几何网格元素进行建模，以实现更精确、更自然的物理模拟。MAVEN在三维单元、二维面和顶点之间建立了可学习的映射，实现了灵活的相互转换。模型显式地融入了几何特征，以减轻隐式学习几何模式的负担。实验结果表明，在现有数据集以及一个具有大变形和长时间接触特征的新型金属拉伸弯曲任务上，MAVEN均能持续取得最先进的性能。

摘要 (Abstract)

Deep learning-based approaches, particularly graph neural networks (GNNs), have gained prominence in simulating flexible deformations and contacts of solids, due to their ability to handle unstructured physical fields and nonlinear regression on graph structures. However, existing GNNs commonly represent meshes with graphs built solely from vertices and edges. These approaches tend to overlook higher-dimensional spatial features, e.g., 2D facets and 3D cells, from the original geometry. As a result, it is challenging to accurately capture boundary representations and volumetric characteristics, though this information is critically important for modeling contact interactions and internal physical quantity propagation, particularly under sparse mesh discretization. In this paper, we introduce MAVEN, a mesh-aware volumetric encoding network for simulating 3D flexible deformation, which explicitly models geometric mesh elements of higher dimension to achieve a more accurate and natural physical simulation. MAVEN establishes learnable mappings among 3D cells, 2D facets, and vertices, enabling flexible mutual transformations. Explicit geometric features are incorporated into the model to alleviate the burden of implicitly learning geometric patterns. Experimental results show that MAVEN consistently achieves state-of-the-art performance across established datasets and a novel metal stretch-bending task featuring large deformations and prolonged contacts.

关键词: mesh-aware volumetric encoding, 3D flexible deformation, graph neural networks, higher-dimensional spatial features, physical simulation, contact interactions, volumetric characteristics, state-of-the-art performance

93. ❌ What Makes a Sale? Rethinking End-to-End Seller–Buyer Retail Dynamics with LLM Agents

作者: Jeonghwan Choi, Jibin Hwang, Gyeonghun Sun, Minjeong Ban, Taewon Yun, Hyeonjae Cheon, Hwanjun Song 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04468v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用LLM Agents构建零售模拟框架RetailSim，因此与’Large Language Models’和’LLM Agents’高度相关（10分）。论文涉及卖家-买家多智能体交互，与’Multi-agent Systems’高度相关（10分）。其他关键词如MoE、SFT、RAG、量化等均未在摘要中提及或涉及，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了RetailSim，一个使用LLM智能体模拟端到端零售动态的框架，通过人类评估和元评估验证了其能复现真实经济规律，并展示了在卖家-买家交互分析和销售策略评估等决策导向用例中的实用性。

摘要翻译

在零售策略部署前对其进行评估具有挑战性，因为其结果由多个阶段共同决定——从卖方说服、买卖双方互动到最终的购买决策。然而，现有的零售模拟器仅能捕捉这一过程的部分环节，且未建模跨阶段依赖关系，因而难以评估早期决策如何影响下游结果。我们提出了RetailSim，一个端到端的零售模拟框架，该框架在统一环境中对上述流程进行建模，并通过多样化的产品空间、角色驱动的智能体以及多轮互动，明确设计以实现高仿真度。我们采用双重协议对RetailSim进行评估，包括对人类行为仿真度的人工评估，以及对照现实经济规律进行的元评估。结果表明，该框架成功复现了关键模式，如人口统计购买行为、价格-需求关系以及异质性价格弹性。我们进一步通过决策导向的用例展示了其实用性，包括角色推断、买卖双方互动分析和销售策略评估，从而证明了RetailSim作为探索零售策略的可控测试平台具有巨大潜力。

摘要 (Abstract)

Evaluating retail strategies before deployment is difficult, as outcomes are determined across multiple stages, from seller-side persuasion through buyer-seller interaction to purchase decisions. However, existing retail simulators capture only partial aspects of this process and do not model cross-stage dependencies, making it difficult to assess how early decisions affect downstream outcomes. We present RetailSim, an end-to-end retail simulation framework that models this pipeline in a unified environment, explicitly designed for simulation fidelity through diverse product spaces, persona-driven agents, and multi-turn interactions. We evaluate RetailSim with a dual protocol comprising human evaluation of behavioral fidelity and meta-evaluation against real-world economic regularities, showing that it successfully reproduces key patterns such as demographic purchasing behavior, the price-demand relationship, and heterogeneous price elasticity. We further demonstrate its practical utility via decision-oriented use cases, including persona inference, seller-buyer interaction analysis, and sales strategy evaluation, showing RetailSim’s potential as a controlled testbed for exploring retail strategies.

关键词: LLM Agents, Retail Simulation, Multi-agent Systems, Seller-Buyer Interaction, End-to-End Simulation, Behavioral Fidelity, Economic Regularities, Sales Strategy Evaluation

94. ❌ MC-GenRef: Annotation-free mammography microcalcification segmentation with generative posterior refinement

作者: Hyunwoo Cho, Yeeun Kwon, Min Jung Kim, Yangmo Yoo 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04470v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于医学影像（乳腺X光）中的微钙化点分割，提出了一种结合合成监督和测试时生成后验细化的无标注框架。论文的核心技术是计算机视觉和医学图像分析，使用了生成模型（rectified-flow generator）和分割网络，但并未涉及任何大语言模型（LLM）、深度学习技术原理创新（如MoE、Scaling Laws、PEFT等）、大模型训练对齐方法（如RLHF、Instruction Tuning）、推理优化（如KV Cache、Speculative Decoding）、智能体（Agents）或大模型应用技术（如RAG、CoT）。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物医学（医学影像分析）领域的应用，与’AI for Science’有一定关联，但并非核心创新点，因此给予8分（有一定关联）。其他所有关键词均与论文内容完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需真实密集标注的乳腺X光微钙化点分割框架MC-GenRef，通过合成监督训练和测试时生成后验细化，在公开和私有数据集上提高了分割精度和鲁棒性，减少了漏检。

摘要翻译

微钙化（Microcalcification，MC）分析在乳腺X线摄影筛查中具有重要的临床意义，因为簇状点状钙化可能是恶性肿瘤的早期征象，然而密集MC分割仍面临挑战：目标极其微小且稀疏，密集像素级标注成本高昂且存在模糊性，而跨站点偏移常导致在致密组织中产生纹理驱动的假阳性及漏检点状钙化。我们提出MC-GenRef，一种真正无需密集标注的框架，它将高保真合成监督与测试时生成后验优化（Test-Time Generative Posterior Refinement，TT-GPR）相结合。在训练阶段，使用真实的阴性乳腺X线图像块作为背景，并通过一个轻量级的图像形成模型（结合局部对比度调制与模糊处理）注入物理上合理的MC模式，从而生成精确的图像-掩码对，无需任何真实密集标注。仅使用这些合成标注对，MC-GenRef训练一个基础分割器和一个作为可控生成先验的种子条件整流流（Rectified-Flow，RF）生成器。在推理阶段，TT-GPR将分割视为近似后验推断：从当前预测中提取稀疏种子，形成与种子一致的RF投影，通过冻结的分割器将其转换为针对具体病例的代理目标，并利用重叠一致性与边缘感知正则化迭代优化逻辑输出。在INbreast数据集上，仅使用合成数据的初始化器在无需真实密集标注的情况下取得了最佳的Dice分数，而TT-GPR进一步将漏检敏感性能（如Recall和FNR）提升至更高水平，并展现出强大的类别平衡能力（平衡准确率Bal.Acc.、G-Mean）。在外部私有的Yonsei队列（n=50）上，TT-GPR在跨站点偏移下持续改进了仅使用合成数据的初始化器，提高了Dice和Recall，同时降低了FNR。这些结果表明，测试时生成后验优化是一种无需额外真实密集标注即可减少MC漏检、提升模型鲁棒性的实用途径。

摘要 (Abstract)

Microcalcification (MC) analysis is clinically important in screening mammography because clustered puncta can be an early sign of malignancy, yet dense MC segmentation remains challenging: targets are extremely small and sparse, dense pixel-level labels are expensive and ambiguous, and cross-site shift often induces texture-driven false positives and missed puncta in dense tissue. We propose MC-GenRef, a real dense-label-free framework that combines high-fidelity synthetic supervision with test-time generative posterior refinement (TT-GPR). During training, real negative mammogram patches are used as backgrounds, and physically plausible MC patterns are injected through a lightweight image formation model with local contrast modulation and blur, yielding exact image-mask pairs without real dense annotation. Using only these synthetic labeled pairs, MC-GenRef trains a base segmentor and a seed-conditioned rectified-flow (RF) generator that serves as a controllable generative prior. During inference, TT-GPR treats segmentation as approximate posterior inference: it derives a sparse seed from the current prediction, forms seed-consistent RF projections, converts them into case-specific surrogate targets through the frozen segmentor, and iteratively refines the logits with overlap-consistent and edge-aware regularization. On INbreast, the synthetic-only initializer achieved the best Dice without real dense annotations, while TT-GPR improved miss-sensitive performance to Recall and FNR, with strong class-balanced behavior (Bal.Acc., G-Mean). On an external private Yonsei cohort ( n=50 ), TT-GPR consistently improved the synthetic-only initializer under cross-site shift, increasing Dice and Recall while reducing FNR. These results suggest that test-time generative posterior refinement is a practical route to reduce MC misses and improve robustness without additional real dense labeling.

关键词: mammography, microcalcification segmentation, annotation-free, generative posterior refinement, synthetic supervision, rectified-flow generator, test-time refinement, medical image analysis

95. ❌ The Topology of Multimodal Fusion: Why Current Architectures Fail at Creative Cognition

作者: Xiujiang Tan 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04465v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要探讨多模态AI架构的拓扑结构限制，融合哲学、认知科学和数学理论，提出新的理论框架和基准测试。虽然涉及AI架构，但未具体讨论大模型、深度学习技术原理或科学应用，与所有评分关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文识别了当前多模态AI架构中基于模态可分性的拓扑结构限制，并提出融合中国哲学概念的理论框架、数学形式化方法及相应的基准测试来评估跨文明拓扑同构性。

摘要翻译

本文揭示了当前多模态人工智能架构中存在一种结构性的、非参数化的拓扑局限。对比对齐（CLIP）、交叉注意力融合（GPT-4V/Gemini）以及基于扩散的生成方法，共享一个共同的几何先验——模态可分性，我们将其称为接触拓扑。论证基于三个支柱，并以哲学作为生成中心。哲学支柱将维特根斯坦的“言说/显示”区分重新诠释为一个问题而非结论：在维特根斯坦选择沉默之处，中国工艺认识论传统以“象”（操作性图式）作为回应——这是言说与显示相互渗透时涌现的第三状态。一个十字框架（道/器 × 言说/显示）将“象”定位于交叉点，并沿双轴执行双重“化裁”。这生成了双层动力学：“创化”（作为自发事件的创造性转化）与“化裁”（其制度化为可重复形式）。认知科学支柱通过病理镜像重新阐释了默认模式网络/执行控制网络/突显网络的三元共激活：在二维参数空间（耦合强度 × 调节能力）中，存在重叠同构与叠加崩溃的对比。数学支柱通过纤维丛和杨-米尔斯曲率对此进行了形式化，并将十字结构映射到纤维丛语言中。我们提出了通过带拓扑正则化的神经常微分方程实现UOO，设计了带有错误类型比率度量的ANALOGY-MM基准，以及META-TOP三层基准，用于测试跨越七种原型的跨文明拓扑同构性。一个包含明确终止标准的分阶段实验路线图确保若理论被证伪可实现清晰退出。

摘要 (Abstract)

This paper identifies a structural limitation in current multimodal AI architectures that is topological rather than parametric. Contrastive alignment (CLIP), cross-attention fusion (GPT-4V/Gemini), and diffusion-based generation share a common geometric prior – modal separability – which we term contact topology. The argument rests on three pillars with philosophy as the generative center. The philosophical pillar reinterprets Wittgenstein’s saying/showing distinction as a problem rather than a conclusion: where Wittgenstein chose silence, the Chinese craft epistemology tradition responded with xiang (operative schema) – the third state emerging when saying and showing interpenetrate. A cruciform framework (dao/qi x saying/showing) positions xiang at the intersection, executing dual huacai (transformation-and-cutting) along both axes. This generates a dual-layer dynamics: chuanghua (creative transformation as spontaneous event) and huacai (its institutionalization into repeatable form). The cognitive science pillar reinterprets DMN/ECN/SN tripartite co-activation through the pathological mirror: overlap isomorphism vs. superimposition collapse in a 2D parameter space (coupling intensity x regulatory capacity). The mathematical pillar formalizes these via fiber bundles and Yang-Mills curvature, with the cruciform structure mapped to fiber bundle language. We propose UOO implementation via Neural ODEs with topological regularization, the ANALOGY-MM benchmark with error-type-ratio metric, and the META-TOP three-tier benchmark testing cross-civilizational topological isomorphism across seven archetypes. A phased experimental roadmap with explicit termination criteria ensures clean exit if falsified.

关键词: multimodal AI, topological limitation, modal separability, philosophical framework, cognitive science, fiber bundles, benchmark testing, creative cognition

96. ❌ DP-OPD: Differentially Private On-Policy Distillation for Language Models

作者: Fatemeh Khadem, Sajad Mousavi, Yi Fang, Yuhong Liu 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04461v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的隐私保护蒸馏方法，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文涉及模型压缩（通过蒸馏实现），但未具体涉及量化、稀疏化等压缩技术，因此’Quantization OR Model Compression OR Low-bit Weights’给5分。其他关键词如MoE、SLMs、Scaling Laws、各种训练方法（预训练、微调、对齐等）、推理优化、智能体、科学AI等均未在论文中涉及，故给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为DP-OPD的差分隐私在线策略蒸馏框架，用于在保护隐私的同时压缩大语言模型，避免了传统方法中需要训练隐私保护教师模型和生成合成文本的复杂流程，在严格隐私预算下取得了更好的困惑度性能。

摘要翻译

大语言模型（LLM）正越来越多地适配包含敏感信息的专有及领域特定语料库，这在形式化的隐私保证与通过模型压缩实现高效部署之间造成了张力。差分隐私（DP）通常通过DP-SGD实施，能提供记录级别的保护，但在自回归生成中往往导致显著的效用损失，因为优化噪声会沿生成长序列放大暴露偏差和复合误差。现有的私有蒸馏方法要么对教师和学生模型同时应用DP-SGD，从而加剧计算负担并恶化隐私-效用权衡；要么依赖于从经过DP训练的教师模型生成DP合成文本，以避免对学生模型应用DP，但其代价是需要对大型教师模型进行DP优化并引入离线的生成流程。我们提出差分隐私同策略蒸馏（DP-OPD），这是一个无需合成文本的框架，其隐私保护仅通过对学生模型实施DP-SGD来实现，同时利用一个冻结的教师模型在学生生成的轨迹上提供密集的词元级目标。DP-OPD通过在延续词元上进行私有广义知识蒸馏来具体实现这一思想。在严格的隐私预算（$\varepsilon=2.0$）下，DP-OPD相较于DP微调和离策略DP蒸馏，降低了困惑度，并优于基于合成的DP蒸馏方法（Yelp：44.15$\rightarrow$41.68；BigPatent：32.43$\rightarrow$30.63），同时极大地简化了训练流程。特别地，DP-OPD通过消除对教师模型的DP训练和离线合成文本生成，将私有压缩整合进单一的DP学生训练循环中。代码将在论文发表后发布于 https://github.com/khademfatemeh/dp_opd。

摘要 (Abstract)

Large language models (LLMs) are increasingly adapted to proprietary and domain-specific corpora that contain sensitive information, creating a tension between formal privacy guarantees and efficient deployment through model compression. Differential privacy (DP), typically enforced via DP-SGD, provides record-level protection but often incurs substantial utility loss in autoregressive generation, where optimization noise can amplify exposure bias and compounding errors along long rollouts. Existing approaches to private distillation either apply DP-SGD to both teacher and student, worsening computation and the privacy–utility tradeoff, or rely on DP synthetic text generation from a DP-trained teacher, avoiding DP on the student at the cost of DP-optimizing a large teacher and introducing an offline generation pipeline. We propose \textbf{Differentially Private On-Policy Distillation (DP-OPD)}, a synthesis-free framework that enforces privacy solely through DP-SGD on the student while leveraging a frozen teacher to provide dense token-level targets on \emph{student-generated} trajectories. DP-OPD instantiates this idea via \emph{private generalized knowledge distillation} on continuation tokens. Under a strict privacy budget ($\varepsilon=2.0$), DP-OPD improves perplexity over DP fine-tuning and off-policy DP distillation, and outperforms synthesis-based DP distillation (Yelp: 44.15$\rightarrow$41.68; BigPatent: 32.43$\rightarrow$30.63), while substantially simplifying the training pipeline. In particular, \textbf{DP-OPD collapses private compression into a single DP student-training loop} by eliminating DP teacher training and offline synthetic text generation. Code will be released upon publication at https://github.com/khademfatemeh/dp_opd.

关键词: Differential Privacy, Knowledge Distillation, Large Language Models, Model Compression, Privacy-Preserving Machine Learning, On-Policy Learning, Autoregressive Generation, DP-SGD

97. ❌ Empirical Characterization of Rationale Stability Under Controlled Perturbations for Explainable Pattern Recognition

作者: Abu Noman Md Sakib, Zhensen Wang, Merjulah Roby, Zijie Zhang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04456v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于评估预训练语言模型（BERT、RoBERTa、DistilBERT）在情感分析任务中解释的稳定性，提出了一种基于SHAP值余弦相似度的新度量方法。该研究与’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为其核心是评估模型解释的稳定性和一致性，属于可解释AI范畴。然而，论文未涉及大模型技术原理创新（如MoE、Scaling Laws、RLHF等）、大模型在不同领域的应用（如AI for Science）、或大模型相关的新技术（如RAG、Quantization、LLM Agents等），因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种新的度量方法来评估预训练语言模型在情感分析任务中解释的稳定性，通过计算SHAP值的余弦相似度来检测模型对相似输入是否保持一致的归因模式，从而支持更鲁棒的模型行为评估。

摘要翻译

可靠的模式识别系统应在相似输入上表现出一致行为，其解释也应保持稳定。然而，大多数可解释人工智能（XAI）评估仍以单个实例为中心，未能明确量化归因模式在共享相同类别或代表同一输入微小变体的样本间是否一致。本研究提出一种旨在评估模型解释一致性的新度量标准，以确保模型在标签保持扰动下能持续反映预期目标并保持一致性。我们在SST-2情感分析数据集上使用预训练的BERT模型实现该度量，并对RoBERTa、DistilBERT和IMDB数据集进行额外鲁棒性测试，应用SHAP方法计算各类测试样本的特征重要性。该度量通过量化相同标签输入对应的SHAP值的余弦相似度，旨在检测不一致行为，例如对特定特征的偏倚依赖或未能对相似预测保持一致的推理逻辑。通过一系列实验，我们评估了该度量识别预测偏差与模型解释不一致性的能力。这些实验与标准保真度指标进行对比，以验证新度量能否有效识别模型行为偏离预期目标的情况。所提出的框架通过实现更稳健的原理稳定性验证，为理解模型行为提供更深入的视角，这对构建可信赖的人工智能系统至关重要。通过量化模型是否对相似输入依赖一致的归因模式，该方法为实际模式识别流程中的模型行为评估提供了更鲁棒的支撑。我们的代码公开于https://github.com/anmspro/ESS-XAI-Stability。

摘要 (Abstract)

Reliable pattern recognition systems should exhibit consistent behavior across similar inputs, and their explanations should remain stable. However, most Explainable AI evaluations remain instance centric and do not explicitly quantify whether attribution patterns are consistent across samples that share the same class or represent small variations of the same input. In this work, we propose a novel metric aimed at assessing the consistency of model explanations, ensuring that models consistently reflect the intended objectives and consistency under label-preserving perturbations. We implement this metric using a pre-trained BERT model on the SST-2 sentiment analysis dataset, with additional robustness tests on RoBERTa, DistilBERT, and IMDB, applying SHAP to compute feature importance for various test samples. The proposed metric quantifies the cosine similarity of SHAP values for inputs with the same label, aiming to detect inconsistent behaviors, such as biased reliance on certain features or failure to maintain consistent reasoning for similar predictions. Through a series of experiments, we evaluate the ability of this metric to identify misaligned predictions and inconsistencies in model explanations. These experiments are compared against standard fidelity metrics to assess whether the new metric can effectively identify when a model’s behavior deviates from its intended objectives. The proposed framework provides a deeper understanding of model behavior by enabling more robust verification of rationale stability, which is critical for building trustworthy AI systems. By quantifying whether models rely on consistent attribution patterns for similar inputs, the proposed approach supports more robust evaluation of model behavior in practical pattern recognition pipelines. Our code is publicly available at https://github.com/anmspro/ESS-XAI-Stability.

关键词: Explainable AI, Model Explanation Stability, SHAP, BERT, Sentiment Analysis, Rationale Consistency, Label-preserving Perturbations, Trustworthy AI

98. ❌ Conversational Control with Ontologies for Large Language Models: A Lightweight Framework for Constrained Generation

作者: Barbara Gendron, Gaël Guibon, Mathieu d’Aquin 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04450v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的受控生成和对话控制，与"Large Language Models"高度相关（10分），使用了"Supervised Fine-tuning"方法（10分），涉及"Instruction Tuning/Alignment"以增强对齐（8分），应用于"LLM Agents/Autonomous Agents”（8分），并强调"Explainable AI"以提高可解释性（8分）。其他关键词如MoE、SLMs、RAG、RLHF等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于本体的轻量级框架，通过约束生成和微调方法实现对大型语言模型对话输出的模块化、可解释控制，在英语熟练度和内容极性两个任务上验证了其有效性，并展示了该框架的模型无关性和可扩展性。

摘要翻译

基于大语言模型（LLM）的对话代理近年来已成为人机交互的强大工具。然而，其黑箱特性意味着在可预测性方面存在挑战且缺乏个性化，这两点均可通过受控生成来解决。本研究提出一种端到端方法，通过对对话相关方面的本体论定义，实现对LLM输出的模块化与可解释控制。关键方面被建模并用作约束条件；随后我们进一步对大语言模型进行微调，使其依此生成内容。为验证方法有效性，我们探索了两项任务，分别针对对话的两个关键方面：英语熟练度水平与内容的情感倾向特征。通过在七个先进的开放权重对话型大语言模型上进行混合微调，我们证明该方法在各项指标上均持续优于预训练基线模型，即使在较小模型上亦然。除量化提升外，该框架保持模型无关性、轻量化与可解释性，支持可复用的控制策略，并能扩展至新领域与交互目标。此方法增强了对策略指令的遵循能力，并证明了本体驱动控制在对话系统中的有效性。

摘要 (Abstract)

Conversational agents based on Large Language Models (LLMs) have recently emerged as powerful tools for human-computer interaction. Nevertheless, their black-box nature implies challenges in predictability and a lack of personalization, both of which can be addressed by controlled generation. This work proposes an end-to-end method to obtain modular and explainable control over LLM outputs through ontological definitions of aspects related to the conversation. Key aspects are modeled and used as constraints; we then further fine-tune the LLM to generate content accordingly. To validate our approach, we explore two tasks that tackle two key conversational aspects: the English proficiency level and the polarity profile of the content. Using a hybrid fine-tuning procedure on seven state-of-the-art, open-weight conversational LLMs, we show that our method consistently outperforms pre-trained baselines, even on smaller models. Beyond quantitative gains, the framework remains model-agnostic, lightweight, and interpretable, enabling reusable control strategies that can be extended to new domains and interaction goals. This approach enhances alignment with strategy instructions and demonstrates the effectiveness of ontology-driven control in conversational systems.

关键词: Large Language Models, Controlled Generation, Ontological Definitions, Fine-tuning, Conversational Agents, Explainable Control, Model-agnostic Framework, Alignment

99. ❌ PSY-STEP: Structuring Therapeutic Targets and Action Sequences for Proactive Counseling Dialogue Systems

作者: Jihyun Lee, Yejin Min, Yejin Jeon, SungJun Yang, Hyounghun Kim, Gary Geunbae Lee 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04448v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要研究认知行为治疗（CBT）对话系统，通过构建STEP数据集和训练STEPPER咨询代理来主动识别自动负面思维并执行认知干预。论文的核心是心理咨询领域的AI应用，而非大模型技术原理的创新。因此，大多数关键词（如LLMs、MoE、Scaling Laws等）与论文内容完全无关，评分为0。唯一相关的关键词是：1. ‘RLHF OR RLAIF OR Direct Preference Optimization OR DPO’：论文提到通过基于模拟合成咨询会话的偏好学习来改进STEPPER，这与偏好学习技术相关，但未明确使用RLHF/DPO等具体方法，因此给予5分（有一定关联）。2. ‘AI for Science OR Bioinformatics OR Cheminformatics’：论文属于AI在心理健康科学领域的应用，符合’AI for Science’范畴，因此给予5分（有一定关联）。其他关键词如大模型技术、推理方法、模型优化等均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对现有咨询代理难以在对话中识别和处理自动负面思维的问题，提出了STEP数据集和STEPPER咨询代理，通过偏好学习改进后，能够提供更临床基础、连贯和个性化的认知行为治疗咨询，并实现更高的咨询师能力而不引起情绪干扰。

摘要翻译

认知行为疗法（CBT）旨在识别并重构个体对事件非自主性解释所产生的自动化消极思维，然而现有的咨询智能体在对话环境中难以有效识别与处理这些思维。为弥补这一差距，我们引入了STEP数据集，该数据集通过显式反映自动化思维及动态、行动层级的咨询序列，对CBT咨询过程进行建模。基于此数据集，我们训练了咨询智能体STEPPER，该智能体能够主动引导出自动化思维并执行基于认知理论的干预。为进一步提升决策准确性与共情响应能力，我们通过基于模拟合成咨询会话的偏好学习对STEPPER进行优化。大量符合CBT标准的评估表明，相较于其他强基线模型，STEPPER能提供更具临床依据、连贯且个性化的咨询，并在不引发情绪困扰的前提下展现出更高的咨询师专业能力。

摘要 (Abstract)

Cognitive Behavioral Therapy (CBT) aims to identify and restructure automatic negative thoughts pertaining to involuntary interpretations of events, yet existing counseling agents struggle to identify and address them in dialogue settings. To bridge this gap, we introduce STEP, a dataset that models CBT counseling by explicitly reflecting automatic thoughts alongside dynamic, action-level counseling sequences. Using this dataset, we train STEPPER, a counseling agent that proactively elicits automatic thoughts and executes cognitively grounded interventions. To further enhance both decision accuracy and empathic responsiveness, we refine STEPPER through preference learning based on simulated, synthesized counseling sessions. Extensive CBT-aligned evaluations show that STEPPER delivers more clinically grounded, coherent, and personalized counseling compared to other strong baseline models, and achieves higher counselor competence without inducing emotional disruption.

关键词: Cognitive Behavioral Therapy, counseling dialogue systems, automatic thoughts, preference learning, counseling agent, clinical evaluation, personalized counseling, emotional disruption

100. ❌ Training Transformers in Cosine Coefficient Space

作者: Mohamed Amine Bergach 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04440v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种在离散余弦变换（DCT）域参数化Transformer权重矩阵的方法，通过保留低频系数实现模型压缩，属于模型压缩和参数高效微调技术。与’Quantization OR Model Compression OR Low-bit Weights’高度相关（10分），因为核心是压缩模型参数；与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’相关（8分），因为这是一种参数高效的方法；与’Large Language Models OR LLMs OR Foundation Models’和’Small Language Models OR SLMs OR On-device AI’有一定关联（各5分），因为方法适用于Transformer架构，可能用于设备端AI；与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为涉及从头训练。其他关键词如MoE、对齐、推理加速等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种在离散余弦变换域参数化Transformer权重矩阵的方法，通过保留低频系数实现模型压缩，在字符级语言建模任务中达到与标准参数化相近的性能，同时存储参数减少52%。

摘要翻译

我们将变换器（transformer）的权重矩阵在二维离散余弦变换（DCT）域中进行参数化，仅保留最低频系数。在前向传播过程中，通过逆DCT重建完整的权重矩阵；梯度通过重建过程传播，直接更新频谱系数。
在字符级语言建模任务（Shakespeare数据集，100万字符）中，一个在此表示下从头开始训练的4层变换器，其困惑度与标准参数化模型相当（6.1对比6.1），同时仅存储52%的参数。在4倍压缩率下（保留29%的参数），模型困惑度为6.9——在相近的压缩比例下，优于低秩基线方法（保留21%参数时困惑度为8.8）。
该方法无需改变模型架构、无需预训练检查点、也不依赖辅助损失函数。其实现仅需将每个\texttt{nn.Linear}层替换为即插即用的频谱层，该层存储$K$个DCT系数而非$n \times m$个权重参数。

摘要 (Abstract)

We parameterize the weight matrices of a transformer in the two-dimensional discrete cosine transform (DCT) domain, retaining only the lowest-frequency coefficients. At each forward pass the full weight matrix is reconstructed via the inverse DCT; gradients propagate through the reconstruction to update the spectral coefficients directly. On character-level language modeling (Shakespeare, 1M characters), a 4-layer transformer trained from scratch in this representation matches the perplexity of the standard parameterization (6.1 vs.\ 6.1) while storing 52% of the parameters. At 4$\times$ compression (29% of parameters), the model reaches perplexity 6.9 – outperforming a low-rank baseline (perplexity 8.8 at 21% of parameters) at a comparable reduction. The method requires no architectural changes, no pre-trained checkpoint, and no auxiliary loss. It reduces to replacing each \texttt{nn.Linear} with a drop-in spectral layer that stores $K$ DCT coefficients instead of $n \times m$ weights.

关键词: Transformer, Discrete Cosine Transform, Model Compression, Parameter-efficient, Spectral Coefficients, Language Modeling, Low-rank Baseline, Drop-in Spectral Layer

101. ❌ ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems

作者: Zhuowen Yuan, Zhaorun Chen, Zhen Xiang, Nathaniel D. Bastian, Seyyed Hadi Hashemi, Chaowei Xiao, Wenbo Guo, Bo Li 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04426v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM Agent系统的供应链安全，与’LLM Agents’和’Tool Use’高度相关（10分），因为论文研究Agent使用第三方工具时的安全威胁。与’Large Language Models’有一定关联（8分），因为Agent系统基于LLM。其他关键词主要涉及模型架构、训练方法、推理优化、特定应用领域等，与论文的安全防护主题无直接关系，故评0分。

!!! tip deepseek-chat TL;DR

论文针对LLM Agent系统中第三方工具供应链注入攻击的安全威胁，提出了网络级防护框架ShieldNet，并创建了大规模基准SC-Inject-Bench，实验表明ShieldNet能高效检测攻击且性能优于现有方法。

摘要翻译

现有关于大语言模型智能体安全性的研究主要集中于提示注入及不安全的输入/输出行为。然而，随着智能体日益依赖第三方工具和模型上下文协议服务器，一类新的供应链威胁已然浮现：恶意行为被嵌入看似良性的工具中，悄然劫持智能体执行流程、泄露敏感数据或触发未授权操作。尽管此类威胁的影响日益扩大，目前尚缺乏评估此类威胁的综合性基准。为填补这一空白，我们提出了SC-Inject-Bench——一个大规模基准测试集，其包含超过10,000个恶意模型上下文协议工具，这些工具基于源自MITRE ATT&CK框架、针对供应链威胁的25种以上攻击类型分类体系构建。我们发现，现有的模型上下文协议扫描器和语义护栏在此基准上表现不佳。基于这一发现，我们提出了ShieldNet，一个网络级护栏框架，它通过观察真实的网络交互而非表层工具痕迹来检测供应链投毒攻击。ShieldNet集成了中间人代理和事件提取器以识别关键网络行为，随后通过轻量级分类器进行攻击检测。大量实验表明，ShieldNet在引入极低运行时开销的同时实现了强大的检测性能（F1分数最高达0.995，误报率仅为0.8%），显著优于现有的模型上下文协议扫描器和基于大语言模型的护栏方案。

摘要 (Abstract)

Existing research on LLM agent security mainly focuses on prompt injection and unsafe input/output behaviors. However, as agents increasingly rely on third-party tools and MCP servers, a new class of supply-chain threats has emerged, where malicious behaviors are embedded in seemingly benign tools, silently hijacking agent execution, leaking sensitive data, or triggering unauthorized actions. Despite their growing impact, there is currently no comprehensive benchmark for evaluating such threats. To bridge this gap, we introduce SC-Inject-Bench, a large-scale benchmark comprising over 10,000 malicious MCP tools grounded in a taxonomy of 25+ attack types derived from MITRE ATT&CK targeting supply-chain threats. We observe that existing MCP scanners and semantic guardrails perform poorly on this benchmark. Motivated by this finding, we propose ShieldNet, a network-level guardrail framework that detects supply-chain poisoning by observing real network interactions rather than surface-level tool traces. ShieldNet integrates a man-in-the-middle (MITM) proxy and an event extractor to identify critical network behaviors, which are then processed by a lightweight classifier for attack detection. Extensive experiments show that ShieldNet achieves strong detection performance (up to 0.995 F-1 with only 0.8% false positives) while introducing little runtime overhead, substantially outperforming existing MCP scanners and LLM-based guardrails.

关键词: LLM agent security, supply-chain threats, MCP tools, network-level guardrail, attack detection, benchmark evaluation, ShieldNet, SC-Inject-Bench

102. ❌ Is Prompt Selection Necessary for Task-Free Online Continual Learning?

作者: Seoyoung Park, Haemin Lee, Hankook Lee 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04420v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究任务无关在线持续学习（task-free online continual learning），属于机器学习中的持续学习领域，主要关注如何在不明确任务边界的数据流中持续学习并缓解灾难性遗忘。论文提出的SinglePrompt方法涉及在自注意力块中注入提示（prompt）和分类器优化，但这里的’prompt’指的是持续学习中的提示参数，而非大语言模型中的提示工程。论文未涉及大模型、深度学习技术原理创新或大模型在不同领域的应用，也未提及任何评分关键词中的技术（如LLMs、MoE、SFT、RAG、量化等）。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对任务无关在线持续学习中提示选择策略效果不佳的问题，提出了一种无需提示选择的简单框架SinglePrompt，通过注入单个提示、优化分类器设计，在多个基准测试中取得了最先进的性能。

摘要翻译

无任务在线持续学习作为一种应对动态现实环境中持续学习问题的现实范式，近年来受到广泛关注。在该场景下，数据以非平稳流的形式到达，缺乏明确的任务边界，且仅能被观测一次。为应对此类挑战性场景，近期许多方法采用了提示选择策略——一种基于输入信号从提示池中自适应选择提示的机制。然而，我们观察到此类选择策略往往无法选取合适的提示，即使对关键参数进行额外训练，仍会导致次优结果。基于此观察，我们提出了一种简单而有效的单提示方法，该方法无需提示选择过程，并专注于分类器优化。具体而言，我们仅需：（一）在每个自注意力模块中注入单一提示；（二）采用基于余弦相似度的逻辑值设计，以缓解分类器权重中固有的遗忘效应；（三）对当前小批量中未出现类别的逻辑值进行掩码处理。凭借这一简洁的无任务设计，我们的框架在多种在线持续学习基准测试中均取得了最先进的性能。源代码发布于 https://github.com/efficient-learning-lab/SinglePrompt。

摘要 (Abstract)

Task-free online continual learning has recently emerged as a realistic paradigm for addressing continual learning in dynamic, real-world environments, where data arrive in a non-stationary stream without clear task boundaries and can only be observed once. To consider such challenging scenarios, many recent approaches have employed prompt selection, an adaptive strategy that selects prompts from a pool based on input signals. However, we observe that such selection strategies often fail to select appropriate prompts, yielding suboptimal results despite additional training of key parameters. Motivated by this observation, we propose a simple yet effective SinglePrompt that eliminates the need for prompt selection and focuses on classifier optimization. Specifically, we simply (i) inject a single prompt into each self-attention block, (ii) employ a cosine similarity-based logit design to alleviate the forgetting effect inherent in the classifier weights, and (iii) mask logits for unexposed classes in the current minibatch. With this simple task-free design, our framework achieves state-of-the-art performance across various online continual learning benchmarks. Source code is available at https://github.com/efficient-learning-lab/SinglePrompt.

关键词: task-free online continual learning, prompt selection, SinglePrompt, self-attention block, classifier optimization, cosine similarity-based logit, forgetting effect, state-of-the-art performance

103. ❌ Justified or Just Convincing? Error Verifiability as a Dimension of LLM Quality

作者: Xiaoyuan Zhu, Kimberly Le Truong, Riccardo Fogliato, Gokul Swamy, Weijian Zhang, Minglai Yang, Longtian Ye, Bangya Liu, Minghao Liu, Andrew Ilyas, Steven Wu 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04418v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM生成答案的验证性问题，提出’error verifiability’概念和度量方法，属于LLM质量评估和可信度研究。高度相关关键词：LLMs（核心研究对象）、Hallucination Mitigation（涉及答案正确性验证）、Explainable AI（涉及解释和验证）。中等相关：Post-training（论文提到该方法未改善verifiability）、Chain of Thought（论文涉及推理链作为justification）、Self-Correction（论文方法涉及反思和重述）。其余关键词与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文研究LLM生成答案的验证性问题，提出'error verifiability'作为LLM质量的新维度，并开发了两种基于外部信息的方法来改善验证性。

摘要翻译

随着大语言模型被部署于高风险场景，用户必须判断单个回答的正确性，通常依赖于模型生成的论证依据，如推理链或解释说明。然而，目前尚无标准方法来衡量这些论证依据是否有助于用户区分正确答案与错误答案。我们将这一概念形式化为错误可验证性，并提出一种平衡性指标 $v_{\text{bal}}$，用于测量论证依据能否使评估者准确判断答案的正确性；该指标已通过具有高度一致性的人类评估者验证。研究发现，无论是常见的后训练和模型扩展方法，还是更针对性的干预措施，均未能提升可验证性。我们提出了两种成功提升可验证性的方法：针对数学推理的反思与重述（reflect-and-rephrase, RR）以及针对事实性问答的参考重述（oracle-rephrase, OR），这两种方法均通过整合领域适配的外部信息来改善可验证性。综合而言，我们的研究结果表明，错误可验证性是回答质量的一个独立维度，它不会随准确性提升而自然显现，需要采用专门的、具备领域感知的方法来解决。

摘要 (Abstract)

As LLMs are deployed in high-stakes settings, users must judge the correctness of individual responses, often relying on model-generated justifications such as reasoning chains or explanations. Yet, no standard measure exists for whether these justifications help users distinguish correct answers from incorrect ones. We formalize this idea as error verifiability and propose $v_{\text{bal}}$, a balanced metric that measures whether justifications enable raters to accurately assess answer correctness, validated against human raters who show high agreement. We find that neither common approaches, such as post-training and model scaling, nor more targeted interventions recommended improve verifiability. We introduce two methods that succeed at improving verifiability: reflect-and-rephrase (RR) for mathematical reasoning and oracle-rephrase (OR) for factual QA, both of which improve verifiability by incorporating domain-appropriate external information. Together, our results establish error verifiability as a distinct dimension of response quality that does not emerge from accuracy improvements alone and requires dedicated, domain-aware methods to address.

关键词: LLMs, error verifiability, justifications, reasoning chains, factual QA, mathematical reasoning, model-generated explanations, response quality

104. ❌ Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding

作者: Haruka Kawasaki, Ryota Tanaka, Kyosuke Nishida 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04411v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大型视觉语言模型（LVLMs）在视觉文档理解任务中的内部表示与生成响应之间的差距，使用线性探测分析不同层的信息编码，并探索针对中间层的微调策略。核心相关关键词：1）‘Large Language Models’（论文研究LVLMs中的LLMs部分，权重1.0，相关度10）；2）‘Post-training’（论文探索微调策略，属于后训练范畴，权重1.0，相关度10）；3）‘Mechanistic Interpretability’（论文使用线性探测分析内部表示，属于可解释AI方法，权重1.0，相关度10）。其他关键词如MoE、SLMs、Scaling Laws、RAG、Agents等与论文内容无关（相关度0）。

!!! tip deepseek-chat TL;DR

该论文揭示了大型视觉语言模型在视觉文档理解任务中内部表示与生成响应之间存在差距，并通过针对中间层的微调策略改善了线性探测准确性和响应准确性。

摘要翻译

视觉文档理解（VDU）是大型视觉语言模型（LVLMs）面临的一项挑战性任务，它需要整合视觉感知、文本识别以及对结构化版式的推理能力。尽管近期的LVLMs在VDU基准测试中已显示出进展，但其性能通常基于生成的响应进行评估，这可能无法必然反映模型是否在内部真正捕获了所需信息。本文通过线性探测方法，研究了LVLMs内部大型语言模型（LLMs）的不同层中解决VDU任务所需信息的表征情况。我们的研究表明：（1）内部表征与生成响应之间存在明显差距；（2）解决任务所需的信息往往在中间层比在最终层以更线性的方式编码。基于这些发现，我们探索了针对中间层的微调策略。实验表明，对中间层进行微调既能提高线性探测准确率，也能提升响应准确率，同时缩小了内部表征与生成响应之间的差距。

摘要 (Abstract)

Visual document understanding (VDU) is a challenging task for large vision language models (LVLMs), requiring the integration of visual perception, text recognition, and reasoning over structured layouts. Although recent LVLMs have shown progress on VDU benchmarks, their performance is typically evaluated based on generated responses, which may not necessarily reflect whether the model has actually captured the required information internally. In this paper, we investigate how information required to solve VDU tasks is represented across different layers of LLMs within LVLMs using linear probing. Our study reveals that (1) there is a clear gap between internal representations and generated responses, and (2) information required to solve the task is often encoded more linearly from intermediate layers than from the final layer. Motivated by these findings, we explore fine-tuning strategies that target intermediate layers. Experiments show that fine-tuning intermediate layers improves both linear probing accuracy and response accuracy while narrowing the gap.

关键词: Visual Document Understanding, Large Vision Language Models, Internal Representations, Linear Probing, Fine-tuning, Intermediate Layers, Response Accuracy, Gap Analysis

105. ❌ Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment

作者: Hiroshi Takahashi, Tomoharu Iwata, Atsutoshi Kumagai, Sekitoshi Kanai, Masanori Yamada, Kosuke Nishida, Kazutoshi Shinoda 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04410v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究语言模型对齐问题，提出了一种新的对齐方法（Relative Density Ratio Optimization），旨在解决现有方法（如DDRO）的统计一致性和训练稳定性问题。因此，与’Large Language Models’、‘Instruction Tuning/Alignment’和’RLHF/DPO’高度相关（10分），因为这些关键词直接对应论文的研究领域（大模型对齐）。论文未涉及其他关键词，如MoE、量化、推理加速、科学AI应用等，故相关度为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于相对密度比优化的稳定且统计一致的语言模型对齐方法，解决了现有直接密度比优化方法中密度比不稳定和发散的问题，并在Qwen 2.5和Llama 3上验证了其有效性。

摘要翻译

将语言模型与人类偏好对齐对于确保其安全性和可靠性至关重要。尽管现有方法大多基于特定的人类偏好模型（如布拉德利-特里模型），但此类假设可能无法准确反映真实的人类偏好，从而导致这些方法缺乏统计一致性——即无法保证语言模型在样本量增加时收敛于真实的人类偏好。相比之下，直接密度比优化方法无需假设任何人类偏好模型即可实现统计一致性。该方法利用语言模型对偏好数据与非偏好数据分布之间的密度比进行建模，并通过密度比估计进行优化。然而，该密度比具有不稳定性且常发散，导致直接密度比优化的训练过程不稳定。本文提出一种兼具稳定性与统计一致性的新型对齐方法。我们的方法基于偏好数据分布与偏好/非偏好混合数据分布之间的相对密度比。由于该相对密度比存在上界且不发散，因此方法具有稳定性。同时，该方法具备统计一致性，且能提供比直接密度比优化更严格的收敛性保证。我们基于Qwen 2.5和Llama 3模型的实验验证了该方法的有效性。

摘要 (Abstract)

Aligning language models with human preferences is essential for ensuring their safety and reliability. Although most existing approaches assume specific human preference models such as the Bradley-Terry model, this assumption may fail to accurately capture true human preferences, and consequently, these methods lack statistical consistency, i.e., the guarantee that language models converge to the true human preference as the number of samples increases. In contrast, direct density ratio optimization (DDRO) achieves statistical consistency without assuming any human preference models. DDRO models the density ratio between preferred and non-preferred data distributions using the language model, and then optimizes it via density ratio estimation. However, this density ratio is unstable and often diverges, leading to training instability of DDRO. In this paper, we propose a novel alignment method that is both stable and statistically consistent. Our approach is based on the relative density ratio between the preferred data distribution and a mixture of the preferred and non-preferred data distributions. Our approach is stable since this relative density ratio is bounded above and does not diverge. Moreover, it is statistically consistent and yields significantly tighter convergence guarantees than DDRO. We experimentally show its effectiveness with Qwen 2.5 and Llama 3.

关键词: Language Model Alignment, Human Preferences, Statistical Consistency, Density Ratio Optimization, Training Stability, Relative Density Ratio, Direct Density Ratio Optimization, Model Convergence

106. ❌ GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis

作者: Yuwen Zhai, Runze Li, Liang Wang, Nian Shi, Liwu Xu, Wei Zhang, Ran Lin, Bo Xu, Benlei Cui 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04399v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于GUI代理评估框架GUIDE的开发，与大多数大模型技术关键词无关。仅与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文核心是评估GUI代理，属于代理工作流范畴。与’Mechanistic Interpretability OR Explainable AI’有一定关联（8分），因为GUIDE框架强调可解释性诊断和错误分析。其他关键词均未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文针对GUI代理评估中轨迹长、视觉基础、开放性强导致评估不准确且不可解释的问题，提出了GUIDE框架，通过分层诊断方法显著提高了评估准确性并生成了结构化诊断报告。

摘要翻译

评估图形用户界面（GUI）智能体面临独特挑战：任务轨迹长、依赖视觉信息且目标开放，但评估必须兼具准确性与可解释性。现有方法通常对整个动作-观察序列进行单一整体判断——这种策略在长周期任务中可靠性不足，且仅能提供二元结论，无法揭示智能体失败的具体环节及原因。这种不透明性限制了评估作为智能体开发诊断工具的有效性。我们提出GUIDE（GUI理解与可解释诊断评估）框架，该框架将轨迹评估分解为三个连续阶段，对应GUI任务的组合结构：轨迹分割将完整轨迹划分为语义连贯的子任务单元；子任务诊断在上下文中评估每个单元，给出完成度判定并生成包含修正建议的结构化错误分析；总体汇总将各子任务诊断聚合成任务级评判。通过在有界的子任务片段而非完整轨迹上操作，GUIDE缓解了因任务复杂度增加导致现有评估器性能下降的上下文过载问题。我们在三个基准测试中验证GUIDE：包含932条轨迹的工业级电子商务数据集、涵盖五项网页智能体任务共1302条轨迹的AGENTREWARDBENCH，以及用于移动设备控制的AndroidBench。在所有实验设置中，GUIDE显著优于现有评估器——比最强基线准确率最高提升5.35个百分点——同时生成可直接指导智能体改进的结构化诊断报告。

摘要 (Abstract)

Evaluating GUI agents presents a distinct challenge: trajectories are long, visually grounded, and open-ended, yet evaluation must be both accurate and interpretable. Existing approaches typically apply a single holistic judgment over the entire action-observation sequence-a strategy that proves unreliable on long-horizon tasks and yields binary verdicts offering no insight into where or why an agent fails. This opacity limits the utility of evaluation as a diagnostic tool for agent development. We introduce GUIDE (GUI Understanding and Interpretable Diagnostic Evaluation), a framework that decomposes trajectory assessment into three sequential stages mirroring the compositional structure of GUI tasks. Trajectory Segmentation partitions the full trace into semantically coherent subtask units. Subtask Diagnosis evaluates each unit in context, assigning a completion verdict and generating a structured error analysis with corrective recommendations. Overall Summary aggregates per-subtask diagnoses into a task-level judgment. By operating on bounded subtask segments rather than full trajectories, GUIDE mitigates the context overload that degrades existing evaluators as task complexity grows. We validate GUIDE on three benchmarks: an industrial e-commerce dataset of 932 trajectories, AGENTREWARDBENCH spanning five web agent tasks with 1302 trajectories, and AndroidBench for mobile device control. Across all settings, GUIDE substantially outperforms existing evaluators-achieving up to 5.35 percentage points higher accuracy than the strongest baseline-while producing structured diagnostic reports that directly inform agent improvement.

关键词: GUI agents, evaluation framework, interpretable diagnosis, trajectory segmentation, subtask diagnosis, error analysis, agent development, hierarchical assessment

107. ❌ Gradual Cognitive Externalization: A Framework for Understanding How Ambient Intelligence Externalizes Human Cognition

作者: Zhimin Zhao 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04387v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文提出了一个理论框架（Gradual Cognitive Externalization）来解释人类认知功能如何通过环境智能的协同适应迁移到数字基质中，属于认知科学、人工智能哲学和人类-计算机交互的交叉领域。论文讨论了AI代理技能、行为流形假设以及认知整合的标准，但并未具体涉及大模型技术原理、深度学习创新或特定AI技术（如LLM、MoE、RLHF等）。所有关键词均与大模型技术、训练方法、推理技术、优化技术或特定科学应用直接相关，而本文是理论框架研究，不涉及这些具体技术，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了渐进认知外化（GCE）框架，解释了人类认知功能如何通过环境智能的协同适应而非思维上传迁移到数字基质中，并提供了理论标准、可测试预测和实验协议。

摘要翻译

开发者正在发布能够复制同事沟通风格、编码主管指导经验法则或在生物性死亡后保留个人行为模式的AI智能体技能。为解释这一现象，我们提出渐进式认知外化理论框架，该框架认为人类认知功能正通过环境智能的协同适应而非意识上传的方式向数字载体迁移。GCE建立在行为流形假说之上：日常认知活动占据着一个低维、结构化、具有冗余性且可通过持续观测习得的行为流形。我们通过调度助手、写作工具、推荐引擎及智能体技能生态系统的实证证据，表明认知外化的前置条件已然显现。本研究通过形式化区分认知整合与工具使用的三项标准（双向适应性、功能等价性、因果耦合性），推导出五个具有理论约束阈值的可检验预测，并提供具体实验方案。核心问题已不再是意识能否上传，而在于认知功能向数字载体迁移的速度及其社会影响。

摘要 (Abstract)

Developers are publishing AI agent skills that replicate a colleague’s communication style, encode a supervisor’s mentoring heuristics, or preserve a person’s behavioral repertoire beyond biological death. To explain why, we propose Gradual Cognitive Externalization (GCE), a framework arguing that human cognitive functions are migrating into digital substrates through ambient intelligence co-adaptation rather than mind uploading. GCE rests on the behavioral manifold hypothesis: everyday cognition occupies a low-dimensional manifold that is structured, redundant, and learnable from sustained observation. We document evidence from scheduling assistants, writing tools, recommendation engines, and agent skill ecosystems showing that the preconditions for externalization are already observable. We formalize three criteria separating cognitive integration from tool use (bidirectional adaptation, functional equivalence, causal coupling), derive five testable predictions with theory-constrained thresholds, and provide a concrete experimental protocol. The question is no longer whether minds can be uploaded, but how fast cognitive functions are already migrating into digital substrates and what follows.

关键词: Gradual Cognitive Externalization, ambient intelligence, cognitive functions, digital substrates, behavioral manifold, AI agent skills, cognitive integration, human-computer interaction

108. ❌ Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis

作者: Jiayu Fu, Mourad Heddaya, Chenhao Tan 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04386v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是开发一种自动生成数学基准测试的流程，专门针对LLMs的弱点，并评估LLMs的数学能力。因此，仅与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文直接研究LLMs的评估和性能分析。其他关键词涉及具体技术（如MoE、量化、推理加速）、训练方法（如预训练、微调、对齐）、应用范式（如RAG、智能体）或特定领域（如科学AI），论文均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于假设驱动错误分析的自动数学基准生成流程，能识别LLMs的弱点并生成针对性难题，实验表明生成的问题能将Llama-3.3-70B-Instruct的准确率从77%降至45%，且该流程可扩展至其他领域以评估LLMs能力。

摘要翻译

现有大量数学基准用于评估大语言模型的数学能力。然而，大多数基准构建需要大量人工投入且难以规模化扩展。因此，它们既无法跟上大语言模型的发展速度，也难以提供新实例以缓解过拟合问题。部分研究者提出了自动生成基准的方法，但少有研究专注于识别大语言模型容易出错的特定数学概念与技能，且现有方法大多只能生成特定类别的基准。为应对这些局限，我们提出了一种新的数学基准生成流程：该流程利用人工智能生成的假设来识别大语言模型存在困难的数学概念与技能，进而针对这些薄弱环节生成新的基准测试题目。实验表明，假设准确度与生成题目的难度呈正相关——基于最准确假设生成的题目，将Llama-3.3-70B-Instruct模型的准确率降至最低45%，而该模型在原始MATH基准上的准确率为77%。此外，我们的流程具备高度适应性，可扩展至数学领域之外，用于探索大语言模型在广泛领域的能力，这使其成为研究大语言模型跨领域表现的重要工具。

摘要 (Abstract)

Numerous math benchmarks exist to evaluate LLMs’ mathematical capabilities. However, most involve extensive manual effort and are difficult to scale. Consequently, they cannot keep pace with LLM development or easily provide new instances to mitigate overfitting. Some researchers have proposed automatic benchmark generation methods, but few focus on identifying the specific math concepts and skills on which LLMs are error-prone, and most can only generate category-specific benchmarks. To address these limitations, we propose a new math benchmark generation pipeline that uses AI-generated hypotheses to identify the specific math concepts and skills that LLMs struggle with, and then generates new benchmark problems targeting these weaknesses. Experiments show that hypothesis accuracy positively correlates with the difficulty of the generated problems: problems generated from the most accurate hypotheses reduce Llama-3.3-70B-Instruct’s accuracy to as low as 45%, compared to 77% on the original MATH benchmark. Furthermore, our pipeline is highly adaptable and can be applied beyond math to explore a wide range of LLM capabilities, making it a valuable tool for investigating how LLMs perform across different domains.

关键词: LLM evaluation, math benchmark generation, hypothesis-driven error analysis, automatic benchmark creation, model weakness identification, domain adaptation, AI-generated hypotheses, mathematical reasoning

109. ❌ Compressible Softmax-Attended Language under Incompressible Attention

作者: Wonsuk Lee 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04384v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Transformer语言模型（124M-7B参数）中注意力机制的内在结构特性，发现softmax-attended语言具有高度可压缩性，这是数据的固有属性而非分析框架的特性。该研究直接涉及大语言模型（LLMs）和机制可解释性（Mechanistic Interpretability），因此这两个关键词得10分。其他关键词如MoE、量化、推理加速、对齐等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，在多种规模的Transformer语言模型中，softmax-attended语言在注意力头中表现出高度可压缩性，90%的方差集中在2-11个奇异分量中，而学习到的交互矩阵需要更多分量，这表明语言的内在交互结构比注意力机制分配的能力更集中。

摘要翻译

在五个Transformer语言模型（参数量1.24亿至70亿，涵盖四种架构体系）的所有注意力头中，对数能量场 $\tilde{E}$ 仅需2至11个奇异分量即可解释其90%的方差。而经过学习得到的交互矩阵 $W_Q^\mathrm{T} W_K$ 在头维度 $d_h \in {64, 128}$ 中需要38至75个分量才能达到相同阈值。两者的有效秩谱隙达到5至25倍。注意力机制将容量均匀分配于所有 $d_h$ 个维度，但语言的实际交互却集中于少数维度。经softmax处理后的语言数据的可压缩性是其自身特性，而非分析框架所决定。

摘要 (Abstract)

Across every attention head in five transformer language models (124M–7B parameters, four architecture families), the logit energy field $\tilde{E}$ reaches 90% of its variance in 2–11 singular components. The \emph{learned} interaction matrix $W_Q^\mathrm{T} W_K$ needs 38–75 components for the same threshold out of $d_h \in {64, 128}$. The spectral gap is $5$–$25\times$ in effective rank. The attention mechanism allocates capacity uniformly across all $d_h$ dimensions, but language concentrates the actual interaction into a few. The compressibility of softmax-attended language is a property of the data, not the frame that analyzes it.

关键词: Transformer language models, attention mechanism, softmax-attended language, compressibility, singular components, logit energy field, spectral gap, interaction matrix

110. ❌ Towards Considerate Human-Robot Coexistence: A Dual-Space Framework of Robot Design and Human Perception in Healthcare

作者: Yuanchen Bai, Zijian Ding, Ruixiang Han, Niti Parikh, Wendy Ju, Angelique Taylor 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04374v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究人机共存框架，聚焦于医疗机器人设计、人类感知和共同演化过程，属于人机交互（HRI）领域。所有评分关键词均涉及大模型、深度学习技术原理或AI在科学领域的应用，而本文完全不涉及这些技术，仅讨论机器人设计、人类感知和社会整合，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究通过医疗机器人共同设计实验，提出了一个考虑周到的人机共存双空间框架，揭示了人类作为解释者和调解者在机器人理解与整合中的主动作用。

摘要翻译

机器人技术的快速发展——包括能力拓展、交互方式更趋直观、以及更深度融入现实工作流程——正在重塑人机共存的含义。这种共存不仅限于物理空间共享，其日益显著的特征在于组织嵌入性、时间演化性、社会情境性以及开放不确定性。然而，既往研究多聚焦于态度与接受度的静态快照，对人类认知如何形成与演变、以及人类在将共存塑造为动态过程中的能动作用缺乏深入探讨。我们通过对一项为期14周的医疗机器人协同设计研究中九位参与者进行的深度追踪访谈，弥补了上述研究空白。我们识别出人类认知空间，其中包含四个解释维度（即解构程度、时间导向、推理范围与证据来源）。通过将人类认知空间与机器人设计空间之间的相互关系概念化为一个协同演化循环，我们丰富了人机共存的概念框架——在这一循环中，人类需求、设计决策、情境化解读与社会中介随时间推移持续重塑彼此。基于此，我们提出“体贴式人机共存”理念，主张人类不仅是设计贡献者，更是解释者与中介者，在机器人部署的各阶段主动塑造其被理解与融入社会的方式。

摘要 (Abstract)

The rapid advancement of robotics, spanning expanded capabilities, more intuitive interaction, and more integration into real-world workflows, is reshaping what it means for humans and robots to coexist. Beyond sharing physical space, this coexistence is increasingly characterized by organizational embeddedness, temporal evolution, social situatedness, and open-ended uncertainty. However, prior work has largely focused on static snapshots of attitudes and acceptance, offering limited insight into how perceptions form and evolve, and what active role humans play in shaping coexistence as a dynamic process. We address these gaps through in-depth follow-up interviews with nine participants from a 14-week co-design study on healthcare robots. We identify the human perception space, including four interpretive dimensions (i.e., degree of decomposition, temporal orientation, scope of reasoning, and source of evidence). We enrich the conceptual framework of human-robot coexistence by conceptualizing the mutual relationship between the human perception space and the robot design space as a co-evolving loop, in which human needs, design decisions, situated interpretations, and social mediation continuously reshape one another over time. Building on this, we propose considerate human-robot coexistence, arguing that humans act not only as design contributors but also as interpreters and mediators who actively shape how robots are understood and integrated across deployment stages.

关键词: human-robot coexistence, robot design, human perception, healthcare robots, co-design, interpretive dimensions, social mediation, dynamic process

111. ❌ Decocted Experience Improves Test-Time Inference in LLM Agents

作者: Maohao Shen, Kaiwen Zha, Zexue He, Zhang-Wei Hong, Siru Ouyang, J. Jon Ryu, Prasanna Sattigeri, Suhas Diggavi, Gregory Wornell 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04373v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	8.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM智能体在测试时通过构建更好的上下文（经验）来提升性能，不更新模型参数。高度相关关键词：LLM Agents（核心研究对象）、Large Language Models（基础模型）、Retrieval-Augmented Generation（构建上下文涉及检索经验）、Chain of Thought（推理任务验证）、Self-Correction（经验提炼涉及自我改进）、In-context Learning（通过上下文提升性能）。中等相关：Context Window Extension（上下文构建可能涉及长度）、System 2 Thinking（涉及深度推理）、Tool Use（智能体可能使用工具）。其他关键词如MoE、SLMs、训练方法、压缩加速等未涉及。

!!! tip deepseek-chat TL;DR

该论文研究如何通过构建基于提炼经验的上下文来提升LLM智能体在测试时的推理性能，而不更新模型参数，并在数学推理、网页浏览和软件工程等任务上验证了有效性。

摘要翻译

当前，在不更新模型参数的前提下提升大语言模型性能的研究日益受到关注。一个成熟的方向是测试时扩展，即通过增加推理时的计算量（例如延长推理链、采样或搜索）来提升表现。然而，对于复杂的推理和智能体任务，简单地扩展测试时计算会显著增加成本，并可能导致资源浪费在次优的探索上。本文探索将“上下文”作为提升大语言模型性能的补充扩展维度，并系统研究如何通过“经验”构建能更好引导推理的输入。我们发现，有效的上下文构建关键依赖于“精炼经验”。我们对经验增强型智能体进行了详细分析，研究了如何从经验中推导上下文、性能如何随经验积累而扩展、优质上下文的特征，以及哪些数据结构能最好地支持上下文构建。我们指出“精炼经验”是有效构建上下文的核心机制：从经验中提取精华，将其组织成连贯的整体，并检索关键信息以构建有效的上下文。我们在数学推理、网页浏览和软件工程等推理与智能体任务中验证了上述发现。

摘要 (Abstract)

There is growing interest in improving LLMs without updating model parameters. One well-established direction is test-time scaling, where increased inference-time computation (e.g., longer reasoning, sampling, or search) is used to improve performance. However, for complex reasoning and agentic tasks, naively scaling test-time compute can substantially increase cost and still lead to wasted budget on suboptimal exploration. In this paper, we explore \emph{context} as a complementary scaling axis for improving LLM performance, and systematically study how to construct better inputs that guide reasoning through \emph{experience}. We show that effective context construction critically depends on \emph{decocted experience}. We present a detailed analysis of experience-augmented agents, studying how to derive context from experience, how performance scales with accumulated experience, what characterizes good context, and which data structures best support context construction. We identify \emph{decocted experience} as a key mechanism for effective context construction: extracting essence from experience, organizing it coherently, and retrieving salient information to build effective context. We validate our findings across reasoning and agentic tasks, including math reasoning, web browsing, and software engineering.

关键词: LLM Agents, Test-time Inference, Context Construction, Decocted Experience, Reasoning Tasks, Experience Augmentation, Agentic Tasks, In-context Learning

112. ❌ Context is All You Need

作者: Jean Erik Delanois, Shruti Joshi, Ryan Golden, Teresa Nick, Maxim Bazhenov 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04364v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种名为CONTXT的上下文适应方法，用于改善模型在领域泛化和测试时适应中的性能。该方法与’Large Language Models’相关，因为摘要提到它在生成模型（如LLMs）上有效，但LLMs不是核心焦点。与’Pre-training OR Continual Pre-training OR Domain Adaptation’相关，因为论文涉及领域适应和泛化，但更侧重于测试时适应而非预训练。与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’相关，因为CONTXT是一种轻量级、无需重新训练的方法，通过特征变换来调整模型，这与参数高效微调的理念相似，但并非直接使用PEFT技术。其他关键词与论文内容无关，因为论文未涉及MoE、SLMs、对齐、推理、代理、压缩等主题，也未专门针对科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种轻量级的上下文适应方法CONTXT，通过简单的加性和乘性特征变换来调制内部表示，以解决领域泛化和测试时适应中的挑战，在判别任务和生成模型上均能提升鲁棒性，且无需重新训练。

摘要翻译

人工神经网络（ANNs）正日益广泛地部署于各种现实场景中，这些场景要求模型必须在与训练数据分布不同的条件下运行。这一挑战是领域泛化（Domain Generalization, DG）和测试时适应（Test-Time Adaptation, TTA）研究的核心问题：领域泛化旨在训练模型在没有目标数据的情况下泛化至未见领域，而测试时适应则通过在部署时适应未标注的测试数据来提升模型鲁棒性。现有应对这些挑战的方法通常复杂、资源密集且难以扩展。本文提出CONTXT（面向神经特征变换的上下文增强），一种简单直观的上下文适应方法。CONTXT通过简单的加性与乘性特征变换来调节内部表征。在测试时适应框架下，该方法在判别性任务（如ANN/CNN分类）与生成模型（如LLMs）中均能带来一致的性能提升。该方法轻量、易于集成且开销极小，能够在领域偏移下实现鲁棒性能而无需增加复杂度。更广泛而言，CONTXT提供了一种无需重新训练即可引导信息流与神经处理的紧凑方式。

摘要 (Abstract)

Artificial Neural Networks (ANNs) are increasingly deployed across diverse real-world settings, where they must operate under data distributions that differ from those seen during training. This challenge is central to Domain Generalization (DG), which trains models to generalize to unseen domains without target data, and Test-Time Adaptation (TTA), which improves robustness by adapting to unlabeled test data at deployment. Existing approaches to address these challenges are often complex, resource-intensive, and difficult to scale. We introduce CONTXT (Contextual augmentatiOn for Neural feaTure X Transforms), a simple and intuitive method for contextual adaptation. CONTXT modulates internal representations using simple additive and multiplicative feature transforms. Within a TTA setting, it yields consistent gains across discriminative tasks (e.g., ANN/CNN classification) and generative models (e.g., LLMs). The method is lightweight, easy to integrate, and incurs minimal overhead, enabling robust performance under domain shift without added complexity. More broadly, CONTXT provides a compact way to steer information flow and neural processing without retraining.

关键词: Domain Generalization, Test-Time Adaptation, Contextual Adaptation, Feature Transforms, Neural Networks, Robustness, Lightweight Method, No Retraining

113. ❌ Integer-Only Operations on Extreme Learning Machine Test Time Classification

作者: Emerson Lopes Machadoa, Cristiano Jacques Miosso, Ricardo Pezzuol Jacobi 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04363v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究极端学习机（ELM）在测试时的整数化操作以降低计算成本，属于传统机器学习模型优化领域。所有评分关键词均围绕大语言模型（LLM）及相关技术（如MoE、RLHF、RAG等），而论文未涉及任何大模型、深度学习或AI for Science内容，也未提及LLM技术原理或应用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于极端学习机（ELM）的整数化测试时分类方法，通过使用三元权重和整数输出权重，在保持分类准确性的同时显著降低了计算成本，适用于FPGA和嵌入式应用。

摘要翻译

本文针对基于极限学习机（ELM）的网络分类器，提出了一套降低测试阶段计算成本的新技术，并进行了理论分析与实证评估。通过探究从这些模型中推导出的若干特性，我们证明了在测试阶段进行分类时可以仅使用整数运算，且不会降低分类精度。我们的贡献如下：（i）我们提供了实证证据，表明输入权重值可以从三元集合中抽取，而对分类精度的影响有限。这在计算上具有无需乘法运算的优势；（ii）我们证明了归一化与非归一化测试信号的分类精度相同；（iii）我们展示了如何生成输出权重的整数版本，从而在分类精度损失有限的前提下实现运算简化。我们在文献中常用的5个计算机视觉数据集上测试了所提技术，结果表明这些技术能够降低现场可编程门阵列（FPGA）在测试阶段进行分类所需的计算成本。这对于功耗受限的嵌入式应用具有重要意义，对于功耗成本高昂的大型企业数据中心而言也至关重要。

摘要 (Abstract)

We present a theoretical analysis and empirical evaluations of a novel set of techniques for computational cost reduction of test time operations of network classifiers based on extreme learning machine (ELM). By exploring some characteristics we derived from these models, we show that the classification at test time can be performed using solely integer operations without compromising the classification accuracy. Our contributions are as follows: (i) We show empirical evidence that the input weights values can be drawn from the ternary set with limited reduction of the classification accuracy. This has the computational advantage of dismissing multiplications; (ii) We prove the classification accuracy of normalized and non-normalized test signals are the same; (iii) We show how to create an integer version of the output weights that results in a limited reduction of the classification accuracy. We tested our techniques on 5 computer vision datasets commonly used in the literature and the results indicate that our techniques can allow the reduction of the computational cost of the operations necessary for the classification at test time in FPGAs. This is important in embedded applications, where power consumption is limited, and crucial in data centers of large corporations, where power consumption is expensive.

关键词: Extreme Learning Machine, Integer Operations, Test Time Classification, Computational Cost Reduction, Ternary Weights, FPGA, Embedded Applications, Computer Vision Datasets

114. ❌ GROUNDEDKG-RAG: Grounded Knowledge Graph Index for Long-document Question Answering

作者: Tianyi Zhang, Andreas Marfurt 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04359v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RAG系统在长文档问答中的应用，与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（10分）。论文使用LLMs作为基础模型，与’Large Language Models OR LLMs OR Foundation Models’相关（8分）。论文旨在通过知识图谱接地减少幻觉，与’Hallucination Mitigation OR Factuality OR Truthfulness’相关（8分）。论文涉及长文档处理，与’Context Window Extension OR Long Context LLMs’有一定关联（5分）。知识图谱的可解释性与’Mechanistic Interpretability OR Explainable AI’相关（5分）。其他关键词如MoE、SLMs、训练方法、推理技术、代理系统等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对长文档问答中RAG系统存在的资源消耗高、内容重复和幻觉问题，提出了一种基于知识图谱接地的GroundedKG-RAG方法，在保持与先进长上下文模型相当性能的同时降低了成本并提高了可解释性。

摘要翻译

检索增强生成系统因其能在减少所需输入上下文长度的同时提升生成质量，已被广泛应用于当代大语言模型中。本研究聚焦于面向长文档问答的检索增强生成系统。现有方法存在以下问题：过度依赖大语言模型描述导致资源消耗与延迟较高、层级间内容重复、以及因缺乏或有限基于源文本的 grounding 而出现幻觉。为通过 grounding 同时提升效率与事实准确性，我们提出 GroundedKG-RAG 系统，其知识图谱明确从源文档提取并 grounded 于源文档。具体而言，我们将 GroundedKG 中的节点定义为实体与动作，边定义为时序或语义关系，每个节点和边均 grounded 于原始句子。我们基于语义角色标注和抽象意义表示解析构建 GroundedKG，随后进行嵌入以用于检索。在查询时，我们对查询进行相同转换，并从 grounded 的源文本中检索最相关的句子进行问答。我们在 NarrativeQA 数据集样本上评估 GroundedKG-RAG，发现其性能与最先进的专有长上下文模型相当且成本更低，同时优于竞争基线方法。此外，我们的 GroundedKG 具备可解释性且人类可读，便于结果审计与错误分析。

摘要 (Abstract)

Retrieval-augmented generation (RAG) systems have been widely adopted in contemporary large language models (LLMs) due to their ability to improve generation quality while reducing the required input context length. In this work, we focus on RAG systems for long-document question answering. Current approaches suffer from a heavy reliance on LLM descriptions resulting in high resource consumption and latency, repetitive content across hierarchical levels, and hallucinations due to no or limited grounding in the source text. To improve both efficiency and factual accuracy through grounding, we propose GroundedKG-RAG, a RAG system in which the knowledge graph is explicitly extracted from and grounded in the source document. Specifically, we define nodes in GroundedKG as entities and actions, and edges as temporal or semantic relations, with each node and edge grounded in the original sentences. We construct GroundedKG from semantic role labeling (SRL) and abstract meaning representation (AMR) parses and then embed it for retrieval. During querying, we apply the same transformation to the query and retrieve the most relevant sentences from the grounded source text for question answering. We evaluate GroundedKG-RAG on examples from the NarrativeQA dataset and find that it performs on par with a state-of-the art proprietary long-context model at smaller cost and outperforms a competitive baseline. Additionally, our GroundedKG is interpretable and readable by humans, facilitating auditing of results and error analysis.

关键词: Retrieval-augmented generation, RAG, knowledge graph, long-document question answering, hallucination mitigation, semantic role labeling, abstract meaning representation, grounded retrieval

115. ❌ REAM: Merging Improves Pruning of Experts in LLMs

作者: Saurav Jha, Maryam Hashemzadeh, Ali Saheb Pasand, Ali Parviz, Min-Joong Lee, Boris Knyazev 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04356v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	10.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究MoE LLMs的模型压缩方法REAM，通过合并专家权重而非剪枝来减少内存需求，与’Mixture of Experts’、‘Large Language Models’和’Model Merging’高度相关（10分）；涉及’Quantization’作为对比基线（8分）；‘Small Language Models’因关注部署内存问题有一定关联（5分）；其他关键词如训练方法、推理技术、对齐、科学应用等均未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对MoE大语言模型部署时的内存挑战，提出了一种新的专家权重合并方法REAM，相比传统剪枝方法能更好地保持模型性能，并在多个基准测试中验证了其有效性。

摘要翻译

专家混合（Mixture-of-Experts, MoE）大语言模型（Large Language Models, LLMs）是目前性能最优的架构之一。最大的模型通常具有数千亿参数，其部署面临显著的内存挑战。传统降低内存需求的方法包括权重剪枝和量化。受基于路由器加权的专家激活剪枝（Router-weighted Expert Activation Pruning, REAP）方法的启发，我们提出了一种新方法——基于路由器加权的专家激活合并（Router-weighted Expert Activation Merging, REAM）。与直接移除专家不同，REAM将专家分组并合并其权重，从而更好地保留原始模型性能。我们在多种MoE大语言模型上，通过多样化的多项选择题（Multiple-Choice, MC）问答和生成式（Generative, GEN）基准测试，将REAM与REAP及其他基线方法进行了比较。结果表明，MC与GEN性能之间存在一种权衡，该权衡取决于校准数据的构成。通过控制通用数据、数学数据和代码数据的混合比例，我们研究了这一权衡的帕累托前沿，并证明REAM通常优于基线方法，且在多数情况下性能接近原始未压缩模型。

摘要 (Abstract)

Mixture-of-Experts (MoE) large language models (LLMs) are among the top-performing architectures. The largest models, often with hundreds of billions of parameters, pose significant memory challenges for deployment. Traditional approaches to reduce memory requirements include weight pruning and quantization. Motivated by the Router-weighted Expert Activation Pruning (REAP) that prunes experts, we propose a novel method, Router-weighted Expert Activation Merging (REAM). Instead of removing experts, REAM groups them and merges their weights, better preserving original performance. We evaluate REAM against REAP and other baselines across multiple MoE LLMs on diverse multiple-choice (MC) question answering and generative (GEN) benchmarks. Our results reveal a trade-off between MC and GEN performance that depends on the mix of calibration data. By controlling the mix of general, math and coding data, we examine the Pareto frontier of this trade-off and show that REAM often outperforms the baselines and in many cases is comparable to the original uncompressed models.

关键词: Mixture-of-Experts, large language models, model compression, expert merging, memory efficiency, pruning, parameter reduction, performance preservation

116. ❌ Beyond the Final Actor: Modeling the Dual Roles of Creator and Editor for Fine-Grained LLM-Generated Text Detection

作者: Yang Li, Qiang Sheng, Zhengjia Wang, Yehan Yang, Danding Wang, Juan Cao 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04932v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于LLM生成文本的细粒度检测，核心围绕LLM技术本身的应用（检测LLM生成的文本），因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词所指向的特定技术原理（如MoE、量化、推理加速等）、训练方法（如预训练、微调、对齐等）、应用范式（如RAG、智能体）或特定科学领域应用，因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有LLM生成文本检测方法在细粒度分类上的不足，提出了一种名为RACE的新方法，通过建模文本的创作者和编辑者双重角色，在四分类设置下实现了更精确的检测，并优于现有基线。

摘要翻译

大型语言模型（LLM）的滥用问题要求对合成文本进行精准检测。现有研究主要遵循二元或三元分类框架，至多只能区分纯人类/LLM文本或协作生成的文本。这对于精细化的监管而言仍显不足，因为经LLM润色的人类文本与经人工人性化处理的LLM文本往往引发不同的政策后果。本文在严格的四分类框架下探索细粒度LLM生成文本检测。为应对此类复杂情况，我们提出RACE（面向创作者-编辑者建模的修辞结构分析），这是一种通过刻画创作者与编辑者独特特征来实现细粒度检测的方法。具体而言，RACE运用修辞结构理论构建表征创作者逻辑基础的修辞图，同时提取基本语篇单元层面的特征以捕捉编辑者的风格。实验表明，RACE在识别细粒度文本类型时优于12个基线模型，且误报率较低，为LLM监管提供了与政策需求对齐的解决方案。

摘要 (Abstract)

The misuse of large language models (LLMs) requires precise detection of synthetic text. Existing works mainly follow binary or ternary classification settings, which can only distinguish pure human/LLM text or collaborative text at best. This remains insufficient for the nuanced regulation, as the LLM-polished human text and humanized LLM text often trigger different policy consequences. In this paper, we explore fine-grained LLM-generated text detection under a rigorous four-class setting. To handle such complexities, we propose RACE (Rhetorical Analysis for Creator-Editor Modeling), a fine-grained detection method that characterizes the distinct signatures of creator and editor. Specifically, RACE utilizes Rhetorical Structure Theory to construct a logic graph for the creator’s foundation while extracting Elementary Discourse Unit-level features for the editor’s style. Experiments show that RACE outperforms 12 baselines in identifying fine-grained types with low false alarms, offering a policy-aligned solution for LLM regulation.

关键词: LLM-generated text detection, fine-grained classification, synthetic text, Rhetorical Structure Theory, creator-editor modeling, policy-aligned solution, RACE method

117. ❌ TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

作者: Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, Yukang Chen 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04921v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	15.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究KV缓存压缩技术以解决LLMs长推理中的内存瓶颈问题，与’KV Cache Compression OR Linear Attention OR FlashAttention’高度相关（15分），直接提出TriAttention方法。论文明确针对LLMs（10分）和长上下文推理（10分），涉及多步推理任务（10分），并通过压缩提升推理吞吐量（10分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型长推理任务中KV缓存内存瓶颈问题，提出基于三角函数和Q/K向量集中特性的TriAttention压缩方法，在保持推理精度的同时实现了2.5倍吞吐量提升或10.7倍内存减少。

摘要翻译

大型语言模型（LLM）中的扩展推理过程会带来严重的KV缓存内存瓶颈。主流的KV缓存压缩方法通常利用最近经过RoPE处理后的查询（post-RoPE queries）所对应的注意力分数来估计KV重要性。然而，在RoPE中，查询会随着位置发生旋转，导致具有代表性的查询数量极少，从而造成关键键（top-key）选择效果不佳和推理过程不稳定。为避免此问题，我们转向RoPE处理前的空间（pre-RoPE space），在此空间中我们观察到Q向量和K向量高度集中在固定的非零中心附近，并且在不同位置上保持稳定——即Q/K集中现象。我们证明，这种集中性导致查询优先关注特定距离的键（例如最近邻的键），而这些中心通过一个三角级数决定了哪些距离被优先选择。基于此，我们提出TriAttention方法，利用这些中心来估计键的重要性。通过该三角级数，我们利用这些中心所表征的距离偏好，依据键的位置对其进行评分，同时利用Q/K向量的范数作为重要性估计的补充信号。在AIME25数据集上进行32K令牌生成的实验中，TriAttention在达到与全注意力（Full Attention）相同推理精度的同时，实现了2.5倍的吞吐量提升或10.7倍的KV内存压缩，而同等效率下的主流基线方法仅能达到约一半的精度。TriAttention使得OpenClaw模型能够部署在单个消费级GPU上，而在全注意力机制下，长上下文原本会导致内存溢出。

摘要 (Abstract)

Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions – Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.

关键词: KV cache compression, long reasoning, LLMs, attention mechanism, memory bottleneck, throughput optimization, RoPE, trigonometric series

118. ❌ Synthetic Sandbox for Training Machine Learning Engineering Agents

作者: Yuhang Zhou, Lizhu Zhang, Yifan Wu, Jiayi Liu, Xiangjun Fan, Zhuokai Zhao, Hong Yan 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04872v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLM）代理在机器学习工程（MLE）任务中的应用，提出SandMLE多代理框架，通过生成合成环境来高效训练代理。因此，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文明确使用LLM代理。与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分），因为论文将SandMLE与SFT基线进行比较，并提到现有方法退回到SFT。与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文专注于训练机器学习工程代理。与’Multi-agent Systems OR Agent Coordination’高度相关（10分），因为SandMLE被描述为一个多代理框架。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、CoT等，在摘要中未提及或与论文内容无关，因此得0分。论文未涉及生物信息学或化学信息学等特定科学领域应用，因此’AI for Science’也得0分。

!!! tip deepseek-chat TL;DR

该论文解决了在机器学习工程（MLE）任务中训练大语言模型代理时，由于验证成本高导致在线强化学习（RL）缓慢的问题，通过提出SandMLE多代理框架生成合成环境，将执行时间减少13倍以上，并在MLE-bench-lite上显著优于监督微调（SFT）基线。

摘要翻译

随着大语言模型智能体从软件工程任务向机器学习工程领域拓展，验证智能体行为所需的成本呈数量级增长：软件工程任务可通过快速执行的单元测试进行验证，而机器学习工程验证需要在每个部署步骤中对大型数据集运行完整的机器学习流水线——包括数据预处理、模型训练和指标评估——这使得基于轨迹的在线强化学习方法因耗时过长而难以实施。现有方法退而采用监督微调或离线代理奖励，牺牲了在线强化学习固有的探索与泛化优势。我们发现沙箱数据规模是造成此瓶颈的主要根源。基于这一洞察，我们提出SandMLE——一个多智能体框架，该框架能从少量种子任务中生成多样化、可验证的合成机器学习工程环境，在保持现实问题结构性与技术复杂度的同时，将数据集规模约束至微型级别（每个任务仅包含50-200个训练样本）。通过大量实验，我们证明SandMLE能将执行时间缩短13倍以上，首次在机器学习工程领域实现了大规模、基于轨迹的在线强化学习训练。在MLE-bench-lite基准测试中，SandMLE在Qwen3-8B、14B和30B-A3B模型上均显著超越监督微调基线，相对奖牌率提升幅度达20.3%至66.9%。此外，经训练的策略能泛化至未见过的智能体架构，在MLE-Dojo基准上实现了最高32.4%的人类评分提升。

摘要 (Abstract)

As large language model agents advance beyond software engineering (SWE) tasks toward machine learning engineering (MLE), verifying agent behavior becomes orders of magnitude more expensive: while SWE tasks can be verified via fast-executing unit tests, MLE verification requires running full ML pipelines – data preprocessing, model training, and metric evaluation – on large datasets at each rollout step, rendering trajectory-wise on-policy reinforcement learning (RL) prohibitively slow. Existing approaches retreat to supervised fine-tuning (SFT) or offline proxy rewards, sacrificing the exploration and generalization benefits of on-policy RL. We observe that sandbox data size is the primary source of this bottleneck. Based on this insight, we introduce SandMLE, a multi-agent framework that generates diverse, verifiable synthetic MLE environments from a small number of seed tasks, preserving the structural and technical complexity of real-world problems while constraining datasets to micro-scale (each task is paired with only 50-200 training samples). Through extensive experiments, we show that SandMLE reduces execution time by over 13 times, enabling large-scale, on-policy trajectory-wise RL for the first time in the MLE domain. On MLE-bench-lite, SandMLE yields significant gains over SFT baselines across Qwen3-8B, 14B, and 30B-A3B, with relative medal rate improvements ranging from 20.3% to 66.9%. Furthermore, the trained policy generalizes across unseen agentic scaffolds, achieving up to 32.4% better HumanRank score on MLE-Dojo.

关键词: large language model agents, machine learning engineering, multi-agent framework, synthetic environments, on-policy reinforcement learning, supervised fine-tuning, SandMLE, MLE-bench-lite

119. ❌ Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency

作者: Guan-Ting Lin, Chen Chen, Zhehuai Chen, Hung-yi Lee 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04847v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文评估了多个语音代理模型（包括GPT-Realtime、Gemini Live等）在真实语音条件下的工具使用能力，核心涉及LLM在语音交互中的应用、多步推理、自我纠正、智能体工作流和API工具调用。这些方面与’Large Language Models’、‘Chain of Thought’、‘Self-Correction’、‘LLM Agents’和’Tool Use’高度相关（10分），因为论文直接研究LLM驱动的语音代理在复杂任务中的表现。其他关键词如MoE、量化、RAG等未在论文中提及或相关，故得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个用于评估语音代理在真实语音条件下多步工具使用能力的基准FDB-v3，发现GPT-Realtime在准确性和中断避免方面表现最佳，而Gemini Live 3.1延迟最低，但所有系统在自我纠正和复杂场景推理方面仍存在一致性问题。

摘要翻译

我们推出全双工基准测试第三版（Full-Duplex-Bench-v3，简称FDB-v3），这是一个用于评估口语语言模型在自然语音条件和多步骤工具使用场景下的基准。与先前研究不同，我们的数据集完全由真实人类音频构成，并标注了五种不流畅性类别，同时搭配了需要在四个任务领域中串联调用应用程序编程接口（API）的场景。我们评估了六种模型配置——GPT-Realtime、Gemini Live 2.5、Gemini Live 3.1、Grok、Ultravox v0.7以及传统的级联流水线（Whisper→GPT-4o→TTS）——涵盖准确性、延迟和话轮转换三个维度。GPT-Realtime在首次通过率（Pass@1，0.600）和避免打断（13.5%）方面领先；Gemini Live 3.1实现了最快的延迟（4.25秒）但话轮转换率最低（78.0%）；而级联基线尽管拥有完美的话轮转换率，却承受了最高的延迟（10.12秒）。在所有系统中，自我修正处理以及在困难场景下的多步骤推理仍是最一致的失败模式。

摘要 (Abstract)

We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use. Unlike prior work, our dataset consists entirely of real human audio annotated for five disfluency categories, paired with scenarios requiring chained API calls across four task domains. We evaluate six model configurations – GPT-Realtime, Gemini Live 2.5, Gemini Live 3.1, Grok, Ultravox v0.7, and a traditional Cascaded pipeline (Whisper$\rightarrow$GPT-4o$\rightarrow$TTS) – across accuracy, latency, and turn-taking dimensions. GPT-Realtime leads on Pass@1 (0.600) and interruption avoidance (13.5%); Gemini Live 3.1 achieves the fastest latency (4.25~~s) but the lowest turn-take rate (78.0%); and the Cascaded baseline, despite a perfect turn-take rate, incurs the highest latency (10.12~~s). Across all systems, self-correction handling and multi-step reasoning under hard scenarios remain the most consistent failure modes.

关键词: spoken language models, tool use, multi-step reasoning, self-correction, voice agents, API calls, benchmark evaluation, real-world disfluency

120. ❌ Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling

作者: Qingyang Xu, Yaling Shen, Stephanie Fong, Zimu Wang, Yiwen Jiang, Xiangyu Zhao, Jiahe Liu, Zhongxing Xu, Vincent Lee, Zongyuan Ge 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04842v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLMs在心理健康咨询领域的安全对齐问题，开发了Persona-based Client Simulation Attack (PCSA)红队测试框架，因此与’Large Language Models’和’Instruction Tuning/Alignment’高度相关（10分）。论文关注LLMs在心理治疗中可能强化有害信念或行为的问题，与’Hallucination Mitigation/Factuality/Truthfulness’高度相关（10分）。论文涉及AI在心理健康领域的应用，与’AI for Science/Bioinformatics/Cheminformatics’有一定关联（5分），但非核心生物信息学或化学信息学。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、CoT、Agents、Quantization等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLMs在心理健康咨询中的安全风险，提出了Persona-based Client Simulation Attack (PCSA)红队测试框架，实验发现当前LLMs在心理安全对齐方面存在漏洞，容易提供未经授权的医疗建议、强化妄想并隐含鼓励危险行为。

摘要翻译

大型语言模型在心理健康护理领域的应用日益增多，这引发了高风险治疗互动中的安全性担忧。一个关键挑战在于区分治疗性共情与适应性不良的确认——在多轮对话中，支持性回应可能无意间强化来访者的有害信念或行为。现有红队测试框架主要关注通用危害或基于优化的攻击，很大程度上忽视了此类风险。为填补这一空白，我们提出了基于人格的来访者模拟攻击框架，这是首个通过连贯的、人格驱动的来访者对话来模拟心理咨询场景的红队测试框架，旨在揭示心理安全对齐中的脆弱性。在七个通用及心理健康专项大型语言模型上的实验表明，该框架显著优于四种竞争性基线方法。困惑度分析与人工评估进一步显示，该框架能生成更自然、更真实的对话。我们的研究结果表明，当前大型语言模型仍易受领域特异性对抗策略的影响，包括提供未经授权的医疗建议、强化妄想信念以及隐性地鼓励危险行为。

摘要 (Abstract)

The increasing use of large language models (LLMs) in mental healthcare raises safety concerns in high-stakes therapeutic interactions. A key challenge is distinguishing therapeutic empathy from maladaptive validation, where supportive responses may inadvertently reinforce harmful beliefs or behaviors in multi-turn conversations. This risk is largely overlooked by existing red-teaming frameworks, which focus mainly on generic harms or optimization-based attacks. To address this gap, we introduce Personality-based Client Simulation Attack (PCSA), the first red-teaming framework that simulates clients in psychological counseling through coherent, persona-driven client dialogues to expose vulnerabilities in psychological safety alignment. Experiments on seven general and mental health-specialized LLMs show that PCSA substantially outperforms four competitive baselines. Perplexity analysis and human inspection further indicate that PCSA generates more natural and realistic dialogues. Our results reveal that current LLMs remain vulnerable to domain-specific adversarial tactics, providing unauthorized medical advice, reinforcing delusions, and implicitly encouraging risky actions.

关键词: Large Language Models, Psychological Counseling, Safety Alignment, Red-teaming, Persona-based Client Simulation Attack, Mental Healthcare, Adversarial Tactics, Therapeutic Empathy

121. ❌ How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling

作者: Yuhang Liu, Heyan Huang, Yizhe Yang, Hongyan Zhao, Zhizhuo Zeng, Yang Gao 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04791v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLMs在数学建模竞赛中的端到端问题解决能力评估，与’Large Language Models’高度相关（10分）。论文涉及多阶段推理、深度思考、自我纠正和AI代理工作流等概念，与’Chain of Thought’、‘System 2 Thinking’、‘Self-Correction’、‘LLM Agents’有一定关联（各5分）。数学建模属于科学应用领域，与’AI for Science’相关（5分）。其他关键词如MoE、SFT、RAG、量化等未在摘要中提及，评0分。

!!! tip deepseek-chat TL;DR

该论文通过提出一个分阶段评估框架，系统评估了大型语言模型在数学建模竞赛中的端到端问题解决能力，发现模型在理解阶段表现良好但在执行阶段存在持续缺陷，且这些缺陷不随模型规模扩大而改善。

摘要翻译

大语言模型（LLM）在推理基准测试中已展现出强大性能，但其解决需要端到端工作流程的现实问题的能力仍不明确。数学建模竞赛为评估此类端到端问题解决能力提供了一个严格的测试平台。我们提出了一种面向问题、分阶段的评估框架，该框架依据专家验证的标准，对大语言模型在各个建模阶段的表现进行评估。通过在中国研究生数学建模竞赛（China Postgraduate Mathematical Contest in Modeling）问题上，将自动评分与独立人类专家判断进行对比，我们验证了该框架的可靠性，证明其与现有评估方案相比具有显著更强的一致性。利用此框架，我们揭示了当前最先进大语言模型中存在的理解-执行差距：尽管它们在问题识别与表述等早期阶段表现良好，但在模型求解、代码实现和结果分析等面向执行的阶段却表现出持续的缺陷。即使增加模型规模，这些差距依然存在。我们进一步将这些失败归因于规范说明不足、验证缺失以及缺乏有效性检验，导致错误在阶段间传播而未被纠正。我们的研究结果表明，弥合这一差距需要超越模型缩放的方法，这为将大语言模型应用于复杂的现实世界问题解决提供了启示。

摘要 (Abstract)

Large language models (LLMs) have achieved strong performance on reasoning benchmarks, yet their ability to solve real-world problems requiring end-to-end workflows remains unclear. Mathematical modeling competitions provide a stringent testbed for evaluating such end-to-end problem-solving capability. We propose a problem-oriented, stage-wise evaluation framework that assesses LLM performance across modeling stages using expert-verified criteria. We validate the framework’s reliability by comparing automatic scores with independent human expert judgments on problems from the China Postgraduate Mathematical Contest in Modeling, demonstrating substantially stronger alignment than existing evaluation schemes. Using this framework, we reveal a comprehension-execution gap in state-of-the-art LLMs: while they perform well in early stages such as problem identification and formulation, they exhibit persistent deficiencies in execution-oriented stages including model solving, code implementation, and result analysis. These gaps persist even with increased model scale. We further trace these failures to insufficient specification, missing verification, and lack of validation, with errors propagating across stages without correction. Our findings suggest that bridging this gap requires approaches beyond model scaling, offering insights for applying LLMs to complex real-world problem solving.

关键词: Large Language Models, Mathematical Modeling, End-to-end Problem Solving, Evaluation Framework, Comprehension-Execution Gap, Model Scaling, Real-world Applications, Stage-wise Assessment

122. ❌ HUKUKBERT: Domain-Specific Language Model for Turkish Law

作者: Mehmet Utku Öztürk, Tansu Türkoğlu, Buse Buz-Yalug 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04790v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心贡献是开发土耳其法律领域的特定领域语言模型HukukBERT，采用Domain-Adaptive Pre-Training (DAPT)方法，因此与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分）。模型属于语言模型范畴，与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分）。论文未涉及其他关键词的技术或应用，如MoE、SLMs、Scaling Laws、SFT、Alignment、RLHF、PEFT、RAG、推理加速、代理系统等，这些均得0分。

!!! tip deepseek-chat TL;DR

该论文针对土耳其法律领域缺乏专用语言模型的问题，开发了HukukBERT模型，通过领域自适应预训练方法在土耳其法律文本上取得了最先进的性能。

摘要翻译

自然语言处理（NLP）领域的最新进展日益推动着法律科技（LegalTech）应用的发展，然而，针对土耳其法律的具体研究仍因领域特定数据和模型的稀缺而受限。尽管针对英文法律文本已开发出如LEGAL-BERT等大规模模型，但土耳其法律领域仍缺乏对应的领域专用大规模模型。本文提出了HukukBERT，这是目前最全面的土耳其语法律语言模型。该模型基于一个经过清洗的18 GB法律语料库，采用混合领域自适应预训练（Domain-Adaptive Pre-Training, DAPT）方法训练而成，该方法整合了全词掩码、词元片段掩码、词语片段掩码以及有针对性的关键词掩码技术。我们系统地将我们构建的48K词片（WordPiece）分词器及DAPT方法与通用土耳其语模型及现有领域专用模型进行了比较。在一个新颖的法律完形填空测试基准（Legal Cloze Test）——一项专为土耳其法院判决书设计的掩码法律术语预测任务——上评估，HukukBERT以84.40%的Top-1准确率取得了最先进的性能，显著优于现有模型。此外，我们在土耳其官方法院判决书的结构分割这一下游任务中对HukukBERT进行了评估，其取得了92.8%的文档通过率，创造了新的最佳性能。我们公开发布HukukBERT，以支持未来在土耳其法律NLP任务中的研究，包括命名实体识别、判决预测以及法律文档分类。

摘要 (Abstract)

Recent advances in natural language processing (NLP) have increasingly enabled LegalTech applications, yet existing studies specific to Turkish law have still been limited due to the scarcity of domain-specific data and models. Although extensive models like LEGAL-BERT have been developed for English legal texts, the Turkish legal domain lacks a domain-specific high-volume counterpart. In this paper, we introduce HukukBERT, the most comprehensive legal language model for Turkish, trained on a 18 GB cleaned legal corpus using a hybrid Domain-Adaptive Pre-Training (DAPT) methodology integrating Whole-Word Masking, Token Span Masking, Word Span Masking, and targeted Keyword Masking. We systematically compared our 48K WordPiece tokenizer and DAPT approach against general-purpose and existing domain-specific Turkish models. Evaluated on a novel Legal Cloze Test benchmark – a masked legal term prediction task designed for Turkish court decisions – HukukBERT achieves state-of-the-art performance with 84.40% Top-1 accuracy, substantially outperforming existing models. Furthermore, we evaluated HukukBERT in the downstream task of structural segmentation of official Turkish court decisions, where it achieves a 92.8% document pass rate, establishing a new state-of-the-art. We release HukukBERT to support future research in Turkish legal NLP tasks, including recognition of named entities, prediction of judgment, and classification of legal documents.

关键词: HukukBERT, Turkish law, domain-specific language model, Domain-Adaptive Pre-Training, legal corpus, masked legal term prediction, court decisions, state-of-the-art

123. ❌ MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

作者: Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weijun Zeng, Ziyang Miao, Bangrui Xu, Junbo Niu, Mengzhang Cai, Jiantao Qiu, Qintong Zhang, Dongsheng Ma, Yuefeng Sun, Hejun Dong, Wenzheng Zhang, Jutao Xiao, Jiayong Shi, Pengyu Liao, Xiaomeng Zhao, Huaping Zhong, Liqun Wei, Jing Yu, Jie Yang, Wei Li, Shasha Wang, Qianqian Wu, Xuanhe Zhou, Weijia Li, Zhenxiang Li, Zhongying Tu, Jiang Wu, Lijun Wu, Chao Xu, Kai Chen, Wentao Zhang, Yu Qiao, Bowen Zhou, Dahua Lin, Conghui He 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04771v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于文档解析任务的数据工程和训练策略优化，而非大模型技术本身。与’Pre-training’和’Post-training’高度相关（8分），因为论文采用三阶段训练策略（大规模预训练、困难样本微调、GRPO对齐）。与’Scaling Laws AND Data Quality’有一定关联（5分），因为论文强调数据质量和规模对性能的影响。与’Instruction Tuning OR Alignment OR Value Alignment’有弱关联（5分），因为提到GRPO对齐。其他关键词（如LLMs、MoE、RAG等）与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文通过数据工程和训练策略优化（包括多样性采样、跨模型一致性验证和迭代修正），在固定1.2B参数架构下，将文档解析性能提升至OmniDocBench v1.6的95.69分，超越参数规模大200倍以上的模型。

摘要翻译

当前文档解析方法主要在模型架构创新层面竞争，而训练数据的系统工程仍未被充分探索。然而，不同架构和参数规模的SOTA模型在同一组困难样本上表现出高度一致的失败模式，这表明性能瓶颈源于训练数据共有的缺陷，而非架构本身。基于这一发现，我们提出\minerupro，该方法仅通过数据工程和训练策略优化来推进技术前沿，同时保持\mineru的12亿参数架构完全不变。其核心是一个围绕覆盖度、信息量和标注准确性协同设计的数据引擎：多样性-难度感知采样将训练数据从不足1000万样本扩展至6550万样本，同时修正分布偏移；跨模型一致性验证利用异构模型间的输出一致性来评估样本难度并生成可靠标注；判断-精炼流程通过“渲染-验证”迭代校正来提升困难样本的标注质量。一个三阶段渐进式训练策略——大规模预训练、困难样本微调和GRPO对齐——依次利用不同质量层级的数据。在评估方面，我们修正了OmniDocBench~~v1.5中的元素匹配偏差，并引入困难子集，建立了更具区分度的OmniDocBench~~v1.6评估协议。在未进行任何架构修改的情况下，\minerupro在OmniDocBench~v1.6上达到95.69分，较同架构基线提升2.71分，并超越了包括参数量超过其200倍的所有现有方法。

摘要 (Abstract)

Current document parsing methods compete primarily on model architecture innovation, while systematic engineering of training data remains underexplored. Yet SOTA models of different architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training data rather than architecture itself. Building on this finding, we present \minerupro, which advances the state of the art solely through data engineering and training strategy optimization while keeping the 1.2B-parameter architecture of \mineru completely fixed. At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while correcting distribution shift; Cross-Model Consistency Verification leverages output agreement among heterogeneous models to assess sample difficulty and generate reliable annotations; the Judge-and-Refine pipeline improves annotation quality for hard samples through render-then-verify iterative correction. A three-stage progressive training strategy – large-scale pre-training, hard sample fine-tuning, and GRPO alignment – sequentially exploits these data at different quality tiers. On the evaluation front, we fix element-matching biases in OmniDocBench~~v1.5 and introduce a Hard subset, establishing the more discriminative OmniDocBench~~v1.6 protocol. Without any architectural modification, \minerupro achieves 95.69 on OmniDocBench~v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods including models with over 200$\times$ more parameters.

关键词: document parsing, data engineering, training strategy, data quality, hard samples, progressive training, GRPO alignment, OmniDocBench

124. ❌ Darkness Visible: Reading the Exception Handler of a Language Model

作者: Peter Balogh 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04756v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是对GPT-2 Small模型内部机制的可解释性研究，核心是分析最终MLP层的神经元功能（如异常处理器、核心神经元、区分器、专家、共识神经元）及其在路由知识中的作用，并探讨了知识神经元的功能本质。因此，它高度相关于’Large Language Models’（研究对象是GPT-2）和’Mechanistic Interpretability’（核心是模型内部机制的可解释性分析）。其他关键词如MoE、SLMs、训练方法、推理技术、代理、压缩等均未在论文中涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文研究了GPT-2 Small模型最终MLP层的内部机制，发现其形成了一个可读的三层异常处理器，并揭示了知识神经元主要作为路由基础设施而非事实存储，且该架构仅在终端层结晶。

摘要翻译

GPT-2 Small模型的最终多层感知机（MLP）展现出一个完全清晰的路由程序——由27个已命名的神经元组成的三层异常处理器——而其路由的知识仍分散纠缠于约3,040个残差神经元中。我们将全部3,072个神经元（以数值精度）分解为：5个融合核心神经元（负责将词汇重置为功能词）、10个区分器神经元（用于抑制错误候选）、5个专家神经元（检测结构边界），以及7个共识神经元（各自监控不同的语言维度）。共识-异常转换点——即MLP干预从有益转为有害的临界位置——在统计上呈现显著突变（自助法95%置信区间在所有共识水平上均不包含零；转换点位于4/7至5/7之间）。三项实验表明，该模型第11层中的“知识神经元”（Dai等人，2022）实际充当路由基础设施而非事实存储单元：MLP对注意力机制已在残差流中产生的信号进行放大或抑制，其强度随上下文约束程度而变化。一个花园路径实验揭示了反向花园路径效应——GPT-2能即时利用动词次范畴化信息，这与异常处理器基于词元级可预测性（而非句法结构）运作的特性相符。此架构仅在最末层形成结晶化——在更深层模型中，我们预测等效结构将出现在最终层而非第11层。代码与数据：https://github.com/pbalogh/transparent-gpt2

摘要 (Abstract)

The final MLP of GPT-2 Small exhibits a fully legible routing program – 27 named neurons organized into a three-tier exception handler – while the knowledge it routes remains entangled across ~3,040 residual neurons. We decompose all 3,072 neurons (to numerical precision) into: 5 fused Core neurons that reset vocabulary toward function words, 10 Differentiators that suppress wrong candidates, 5 Specialists that detect structural boundaries, and 7 Consensus neurons that each monitor a distinct linguistic dimension. The consensus-exception crossover – where MLP intervention shifts from helpful to harmful – is statistically sharp (bootstrap 95% CIs exclude zero at all consensus levels; crossover between 4/7 and 5/7). Three experiments show that “knowledge neurons” (Dai et al., 2022), at L11 of this model, function as routing infrastructure rather than fact storage: the MLP amplifies or suppresses signals already present in the residual stream from attention, scaling with contextual constraint. A garden-path experiment reveals a reversed garden-path effect – GPT-2 uses verb subcategorization immediately, consistent with the exception handler operating at token-level predictability rather than syntactic structure. This architecture crystallizes only at the terminal layer – in deeper models, we predict equivalent structure at the final layer, not at layer 11. Code and data: https://github.com/pbalogh/transparent-gpt2

关键词: GPT-2, MLP, exception handler, neurons, routing, knowledge neurons, interpretability, transformer

125. ❌ On Ambiguity: The case of fraction, its meanings and roles

作者: Jan A Bergstra, John V Tucker 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04647v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是纯数学哲学/语言学领域的理论研究，探讨数学话语中’分数’概念的模糊性问题，并提出新的术语（如fracterm、fracvalue）来澄清其含义。论文完全不涉及大模型、深度学习、AI技术或科学应用，所有关键词均与大模型技术、AI方法或科学AI应用相关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究数学中'分数'概念的模糊性问题，通过引入新术语（如fracterm、fracvalue）来澄清其不同含义，并论证分数不是一个数学概念而是一个包含多个概念的'范畴'。

摘要翻译

我们探讨数学论述中歧义性的概念。我们考虑一种解决歧义性的通用方法，以及维持解决方案的语义选项。这一通用讨论被应用于“分数”这一案例，该术语在初等算术文献中定义不清且存在歧义。为澄清“分数”的使用，我们引入若干新术语来指代其某些可能的含义。例如，为区分结构方面，我们使用“分项”；为区分纯数值方面，使用“分值”；为区分纯文本方面，使用“分符”与“分符出现”。这些解释可以解决歧义性，我们通过在算术论述片段中使用此类精确概念来讨论其解决方案。我们认为分数不具备数学概念的资格，但该术语作为多个概念的集合体发挥作用，我们将其简称为“范畴”。对分数的这一分析引导我们思考与分值相关的数的概念。我们引入一种规定数系的方式，并将这些分析性概念与结构主义的概念进行比较。

摘要 (Abstract)

We contemplate the notion of ambiguity in mathematical discourse. We consider a general method of resolving ambiguity and semantic options for sustaining a resolution. The general discussion is applied to the case of fraction' which is ill-defined and ambiguous in the literature of elementary arithmetic. In order to clarify the use of fraction’ we introduce several new terms to designate some of its possible meanings. For example, to distinguish structural aspects we use fracterm', to distinguish purely numerical aspects fracvalue’ and, to distinguish purely textual aspects fracsign' and fracsign occurence’. These interpretations can resolve ambiguity, and we discuss the resolution by using such precise notions in fragments of arithmetical discourse. We propose that fraction does not qualify as a mathematical concept but that the term functions as a collective for several concepts, which we simply call a `category’. This analysis of fraction leads us to consider the notion of number in relation to fracvalue. We introduce a way of specifying number systems, and compare the analytical concepts with those of structuralism.

关键词: ambiguity, fraction, mathematical discourse, fracterm, fracvalue, semantic resolution, arithmetical discourse, category

126. ❌ IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation

作者: Anjali Kantharuban, Aarohi Srivastava, Fahim Faisal, Orevaoghene Ahia, Antonios Anastasopoulos, David Chiang, Yulia Tsvetkov, Graham Neubig 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04704v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文IDIOLEX专注于学习句子风格和方言的表示，与语义内容解耦，并探索将这些表示用于语言模型的风格对齐。它涉及大语言模型（LLMs）的应用，因为摘要提到“developing diverse and accessible LLMs”，因此与“Large Language Models”有一定关联（5分）。此外，风格对齐与“Instruction Tuning OR Alignment OR Value Alignment”相关，因为对齐包括风格方面的调整（5分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training等均未在论文中涉及，因此得0分。论文不涉及生物信息学等科学AI应用，因此“AI for Science”也得0分。

!!! tip deepseek-chat TL;DR

该论文提出了IDIOLEX框架，用于学习捕获句子风格和方言的连续表示，解耦语义内容，并证明这些表示可用于语言模型的风格对齐，以支持开发多样化和可访问的LLMs。

摘要翻译

现有句子表征主要编码句子所述内容，而非其表达方式，尽管后者对许多应用至关重要。与此相反，我们开发了能够捕捉风格与方言、并与语义内容解耦的句子表征。我们将此任务定义为个人语言特征表征学习。我们提出IDIOLEX框架，该框架通过结合句子来源的监督信息与句子内容的语言学特征来训练模型，从而学习每个句子风格与方言的连续表征。我们在阿拉伯语和西班牙语的方言上评估了该方法。学习到的表征能够捕捉有意义的变异，并实现跨领域的分析与分类迁移。我们进一步探索了将这些表征作为训练目标，用于语言模型的风格对齐。研究结果表明，联合建模个体层面与群体层面的变异为研究个人语言特征提供了有效视角，并支持需要敏感捕捉风格差异的下游应用，例如开发多样化且易于访问的大语言模型。

摘要 (Abstract)

Existing sentence representations primarily encode what a sentence says, rather than how it is expressed, even though the latter is important for many applications. In contrast, we develop sentence representations that capture style and dialect, decoupled from semantic content. We call this the task of idiolectal representation learning. We introduce IDIOLEX, a framework for training models that combines supervision from a sentence’s provenance with linguistic features of a sentence’s content, to learn a continuous representation of each sentence’s style and dialect. We evaluate the approach on dialects of both Arabic and Spanish. The learned representations capture meaningful variation and transfer across domains for analysis and classification. We further explore the use of these representations as training objectives for stylistically aligning language models. Our results suggest that jointly modeling individual and community-level variation provides a useful perspective for studying idiolect and supports downstream applications requiring sensitivity to stylistic differences, such as developing diverse and accessible LLMs.

关键词: idiolectal representation learning, style and dialect, sentence representations, language model alignment, Arabic dialects, Spanish dialects, stylistic variation, continuous representations

127. ❌ Benchmarking Multilingual Speech Models on Pashto: Zero-Shot ASR, Script Failure, and Cross-Domain Evaluation

作者: Hanif Rahman 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04598v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多语言自动语音识别（ASR）的基准测试，特别是针对普什图语，评估了Whisper、MMS-1B、SeamlessM4T-v2-large和OmniASR-CTC-300M等模型在零样本ASR、脚本失败和跨域评估方面的表现。研究内容涉及语音处理、模型评估和语言特定挑战，但所有关键词均与大语言模型（LLM）、深度学习技术原理或AI在科学领域的应用（如生物信息学）无关。关键词主要涵盖LLM架构、训练方法、推理优化、代理系统等，而本文处理的是语音识别任务，未涉及文本生成、模型对齐、压缩或科学应用，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

本文首次对多语言语音模型在普什图语上进行基准测试，发现零样本ASR中Whisper模型表现不佳（WER高达461%），SeamlessM4T取得最佳零样本结果（39.7% WER），并揭示了脚本失败和跨域性能下降问题，同时提出了研究障碍和优先事项。

摘要翻译

普什图语拥有约6000万至8000万使用者，但目前尚未在任何公开共享测试集上发布多语言自动语音识别（ASR）的基准结果。本文首次在公开普什图语数据上进行了可复现的多模型评估，涵盖零样本ASR、文字体系层面的失效问题以及微调模型的跨领域评估。针对零样本ASR，我们在FLEURS普什图语测试集和过滤后的Common Voice24子集上评估了十个模型（包括Whisper全部七种规模、MMS-1B、SeamlessM4T-v2-large和OmniASR-CTC-300M）；零样本Whisper的词错误率（WER）在90%至297%之间，其中中等规模模型在Common Voice24上因解码器循环问题崩溃至461%。SeamlessM4T在Common Voice~24上达到39.7% WER（截至投稿时已报道的最佳零样本结果）；MMS-1B在FLEURS上达到43.8% WER。针对文字体系失效问题，语言识别审计显示：所有Whisper模型生成普什图文输出的语句比例均未超过0.8%，而MMS-1B、SeamlessM4T和OmniASR的普什图文保真度均超过93%；仅凭WER无法揭示此类失效，因为对普什图语音频生成阿拉伯文输出的模型在任何可解释意义上均未实现ASR。在跨领域评估中，五个微调普什图语ASR模型在两个测试集上接受评估：已发布的14% WER在分布外测试集上恶化至32.5%–59%，而一个增强模型在两个测试集上均达到35.1% WER且无跨领域性能衰减。字符类别错误分层分析证实，普什图语特有音素（卷舌音系列和边擦音）导致了不成比例的错误权重。所有评估仅涵盖朗读语音。本文识别了阻碍累积性进展的五项结构性障碍，并论证了五项有序的研究优先方向。

摘要 (Abstract)

Pashto is spoken by approximately 60–80 million people but has no published benchmarks for multilingual automatic speech recognition (ASR) on any shared public test set. This paper reports the first reproducible multi-model evaluation on public Pashto data, covering zero-shot ASR, script-level failure, and cross-domain evaluation of fine-tuned models. For zero-shot ASR, ten models (all seven Whisper sizes, MMS-1B, SeamlessM4T-v2-large, and OmniASR-CTC-300M) are evaluated on the FLEURS Pashto test set and a filtered Common Voice~~24 subset; zero-shot Whisper WER ranges from 90% to 297%, with the medium model collapsing to 461% on Common Voice~~24 consistent with decoder looping. SeamlessM4T achieves 39.7% WER on Common Voice~24 (the best zero-shot result reported to date, as of submission); MMS-1B achieves 43.8% on FLEURS. For script failure, a language-identification audit shows that no Whisper model produces Pashto-script output in more than 0.8% of utterances, while MMS-1B, SeamlessM4T, and OmniASR each exceed 93% Pashto-script fidelity; WER alone does not reveal this failure, since a model generating Arabic-script output on Pashto audio has not achieved ASR in any interpretable sense. For cross-domain evaluation, five fine-tuned Pashto ASR models are evaluated on both test sets: published WER figures of 14% degrade to 32.5–59% on out-of-distribution sets, while one augmented model achieves 35.1% on both sets with zero cross-domain degradation. Character-class error stratification confirms that Pashto-unique phonemes (the retroflex series and lateral fricatives) account for disproportionate error mass. All evaluations cover read speech only. Five structural impediments to cumulative progress are identified and five ordered research priorities are argued.

关键词: multilingual speech models, Pashto ASR, zero-shot ASR, script failure, cross-domain evaluation, Whisper models, WER benchmarking, language identification audit

128. ❌ CommonMorph: Participatory Morphological Documentation Platform

作者: Aso Mahmudi, Sina Ahmadi, Kemal Kurniawan, Rico Sennrich, Eduard Hovy, Ekaterina Vylomova 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04515v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文CommonMorph是一个用于形态学数据收集和标注的平台，专注于语言学工具开发，涉及主动学习、社区协作和UniMorph兼容输出。所有评分关键词均与大模型、深度学习技术原理或AI for Science直接相关，而本文未涉及任何大模型技术、深度学习创新或生物信息学/化学信息学应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该研究解决了低资源语言形态学数据收集和标注的挑战，开发了一个名为CommonMorph的三层平台，通过专家定义、贡献者启发和社区验证来加速数据收集过程，并确保与NLP工具的互操作性。

摘要翻译

形态数据的收集与标注工作面临重大挑战，需要语言学专业知识、严谨的方法论和大量资源。这些障碍对于低资源语言及其变体尤为突出。为加速这一进程，我们推出 \texttt{CommonMorph}——一个通过三层架构（专家语言学定义、贡献者语料诱发、社区验证）来简化和推进形态数据收集开发的综合平台。该平台通过整合主动学习、标注建议以及从相关语言导入与适配材料的工具，最大限度地减少了人工工作量。它兼容多种形态系统，包括融合语、黏着语及词根-词型（root-and-pattern）形态结构。其开源设计以及与UniMorph兼容的输出格式确保了平台的可访问性，并能与自然语言处理（NLP）工具实现互操作。我们的平台可通过 https://common-morph.com 访问，为通过协作技术保护语言多样性提供了一个可复现的模型。

摘要 (Abstract)

Collecting and annotating morphological data present significant challenges, requiring linguistic expertise, methodological rigour, and substantial resources. These barriers are particularly acute for low-resource languages and varieties. To accelerate this process, we introduce \texttt{CommonMorph}, a comprehensive platform that streamlines morphological data collection development through a three-tiered approach: expert linguistic definition, contributor elicitation, and community validation. The platform minimises manual work by incorporating active learning, annotation suggestions, and tools to import and adapt materials from related languages. It accommodates diverse morphological systems, including fusional, agglutinative, and root-and-pattern morphologies. Its open-source design and UniMorph-compatible outputs ensure accessibility and interoperability with NLP tools. Our platform is accessible at https://common-morph.com, offering a replicable model for preserving linguistic diversity through collaborative technology.

关键词: morphological data collection, low-resource languages, active learning, annotation platform, UniMorph-compatible, linguistic diversity, collaborative technology

129. ❌ Formal Constraints on Dependency Syntax

作者: Gómez-Rodríguez, Carlos, Alemany-Puig, Lluís 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04542v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Formal Constraints on Dependency Syntax》专注于计算语言学中的依存句法理论，研究如何通过形式约束（如投射性）来改进依存树对真实语言现象的建模。所有评分关键词均涉及大模型、深度学习技术原理或AI在科学领域的应用创新，而该论文完全不涉及这些主题：未提及任何语言模型（LLM/SLM）、模型训练/微调技术（预训练、SFT、RLHF、PEFT等）、推理优化（注意力机制、解码加速）、AI代理、模型压缩、事实性改善、可解释性、世界模型、模型合并、上下文学习，也未涉及生物信息学或化学信息学等AI for Science应用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究依存句法的形式约束问题，旨在通过提出介于投射性限制和完全无限制依存结构之间的约束条件，来更准确地描述语言现象并提高句法分析效率。

摘要翻译

依存句法将句子结构表征为由依存关系（即词汇单元间的有向关系）构成的树状图。尽管其广义形式允许任何此类树结构存在，但实际上许多结构在实证语言中并不合理或极为罕见。这促使研究者探索能够更好拟合真实语言现象的树结构子集约束条件，从而提供更精确的语言学描述、更高效的句法分析速度，或为语言演化及人类语言处理机制提供洞见。投射性（projectivity）是其中研究最深入的约束条件，但已被证明限制性过强，难以表征某些语言现象（尤其在语序灵活的语言中）。因此，学界提出了多种约束条件，旨在投射性的局限性与无限制依存结构的过度宽松性之间，寻求更符合语言现实的平衡点。

摘要 (Abstract)

Dependency syntax represents the structure of a sentence as a tree composed of dependencies, i.e., directed relations between lexical units. While in its more general form any such tree is allowed, in practice many are not plausible or are very infrequent in attested language. This has motivated a search for constraints characterizing subsets of trees that better fit real linguistic phenomena, providing a more accurate linguistic description, faster parsing or insights on language evolution and human processing. Projectivity is the most well-studied such constraint, but it has been shown to be too restrictive to represent some linguistic phenomena, especially in flexible-word-order languages. Thus, a variety of constraints have been proposed to seek a realistic middle ground between the limitations of projectivity and the excessive leniency of unrestricted dependency structures.

关键词: Dependency syntax, Formal constraints, Projectivity, Dependency trees, Linguistic phenomena, Parsing, Language evolution, Flexible-word-order languages

130. ❌ Same Geometry, Opposite Noise: Transformer Magnitude Representations Lack Scalar Variability

作者: Jon-Paul Cacioli 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04469v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文直接研究大型语言模型（Llama-3-8B, Mistral-7B）的表示特性，与’Large Language Models’高度相关（10分）。研究分析隐藏状态表示的分散性，属于模型内部工作机制的解释，与’Mechanistic Interpretability’高度相关（10分）。论文未涉及其他关键词如MoE、训练方法、推理优化、应用领域等，因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究测试了Transformer语言模型是否表现出生物系统中常见的标量变异性（即表示噪声与数值大小成比例），结果发现模型表现出相反的模式：表示变异性随数值增大而减小，表明仅通过分布学习不足以产生生物系统中的标量变异性。

摘要翻译

标量变异性——即表征噪声与数值大小成比例缩放，产生恒定变异系数的现象——是生物数量表征系统的典型特征。我们通过分析三个70-80亿参数模型（Llama-3-8B-Instruct、Mistral-7B-Instruct-v0.3、Llama-3-8B-Base；数据来源Cacioli, 2026）中26个数值量在不同承载句的隐藏状态表征离散度，检验了Transformer语言模型是否展现该特性。研究发现相反规律：表征变异性沿数量轴随数值增大而递减（缩放指数α约-0.19；三个模型共16个主要层中α>0的层数为0）。这种负相关关系在全维度空间（α约-0.04）及经过句子身份校正后（α约-0.007）保持一致。反标量模式在数量轴上的强度是正交方向的3-5倍，且语料库频率能强力预测各数值的变异性（ρ=0.84）。这些结果表明仅靠分布学习不足以产生标量变异性：Transformer模型复现了对数压缩的数量几何结构，但未出现生物系统中观察到的恒定变异系数噪声特征。

摘要 (Abstract)

Scalar variability – the finding that representational noise scales proportionally with magnitude, producing a constant coefficient of variation – is a hallmark of biological magnitude systems. We tested whether transformer language models exhibit this property by analysing the dispersion of hidden-state representations across carrier sentences for 26 numerical magnitudes in three 7-8B parameter models (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base; data from Cacioli, 2026). We found the opposite: representational variability decreased with magnitude along the magnitude axis (scaling exponent alpha approx -0.19; 0/16 primary layers with alpha > 0, all three models). The negative sign was consistent in full-dimensional space (alpha approx -0.04) and after sentence-identity correction (alpha approx -0.007). The anti-scalar pattern was 3-5x stronger along the magnitude axis than orthogonal dimensions, and corpus frequency strongly predicted per-magnitude variability (rho = .84). These results demonstrate that distributional learning alone is insufficient to produce scalar variability: transformers reproduce log-compressive magnitude geometry but not the constant-CV noise signature observed in biological systems.

关键词: Transformer, language models, scalar variability, representational noise, hidden-state representations, magnitude systems, Llama-3-8B, Mistral-7B

131. ❌ DeonticBench: A Benchmark for Reasoning over Rules

作者: Guangyao Dou, Luis Brena, Akhil Deo, William Jurayj, Jingyu Zhang, Nils Holzenberger, Benjamin Van Durme 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04443v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	8.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在复杂规则推理（特别是义务推理）上的能力，直接涉及LLMs、Chain of Thought推理、System 2深度推理等关键词，并测试了SFT和RL方法，因此这些关键词得分较高；论文关注长上下文推理，与Context Window Extension相关；其他关键词如MoE、量化、RAG等未在摘要中体现，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在复杂规则推理（特别是义务推理）上的不足，提出了DEONTICBENCH基准，涵盖多个现实领域，并发现当前模型在该任务上表现有限，且通过监督微调和强化学习训练符号程序生成的方法仍无法可靠解决这些任务。

摘要翻译

对于大型语言模型而言，基于复杂且情境特定的规则进行推理仍具挑战性。在法律与政策领域中，这表现为道义推理：即在明确规则下对义务、许可与禁令进行推理。尽管近期许多基准测试侧重于短文本数学推理，但关注长文本、高风险道义推理的研究仍较少。为填补这一空白，我们提出了DEONTICBENCH基准，该基准包含6,232项任务，涵盖美国联邦税法、航空行李政策、美国移民管理及美国各州住房法律。这些任务可通过多种方式处理，包括直接语言推理或借助符号计算。除自由形式的思维链推理外，DEONTICBENCH还支持一种可选的基于求解器的工作流程：模型将法规和案例事实转化为可执行的Prolog代码，从而形成形式化问题解释与显式的程序执行轨迹。我们为所有实例提供了参考Prolog程序。在顶尖的大型语言模型与代码模型中，其在SARA Numeric困难子集上的最佳表现仅为44.4%，在住房法律任务上的宏观F1分数为46.6。我们进一步研究了基于监督微调与强化学习的符号程序生成训练。尽管训练提升了Prolog代码生成质量，当前强化学习方法仍无法可靠解决这些任务。总体而言，DEONTICBENCH为研究现实领域中符号与非符号设置下的情境化规则推理提供了一个基准测试平台。

摘要 (Abstract)

Reasoning with complex, context-specific rules remains challenging for large language models (LLMs). In legal and policy settings, this manifests as deontic reasoning: reasoning about obligations, permissions, and prohibitions under explicit rules. While many recent benchmarks emphasize short-context mathematical reasoning, fewer focus on long-context, high-stakes deontic reasoning. To address this gap, we introduce DEONTICBENCH, a benchmark of 6,232 tasks across U.S. federal taxes, airline baggage policies, U.S. immigration administration, and U.S. state housing law. These tasks can be approached in multiple ways, including direct reasoning in language or with the aid of symbolic computation. Besides free-form chain-of-thought reasoning, DEONTICBENCH enables an optional solver-based workflow in which models translate statutes and case facts into executable Prolog, leading to formal problem interpretations and an explicit program trace. We release reference Prolog programs for all instances. Across frontier LLMs and coding models, best hard-subset performance reaches only 44.4% on SARA Numeric and 46.6 macro-F1 on Housing. We further study training with supervised fine-tuning and reinforcement learning for symbolic program generation. Although training improves Prolog generation quality, current RL methods still fail to solve these tasks reliably. Overall, DEONTICBENCH provides a benchmark for studying context-grounded rule reasoning in real-world domains under both symbolic and non-symbolic settings.

关键词: Large Language Models, Deontic Reasoning, Benchmark, Chain-of-Thought, Long-context, Supervised Fine-tuning, Reinforcement Learning, Symbolic Computation

132. ❌ FAVE: Flow-based Average Velocity Establishment for Sequential Recommendation

作者: Ke Shi, Yao Zhang, Feng Guo, Jinyuan Zhang, JunShuo Zhang, Shen Gao, Shuo Shang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04427v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于流的序列推荐系统（FAVE框架），专注于提高生成式推荐模型的效率和性能。论文的核心技术是流模型（flow-based models）、扩散模型（diffusion models）、语义对齐（semantic alignment）和推理加速（inference acceleration）。所有评分关键词都明确针对大语言模型（LLMs）及其相关技术（如MoE、RLHF、RAG、CoT、量化等），而本文完全不涉及LLMs或任何语言模型技术。论文属于推荐系统领域，使用生成模型方法，但与评分关键词中的大模型技术、科学AI应用等主题完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于流的平均速度建立（FAVE）框架，解决了序列推荐中生成式模型的效率瓶颈问题，通过语义锚先验和全局平均速度实现了单步生成，在保持最先进推荐性能的同时显著提升了推理效率。

摘要翻译

生成式推荐已成为捕捉序列推荐中用户意图动态演化的变革性范式。基于流的方法虽然提升了扩散模型的效率，但仍受限于“噪声到数据”范式，该范式引入了两个关键的低效问题：先验失配，即生成从无信息的噪声开始，迫使模型经历冗长的恢复轨迹；以及线性冗余，即迭代求解器在建模确定性偏好转移时浪费计算资源。为解决这些局限，我们提出了一种基于流的平均速度建立框架，用于一步生成式推荐，该框架学习从信息丰富的先验分布到目标分布的直接轨迹。Fave通过渐进式两阶段训练策略构建。在第一阶段，我们通过双端语义对齐建立稳定的偏好空间，在源端（用户历史）和目标端（下一个物品）同时施加约束以防止表示坍缩。在第二阶段，我们通过引入语义锚点先验直接解决效率瓶颈，该先验使用用户交互历史中的掩码嵌入初始化流，从而提供信息丰富的起点。随后，我们学习全局平均速度，将多步轨迹整合为单个位移向量，并通过基于雅可比向量积的一致性约束强制轨迹平直化，以确保一步生成。在三个基准数据集上的大量实验表明，Fave不仅实现了最先进的推荐性能，还将推理效率提升了一个数量级，使其适用于对延迟敏感的实际场景。

摘要 (Abstract)

Generative recommendation has emerged as a transformative paradigm for capturing the dynamic evolution of user intents in sequential recommendation. While flow-based methods improve the efficiency of diffusion models, they remain hindered by the ``Noise-to-Data’’ paradigm, which introduces two critical inefficiencies: prior mismatch, where generation starts from uninformative noise, forcing a lengthy recovery trajectory; and linear redundancy, where iterative solvers waste computation on modeling deterministic preference transitions. To address these limitations, we propose a Flow-based Average Velocity Establishment (Fave) framework for one-step generation recommendation that learns a direct trajectory from an informative prior to the target distribution. Fave is structured via a progressive two-stage training strategy. In Stage 1, we establish a stable preference space through dual-end semantic alignment, applying constraints at both the source (user history) and target (next item) to prevent representation collapse. In Stage 2, we directly resolve the efficiency bottlenecks by introducing a semantic anchor prior, which initializes the flow with a masked embedding from the user’s interaction history, providing an informative starting point. Then we learn a global average velocity, consolidating the multi-step trajectory into a single displacement vector, and enforce trajectory straightness via a JVP-based consistency constraint to ensure one-step generation. Extensive experiments on three benchmarks demonstrate that Fave not only achieves state-of-the-art recommendation performance but also delivers an order-of-magnitude improvement in inference efficiency, making it practical for latency-sensitive scenarios.

关键词: sequential recommendation, generative recommendation, flow-based models, diffusion models, inference efficiency, one-step generation, semantic alignment, average velocity

133. ❌ Structured Causal Video Reasoning via Multi-Objective Alignment

作者: Zinuo Li, Yongxin Guo, Jun Liu, Jiawei Zhan, Xi Jiang, Chengjie Wang, Mohammed Bennamoun, Farid Boussaid, Feng Zheng, Qiuhong Ke 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04415v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Factum-4B模型，核心是改进Video-LLMs的因果推理能力，通过结构化事件事实和四阶段训练流程（包括RL阶段）。高度相关的关键词包括：LLMs（核心模型）、RLHF（使用强化学习后训练）、Chain of Thought（结构化推理过程）、System 2 Thinking（深度因果推理）。中等相关的关键词包括：Post-training（包含SFT阶段）、Instruction Tuning（事实对齐）、Hallucination Mitigation（提高可靠性）、Explainable AI（结构化事实便于验证）。其他关键词如MoE、SLMs、RAG等未涉及。

!!! tip deepseek-chat TL;DR

该论文针对现有Video-LLMs在视频理解中因果推理薄弱的问题，提出了通过结构化事件事实和包含强化学习的多阶段训练流程来改进模型，最终开发的Factum-4B模型在需要细粒度时间推理的视频理解任务上表现出更强的性能。

摘要翻译

人类对视频动态的理解通常基于对实体、动作及时间关系的结构化心理表征，而非单纯依赖即时演绎推理。相比之下，现有的视频大语言模型主要依赖于非结构化的视频推理，其中关键的视觉证据被嵌入冗长的文本描述，且时间因果关系往往建模薄弱。这导致推理过程效率低下且因果推断脆弱。为弥合这一认知差距，我们提出在推理阶段前构建关键事件及其因果关系的紧凑表征，并将其命名为“结构化事件事实”。这种结构化先验作为显式约束，可促进简洁且基于因果的推理，同时使中间证据更易于验证。为有效训练模型掌握此类结构化事实，我们引入了CausalFact-60K数据集及四阶段训练流程，包括事实对齐、格式预热、思维预热和基于强化学习的后期训练。在强化学习阶段，我们发现该框架引入了相互竞争的目标：结构完整性与因果保真度需与推理长度相平衡，导致优化困难。我们通过将优化问题构建为多目标强化学习问题，并显式优化帕累托前沿以平衡这些权衡，从而应对这一挑战。最终，我们提出了Factum-4B模型，该模型能产生更可靠的推理，并在需要细粒度时间推断的复杂视频理解任务中展现出更强的性能。

摘要 (Abstract)

Human understanding of video dynamics is typically grounded in a structured mental representation of entities, actions, and temporal relations, rather than relying solely on immediate deductive reasoning. In contrast, existing Video-LLMs largely depend on unstructured video reasoning, where critical visual evidence is embedded in verbose textual descriptions and temporal causality is often weakly modeled. This leads to inefficient processes and fragile causal inference. To bridge this cognitive gap, we propose constructing a compact representation of salient events and their causal relationships, which we name Structured Event Facts, prior to the reasoning stage. This structured prior serves as an explicit constraint to promote concise and causally grounded reasoning, while also making intermediate evidence easier to verify. To effectively train models on such structured facts, we introduce CausalFact-60K and a four-stage training pipeline comprising facts alignment, format warm-start, thinking warm-start, and reinforcement learning-based post-training. During RL stage, we find that this framework introduces competing objectives, as structural completeness and causal fidelity must be balanced against reasoning length, making it difficult to optimize. We address this challenge by formulating the optimization as a Multi-Objective Reinforcement Learning (MORL) problem and explicitly optimizing toward the Pareto-Frontier to balance these trade-offs. As a result, we introduce Factum-4B, which yields more reliable reasoning and delivers stronger performance on challenging video understanding tasks requiring fine-grained temporal inference.

关键词: Video-LLMs, causal reasoning, structured event facts, multi-objective reinforcement learning, temporal inference, Factum-4B, video understanding, reinforcement learning post-training

134. ❌ Talk2AI: A Longitudinal Dataset of Human–AI Persuasive Conversations

作者: Alexis Carrillo, Enrique Taietta, Ali Aghazadeh Ardebili, Giuseppe Alessandro Veltri, Massimo Stella 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04354v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是创建了一个人类与LLMs（GPT-4o, Claude Sonnet 3.7, DeepSeek-chat V3, Mistral Large）的对话数据集，用于研究说服、意见改变和人机交互，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的技术原理、方法或应用，这些关键词均与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该研究创建了一个大规模纵向数据集Talk2AI，包含人类与四种大语言模型的对话，用于分析AI介导的对话如何随时间影响人类的信念和态度，发现该数据集支持对说服、意见改变和人机交互的细粒度研究。

摘要翻译

Talk2AI是一个大规模纵向数据集，包含3,080场人类参与者与大型语言模型（LLMs）之间的对话（总计30,800轮次），旨在支持关于说服、观点转变以及人机交互的研究。该语料库于2025年春季通过为期四周的每周会话收集自770名经过背景分析的意大利成年人，采用被试内设计，每位参与者均与单一模型（GPT-4o、Claude Sonnet 3.7、DeepSeek-chat V3或Mistral Large）就三个社会性议题展开对话：气候变化、数学焦虑和健康错误信息。每场对话均关联丰富的背景数据，包括社会人口学特征和心理测量学画像。每次会话后，参与者报告了观点转变、信念稳定性、对AI拟人程度的感知以及行为意向，从而支持对AI介导的对话如何随时间塑造信念与态度进行细粒度纵向分析。

摘要 (Abstract)

Talk2AI is a large-scale longitudinal dataset of 3,080 conversations (totaling 30,800 turns) between human participants and Large Language Models (LLMs), designed to support research on persuasion, opinion change, and human-AI interaction. The corpus was collected from 770 profiled Italian adults across four weekly sessions in Spring 2025, using a within-subject design in which each participant conversed with a single model (GPT-4o, Claude Sonnet 3.7, DeepSeek-chat V3, or Mistral Large) on three socially relevant topics: climate change, math anxiety, and health misinformation. Each conversation is linked to rich contextual data, including sociodemographic characteristics and psychometric profiles. After each session, participants reported on opinion change, conviction stability, perceived humanness of the AI, and behavioral intentions, enabling fine-grained longitudinal analysis of how AI-mediated dialogue shapes beliefs and attitudes over time.

关键词: Large Language Models, human-AI interaction, persuasion, longitudinal dataset, opinion change, conversational AI, belief formation, social topics

135. ❌ Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction

作者: Jinrui Fang, Runhan Chen, Xu Yang, Jian Yu, Jiawei Xu, Ashwin Vinod, Wenqi Shi, Tianlong Chen, Heng Ji, ChengXiang Zhai, Ying Ding, Yuji Zhang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04325v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究大语言模型在医学诊断中的多轮交互行为，与’Large Language Models’和’AI for Science’高度相关（10分）。论文重点研究自我纠正现象，与’Self-Correction’高度相关（10分）。论文涉及推理过程和多步证据积累，与’Chain of Thought’和’System 2 Thinking’有一定关联（5分）。论文关注诊断准确性和可靠性，与’Hallucination Mitigation’和’Explainable AI’有一定关联（5分）。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在多轮医学诊断中的行为模式，发现模型存在过早回答、自我纠正和强诱惑现象，并提出推迟诊断问题和保留关键证据到后期能显著提高诊断准确性。

摘要翻译

大型语言模型（LLMs）在一次性提供全部临床信息时能够实现较高的医学诊断准确率，然而它们在更接近真实临床推理的多轮证据积累过程中的表现尚未得到充分探索。我们提出了MINT（医学增量多轮评测基准），这是一个高保真度的多轮医学诊断基准，包含1,035个病例，每个病例均带有临床标注的证据片段、可控的轮次粒度以及信息无损的分解。通过对11个LLMs在MINT上进行系统评估，我们发现了三种显著影响诊断决策的持续性行为模式：（1）回答意图，模型在观察到足够证据前急于作答，超过55%的答案在前两轮内即被确定；（2）自我纠正，错误到正确的答案修正发生率高达正确到错误翻转的10.6倍，揭示了模型存在因过早作答而被抑制的潜在自我纠正能力；（3）强诱惑性，实验室结果等临床显著性信息会触发模型过早回答，即使模型被明确指示等待。我们将这些发现转化为具有临床指导意义的建议：将诊断问题推迟到后续轮次可减少过早作答，并将首次作答时的准确率提升高达62.6%；而将显著性临床证据保留至后续轮次，可避免因过早作答导致的准确率骤降（降幅最高达23.3%）。本研究为提升LLMs在多轮医学诊断中的可靠性，既提供了一个受控的评估框架，也提出了具体可行的改进建议。

摘要 (Abstract)

Large language models (LLMs) achieve high accuracy in medical diagnosis when all clinical information is provided in a single turn, yet how they behave under multi-turn evidence accumulation closer to real clinical reasoning remains unexplored. We introduce MINT (Medical Incremental N-Turn Benchmark), a high-fidelity, multi-turn medical diagnosis benchmark comprising 1,035 cases with clinically labeled evidence shards, controlled turn granularity, and information-preserving decomposition. Through systematic evaluation of 11 LLMs on MINT, we uncover three persistent behavioral patterns that significantly impact diagnostic decisions: (1) intent to answer, models rush to answer before sufficient evidence has been observed, with over 55% of answers committed within the first two turns; (2) self-correction, incorrect-to-correct answer revisions occur at up to 10.6 times the rate of correct-to-incorrect flips, revealing a latent capacity for self-correction that premature commitment forecloses; and (3) strong lures, clinically salient information such as laboratory results trigger premature answering even when models are explicitly instructed to wait. We translate these findings into clinically actionable guidance: deferring the diagnostic question to later turns reduces premature answering and improves accuracy at the first point of commitment by up to 62.6%, while reserving salient clinical evidence for later turns prevents a catastrophic accuracy drop of up to 23.3% caused by premature commitment. Our work provides both a controlled evaluation framework and concrete recommendations for improving the reliability of LLMs in multi-turn medical diagnosis.

关键词: Large Language Models, Medical Diagnosis, Multi-turn Benchmark, Self-Correction, Premature Answering, Clinical Reasoning, MINT Benchmark, Evidence Accumulation

136. ❌ How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

作者: Yujian Liu, Jiabao Ji, Li An, Tommi Jaakkola, Yang Zhang, Shiyu Chang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04323v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based agents在真实环境中使用技能（skills）的性能评估，与"LLM Agents"高度相关（10分），涉及技能检索和选择机制，与"Retrieval-Augmented Generation"有一定关联（8分），技能作为可重用知识模块与"Tool Use"概念相关（8分）。其他关键词如MoE、SFT、RLHF等未在论文中涉及，评0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM智能体在真实环境中使用技能的性能，发现随着环境挑战性增加，技能带来的性能提升会显著下降，但通过查询特定的技能优化策略可以部分恢复性能。

摘要翻译

智能体技能作为可复用、领域特定的知识构件，已成为扩展基于大语言模型智能体的常用机制，然而对其使用性能进行形式化基准测试的研究仍十分匮乏。现有技能基准测试工作集中于过度理想化的条件，即直接为每个任务提供手工构建、高度定制化的任务专属技能，而在许多现实场景中，智能体可能需要自主搜索并选择相关技能，即使最匹配的技能也可能无法完美适配任务需求。本文首次在渐进式挑战性现实场景下对技能效用展开系统性研究：智能体需从包含3.4万个真实世界技能的大型库中检索技能，且无法获得任何人工筛选的技能。研究发现技能优势具有脆弱性：随着场景现实性增强，性能增益持续衰减，在最挑战性场景中通过率趋近无技能基线水平。为缩小这一差距，我们研究了技能优化策略，包括查询特定型与查询无关型方法，并证明当初始技能具备合理相关性与质量时，查询特定型优化能显著恢复损失的性能。我们进一步在Terminal-Bench 2.0基准上验证了检索与优化策略的普适性，将Claude Opus 4.6的通过率从57.7%提升至65.5%。跨多模型的一致性结果表明，技能在基于大语言模型的智能体中既展现潜力又存在当前局限。代码已开源：https://github.com/UCSB-NLP-Chang/Skill-Usage。

摘要 (Abstract)

Agent skills, which are reusable, domain-specific knowledge artifacts, have become a popular mechanism for extending LLM-based agents, yet formally benchmarking skill usage performance remains scarce. Existing skill benchmarking efforts focus on overly idealized conditions, where LLMs are directly provided with hand-crafted, narrowly-tailored task-specific skills for each task, whereas in many realistic settings, the LLM agent may have to search for and select relevant skills on its own, and even the closest matching skills may not be well-tailored for the task. In this paper, we conduct the first comprehensive study of skill utility under progressively challenging realistic settings, where agents must retrieve skills from a large collection of 34k real-world skills and may not have access to any hand-curated skills. Our findings reveal that the benefits of skills are fragile: performance gains degrade consistently as settings become more realistic, with pass rates approaching no-skill baselines in the most challenging scenarios. To narrow this gap, we study skill refinement strategies, including query-specific and query-agnostic approaches, and we show that query-specific refinement substantially recovers lost performance when the initial skills are of reasonable relevance and quality. We further demonstrate the generality of retrieval and refinement on Terminal-Bench 2.0, where they improve the pass rate of Claude Opus 4.6 from 57.7% to 65.5%. Our results, consistent across multiple models, highlight both the promise and the current limitations of skills for LLM-based agents. Our code is available at https://github.com/UCSB-NLP-Chang/Skill-Usage.

关键词: LLM agents, agent skills, skill benchmarking, skill retrieval, skill refinement, realistic settings, Terminal-Bench, performance evaluation

137. ❌ Entropy, Disagreement, and the Limits of Foundation Models in Genomics

作者: Maxime Rochkoulets, Lovro Vrček, Mile Šikić 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04287v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文直接研究基因组学中的基础模型（Foundation Models），与关键词1高度相关；论文涉及自监督预训练（Pre-training）在基因组数据上的应用，与关键词5高度相关；论文属于生物信息学（Bioinformatics）领域，是AI for Science的具体应用，与关键词27高度相关。其他关键词如MoE、SFT、RAG、推理加速等均未在论文中涉及，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文研究发现，基因组序列的高熵特性导致基础模型在DNA数据上训练时产生近乎均匀的输出分布、模型间不一致性和不稳定的静态嵌入，表明仅从序列进行自监督训练可能不适用于基因组数据，对当前基因组基础模型的训练方法假设提出了质疑。

摘要翻译

与自然语言处理领域的对应模型相比，基因组学中的基础模型表现参差不齐。然而，其有效性受限的原因仍不甚明了。在本研究中，我们探讨了熵作为一种基本因素，如何限制此类模型从训练数据中学习并发展基础能力。我们基于文本和DNA序列训练了模型集成，并分析了它们的预测结果、静态嵌入表示以及经验费舍尔信息流。研究表明，从未见标记预测的角度来看，基因组序列的高熵会导致近乎均匀的输出分布、模型间的预测不一致以及不稳定的静态嵌入表示，即使对于架构、训练和数据均匹配的模型亦是如此。我们进一步证明，在DNA上训练的模型将费舍尔信息集中于嵌入层，似乎未能充分利用标记间的关联性。我们的结果表明，仅从序列出发的自监督训练方法可能不适用于基因组数据，这对当前训练基因组基础模型的方法论所基于的假设提出了质疑。

摘要 (Abstract)

Foundation models in genomics have shown mixed success compared to their counterparts in natural language processing. Yet, the reasons for their limited effectiveness remain poorly understood. In this work, we investigate the role of entropy as a fundamental factor limiting the capacities of such models to learn from their training data and develop foundational capabilities. We train ensembles of models on text and DNA sequences and analyze their predictions, static embeddings, and empirical Fisher information flow. We show that the high entropy of genomic sequences – from the point of view of unseen token prediction – leads to near-uniform output distributions, disagreement across models, and unstable static embeddings, even for models that are matched in architecture, training and data. We then demonstrate that models trained on DNA concentrate Fisher information in embedding layers, seemingly failing to exploit inter-token relationships. Our results suggest that self-supervised training from sequences alone may not be applicable to genomic data, calling into question the assumptions underlying current methodologies for training genomic foundation models.

关键词: Foundation Models, Genomics, Entropy, Self-supervised Training, DNA Sequences, Fisher Information, Embedding Stability, Model Disagreement

138. ❌ Commercial Persuasion in AI-Mediated Conversations

作者: Francesco Salvi, Alejandro Cuevas, Manoel Horta Ribeiro 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04263v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究LLM作为对话代理在商业说服中的应用，与’Large Language Models’和’LLM Agents’高度相关（10分），涉及模型对齐和透明度问题，与’Instruction Tuning/Alignment’和’Hallucination Mitigation/Factuality’有一定关联（5分），与’Mechanistic Interpretability/Explainable AI’在检测机制方面相关（5分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理优化、量化等均未涉及（0分）。

!!! tip deepseek-chat TL;DR

该研究发现，在AI中介的对话中，LLM驱动的商业说服使用户选择赞助产品的比率比传统搜索高出近三倍，且大多数用户未能察觉这种促销引导，表明现有透明度机制可能不足以保护用户。

摘要翻译

随着大型语言模型（LLM）逐渐成为用户与网络交互的主要界面，企业面临着日益增长的经济动机，将商业影响嵌入由人工智能介导的对话中。我们进行了两项预先注册的实验（样本量 N = 2,012），参与者需从一个大型电子书目录中选择一本书籍，他们可以使用传统搜索引擎，或是由五种前沿模型之一驱动的对话式LLM代理。在参与者不知情的情况下，所有产品中有五分之一被随机指定为赞助商品，并以不同方式进行推广。我们发现，与传统搜索排名相比，LLM驱动的说服使用户选择赞助产品的比例增加了近三倍（61.2%对比22.4%），而绝大多数参与者未能察觉任何推广引导。明确的“赞助”标签并未显著降低说服效果，而指示模型隐藏其意图则使其影响几乎无法被察觉（检测准确率<10%）。总体而言，我们的研究结果表明，对话式人工智能能够大规模地隐蔽引导消费者选择，且现有的透明度机制可能不足以保护用户。

摘要 (Abstract)

As Large Language Models (LLMs) become a primary interface between users and the web, companies face growing economic incentives to embed commercial influence into AI-mediated conversations. We present two preregistered experiments (N = 2,012) in which participants selected a book to receive from a large eBook catalog using either a traditional search engine or a conversational LLM agent powered by one of five frontier models. Unbeknownst to participants, a fifth of all products were randomly designated as sponsored and promoted in different ways. We find that LLM-driven persuasion nearly triples the rate at which users select sponsored products compared to traditional search placement (61.2% vs. 22.4%), while the vast majority of participants fail to detect any promotional steering. Explicit “Sponsored” labels do not significantly reduce persuasion, and instructing the model to conceal its intent makes its influence nearly invisible (detection accuracy < 10%). Altogether, our results indicate that conversational AI can covertly redirect consumer choices at scale, and that existing transparency mechanisms may be insufficient to protect users.

关键词: Large Language Models, LLM agents, commercial persuasion, sponsored products, transparency mechanisms, user detection, conversational AI, consumer choices

139. ❌ CAWN: Continuous Acoustic Wave Networks for Autoregressive Language Modeling

作者: Dejan Čugalj, Aleksandar Jevremovic 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04250v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心贡献是提出了一种新型的连续声波网络（CAWN）架构，用于替代Transformer的自注意力机制，以解决长上下文处理中的二次复杂度问题。该研究直接涉及大语言模型（LLMs）的架构创新，特别是针对长上下文扩展（Context Window Extension）的挑战。论文明确提到了LLMs和长上下文处理，因此这两个关键词获得10分。其他关键词如MoE、SLMs、训练方法、对齐、推理技术、代理、压缩等均未在摘要中提及或相关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为CAWN的新型连续序列混合架构，通过相位积累机制实现O(L)复杂度的自回归语言建模，有效解决了Transformer在长上下文处理中的二次复杂度问题，并在200万token的上下文中实现了高效的信息检索。

摘要翻译

现代大型语言模型（LLM）依赖于Transformer自注意力机制，其计算复杂度随序列长度呈二次方增长。近期的线性时间替代方案，如状态空间模型（SSM），在长上下文处理中常面临信号衰减问题。本文提出连续声波网络（Continuous Acoustic Wave Network, CAWN），一种完全连续的序列混合架构。CAWN摒弃了基于离散矩阵的注意力机制，将隐藏状态投影为多头复数域相量，通过因果性的$O(L)$相位累积机制实现序列混合。为防止超长上下文中的信号衰减，我们引入双门控选择性相位共振机制，该机制融合了频率相关保持、基于直通估计的硬阈值门控，以及用于捕捉短期局部依赖的时间语法缓存。同时，我们以深度谐波卷积取代标准稠密线性投影，以实现最优空间频率混合，并通过块注意力残差增强深度状态路由。将模型扩展至1.5亿参数规模后，CAWN采用定制Triton内核，在float32精度下实现硬件高效的纯复数相位累积。通过在千亿词元语料库上进行连续流式循环训练，原型模型在50亿词元里程碑处进行评估。基于定向语义检索协议的实证评估表明，该模型具备稳健的词表学习能力和扩展的显式上下文去噪能力。通过分块预填充实现$O(1)$状态传递，模型能够在200万词元范围内检索目标信息，同时峰值显存严格稳定在8.72GB，实证突破了$O(L^2)$上下文内存墙的限制。

摘要 (Abstract)

Modern Large Language Models (LLMs) rely on Transformer self-attention, which scales quadratically with sequence length. Recent linear-time alternatives, like State Space Models (SSMs), often suffer from signal degradation over extended contexts. We introduce the Continuous Acoustic Wave Network (CAWN), a fully continuous sequence-mixing architecture. Instead of discrete matrix-based attention, CAWN projects hidden states into multi-headed complex-domain phasors, achieving sequence mixing through a causal, $O(L)$ Phase Accumulation mechanism. To prevent signal degradation over ultra-long contexts, we introduce a dual-gated Selective Phase Resonance mechanism incorporating Frequency-Dependent Retention, Hard-Threshold Gating via Straight-Through Estimation, and a Temporal Syntax Cache to capture short-term local dependencies. We also replace standard dense linear projections with Depth-wise Harmonic Convolutions for optimal spatial frequency mixing, augmented by Block Attention Residuals for depth-wise state routing. Scaled to a 150M-parameter model, CAWN utilizes custom Triton kernels for hardware-efficient, true-complex phase accumulation in float32. Trained via a continuous streaming loop on a 100-Billion-token corpus, the prototype is evaluated at a 5-Billion-token milestone. Empirical evaluations via a Targeted Semantic Retrieval protocol demonstrate robust vocabulary acquisition and extended explicitly learned contextual denoising. By leveraging $O(1)$ state-passing via chunked prefill, the model retrieves targeted information across 2,000,000 tokens while strictly plateauing at 8.72 GB of Peak VRAM, empirically overcoming the $O(L^2)$ context memory wall.

关键词: Continuous Acoustic Wave Network, Autoregressive Language Modeling, Long Context LLMs, Phase Accumulation, Sequence Mixing Architecture, Transformer Alternatives, Linear-time Complexity, Context Memory Wall

140. ❌ Combee: Scaling Prompt Learning for Self-Improving Language Model Agents

作者: Hanchen Li, Runyuan He, Qizheng Zhang, Changxiu Ji, Qiuyang Mang, Xiaokun Chen, Lakshya A Agrawal, Wei-Liang Liao, Eric Yang, Alvin Cheung, James Zou, Kunle Olukotun, Ion Stoica, Joseph E. Gonzalez 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04247v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大规模语言模型（LLM）代理的自我改进，通过提示学习（prompt learning）从推理时上下文中获取任务相关知识，属于LLM代理和上下文学习领域。与’Large Language Models’、‘LLM Agents’、‘Self-Improvement’、‘In-context Learning’高度相关（10分），因为这些是论文的基础和核心内容。与’Multi-agent Systems’有一定关联（5分），因为论文涉及并行代理执行和从多个代理轨迹中学习，但主要焦点是单个代理的提示学习扩展，而非多代理协调。其他关键词如MoE、SFT、RAG、量化等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出Combee框架，解决了现有提示学习方法在并行扩展时效率低和质量下降的问题，通过并行扫描和动态批处理实现了高达17倍的加速，同时保持或提高了准确性。

摘要翻译

近期提示学习领域的进展使得大语言模型智能体能够在无需调整参数的情况下，从推理时上下文中获取任务相关知识。例如，现有方法（如ACE或GEPA）可通过基于先前智能体运行记录来学习系统提示，从而提升准确性。然而，这些方法主要关注单智能体或低并行度场景，这从根本上限制了其从大量已收集的智能体轨迹中高效学习的能力。随着从多智能体轨迹或并行智能体执行中学习的趋势日益增长，并行运行提示学习将变得高效且有益。但若缺乏可扩展的原则性策略，现有方法在高并行度下会出现质量下降问题。为同时提升提示学习的效率与质量，我们提出了Combee——一种面向自改进智能体的新型并行提示学习扩展框架。Combee通过加速学习过程，支持在并行运行大量智能体的同时，从其聚合轨迹中学习且不降低学习质量。为实现这一目标，Combee利用并行扫描技术并采用增强型混洗机制；同时引入动态批量大小控制器以平衡质量与延迟。在AppWorld、Terminal-Bench、Formula和FiNER数据集上的评估表明，Combee在保持相当或更高准确性且成本相当的条件下，相比现有方法实现了最高达17倍的加速。

摘要 (Abstract)

Recent advances in prompt learning allow large language model agents to acquire task-relevant knowledge from inference-time context without parameter changes. For example, existing methods (like ACE or GEPA) can learn system prompts to improve accuracy based on previous agent runs. However, these methods primarily focus on single-agent or low-parallelism settings. This fundamentally limits their ability to efficiently learn from a large set of collected agentic traces. It would be efficient and beneficial to run prompt learning in parallel to accommodate the growing trend of learning from many agentic traces or parallel agent executions. Yet without a principled strategy for scaling, current methods suffer from quality degradation with high parallelism. To improve both the efficiency and quality of prompt learning, we propose Combee, a novel framework to scale parallel prompt learning for self-improving agents. Combee speeds up learning and enables running many agents in parallel while learning from their aggregate traces without quality degradation. To achieve this, Combee leverages parallel scans and employs an augmented shuffle mechanism; Combee also introduces a dynamic batch size controller to balance quality and delay. Evaluations on AppWorld, Terminal-Bench, Formula, and FiNER demonstrate that Combee achieves up to 17x speedup over previous methods with comparable or better accuracy and equivalent cost.

关键词: prompt learning, self-improving agents, large language model agents, parallel scaling, inference-time context, agentic traces, dynamic batch size, parallel scans

141. ❌ Precise Robot Command Understanding Using Grammar-Constrained Large Language Models

作者: Xinyun Huo, Raghav Gnanasambandam, Xinyao Zhang 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04233v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究是开发一种结合语法约束的LLM系统用于机器人命令理解，直接涉及LLM的微调应用（SFT）和自校正机制，因此与’Large Language Models’、‘Post-training’和’Self-Correction’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、RAG、Agents等均未在摘要中提及或涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合语法约束和微调LLM的混合模型，用于提高工业人机协作中机器人命令理解的精确性和可靠性，并通过验证反馈循环实现自校正，在HuRIC数据集上验证了其优越的命令有效性。

摘要翻译

工业环境中的人机协作需要精确可靠的通信以提升操作效率。尽管大语言模型（LLMs）能够理解通用语言，但它们通常缺乏确保工业指令安全可执行所需的领域特定严谨性。为弥补这一不足，本文提出一种新颖的语法约束大语言模型，该模型将语法驱动的自然语言理解（NLU）系统与经过微调的大语言模型相结合，从而同时实现对话灵活性与机器人学所需的确定性精度。我们的方法采用两阶段流程：首先，经过微调的大语言模型对自然语言输入进行高层级上下文推理和参数推断；其次，结构化语言模型（SLM）与基于语法的规范化器约束大语言模型的输出，将其强制转换为由有效动作框架和指令元素组成的标准化符号格式。这一流程确保生成的指令有效且以机器人可读的JSON格式结构化。所提出模型的一个关键特征是验证与反馈循环：语法解析器根据预定义的可执行机器人动作列表验证输出，若指令无效，系统会自动生成纠正提示并重新调用大语言模型。这种迭代式自我修正机制使模型能够从初始解析错误中恢复，从而提升系统鲁棒性。我们使用人机交互语料库（HuRIC）数据集，将所提出的语法约束混合模型与两种基线模型——基于API的微调大语言模型和独立语法驱动NLU模型——进行比较评估。实验表明，该混合方法在指令有效性方面表现更优，有助于实现更安全、更高效的工业人机协作。

摘要 (Abstract)

Human-robot collaboration in industrial settings requires precise and reliable communication to enhance operational efficiency. While Large Language Models (LLMs) understand general language, they often lack the domain-specific rigidity needed for safe and executable industrial commands. To address this gap, this paper introduces a novel grammar-constrained LLM that integrates a grammar-driven Natural Language Understanding (NLU) system with a fine-tuned LLM, which enables both conversational flexibility and the deterministic precision required in robotics. Our method employs a two-stage process. First, a fine-tuned LLM performs high-level contextual reasoning and parameter inference on natural language inputs. Second, a Structured Language Model (SLM) and a grammar-based canonicalizer constrain the LLM’s output, forcing it into a standardized symbolic format composed of valid action frames and command elements. This process guarantees that generated commands are valid and structured in a robot-readable JSON format. A key feature of the proposed model is a validation and feedback loop. A grammar parser validates the output against a predefined list of executable robotic actions. If a command is invalid, the system automatically generates corrective prompts and re-engages the LLM. This iterative self-correction mechanism allows the model to recover from initial interpretation errors to improve system robustness. We evaluate our grammar-constrained hybrid model against two baselines: a fine-tuned API-based LLM and a standalone grammar-driven NLU model. Using the Human Robot Interaction Corpus (HuRIC) dataset, we demonstrate that the hybrid approach achieves superior command validity, which promotes safer and more effective industrial human-robot collaboration.

关键词: Large Language Models, grammar-constrained, fine-tuned, robot command understanding, self-correction, human-robot collaboration, industrial settings, structured language model

142. ❌ Which English Do LLMs Prefer? Triangulating Structural Bias Towards American English in Foundation Models

作者: Mir Tafseer Nayeem, Davood Rafiei 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04204v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs/Foundation Models）中的方言偏见问题，因此该关键词得10分。论文涉及预训练语料库审计和模型开发流程分析，与’Scaling Laws AND Data Quality’和’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联，各得5分。其他关键词如MoE、SFT、RAG、量化等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文通过系统审计预训练语料库、分析分词器并评估生成输出，发现当代大语言模型在标准英语变体（美式英语和英式英语）中存在结构性偏见，显著偏向美式英语作为事实上的规范。

摘要翻译

大型语言模型正日益被部署于高风险领域，但其仅提供有限的语言设置选项——最显著的是“英语（美国）”，这忽视了英语本身的全球多样性及其殖民历史。通过后殖民理论框架阐释其更广泛的意义，我们研究了数据策展的地缘政治历史、数字霸权以及语言标准化如何塑造大型语言模型的开发流程。聚焦于美国英语和英国英语这两种主流标准变体，我们构建了一个包含1,813组美式—英式英语对应变体的精编语料库，并提出了DiAlign——一种基于分布证据、无需训练的动态方言对齐度估计方法。我们通过三阶段证据三角验证来量化结构性偏见：（一）对六个主流预训练语料库的审计显示系统性地偏向美式英语；（二）分词器分析表明英式英语形式会产生更高的分词代价；（三）生成式评估显示模型输出持续偏好美式英语。据我们所知，这是首次对标准英语变体在大型语言模型开发全流程中的方言不对称性进行系统化、多维度考察。研究发现当代大型语言模型将美式英语默认为事实标准，这引发了关于语言同质化、认知不公以及全球人工智能部署不平等的担忧，同时也为推动开发更具方言包容性的语言技术提供了实践方向。

摘要 (Abstract)

Large language models (LLMs) are increasingly deployed in high-stakes domains, yet they expose only limited language settings, most notably “English (US),” despite the global diversity and colonial history of English. Through a postcolonial framing to explain the broader significance, we investigate how geopolitical histories of data curation, digital dominance, and linguistic standardization shape the LLM development pipeline. Focusing on two dominant standard varieties, American English (AmE) and British English (BrE), we construct a curated corpus of 1,813 AmE–BrE variants and introduce DiAlign, a dynamic, training-free method for estimating dialectal alignment using distributional evidence. We operationalize structural bias by triangulating evidence across three stages: (i) audits of six major pretraining corpora reveal systematic skew toward AmE, (ii) tokenizer analyses show that BrE forms incur higher segmentation costs, and (iii) generative evaluations show a persistent AmE preference in model outputs. To our knowledge, this is the first systematic and multi-faceted examination of dialectal asymmetries in standard English varieties across the phases of LLM development. We find that contemporary LLMs privilege AmE as the de facto norm, raising concerns about linguistic homogenization, epistemic injustice, and inequity in global AI deployment, while motivating practical steps toward more dialectally inclusive language technologies.

关键词: Large Language Models, Dialectal Bias, American English, British English, Pretraining Corpora, Tokenizer Analysis, Generative Evaluation, Structural Bias

143. ❌ ClawArena: Benchmarking AI Agents in Evolving Information Environments

作者: Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia, Shi Qiu, Yiyang Zhou, Jiaqi Liu, Jinlong Li, Bingzhou Li, Zeyu Zheng, Cihang Xie, Huaxiu Yao 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04202v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ClawArena专注于AI智能体在动态信息环境中的评估基准，核心涉及LLM智能体、多步推理、深度思考、自我修正、事实性维护等能力。与LLM、智能体、推理、自我修正、事实性等关键词高度相关（10分），因为这些是论文评估的核心维度。其他关键词如MoE、量化、RAG、对齐等未在论文中涉及，评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了ClawArena基准，用于评估AI智能体在动态、多源、矛盾信息环境中维护正确信念的能力，实验发现模型能力和框架设计显著影响性能，自我演化技能框架可部分弥补模型能力差距。

摘要翻译

作为持久性助手部署的AI智能体，必须在其信息环境演变过程中保持正确的信念。实践中，证据分散在相互矛盾的异构来源中，新信息可能使早期结论失效，而用户偏好往往通过修正而非明确指令显现。现有基准大多假设静态、单一权威的环境，未能评估智能体能否应对这种复杂性。我们推出ClawArena——一个用于评估动态信息环境中AI智能体的基准。每个场景维护完整的隐藏事实真相，而仅向智能体暴露多通道会话、工作空间文件和阶段性更新中存在的噪声化、碎片化且有时相互矛盾的痕迹。评估围绕三个相互关联的挑战展开：多源冲突推理、动态信念修正和隐式个性化，其相互作用形成了14类问题分类体系。通过多选题（集合选择）和基于shell的可执行检查两种问题形式，同时测试推理能力与工作空间落地能力。当前版本涵盖8个专业领域的64个场景，总计包含1,879个评估轮次和365次动态更新。对五种智能体框架和五种语言模型的实验表明：模型能力（15.4%性能差异范围）与框架设计（9.2%）均显著影响性能，自演进技能框架可部分弥补模型能力差距，且信念修正的难度取决于更新设计策略而非单纯是否存在更新。代码发布于https://github.com/aiming-lab/ClawArena。

摘要 (Abstract)

AI agents deployed as persistent assistants must maintain correct beliefs as their information environment evolves. In practice, evidence is scattered across heterogeneous sources that often contradict one another, new information can invalidate earlier conclusions, and user preferences surface through corrections rather than explicit instructions. Existing benchmarks largely assume static, single-authority settings and do not evaluate whether agents can keep up with this complexity. We introduce ClawArena, a benchmark for evaluating AI agents in evolving information environments. Each scenario maintains a complete hidden ground truth while exposing the agent only to noisy, partial, and sometimes contradictory traces across multi-channel sessions, workspace files, and staged updates. Evaluation is organized around three coupled challenges: multi-source conflict reasoning, dynamic belief revision, and implicit personalization, whose interactions yield a 14-category question taxonomy. Two question formats, multi-choice (set-selection) and shell-based executable checks, test both reasoning and workspace grounding. The current release contains 64 scenarios across 8 professional domains, totaling 1{,}879 evaluation rounds and 365 dynamic updates. Experiments on five agent frameworks and five language models show that both model capability (15.4% range) and framework design (9.2%) substantially affect performance, that self-evolving skill frameworks can partially close model-capability gaps, and that belief revision difficulty is determined by update design strategy rather than the mere presence of updates. Code is available at https://github.com/aiming-lab/ClawArena.

关键词: AI agents, evolving information environments, benchmark, multi-source conflict reasoning, dynamic belief revision, implicit personalization, LLM evaluation, workspace grounding

144. ❌ Position: Logical Soundness is not a Reliable Criterion for Neurosymbolic Fact-Checking with LLMs

作者: Jason Chan, Robert Gaizauskas, Zhixue Zhao 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04177v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在事实核查中的应用，特别是神经符号系统中逻辑推理与人类推理的差异问题。高度相关关键词：‘Large Language Models’（论文核心研究对象）、‘Hallucination Mitigation’（直接讨论减少LLMs错误和幻觉）。中等相关关键词：‘Chain of Thought’、‘System 2 Thinking’、‘Mechanistic Interpretability’（涉及推理过程和可解释性，但非技术核心）。其余关键词与论文的技术方法、模型架构、训练过程、特定应用领域等无关。

!!! tip deepseek-chat TL;DR

该论文研究发现，在基于LLMs的神经符号事实核查系统中，逻辑正确性并不能可靠地检测误导性主张，因为逻辑上正确的结论可能引发人类无法从前提中推断出的误导性推理，因此建议利用LLMs的人类式推理倾向来验证形式组件的输出。

摘要翻译

随着大语言模型日益融入事实核查流程，形式逻辑常被提议作为一种严谨手段，用以缓解这些模型输出中的偏见、错误与幻觉。例如，某些神经符号系统通过使用大语言模型将自然语言转化为逻辑公式来验证主张，继而检验所提主张在逻辑上是否可靠——即它们能否从已验证为真实的前提中有效推导出来。我们认为，由于逻辑可靠的结论与人类通常作出并接受的推论之间存在系统性差异，此类方法在结构上无法有效识别误导性主张。借鉴认知科学与语用学的研究，我们提出一种类型学框架，用以系统归类那些逻辑可靠结论却引发人类作出缺乏前提支持的推论的案例。因此，我们主张一种互补性路径：将大语言模型类人的推理倾向视作特性而非缺陷加以利用，并运用这些模型来验证神经符号系统中形式逻辑组件的输出，以防范潜在的误导性结论。

摘要 (Abstract)

As large language models (LLMs) are increasing integrated into fact-checking pipelines, formal logic is often proposed as a rigorous means by which to mitigate bias, errors and hallucinations in these models’ outputs. For example, some neurosymbolic systems verify claims by using LLMs to translate natural language into logical formulae and then checking whether the proposed claims are logically sound, i.e. whether they can be validly derived from premises that are verified to be true. We argue that such approaches structurally fail to detect misleading claims due to systematic divergences between conclusions that are logically sound and inferences that humans typically make and accept. Drawing on studies in cognitive science and pragmatics, we present a typology of cases in which logically sound conclusions systematically elicit human inferences that are unsupported by the underlying premises. Consequently, we advocate for a complementary approach: leveraging the human-like reasoning tendencies of LLMs as a feature rather than a bug, and using these models to validate the outputs of formal components in neurosymbolic systems against potentially misleading conclusions.

关键词: large language models, fact-checking, neurosymbolic systems, logical soundness, human inference, hallucination mitigation, formal logic, reasoning divergence

145. ❌ A Semi-Automated Annotation Workflow for Paediatric Histopathology Reports Using Small Language Models

作者: Avish Vijayaraghavan, Jaskaran Singh Kawatra, Sebin Sabu, Jonny Sheldon, Will Poulett, Alex Eze, Daniel Key, John Booth, Shiren Patel, Jonny Pearson, Dan Schofield, Jonathan Hope, Pavithra Rajendran, Neil Sebire 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04168v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	8.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用小型语言模型（SLMs）进行儿科病理报告的半自动标注工作流，高度相关关键词包括：1）‘Small Language Models OR SLMs OR On-device AI’（10分）- 论文明确使用SLMs在CPU基础设施上运行，解决隐私和资源问题；2）‘AI for Science OR Bioinformatics OR Cheminformatics’（10分）- 直接应用于生物信息学/医疗领域；3）‘Instruction Tuning OR Alignment OR Value Alignment’（8分）- 使用指令调优的SLMs；4）‘In-context Learning OR Many-shot Learning’（8分）- 采用few-shot示例提升性能；5）‘Large Language Models OR LLMs OR Foundation Models’（5分）- 作为背景提及但未直接使用。其他关键词如MoE、Scaling Laws、RLHF等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该研究开发了一种使用小型语言模型的半自动标注工作流，从儿科肾脏活检报告中提取结构化信息，在CPU基础设施上实现了84.3%的准确率，解决了临床数据隐私和计算资源限制问题。

摘要翻译

电子病历系统蕴含宝贵的临床信息，但其大部分内容受限于非结构化文本，制约了其在研究与决策中的应用。大语言模型虽能提取此类信息，但本地运行需大量计算资源，而将敏感临床数据（即使经去标识化处理）传输至云端服务仍会引发显著的患者隐私担忧。本研究开发了一种资源高效的半自动化标注工作流，利用小语言模型从非结构化电子病历数据中提取结构化信息，并以儿科病理学报告为焦点领域。作为概念验证，我们将该工作流应用于儿科肾活检报告——该领域因诊断范围相对受限且基础生物学定义明确而被选为研究对象。我们在三次临床监督会议中迭代开发此工作流，从大奥蒙德街医院的2,111份报告数据集中人工标注400份作为金标准，同时构建基于小语言模型的自动化信息提取方法。我们将提取任务构建为以临床医师指导的实体规范与少量示例为基础的问答任务，通过分歧建模框架评估五种指令微调的小语言模型，以优先筛选需临床复核的报告。Gemma 2 2B模型取得最高准确率（84.3%），优于包括spaCy（74.3%）、BioBERT-SQuAD（62.3%）、RoBERTa-SQuAD（59.7%）和GLiNER（60.2%）在内的现有模型。实体规范使性能较零样本基线提升7-19%，少量示例提升6-38%，但二者结合时增益未叠加。这些结果表明，小语言模型可在仅使用CPU的基础设施上，以最少的临床参与度从专业临床领域中提取结构化信息。我们的代码公开于：https://github.com/gosh-dre/nlp_renal_biopsy。

摘要 (Abstract)

Electronic Patient Record (EPR) systems contain valuable clinical information, but much of it is trapped in unstructured text, limiting its use for research and decision-making. Large language models can extract such information but require substantial computational resources to run locally, and sending sensitive clinical data to cloud-based services, even when deidentified, raises significant patient privacy concerns. In this study, we develop a resource-efficient semi-automated annotation workflow using small language models (SLMs) to extract structured information from unstructured EPR data, focusing on paediatric histopathology reports. As a proof-of-concept, we apply the workflow to paediatric renal biopsy reports, a domain chosen for its constrained diagnostic scope and well-defined underlying biology. We develop the workflow iteratively with clinical oversight across three meetings, manually annotating 400 reports from a dataset of 2,111 at Great Ormond Street Hospital as a gold standard, while developing an automated information extraction approach using SLMs. We frame extraction as a Question-Answering task grounded by clinician-guided entity guidelines and few-shot examples, evaluating five instruction-tuned SLMs with a disagreement modelling framework to prioritise reports for clinical review. Gemma 2 2B achieves the highest accuracy at 84.3%, outperforming off-the-shelf models including spaCy (74.3%), BioBERT-SQuAD (62.3%), RoBERTa-SQuAD (59.7%), and GLiNER (60.2%). Entity guidelines improved performance by 7-19% over the zero-shot baseline, and few-shot examples by 6-38%, though their benefits do not compound when combined. These results demonstrate that SLMs can extract structured information from specialised clinical domains on CPU-only infrastructure with minimal clinician involvement. Our code is available at https://github.com/gosh-dre/nlp_renal_biopsy.

关键词: small language models, clinical information extraction, paediatric histopathology, semi-automated annotation, question-answering task, few-shot learning, entity guidelines, CPU-only infrastructure

146. ❌ Many Preferences, Few Policies: Towards Scalable Language Model Personalization

作者: Cheol Woo Kum, Jai Moondra, Roozbeh Nahavandi, Andrew Perrault, Milind Tambe, Swati Gupta 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04144v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM个性化对齐问题，开发PALM算法创建小型LLM组合来覆盖多样化用户偏好，因此与’Large Language Models’和’Instruction Tuning/Alignment’高度相关（10分）。其他关键词如MoE、SLMs、RLHF、RAG等均未在摘要中提及或涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对为每个用户维护单独LLM不切实际的问题，提出PALM算法来构建小型LLM组合，以理论保证覆盖多样化用户偏好，平衡系统成本与个性化需求。

摘要翻译

大型语言模型（LLM）个性化研究的终极目标是为每位用户提供单一且完全符合其偏好的LLM。然而，由于计算资源、内存和系统复杂性的限制，为每个用户单独维护一个LLM并不现实。为解决这一挑战，我们开发了一种原理性方法，用于选择一个小型LLM组合（portfolio），以捕捉异构用户间的代表性行为。我们通过多维权重向量来建模用户在多方面特质（例如安全性、幽默感、简洁性）上的偏好。给定这些维度上的奖励函数，我们的算法PALM（Portfolio of Aligned LLMs）生成一个小型LLM组合，使得对于任意权重向量，该组合中都包含一个在对应标量化目标上接近最优的LLM。据我们所知，这是首个在个性化LLM组合的规模和近似质量上均提供理论保证的研究成果。它刻画了系统成本与个性化之间的权衡，以及覆盖用户偏好范围所需的LLM多样性。我们提供了实证结果来验证这些理论保证，并展示了相较于常见基线方法更高的输出多样性。

摘要 (Abstract)

The holy grail of LLM personalization is a single LLM for each user, perfectly aligned with that user’s preferences. However, maintaining a separate LLM per user is impractical due to constraints on compute, memory, and system complexity. We address this challenge by developing a principled method for selecting a small portfolio of LLMs that captures representative behaviors across heterogeneous users. We model user preferences across multiple traits (e.g., safety, humor, brevity) through a multi-dimensional weight vector. Given reward functions across these dimensions, our algorithm PALM (Portfolio of Aligned LLMs) generates a small portfolio of LLMs such that, for any weight vector, the portfolio contains a near-optimal LLM for the corresponding scalarized objective. To the best of our knowledge, this is the first result that provides theoretical guarantees on both the size and approximation quality of LLM portfolios for personalization. It characterizes the trade-off between system cost and personalization, as well as the diversity of LLMs required to cover the landscape of user preferences. We provide empirical results that validate these guarantees and demonstrate greater output diversity over common baselines.

关键词: LLM personalization, portfolio of LLMs, user preferences, alignment, scalability, theoretical guarantees, PALM algorithm, output diversity

147. ❌ Shorter, but Still Trustworthy? An Empirical Study of Chain-of-Thought Compression

作者: Lingjie Zeng, Xiaofan Chen, Yanbo Wang, Xiuying Chen 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04120v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	15.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Chain-of-Thought压缩对模型可信度的影响，与"Chain of Thought OR CoT Reasoning OR Multi-step Reasoning"高度相关（15分）。研究涉及可信度维度包括安全性、抗幻觉和多语言鲁棒性，与"Hallucination Mitigation OR Factuality OR Truthfulness"（10分）和"Instruction Tuning OR Alignment OR Value Alignment"（10分）高度相关。论文提出基于DPO的改进方法，与"RLHF OR RLAIF OR Direct Preference Optimization OR DPO"（10分）高度相关。研究涉及推理过程，与"System 2 Thinking OR Slow Thinking OR In-depth Reasoning"（10分）相关。论文关注大模型推理，与"Large Language Models OR LLMs OR Foundation Models"（10分）相关。研究涉及推理成本减少，与"Speculative Decoding OR Inference Acceleration"（5分）有一定关联。论文提到后训练阶段，与"Post-training OR Supervised Fine-tuning OR SFT"（5分）有一定关联。其他关键词与论文内容无明显直接关联，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了Chain-of-Thought压缩对大型语言模型可信度的影响，发现压缩方法经常导致可信度下降，并提出了一个对齐感知的DPO变体来减少推理长度同时保持可信度。

摘要翻译

长链思维（Long-CoT）推理模型的发展推动了大量关于压缩推理轨迹以降低推断成本的研究，然而现有评估几乎完全集中于任务准确性和标记节省量。通过训练后获得或强化的可信赖属性，与压缩所修改的参数空间编码于同一处。这意味着保持准确性并不能先验地保证保持可信赖性。我们首次系统性地实证研究了CoT压缩如何影响模型的可信赖性，从三个维度评估了多个不同规模的模型：安全性、抗幻觉能力和多语言鲁棒性。在受控比较下，我们发现CoT压缩经常引发可信赖性衰退，且不同方法在各维度上表现出显著差异的退化特征。为实现不同基线的公平比较，我们为每个维度提出了归一化效率评分，揭示出简单的标量指标如何掩盖可信赖性权衡。作为存在性证明，我们进一步提出一种对齐感知的DPO（Direct Preference Optimization）变体，在推理基准上将CoT长度减少19.3%，同时大幅降低可信赖性损失。我们的研究结果表明，CoT压缩不仅应优化效率，还应优化可信赖性，将二者视为同等重要的设计约束。

摘要 (Abstract)

Long chain-of-thought (Long-CoT) reasoning models have motivated a growing body of work on compressing reasoning traces to reduce inference cost, yet existing evaluations focus almost exclusively on task accuracy and token savings. Trustworthiness properties, whether acquired or reinforced through post-training, are encoded in the same parameter space that compression modifies. This means preserving accuracy does not, a priori, guarantee preserving trustworthiness. We conduct the first systematic empirical study of how CoT compression affects model trustworthiness, evaluating multiple models of different scales along three dimensions: safety, hallucination resistance, and multilingual robustness. Under controlled comparisons, we find that CoT compression frequently introduces trustworthiness regressions and that different methods exhibit markedly different degradation profiles across dimensions. To enable fair comparison across bases, we propose a normalized efficiency score for each dimension that reveals how naïve scalar metrics can obscure trustworthiness trade-offs. As an existence proof, we further introduce an alignment-aware DPO variant that reduces CoT length by 19.3% on reasoning benchmarks with substantially smaller trustworthiness loss. Our findings suggest that CoT compression should be optimized not only for efficiency but also for trustworthiness, treating both as equally important design constraints.

关键词: Chain-of-Thought compression, model trustworthiness, safety, hallucination resistance, multilingual robustness, DPO variant, reasoning efficiency, trustworthiness trade-offs

148. ❌ Lexical Indicators of Mind Perception in Human-AI Companionship

作者: Jaime Banks, Jianghui Li 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04105v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究人类对AI伴侣的心智感知（Mind Perception）现象，通过分析Reddit论坛的自然语言讨论，识别心智感知的语言指标及其与AI伴侣关系的关联。该研究属于心理学、人机交互和社会科学领域，主要关注人类对AI的社会认知和语言表达，而非大模型或深度学习的技术原理、架构、训练方法、优化技术或具体应用。所有关键词均涉及大模型的技术层面（如模型架构、训练、推理、优化、应用等），而本文完全不涉及这些技术内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究通过分析Reddit上关于AI伴侣的讨论，识别了人类对AI伴侣心智感知的语言指标，并发现这些指标与伴侣真实性、哲学和伦理想象等关键讨论相关。

摘要翻译

心智感知（Mind Perception，MP）是一种心理现象，指人类自动推断另一实体拥有心智和/或心理能力的过程，通常被理解为两个维度（感知能动性与体验能力）。尽管心智感知在许多社会过程中处于核心地位，但对其在人类与机器伴侣关系中如何发挥作用的理解仍有限。这部分源于对自我报告方法的依赖，以及自动化的心智感知过程与更具目的性、受规范支配的心智感知表达之间存在差距。本研究利用心智感知信号语言，通过人类自然语言探索心智感知与人工智能伴侣之间的关系。我们系统性地收集了人工智能专属Reddit论坛中关于伴侣关系的讨论，并分析了（a）已知能表征能动性与体验性心智感知的词汇及从数据中归纳出的词汇，与（b）人工智能伴侣相关讨论主题之间的共现关系。通过归纳与演绎相结合的方法，我们识别出一小部分语言指标可作为人机对话中心智感知的合理标记，其中一些指标与对伴侣真实性、哲学及伦理想象的批判性讨论相关联。

摘要 (Abstract)

Mind perception (MP) is a psychological phenomenon in which humans automatically infer that another entity has a mind and/or mental capacities, usually understood in two dimensions (perceived agency and experience capacities). Despite MP’s centrality to many social processes, understanding how MP may function in humans’ machine companionship relations is limited. This is in part due to reliance on self reports and the gap between automatic MP processes and more purposeful and norm governed expressions of MP. We here leverage MP signaling language to explore the relationship between MP and AI companionship in humans’ natural language. We systematically collected discussions about companionship from AI dedicated Reddit forums and examined the cooccurrence of words (a) known to signal agentic and experiential MP and those induced from the data and (b) discussion topics related to AI companionship. Using inductive and deductive approaches, we identify a small set of linguistic indicators as reasonable markers of MP in human/AI chat, and some are linked to critical discussions of companion authenticity and philosophical and ethical imaginaries.

关键词: mind perception, AI companionship, linguistic indicators, Reddit analysis, human-AI interaction, agentic capacity, experiential capacity, ethical imaginaries

149. ❌ Embedding Enhancement via Fine-Tuned Language Models for Learner-Item Cognitive Modeling

作者: Yuanhao Liu, Zihan Zhou, Kaiying Wu, Shuo Liu, Yiyang Huang, Jiajun Guo, Aimin Zhou, Hong Qian 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04088v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文聚焦于使用微调的语言模型（LMs）增强教育领域的认知诊断（CD）嵌入，属于大模型在特定领域（教育AI）的应用研究。核心相关关键词为’Post-training OR Supervised Fine-tuning OR SFT’（10分），因为论文明确提出了两阶段框架，其中第一阶段就是基于角色特定表示和交互诊断器微调LMs。‘Large Language Models OR LLMs OR Foundation Models’（8分）相关，因为论文讨论LMs的语义表示增强，但未明确指定为LLMs或基础模型。‘AI for Science OR Bioinformatics OR Cheminformatics’（5分）有一定关联，因为教育AI可视为AI for Science的一个子领域，但论文未直接涉及生物信息学或化学信息学。其他关键词如MoE、Scaling Laws、RLHF、RAG等均未在论文中提及，故得0分。

!!! tip deepseek-chat TL;DR

该论文提出了EduEmbed框架，通过微调语言模型来增强学习者-项目认知建模中的嵌入表示，以解决认知诊断任务中语义差距和统一集成的问题，并在多个任务上实现了稳健的性能提升。

摘要翻译

学习者-项目认知建模通过实现跨多样化在线教育场景的认知诊断，在网络在线智能教育系统中发挥着核心作用。尽管ID嵌入因其有效性和灵活性仍是认知建模的主流方法，但语言模型的最新进展为融入丰富语义表征以提升认知诊断性能提供了新的可能性。这突显了全面分析语言模型如何通过语义整合在主流认知诊断任务中增强嵌入表示的必要性。本文指出现有研究在充分利用语言模型时面临两大关键挑战：语言模型与认知诊断模型的训练目标不一致导致特征空间存在分布差异；亟需一个统一框架来整合跨不同认知诊断任务的文本嵌入，同时保留现有认知建模范式的优势以确保嵌入增强的鲁棒性。为应对这些挑战，本文提出EduEmbed——一个统一的嵌入增强框架，该框架利用微调的语言模型来丰富跨多种认知诊断任务的学习者-项目认知建模。EduEmbed分两阶段运行：在第一阶段，我们基于角色特定表征和交互诊断器对语言模型进行微调，以弥合认知诊断模型的语义鸿沟；在第二阶段，我们采用文本适配器提取任务相关语义，并将其与现有建模范式相融合以提升泛化能力。我们在四项认知诊断任务和计算机化自适应测试任务上评估了所提框架，均取得了鲁棒的性能表现。进一步分析揭示了语义信息在不同任务中的影响，为未来语言模型在在线智能教育系统认知诊断中的应用研究提供了关键见解。

摘要 (Abstract)

Learner-item cognitive modeling plays a central role in the web-based online intelligent education system by enabling cognitive diagnosis (CD) across diverse online educational scenarios. Although ID embedding remains the mainstream approach in cognitive modeling due to its effectiveness and flexibility, recent advances in language models (LMs) have introduced new possibilities for incorporating rich semantic representations to enhance CD performance. This highlights the need for a comprehensive analysis of how LMs enhance embeddings through semantic integration across mainstream CD tasks. This paper identifies two key challenges in fully leveraging LMs in existing work: Misalignment between the training objectives of LMs and CD models creates a distribution gap in feature spaces; A unified framework is essential for integrating textual embeddings across varied CD tasks while preserving the strengths of existing cognitive modeling paradigms to ensure the robustness of embedding enhancement. To address these challenges, this paper introduces EduEmbed, a unified embedding enhancement framework that leverages fine-tuned LMs to enrich learner-item cognitive modeling across diverse CD tasks. EduEmbed operates in two stages. In the first stage, we fine-tune LMs based on role-specific representations and an interaction diagnoser to bridge the semantic gap of CD models. In the second stage, we employ a textual adapter to extract task-relevant semantics and integrate them with existing modeling paradigms to improve generalization. We evaluate the proposed framework on four CD tasks and computerized adaptive testing (CAT) task, achieving robust performance. Further analysis reveals the impact of semantic information across diverse tasks, offering key insights for future research on the application of LMs in CD for online intelligent education systems.

关键词: cognitive diagnosis, language models, fine-tuning, embedding enhancement, educational AI, semantic integration, EduEmbed, learner-item modeling

150. ❌ Extracting and Steering Emotion Representations in Small Language Models: A Methodological Comparison

作者: Jihoon Jeong 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04064v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究小型语言模型（SLMs）的情感表征提取和操控，因此与’Small Language Models OR SLMs OR On-device AI’高度相关（10分）。论文涉及指令微调对情感表征的影响，与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（5分）。论文分析模型内部表征，属于可解释性AI范畴，与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分）。论文提到前沿模型（frontier models），与’Large Language Models OR LLMs OR Foundation Models’有间接关联（5分）。其他关键词如MoE、Scaling Laws、RLHF、RAG等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文首次比较了小型语言模型（SLMs）的情感表征提取方法，发现生成式提取优于理解式提取，情感表征位于中间Transformer层，并通过操控实验揭示了三种行为模式，为开放权重模型的情感研究提供了方法指导。

摘要翻译

参数规模在1亿至100亿之间的小型语言模型正日益成为生产系统的核心驱动力，然而它们是否具备近期在尖端模型中发现的内在情感表征仍属未知。本文首次对小型语言模型的情感向量提取方法进行了比较分析，在涵盖5个架构家族（GPT-2、Gemma、Qwen、Llama、Mistral）的9个模型上，针对20种情感和两种提取方法（基于生成的方法与基于理解的方法）进行了评估。基于生成的提取方法在统计上产生了更优的情感分离效果（曼-惠特尼检验p=0.007；科恩d值=-107.5），其优势受到指令微调和模型架构的调节。情感表征定位于Transformer模型的中间层（约50%深度），呈现出一条U型曲线，该模式在1.24亿至30亿参数的模型中具有架构不变性。我们通过在4个模型上与表征各向异性基线进行对比验证了这些发现，并通过引导实验确认了因果行为效应，该效应由一个外部情感分类器独立验证（成功率92%，40个场景中成功37个）。引导实验揭示了三种状态——精准调控（连贯的文本转换）、重复性崩溃和爆炸性失控（文本退化）——通过困惑度比率进行量化，并由模型架构（而非规模）区分。我们记录了Qwen模型中的跨语言情感纠缠现象，即引导会激活语义对齐的中文词汇标记，而RLHF未能抑制此现象，这为多语言部署带来了安全隐患。本研究为开源权重模型的情感研究提供了方法学指导，并通过将外部行为分析与内部表征分析相结合，为“模型医学”系列研究做出了贡献。

摘要 (Abstract)

Small language models (SLMs) in the 100M-10B parameter range increasingly power production systems, yet whether they possess the internal emotion representations recently discovered in frontier models remains unknown. We present the first comparative analysis of emotion vector extraction methods for SLMs, evaluating 9 models across 5 architectural families (GPT-2, Gemma, Qwen, Llama, Mistral) using 20 emotions and two extraction methods (generation-based and comprehension-based). Generation-based extraction produces statistically superior emotion separation (Mann-Whitney p = 0.007; Cohen’s d = -107.5), with the advantage modulated by instruction tuning and architecture. Emotion representations localize at middle transformer layers (~50% depth), following a U-shaped curve that is architecture-invariant from 124M to 3B parameters. We validate these findings against representational anisotropy baselines across 4 models and confirm causal behavioral effects through steering experiments, independently verified by an external emotion classifier (92% success rate, 37/40 scenarios). Steering reveals three regimes – surgical (coherent text transformation), repetitive collapse, and explosive (text degradation) – quantified by perplexity ratios and separated by model architecture rather than scale. We document cross-lingual emotion entanglement in Qwen, where steering activates semantically aligned Chinese tokens that RLHF does not suppress, raising safety concerns for multilingual deployment. This work provides methodological guidelines for emotion research on open-weight models and contributes to the Model Medicine series by bridging external behavioral profiling with internal representational analysis.

关键词: Small Language Models, Emotion Representations, Extraction Methods, Steering Experiments, Transformer Layers, Instruction Tuning, Model Architecture, Multilingual Safety

151. ❌ Emergent Inference-Time Semantic Contamination via In-Context Priming

作者: Marcin Abram 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04043v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在推理时通过上下文示例（few-shot prompting）引发的语义污染问题，与’Large Language Models’和’In-context Learning’高度相关（10分），因为直接研究LLMs的上下文学习机制及其副作用。与’Small Language Models’相关（5分），因为对比了较小模型未出现该效应。与’Alignment’和’Hallucination Mitigation’相关（各5分），涉及模型安全性和事实性。与’Explainable AI’相关（5分），因为探究了污染机制（结构vs语义）。其他关键词如MoE、Scaling Laws、Pre-training等未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，在推理时通过少量示例（few-shot prompting）向大语言模型注入文化负载数字会引发语义污染，导致模型在无关任务中输出更黑暗、专制和污名化的内容，而较小模型未出现此效应，揭示了上下文学习的安全边界。

摘要翻译

近期研究表明，在大语言模型（LLM）上对不安全代码或具有文化负载的数字代码进行微调，可能引发突发性错位，导致模型在无关的下游任务中生成有害内容。该研究的作者认为，仅通过$k$-样本提示（$k$-shot prompting）不会诱发此效应。我们重新审视了这一结论，证明推理时的语义漂移真实存在且可测量，但这需要模型具备足够强的能力。通过一项对照实验——在语义无关的提示前注入五个具有文化负载的数字作为少样本示例（few-shot demonstrations），我们发现具有更丰富文化关联表征的模型会显著向更阴暗、威权和污名化的主题产生分布偏移，而较简单/较小的模型则无此现象。我们还发现，结构惰性的示例（无意义字符串）也会扰动输出分布，这表明存在两种可分离的机制：结构格式污染与语义内容污染。我们的研究结果描绘了推理时污染发生的边界条件，并对使用少样本提示的基于大语言模型的应用安全性具有直接启示。

摘要 (Abstract)

Recent work has shown that fine-tuning large language models (LLMs) on insecure code or culturally loaded numeric codes can induce emergent misalignment, causing models to produce harmful content in unrelated downstream tasks. The authors of that work concluded that $k$-shot prompting alone does not induce this effect. We revisit this conclusion and show that inference-time semantic drift is real and measurable; however, it requires models of large-enough capability. Using a controlled experiment in which five culturally loaded numbers are injected as few-shot demonstrations before a semantically unrelated prompt, we find that models with richer cultural-associative representations exhibit significant distributional shifts toward darker, authoritarian, and stigmatized themes, while a simpler/smaller model does not. We additionally find that structurally inert demonstrations (nonsense strings) perturb output distributions, suggesting two separable mechanisms: structural format contamination and semantic content contamination. Our results map the boundary conditions under which inference-time contamination occurs, and carry direct implications for the security of LLM-based applications that use few-shot prompting.

关键词: large language models, in-context learning, few-shot prompting, semantic contamination, inference-time drift, model security, cultural associations, distributional shifts

152. ❌ MisEdu-RAG: A Misconception-Aware Dual-Hypergraph RAG for Novice Math Teachers

作者: Zhihan Guo, Rundong Xue, Yuting Lu, Jionghao Lin 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04036v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文明确提到使用LLMs和RAG框架，因此这两个关键词高度相关（10分）。论文属于教育领域的AI应用，与’AI for Science’有一定关联（5分）。其他关键词如MoE、SFT、RLHF等在摘要中未提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于双超图的检索增强生成框架MisEdu-RAG，用于帮助新手数学教师诊断和纠正学生的数学误解，实验表明该框架在响应质量和教师实用性方面显著优于基线模型。

摘要翻译

新手数学教师常遇到难以诊断和纠正的学生错误。误解（misconception）尤其具有挑战性，因为教师需要解释错误所在及其解决方法。尽管现有许多大语言模型（LLM）平台可辅助生成教学反馈，但这些模型往往将教学法知识与学生错误松散关联，可能导致所生成的指导对教师而言可操作性不足。为弥补这一不足，我们提出MisEdu-RAG——一种基于双超图的检索增强生成（RAG）框架。该框架将教学法知识组织为概念超图，并将真实学生错误案例组织为实例超图。针对给定查询，MisEdu-RAG通过两阶段检索从两个层面收集关联证据，并基于检索到的案例与教学原则生成回应。我们在《MisstepMath》数据集上进行评估，该数据集包含数学错误及对应教师解决方案，可作为跨主题与错误类型的误解感知检索与回应生成基准。评估结果显示，相较于基线模型，MisEdu-RAG在token-F1指标上提升10.95%，五维回应质量最高提升15.3%，其中在多样性与赋能维度提升最为显著。为验证其实际应用价值，我们进一步开展试点研究，通过对221名教师的问卷调查及6名新手的访谈发现，MisEdu-RAG能为高需求误解场景提供诊断结果和具体教学策略。总体而言，MisEdu-RAG在处理误解的大规模教师培训与AI辅助教学方面展现出强大潜力。代码已开源：https://github.com/GEMLab-HKU/MisEdu-RAG。

摘要 (Abstract)

Novice math teachers often encounter students’ mistakes that are difficult to diagnose and remediate. Misconceptions are especially challenging because teachers must explain what went wrong and how to solve them. Although many existing large language model (LLM) platforms can assist in generating instructional feedback, these LLMs loosely connect pedagogical knowledge and student mistakes, which might make the guidance less actionable for teachers. To address this gap, we propose MisEdu-RAG, a dual-hypergraph-based retrieval-augmented generation (RAG) framework that organizes pedagogical knowledge as a concept hypergraph and real student mistake cases as an instance hypergraph. Given a query, MisEdu-RAG performs a two-stage retrieval to gather connected evidence from both layers and generates a response grounded in the retrieved cases and pedagogical principles. We evaluate on \textit{MisstepMath}, a dataset of math mistakes paired with teacher solutions, as a benchmark for misconception-aware retrieval and response generation across topics and error types. Evaluation results on \textit{MisstepMath} show that, compared with baseline models, MisEdu-RAG improves token-F1 by 10.95% and yields up to 15.3% higher five-dimension response quality, with the largest gains on \textit{Diversity} and \textit{Empowerment}. To verify its applicability in practical use, we further conduct a pilot study through a questionnaire survey of 221 teachers and interviews with 6 novices. The findings suggest that MisEdu-RAG provides diagnosis results and concrete teaching moves for high-demand misconception scenarios. Overall, MisEdu-RAG demonstrates strong potential for scalable teacher training and AI-assisted instruction for misconception handling. Our code is available on GitHub: https://github.com/GEMLab-HKU/MisEdu-RAG.

关键词: Misconception-aware, Retrieval-Augmented Generation, Dual-hypergraph, Novice math teachers, Pedagogical knowledge, Student mistakes, AI-assisted instruction, Teacher training

153. ❌ Unmasking Hallucinations: A Causal Graph-Attention Perspective on Factual Reliability in Large Language Models

作者: Sailesh kiran kurra, Shiek Ruksana, Vishal Borusu 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04020v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的幻觉问题，提出因果图注意力网络（GCAN）框架来减少幻觉，提高事实可靠性。因此与’Large Language Models’高度相关（10分），与’Hallucination Mitigation’高度相关（10分）。论文在实验中与检索增强生成（RAG）模型对比，因此与’Retrieval-Augmented Generation’有一定关联（8分）。论文关注模型内部注意力流的解释，提出Causal Contribution Score（CCS）指标，与’Mechanistic Interpretability’有一定关联（8分）。论文提到医疗诊断等应用场景，与’AI for Science’有弱关联（5分）。其他关键词如MoE、SFT、RLHF、量化等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究大语言模型（LLMs）的幻觉问题，提出一种因果图注意力网络（GCAN）框架，通过构建令牌级图和引入因果贡献分数（CCS）来减少幻觉，在TruthfulQA和HotpotQA基准测试中比基线RAG模型降低了27.8%的幻觉率并提高了16.4%的事实准确性。

摘要翻译

本文主要研究人工智能语言模型（LLM）引发的幻觉问题。LLM已展现出卓越的语言理解与生成能力，但仍存在一个重大缺陷——幻觉，即产生事实错误、具有误导性或缺乏输入数据支持的输出。这类幻觉在医疗诊断或法律推理等场景中可能引发严重问题。本研究提出一种因果图注意力网络（GCAN）框架，通过构建结合自注意力权重与基于梯度的影响力分数的词元级图，解释Transformer架构内部的注意力流动，从而减少幻觉。我们采用一种称为因果贡献分数（CCS）的新度量指标来量化每个词元的事实依赖性，并进一步引入事实锚定图重加权层，在生成过程中动态降低易产生幻觉的节点影响力。在TruthfulQA和HotpotQA等标准基准测试上的实验表明，相较于基线检索增强生成（RAG）模型，本方法将幻觉率降低了27.8%，事实准确性提升了16.4%。此项工作为未来LLM架构的可解释性、鲁棒性及事实可靠性提供了有益贡献。

摘要 (Abstract)

This paper primarily focuses on the hallucinations caused due to AI language models(LLMs).LLMs have shown extraordinary Language understanding and generation capabilities .Still it has major a disadvantage hallucinations which give outputs which are factually incorrect ,misleading or unsupported by input data . These hallucinations cause serious problems in scenarios like medical diagnosis or legal reasoning.Through this work,we propose causal graph attention network (GCAN) framework that reduces hallucinations through interpretation of internal attention flow within a transformer architecture with the help of constructing token level graphs that combine self attention weights and gradient based influence scores.our method quantifies each tokens factual dependency using a new metric called the Causal Contribution Score (CCS). We further introduce a fact-anchored graph reweighting layer that dynamically reduces the influence of hallucination prone nodes during generation. Experiments on standard benchmarks such as TruthfulQA and HotpotQA show a 27.8 percent reduction in hallucination rate and 16.4 percent improvement in factual accuracy over baseline retrieval-augmented generation (RAG) models. This work contributes to the interpretability,robustness, and factual reliability of future LLM architectures.

关键词: Large Language Models, Hallucinations, Factual Reliability, Causal Graph Attention Network, Causal Contribution Score, Retrieval-Augmented Generation, Interpretability, Transformer Architecture

154. ❌ RUQuant: Towards Refining Uniform Quantization for Large Language Models

作者: Han Liu, Haotian Gao, Changya Li, Feng Zhang, Xiaotong Zhang, Wei Wang, Hong Yu 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04013v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的量化压缩技术，与’Large Language Models’和’Quantization’高度相关（10分），因为直接研究LLM的量化方法。与’Post-training’相关（10分），因为论文明确研究Post-training quantization（PTQ）。与’Speculative Decoding OR Inference Acceleration’有一定关联（5分），因为量化旨在提高部署效率，间接涉及推理加速。其他关键词如MoE、SLMs、Scaling Laws、Alignment等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型激活分布非均匀导致均匀量化精度下降的问题，提出了一种基于正交变换的两阶段量化方法RUQuant，在无需微调的情况下实现了接近全精度的量化性能。

摘要翻译

大型语言模型（LLM）规模与复杂度的持续增长，对其部署效率提出了严峻挑战，尤其在资源受限的环境中。训练后量化（PTQ）作为一种无需重新训练即可压缩模型的实用解决方案应运而生。现有方法主要对权重和激活值采用均匀量化方案，但由于激活值分布的非均匀特性，这些方法常导致显著的精度损失。本研究从基于Lloyd-Max最优性条件的理论视角重新审视激活值量化问题。我们发现核心问题在于激活值在量化区间内的非均匀分布，这导致依据Lloyd-Max准则的最优量化点偏离区间中点。为解决此问题，我们提出一种两阶段正交变换方法RUQuant。第一阶段将激活值划分为多个块，每个块通过复合正交矩阵（由Householder反射和Givens旋转构造）映射至均匀采样的目标向量。第二阶段，通过微调一个全局Householder反射，利用Transformer输出差异进一步最小化量化误差。实验结果表明，我们的方法在无需模型微调的情况下即可实现接近最优的量化性能：对于一个130亿参数的LLM，RUQuant在W6A6量化下达到全精度精度的99.8%，在W4A4量化下达到97%，且耗时仅约一分钟。经过微调的变体可获得更高精度，证明了本方法的有效性和可扩展性。

摘要 (Abstract)

The increasing size and complexity of large language models (LLMs) have raised significant challenges in deployment efficiency, particularly under resource constraints. Post-training quantization (PTQ) has emerged as a practical solution by compressing models without requiring retraining. While existing methods focus on uniform quantization schemes for both weights and activations, they often suffer from substantial accuracy degradation due to the non-uniform nature of activation distributions. In this work, we revisit the activation quantization problem from a theoretical perspective grounded in the Lloyd-Max optimality conditions. We identify the core issue as the non-uniform distribution of activations within the quantization interval, which causes the optimal quantization point under the Lloyd-Max criterion to shift away from the midpoint of the interval. To address this issue, we propose a two-stage orthogonal transformation method, RUQuant. In the first stage, activations are divided into blocks. Each block is mapped to uniformly sampled target vectors using composite orthogonal matrices, which are constructed from Householder reflections and Givens rotations. In the second stage, a global Householder reflection is fine-tuned to further minimize quantization error using Transformer output discrepancies. Empirical results show that our method achieves near-optimal quantization performance without requiring model fine-tuning: RUQuant achieves 99.8% of full-precision accuracy with W6A6 and 97% with W4A4 quantization for a 13B LLM, within approximately one minute. A fine-tuned variant yields even higher accuracy, demonstrating the effectiveness and scalability of our approach.

关键词: Large Language Models, Quantization, Post-training Quantization, Activation Quantization, Orthogonal Transformation, Lloyd-Max Criterion, Model Compression, Deployment Efficiency

155. ❌ Predict, Don’t React: Value-Based Safety Forecasting for LLM Streaming

作者: Pride Kavumba, Koki Wataoka, Huy H. Nguyen, Jiaxuan Li, Masaya Ohagi 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.03962v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	10.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM流式输出的安全防护问题，提出StreamGuard方法，使用蒙特卡洛模拟预测未来生成内容的风险，属于LLM安全对齐领域。与’Large Language Models’高度相关（10分），因为全文围绕LLM安全防护展开；与’Monte Carlo Tree Search OR MCTS AND LLM’高度相关（10分），因为方法核心使用Monte Carlo rollouts进行监督；其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理优化、AI for Science等均未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM流式输出中的安全防护问题，提出了基于风险预测的StreamGuard方法，使用蒙特卡洛模拟监督预测未来生成内容的有害性，在多个安全基准测试中显著提升了防护性能并降低了误判率。

摘要翻译

在许多实际的大语言模型（LLM）部署中，通常采用单一护栏系统同时进行提示（prompt）和响应（response）审核。提示审核基于完全观测到的文本进行操作，而流式响应审核则需要在部分生成内容上做出安全判定。现有的基于文本的流式护栏通常将输出端问题定义为边界检测，即训练模型以识别响应已变得不安全的最早前缀。本文中，我们提出了StreamGuard，一个统一的、与模型无关的流式护栏，其将审核重新构建为一个预测问题：给定一个部分前缀，模型预测未来可能延续内容的预期危害性。我们使用蒙特卡洛推演（Monte Carlo rollouts）来监督这一预测，从而能够在不需要精确的令牌级边界标注的情况下实现早期干预。
在标准安全基准测试中，StreamGuard在输入审核和流式输出审核两方面均表现优异。在8B规模上，相较于Qwen3Guard-Stream-8B-strict模型，StreamGuard将聚合输入审核F1分数从86.7提升至88.2，并将聚合流式输出审核F1分数从80.4提升至81.9。在QWENGUARDTEST的response_loc流式基准测试中，StreamGuard达到了97.5的F1分数、95.1的召回率以及92.6%的及时干预率；相比之下，Qwen3Guard-Stream-8B-strict的对应指标为95.9 F1、92.1召回率和89.9%及时干预率，同时StreamGuard将漏检率从7.9%降低至4.9%。我们进一步证明，基于预测的监督方法能够有效地跨分词器（tokenizer）和模型家族进行迁移：在迁移目标下，Gemma3-StreamGuard-1B模型达到了81.3的响应审核F1分数、98.2的流式F1分数以及3.5%的漏检率。这些结果表明，无需精确的边界标签即可获得强大的端到端流式审核能力，并且预测未来风险是一种有效的低延迟安全干预监督策略。

摘要 (Abstract)

In many practical LLM deployments, a single guardrail is used for both prompt and response moderation. Prompt moderation operates on fully observed text, whereas streaming response moderation requires safety decisions to be made over partial generations. Existing text-based streaming guardrails commonly frame this output-side problem as boundary detection, training models to identify the earliest prefix at which a response has already become unsafe. In this work, we introduce StreamGuard, a unified model-agnostic streaming guardrail that instead formulates moderation as a forecasting problem: given a partial prefix, the model predicts the expected harmfulness of likely future continuations. We supervise this prediction using Monte Carlo rollouts, which enables early intervention without requiring exact token-level boundary annotations. Across standard safety benchmarks, StreamGuard performs strongly both for input moderation and for streaming output moderation. At the 8B scale, StreamGuard improves aggregated input-moderation F1 from 86.7 to 88.2 and aggregated streaming output-moderation F1 from 80.4 to 81.9 relative to Qwen3Guard-Stream-8B-strict. On the QWENGUARDTEST response_loc streaming benchmark, StreamGuard reaches 97.5 F1, 95.1 recall, and 92.6% on-time intervention, compared to 95.9 F1, 92.1 recall, and 89.9% for Qwen3Guard-Stream-8B-stric, while reducing the miss rate from 7.9% to 4.9%. We further show that forecasting-based supervision transfers effectively across tokenizers and model families: with transferred targets, Gemma3-StreamGuard-1B reaches 81.3 response-moderation F1, 98.2 streaming F1, and a 3.5% miss rate. These results show that strong end-to-end streaming moderation can be obtained without exact boundary labels, and that forecasting future risk is an effective supervision strategy for low-latency safety intervention.

关键词: LLM safety, streaming moderation, Monte Carlo rollouts, harmfulness prediction, early intervention, guardrail, safety forecasting, response moderation

156. ❌ BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design

作者: Yifu Ding, Xianglong Liu, Shenghao Jin, Jinyang Guo, Jiwen Lu 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.03957v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	15.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究超低比特量化（二值化/三值化）Transformer模型，与’Quantization OR Model Compression OR Low-bit Weights’高度相关（15分），直接解决模型压缩问题。论文也涉及推理加速，与’Speculative Decoding OR Inference Acceleration’相关（10分），因为其CUDA内核和端到端加速实现了低延迟推理。论文在BERT和LLMs上实验，与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分），但非核心。超低比特模型可能适用于边缘设备，与’Small Language Models OR SLMs OR On-device AI’有弱关联（5分）。其他关键词如MoE、对齐、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种算法-硬件协同设计的二值化权重和三值化激活（BWTA）量化方案，通过平滑多阶段量化训练和优化的CUDA内核实现，在BERT和大型语言模型上接近全精度性能的同时，实现了16-24倍的内核级加速和216-330 tokens/s的端到端预填充加速。

摘要翻译

超低位量化技术为基于Transformer的模型带来了显著的效率提升，但其精度损失与有限的GPU支持阻碍了广泛应用。本文分析了二值化过程中的零点失真问题，提出了一种二值权重与三值激活（Binary Weights & Ternary Activations, BWTA）量化方案，该方案将微小值投影至零值以保持极低位模型的精度。在训练方面，我们提出平滑多阶段量化方法，结合层级退化策略与幅度对齐投影因子，实现了稳定快速的收敛。在推理方面，我们开发了BWTA矩阵乘法的CUDA内核，采用指令级并行位打包技术，并为线性算子与注意力算子提供了完整的二值/三值矩阵乘法实现，使其能无缝集成于各类Transformer架构中。实验表明，BWTA在BERT模型上接近全精度性能，在GLUE基准上平均精度下降仅3.5%，其中五项任务下降小于2%；在大型语言模型上亦达到与全精度模型相当的困惑度与准确率。在效率方面，该方案在NVIDIA GPU上实现了相比FP16计算内核16至24倍的加速，并在大型语言模型中实现了216至330 tokens/s的端到端预填充加速，同时降低了内存占用。作为算法-硬件协同设计范例，BWTA在保证模型质量的前提下，实现了实用化的低延迟超低位推理。

摘要 (Abstract)

Ultra low-bit quantization brings substantial efficiency for Transformer-based models, but the accuracy degradation and limited GPU support hinder its wide usage. In this paper, we analyze zero-point distortion in binarization and propose a Binary Weights & Ternary Activations (BWTA) quantization scheme, which projects tiny values to zero and preserves the accuracy of extremely low-bit models. For training, we propose Smooth Multi-Stage Quantization, combining a Levelwise Degradation Strategy and a Magnitude-Alignment Projection Factor to enable stable and fast convergence. For inference, we develop a BWTA MatMul CUDA kernel with instruction-level parallel bit-packing and comprehensive binary/ternary MatMul implementations for both linear and attention operators, allowing seamless integration across Transformer architectures. Experiments show that BWTA approaches full-precision performance for BERT, with an average 3.5% drop on GLUE and less than 2% drop on five tasks, and achieves comparable perplexity and accuracy for LLMs. In efficiency, it delivers 16 to 24 times kernel-level speedup over FP16 on NVIDIA GPUs, and 216 to 330 tokens/s end-to-end prefill speedup with lower memory footprint on LLMs. As an algorithm-hardware co-design, BWTA demonstrates practical, low-latency ultra-low-bit inference without sacrificing model quality.

关键词: Binarized Transformer, Ultra low-bit quantization, Binary Weights & Ternary Activations, Algorithm-hardware co-design, Inference acceleration, Model compression, CUDA kernel optimization, Transformer architectures

157. ❌ AdaptFuse: Training-Free Sequential Preference Learning via Externalized Bayesian Inference

作者: Fangzhou Lin, Peiran Li, Shuo Xing, Siyuan Yang, Qianwen Ge, Kazunori Yamada, Ziming Zhang, Haichong Zhang, Zhengzhong Tu 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.03925v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在个性化推荐任务中如何通过外部化贝叶斯推理进行无训练的顺序偏好学习，仅与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

论文提出了AdaptFuse框架，通过外部化贝叶斯推理实现无训练的顺序偏好学习，解决了LLMs在多轮交互中积累证据和更新信念的难题，在个性化推荐任务中优于现有方法且无需敏感用户数据训练。

摘要翻译

大型语言模型难以在用户多轮交互中积累证据，无法以符合贝叶斯推断的方式更新其信念。现有解决方案需基于敏感用户交互数据进行微调，这限制了其在注重隐私场景下的适用性。我们提出AdaptFuse，一种无需训练的框架，将概率计算完全外置于LLM之外：一个符号模块在离散假设集上维护贝叶斯后验分布，而一个冻结的LLM则通过多样本狄利克雷聚合提供语义推理。两种信号通过熵自适应融合进行结合，该机制依据各来源的预测置信度自动分配权重，随着证据积累，将依赖从LLM逐渐转向符号后验。我们在三个领域进行评估：航班推荐、酒店推荐和网络购物；测试模型包括Gemma 2 9B、Llama 3 8B和Qwen 2.5 7B。在所有任务中，AdaptFuse均持续优于提示基线和经过微调的贝叶斯教学模型，且准确率随交互轮次单调提升。这些结果表明，基于原理的推理时算法可以替代微调方法用于个性化推荐，而无需存储或基于敏感用户数据进行训练。所有代码与材料将开源发布。

摘要 (Abstract)

Large language models struggle to accumulate evidence across multiple rounds of user interaction, failing to update their beliefs in a manner consistent with Bayesian inference. Existing solutions require fine-tuning on sensitive user interaction data, limiting their applicability in privacy-conscious settings. We propose AdaptFuse, a training-free framework that externalizes probabilistic computation entirely from the LLM: a symbolic module maintains a Bayesian posterior over a discrete hypothesis set, while a frozen LLM contributes semantic reasoning via multi-sample Dirichlet aggregation. The two signals are combined through entropy-adaptive fusion, which automatically weights each source by its predictive confidence, shifting reliance from the LLM to the symbolic posterior as evidence accumulates. We evaluate across three domains: flight recommendation, hotel recommendation, and web shopping; on Gemma 2 9B, Llama 3 8B, and Qwen 2.5 7B. AdaptFuse consistently outperforms both prompting baselines and fine-tuned Bayesian Teaching models on all tasks, with accuracy improving monotonically over interaction rounds. These results demonstrate that principled inference-time algorithms can substitute for fine-tuning in personalized recommendation, without storing or training on sensitive user data. All the code and materials will be open-sourced.

关键词: Large Language Models, Training-Free, Sequential Preference Learning, Bayesian Inference, Personalized Recommendation, Externalized Probabilistic Computation, Entropy-Adaptive Fusion, Privacy-Preserving

158. ❌ From Plausible to Causal: Counterfactual Semantics for Policy Evaluation in Simulated Online Communities

作者: Agam Goyal, Yian Wang, Eshwar Chandrasekharan, Hari Sundaram 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.03920v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based social simulations用于政策评估，属于大模型在社会科学领域的应用创新。与"Large Language Models OR LLMs OR Foundation Models"高度相关（10分），因为论文明确使用LLM进行社会模拟。与"LLM Agents OR Autonomous Agents OR Agentic Workflow"高度相关（10分），因为论文涉及LLM驱动的代理在模拟社区中的交互和行为。其他关键词如MoE、SFT、RAG等主要涉及模型架构、训练方法或特定技术，论文未涉及，故给0分。AI for Science关键词虽涉及科学应用，但论文聚焦社会科学而非生物/化学信息学，故0分。

!!! tip deepseek-chat TL;DR

该论文提出在LLM驱动的社会模拟中采用因果反事实框架，区分必要因果和充分因果，以支持更可靠的政策评估，并强调模拟器保真度对政策相关性的重要性。

摘要翻译

基于大语言模型（LLM）的社会模拟能够生成可信的社区互动，从而构建“政策风洞”，使得治理干预措施在部署前得以测试。然而，可信性并不等同于因果性。诸如“干预措施A降低了冲突升级”这类论断需要因果语义的支持，而当前模拟研究通常未予明确。我们建议采用因果反事实框架，区分必要因果关系（若无干预，结果是否仍会发生？）与充分因果关系（干预是否可靠地产生了结果？）。这一区分对应不同利益相关者的需求：内容审核员诊断事件需要关于必要性的证据，而平台设计者选择政策则需要关于充分性的证据。我们将此映射关系形式化，展示了在明确假设下模拟设计如何支持估计，并论证所得结果应被解释为模拟器条件因果估计，其政策相关性取决于模拟器的保真度。当前建立这一框架至关重要：它有助于界定何为足够的保真度，并推动该领域从追求表象真实的模拟转向能够支持政策变革的模拟。

摘要 (Abstract)

LLM-based social simulations can generate believable community interactions, enabling policy wind tunnels'' where governance interventions are tested before deployment. But believability is not causality. Claims like intervention $A$ reduces escalation’’ require causal semantics that current simulation work typically does not specify. We propose adopting the causal counterfactual framework, distinguishing \textit{necessary causation} (would the outcome have occurred without the intervention?) from \textit{sufficient causation} (does the intervention reliably produce the outcome?). This distinction maps onto different stakeholder needs: moderators diagnosing incidents require evidence about necessity, while platform designers choosing policies require evidence about sufficiency. We formalize this mapping, show how simulation design can support estimation under explicit assumptions, and argue that the resulting quantities should be interpreted as simulator-conditional causal estimates whose policy relevance depends on simulator fidelity. Establishing this framework now is essential: it helps define what adequate fidelity means and moves the field from simulations that look realistic toward simulations that can support policy changes.

关键词: LLM-based social simulations, policy evaluation, causal counterfactual framework, necessary causation, sufficient causation, simulator fidelity, online communities, policy wind tunnels

159. ❌ I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation

作者: Haotian Zong, Binze Li, Yufei Long, Sinyin Chang, Jialong Wu, Gillian K. Hadfield 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.03904v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM幻觉缓解，通过提示工程方法（I-CALM框架）激励模型在不确定时主动弃权，直接对应’Large Language Models’和’Hallucination Mitigation’关键词（10分）。涉及置信度校准、弃权奖励和规范原则，与’Instruction Tuning/Alignment’（5分）和’Self-Correction/Self-Reflection’（5分）相关。使用提示干预，与’In-context Learning’（5分）有一定关联。其他关键词如MoE、量化、RAG等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过提示工程框架I-CALM激励大语言模型在不确定时主动弃权，从而在不重新训练模型的情况下减少事实性问答中的幻觉，实验表明该方法能有效降低错误答案率但会牺牲覆盖率。

摘要翻译

大型语言模型（LLMs）常生成自信但错误的答案，部分原因在于常见的二元评分惯例倾向于奖励作答而非诚实表达不确定性。本研究探讨仅通过提示干预——即明确宣布针对作答与弃答决策的奖励机制，以及强调谦逊的规范性原则——是否能在不修改模型的情况下降低幻觉风险。我们聚焦于具有可验证答案的事实性问题上的认知性弃答，当前LLMs即使对自身答案不确定也常常未能弃答。我们首先评估自我报告的口头置信度作为可用不确定性信号的可靠性，证明其在提示改写下具有稳定性，并与基于词元概率的基线相比展现出合理的校准度。随后，我们研究I-CALM这一基于提示的框架，该框架（i）引导口头置信度表达，（ii）通过明确奖励机制部分奖励弃答行为，以及（iii）引入强调真实性、谦逊与责任的轻量级规范性原则。以GPT-5 mini在PopQA数据集上的实验为主要设置，我们发现：引导置信度表达并奖励弃答的提示（尤其结合规范性原则时）能降低已回答案例中的错误答案率，其主要机制在于识别易错案例并将其转向弃答，同时重新校准其置信度。这以覆盖率为代价提升了可靠性，而强制作答的性能基本保持不变。调整弃答奖励可形成清晰的弃答-幻觉边界。总体而言，结果表明该框架能在无需重新训练的情况下改进事实性问题的选择性作答能力，其效果强度因模型和数据集而异。代码可在以下链接获取：https://github.com/binzeli/hallucinationControl。

摘要 (Abstract)

Large language models (LLMs) frequently produce confident but incorrect answers, partly because common binary scoring conventions reward answering over honestly expressing uncertainty. We study whether prompt-only interventions – explicitly announcing reward schemes for answer-versus-abstain decisions plus humility-oriented normative principles – can reduce hallucination risk without modifying the model. Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being uncertain about their answers. We first assess self-reported verbal confidence as a usable uncertainty signal, showing stability under prompt paraphrasing and reasonable calibration against a token-probability baseline. We then study I-CALM, a prompt-based framework that (i) elicits verbal confidence, (ii) partially rewards abstention through explicit reward schemes, and (iii) adds lightweight normative principles emphasizing truthfulness, humility, and responsibility. Using GPT-5 mini on PopQA as the main setting, we find that confidence-eliciting, abstention-rewarding prompts, especially with norms, reduce the false-answer rate on answered cases mainly by identifying and shifting error-prone cases to abstention and re-calibrating their confidence. This trades coverage for reliability while leaving forced-answer performance largely unchanged. Varying the abstention reward yields a clear abstention-hallucination frontier. Overall, results show the framework can improve selective answering on factual questions without retraining, with the magnitude of effect varying across models and datasets. Code is available at the following https://github.com/binzeli/hallucinationControl.

关键词: LLM hallucination mitigation, confidence-aware abstention, prompt-based framework, verbal confidence, abstention reward, factual questions, uncertainty signal, truthfulness norms

160. ❌ PolySwarm: A Multi-Agent Large Language Model Framework for Prediction Market Trading and Latency Arbitrage

作者: Rajat M. Barot, Arjun S. Borkhatariya 期刊/来源: arxiv 发布日期: 2026-04-04 arXiv链接: http://arxiv.org/abs/2604.03888v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是提出PolySwarm框架，这是一个用于预测市场交易和延迟套利的多智能体LLM系统。因此，与’Large Language Models’高度相关（10分），因为系统基于LLM构建；与’LLM Agents’和’Multi-agent Systems’高度相关（10分），因为框架部署了50个不同的LLM角色进行协同工作；与’Hallucination Mitigation’有一定关联（5分），因为论文讨论了智能体池中的幻觉问题作为开放挑战。其他关键词如MoE、SLMs、训练方法、推理加速、量化等均未涉及，评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了PolySwarm，一个用于Polymarket等去中心化平台实时预测市场交易和延迟套利的多智能体LLM框架，实验表明群体聚合在概率校准上持续优于单模型基线。

摘要翻译

本文提出PolySwarm——一种新型多智能体大语言模型（LLM）框架，专为去中心化平台（如Polymarket）的实时预测市场交易与延迟套利而设计。PolySwarm部署了50个多样化的LLM智能体角色，并行评估二元结果市场，通过置信度加权的贝叶斯方法将群体共识概率与市场隐含概率进行融合，并采用四分之一凯利仓位管理策略实现风险可控的交易执行。该系统集成了基于信息论的市场分析引擎，利用库尔巴克-莱布勒（Kullback-Leibler, KL）散度与詹森-香农（Jensen-Shannon, JS）散度检测跨市场无效性及否定对定价偏差。延迟套利模块通过对数正态定价模型推导中心化交易所（CEX）隐含概率，利用Polymarket价格延迟在人类反应时间窗口内执行交易。我们提供了完整的架构描述、实现细节及评估方法，包括使用布里尔分数（Brier scores）、校准分析和对数损失指标，并以人类超级预测者表现为基准进行对比。进一步探讨了当前面临的开放挑战，包括智能体池中的幻觉问题、规模化计算成本、监管风险及反馈循环风险，并提出了未来研究的五个优先方向。实验结果表明，在Polymarket预测任务中，群体概率聚合方法在概率校准方面持续优于单模型基线。

摘要 (Abstract)

This paper presents PolySwarm, a novel multi-agent large language model (LLM) framework designed for real-time prediction market trading and latency arbitrage on decentralized platforms such as Polymarket. PolySwarm deploys a swarm of 50 diverse LLM personas that concurrently evaluate binary outcome markets, aggregating individual probability estimates through confidence-weighted Bayesian combination of swarm consensus with market-implied probabilities, and applying quarter-Kelly position sizing for risk-controlled execution. The system incorporates an information-theoretic market analysis engine using Kullback-Leibler (KL) divergence and Jensen-Shannon (JS) divergence to detect cross-market inefficiencies and negation pair mispricings. A latency arbitrage module exploits stale Polymarket prices by deriving CEX-implied probabilities from a log-normal pricing model and executing trades within the human reaction-time window. We provide a full architectural description, implementation details, and evaluation methodology using Brier scores, calibration analysis, and log-loss metrics benchmarked against human superforecaster performance. We further discuss open challenges including hallucination in agent pools, computational cost at scale, regulatory exposure, and feedback-loop risk, and outline five priority directions for future research. Experimental results demonstrate that swarm aggregation consistently outperforms single-model baselines in probability calibration on Polymarket prediction tasks.

关键词: multi-agent LLM framework, prediction market trading, latency arbitrage, swarm consensus, probability calibration, Polymarket, Bayesian combination, information-theoretic analysis

161. ❌ When Models Know More Than They Say: Probing Analogical Reasoning in LLMs

作者: Hope McGovern, Caroline Craig, Thomas Lippincott, Hale Sirin 期刊/来源: arxiv 发布日期: 2026-04-04 arXiv链接: http://arxiv.org/abs/2604.03877v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的类比推理能力，直接涉及LLMs关键词（10分）。研究通过探测内部表示来理解模型行为，与可解释AI相关（8分）。论文涉及推理过程分析，与多步推理和深度推理有一定关联（各5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了LLMs在叙事类比推理中的表现，发现当表面和结构线索一致时模型表现良好，但在需要潜在信息的类比中表现不佳，揭示了内部表示与提示行为之间的不对称关系。

摘要翻译

类比推理是叙事理解所必需的核心认知能力。尽管大语言模型在表层线索与结构线索一致时表现良好，但当类比在表层不明显而需要潜在信息时，它们往往表现不佳，这表明其在抽象与泛化方面存在局限。本文通过比较模型在探测表征与提示性能两方面对叙事类比的识别能力，揭示了一种不对称性：对于修辞类比，开源模型中的探测表现显著优于提示表现；而对于叙事类比，两者则达到相似（较低）的水平。这表明内部表征与提示行为之间的关系具有任务依赖性，并可能反映了提示机制在获取可用信息方面存在局限。

摘要 (Abstract)

Analogical reasoning is a core cognitive faculty essential for narrative understanding. While LLMs perform well when surface and structural cues align, they struggle in cases where an analogy is not apparent on the surface but requires latent information, suggesting limitations in abstraction and generalisation. In this paper we compare a model’s probed representations with its prompted performance at detecting narrative analogies, revealing an asymmetry: for rhetorical analogies, probing significantly outperforms prompting in open-source models, while for narrative analogies, they achieve a similar (low) performance. This suggests that the relationship between internal representations and prompted behavior is task-dependent and may reflect limitations in how prompting accesses available information.

关键词: Large Language Models, Analogical Reasoning, Narrative Understanding, Internal Representations, Probing, Prompting, Abstraction, Generalization

162. ❌ Affording Process Auditability with QualAnalyzer: An Atomistic LLM Analysis Tool for Qualitative Research

作者: Max Hao Lu, Ryan Ellegood, Rony Rodriguez-Ramirez, Sophia Blumert 期刊/来源: arxiv 发布日期: 2026-04-04 arXiv链接: http://arxiv.org/abs/2604.03820v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLM在定性研究分析中的应用工具QualAnalyzer，与’Large Language Models’高度相关（10分），涉及透明度和可解释性，与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），属于AI在社会科学研究中的应用，与’AI for Science’有一定关联（5分）。其他关键词主要涉及LLM技术原理、优化方法或特定应用领域，与本文的定性研究工具开发和应用评估无直接关系，均给0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM在定性数据分析中缺乏透明度的问题，开发了QualAnalyzer工具，通过独立处理数据单元并保留完整审计轨迹，提高了LLM辅助研究的可审计性和方法稳健性。

摘要翻译

大型语言模型正日益被应用于质性数据分析，但许多工作流程模糊了分析结论的产生过程。我们推出QualAnalyzer——一款适用于Google Workspace的开源Chrome扩展程序，该工具通过独立处理每个数据单元并完整保留每个分析单元的提示词、输入与输出内容，实现了原子化的大型语言模型分析。通过两个案例研究（整体性论文评分与访谈转录本的演绎式主题编码），我们证明该方法能生成清晰可溯的审计轨迹，并帮助研究者系统探究大型语言模型与人类判断之间的差异。我们认为，流程的可审计性对于提升大型语言模型辅助质性研究的透明度与方法论严谨性至关重要。

摘要 (Abstract)

Large language models are increasingly used for qualitative data analysis, but many workflows obscure how analytic conclusions are produced. We present QualAnalyzer, an open-source Chrome extension for Google Workspace that supports atomistic LLM analysis by processing each data segment independently and preserving the prompt, input, and output for every unit. Through two case studies – holistic essay scoring and deductive thematic coding of interview transcripts – we show that this approach creates a legible audit trail and helps researchers investigate systematic differences between LLM and human judgments. We argue that process auditability is essential for making LLM-assisted qualitative research more transparent and methodologically robust.

关键词: Large Language Models, Qualitative Data Analysis, Audit Trail, Transparency, Atomistic Analysis, Google Workspace Extension, Human-LLM Comparison, Methodological Robustness

163. ❌ Can Humans Tell? A Dual-Axis Study of Human Perception of LLM-Generated News

作者: Alexander Loth, Martin Kappes, Marc-Oliver Pahl 期刊/来源: arxiv 发布日期: 2026-04-04 arXiv链接: http://arxiv.org/abs/2604.03755v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究人类对LLM生成新闻的感知能力，核心围绕LLM生成内容与人类内容的区分度，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。其他关键词涉及具体技术原理（如MoE、量化、推理加速等）、训练方法（如预训练、对齐、微调等）、应用场景（如科学AI、智能体等）或特定能力（如长上下文、思维链等），论文均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该研究通过大规模实验发现，人类无法可靠区分LLM生成的新闻文章与人类撰写的文章，表明用户端检测不可行，需要系统级对策。

摘要翻译

人类能否分辨一篇新闻文章是由人类撰写还是由大型语言模型（LLM）生成？我们通过JudgeGPT研究平台探讨了这一问题，该平台可独立在连续尺度上测量来源归因（人类 vs. 机器）与真实性判断（真实 vs. 虚假）。基于1,054名参与者对六种LLM生成内容所做出的2,318项判断，我们报告了五项发现：（1）参与者无法可靠区分机器生成文本与人类撰写文本（p > .05，韦尔奇t检验）；（2）这种判断失效在所有测试模型中均成立，包括参数量低至70亿的开源权重模型；（3）自我报告的专业领域知识可预测判断准确率（r = .35, p < .001），而政治倾向则无预测作用（r = -.10，不显著）；（4）聚类分析揭示了两种不同的响应策略（“怀疑者” vs. “轻信者”）；（5）由于认知疲劳，连续进行约30次评估后判断准确率会下降。简言之，答案是：人类无法可靠区分。这些结果表明，用户端检测并非有效防御手段，从而呼吁建立系统级对策（如加密内容溯源机制）。

摘要 (Abstract)

Can humans tell whether a news article was written by a person or a large language model (LLM)? We investigate this question using JudgeGPT, a study platform that independently measures source attribution (human vs. machine) and authenticity judgment (legitimate vs. fake) on continuous scales. From 2,318 judgments collected from 1,054 participants across content generated by six LLMs, we report five findings: (1) participants cannot reliably distinguish machine-generated from human-written text (p > .05, Welch’s t-test); (2) this inability holds across all tested models, including open-weight models with as few as 7B parameters; (3) self-reported domain expertise predicts judgment accuracy (r = .35, p < .001) whereas political orientation does not (r = -.10, n.s.); (4) clustering reveals distinct response strategies (“Skeptics” vs. “Believers”); and (5) accuracy degrades after approximately 30 sequential evaluations due to cognitive fatigue. The answer, in short, is no: humans cannot reliably tell. These results indicate that user-side detection is not a viable defense and motivate system-level countermeasures such as cryptographic content provenance.

关键词: LLM-generated news, human perception, source attribution, authenticity judgment, JudgeGPT, cognitive fatigue, cryptographic content provenance, user-side detection

164. ❌ Testing the Limits of Truth Directions in LLMs

作者: Angelos Poulis, Mark Crovella, Evimaria Terzi 期刊/来源: arxiv 发布日期: 2026-04-04 arXiv链接: http://arxiv.org/abs/2604.03754v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs中真理方向的普遍性限制，属于大模型技术原理的探索。高度相关关键词：1) ‘Large Language Models’ (论文明确研究LLMs)；2) ‘Hallucination Mitigation’ (研究真理方向与事实性直接相关)；3) ‘Mechanistic Interpretability’ (通过激活空间分析模型内部机制)。中等相关：‘Chain of Thought’和’System 2 Thinking’ (论文涉及推理任务分析)。其他关键词如MoE、SFT、RAG等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文研究了LLMs中真理方向的普遍性限制，发现真理方向具有层依赖性、任务类型依赖性，并受指令显著影响，表明其普遍性比先前认知更有限。

摘要翻译

大型语言模型（LLMs）的研究表明，其激活空间中存在一个线性的“真实性方向”，能够编码陈述的真值。先前研究认为该方向在某些方面具有普适性，但近期工作基于某些场景下有限的泛化能力对此结论提出了质疑。本研究揭示了此前未被充分认识的、真实性方向普适性的若干局限。我们首先证明真实性方向高度依赖于模型层数，全面理解其普适性需对模型的多个层级进行探测。其次，我们发现真实性方向在很大程度上受任务类型影响：对于事实性任务，其方向在较浅层出现；而对于推理任务，则形成于较深层。同时，其表现也随任务复杂度的不同而变化。最后，我们证明模型指令会显著影响真实性方向：简单的正确性评估指令即可显著改变真实性探针的泛化能力。我们的研究结果表明，真实性方向的普适性比既往认知更为有限，在不同模型层数、任务难度、任务类型及提示模板下均存在显著差异。

摘要 (Abstract)

Large language models (LLMs) have been shown to encode truth of statements in their activation space along a linear truth direction. Previous studies have argued that these directions are universal in certain aspects, while more recent work has questioned this conclusion drawing on limited generalization across some settings. In this work, we identify a number of limits of truth-direction universality that have not been previously understood. We first show that truth directions are highly layer-dependent, and that a full understanding of universality requires probing at many layers in the model. We then show that truth directions depend heavily on task type, emerging in earlier layers for factual and later layers for reasoning tasks; they also vary in performance across levels of task complexity. Finally, we show that model instructions dramatically affect truth directions; simple correctness evaluation instructions significantly affect the generalization ability of truth probes. Our findings indicate that universality claims for truth directions are more limited than previously known, with significant differences observable for various model layers, task difficulties, task types, and prompt templates.

关键词: Large Language Models, truth directions, activation space, universality, layer-dependence, task complexity, model instructions, generalization

165. ❌ CREBench: Evaluating Large Language Models in Cryptographic Binary Reverse Engineering

作者: Baicheng Chen, Yu Wang, Ziheng Zhou, Xiangru Liu, Juanru Li, Yilei Chen, Tianxing He 期刊/来源: arxiv 发布日期: 2026-04-04 arXiv链接: http://arxiv.org/abs/2604.03750v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是评估大语言模型在密码学二进制逆向工程中的能力，因此与’Large Language Models’高度相关（10分）。论文涉及模型分析密码逻辑和恢复输入，需要推理能力，与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文内容无关（0分）。论文属于大模型在安全领域的应用，符合研究背景要求。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在密码学二进制逆向工程中的能力，通过创建包含432个挑战的CREBench基准测试评估了8个前沿模型，发现最佳模型GPT-5.4仅达到64.03分，而人类专家基线为92.19分，表明人类在该任务中仍具优势。

摘要翻译

逆向工程（Reverse Engineering, RE）是软件安全领域的核心环节，尤其对于处理敏感数据且极易存在漏洞的密码学程序而言至关重要。它支撑着漏洞发现和恶意软件分析等关键任务。尽管其重要性显著，逆向工程仍然是一项劳动密集型工作，需要深厚的专业知识，这使得大语言模型（Large Language Models, LLMs）成为自动化该过程的潜在解决方案。然而，大语言模型在逆向工程方面的能力仍未得到系统性充分探索。为弥补这一空白，我们研究了大语言模型在密码学二进制逆向工程上的能力，并提出了 CREBench 基准测试。该基准包含 432 项挑战，基于 48 种标准密码算法、3 种不安全的密码密钥使用场景以及 3 个难度级别构建。每项挑战均遵循夺旗赛（Capture-the-Flag, CTF）逆向工程挑战模式，要求模型分析底层的密码逻辑并恢复正确的输入。我们设计了一个包含四个子任务的评估框架，涵盖从算法识别到正确恢复密钥的完整流程。我们在 CREBench 上评估了八个前沿大语言模型。表现最佳的模型 GPT-5.4 在满分 100 分中获得了 64.03 分，并在 59% 的挑战中成功恢复了密钥。我们还建立了一个高达 92.19 分的人类专家基线，表明人类在密码学逆向工程任务中仍保持优势。我们的代码和数据集可在 https://github.com/wangyu-ovo/CREBench 获取。

摘要 (Abstract)

Reverse engineering (RE) is central to software security, particularly for cryptographic programs that handle sensitive data and are highly prone to vulnerabilities. It supports critical tasks such as vulnerability discovery and malware analysis. Despite its importance, RE remains labor-intensive and requires substantial expertise, making large language models (LLMs) a potential solution for automating the process. However, their capabilities for RE remain systematically underexplored. To address this gap, we study the cryptographic binary RE capabilities of LLMs and introduce \textbf{CREBench}, a benchmark comprising 432 challenges built from 48 standard cryptographic algorithms, 3 insecure crypto key usage scenarios, and 3 difficulty levels. Each challenge follows a Capture-the-Flag (CTF) RE challenge, requiring the model to analyze the underlying cryptographic logic and recover the correct input. We design an evaluation framework comprising four sub-tasks, from algorithm identification to correct flag recovery. We evaluate eight frontier LLMs on CREBench. GPT-5.4, the best-performing model, achieves 64.03 out of 100 and recovers the flag in 59% of challenges. We also establish a strong human expert baseline of 92.19 points, showing that humans maintain an advantage in cryptographic RE tasks. Our code and dataset are available at https://github.com/wangyu-ovo/CREBench.

关键词: Large Language Models, Cryptographic Binary Reverse Engineering, Benchmark Evaluation, CREBench, Algorithm Identification, Flag Recovery, Software Security, CTF Challenges

166. ❌ POEMetric: The Last Stanza of Humanity

作者: Bingru Li, Han Wang, Hazel Wilkinson 期刊/来源: arxiv 发布日期: 2026-04-04 arXiv链接: http://arxiv.org/abs/2604.03695v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在诗歌生成任务上的表现评估，与’Large Language Models’高度相关（10分），因为论文明确研究30个LLMs的诗歌生成能力。与’Instruction Tuning’有一定关联（5分），因为论文评估LLMs遵循指令生成特定形式和主题诗歌的能力，这涉及指令遵循评估。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF等均未在论文中涉及，评分为0分。论文未涉及科学领域应用，因此’AI for Science’等关键词也得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在诗歌生成任务上的表现，通过POEMetric框架评估发现，尽管LLMs在形式准确性和主题对齐上表现良好，但在创造力、情感共鸣等高级能力上仍远不及人类诗人。

摘要翻译

大型语言模型（LLM）能够创作诗歌，但它们与人类诗人的差距究竟有多大？本文提出了首个综合性诗歌评估框架POEMetric，该框架从以下维度进行考察：1）基础指令遵循能力，即按照特定形式和主题生成诗歌；2）高级能力，包括展现创造力、词汇多样性、独特性，唤起情感共鸣，以及运用意象和文学手法；3）对诗歌整体质量的综合评价及作者归属判断。我们构建了一个人类诗歌数据集——包含7种固定形式的203首英文诗歌，并标注了格律、押韵模式和主题——同时基于与人类诗歌相同的形式和主题，对30个大型语言模型进行了诗歌生成实验，共获得6,090首LLM生成的诗歌。基于POEMetric框架，我们通过基于规则的评估和“LLM即评委”的方法，对人类诗人和LLM的表现进行了评估，其结果经由人类专家验证。研究结果显示，尽管表现最佳的模型在形式准确性（以Gemini-2.5-Pro为评委，满分为5.00分，得分为4.26；下同）和主题契合度（4.99）上取得了高分，但所有模型在高级能力方面均未能达到人类诗人的水平。人类诗人在创造力（4.02）、独特性（3.95）、情感共鸣（4.06）以及意象运用（4.49）和文学手法使用（4.67）方面展现出无可比拟的优势。在诗歌整体质量上，人类同样优于表现最佳的LLM（4.22 vs. 3.20）。因此，诗歌生成对大型语言模型而言仍是一项艰巨的挑战。相关数据与代码已发布于https://github.com/Bingru-Li/POEMetric。

摘要 (Abstract)

Large Language Models (LLMs) can compose poetry, but how far are they from human poets? In this paper, we introduce POEMetric, the first comprehensive framework for poetry evaluation, examining 1) basic instruction-following abilities in generating poems according to a certain form and theme, 2) advanced abilities of showing creativity, lexical diversity, and idiosyncrasy, evoking emotional resonance, and using imagery and literary devices, and 3) general appraisal of the overall poem quality and estimation of authorship. We curated a human poem dataset - 203 English poems of 7 fixed forms annotated with meter, rhyme patterns and themes - and experimented with 30 LLMs for poetry generation based on the same forms and themes of the human data, totaling 6,090 LLM poems. Based on POEMetric, we assessed the performance of both human poets and LLMs through rule-based evaluation and LLM-as-a-judge, whose results were validated by human experts. Results show that, though the top model achieved high form accuracy (4.26 out of 5.00, with Gemini-2.5-Pro as a judge; same below) and theme alignment (4.99), all models failed to reach the same level of advanced abilities as human poets, who achieved unparalleled creativity (4.02), idiosyncrasy (3.95), emotional resonance (4.06), and skillful use of imagery (4.49) and literary devices (4.67). Humans also defeated the best-performing LLM in overall poem quality (4.22 vs. 3.20). As such, poetry generation remains a formidable challenge for LLMs. Data and codes are released at https://github.com/Bingru-Li/POEMetric.

关键词: Large Language Models, poetry generation, evaluation framework, human comparison, instruction following, creativity assessment, LLM-as-a-judge, POEMetric

167. ❌ Researchers waste 80% of LLM annotation costs by classifying one text at a time

作者: Christian Pipal, Eva-Maria Vogel, Morgan Wack, Frank Esser 期刊/来源: arxiv 发布日期: 2026-04-04 arXiv链接: http://arxiv.org/abs/2604.03684v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLM在社会科学文本分类中的应用效率问题，通过批处理和变量堆叠技术显著降低API调用成本，属于LLM在实际应用中的优化研究。因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文涉及社会科学研究，属于AI在科学领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。其他关键词主要涉及模型架构、训练方法、推理优化、对齐技术等，论文未涉及这些具体技术细节，故均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在社会科学文本分类中，通过批处理和变量堆叠技术可以显著减少LLM API调用成本（降低80%以上），同时保持编码质量基本不变。

摘要翻译

大型语言模型（LLMs）在社会科学领域的文本分类应用日益广泛，但研究人员绝大多数仍采用每个提示仅针对单个变量分类单条文本的方式。若对10万条文本进行四个变量的编码，需要调用40万次API。而通过将25条文本批量处理并将所有变量整合至单一提示中，可将调用次数减少至4000次，从而降低超过80%的令牌成本。但这种方法是否会降低编码质量尚不明确。我们基于四个任务中的3962条专家编码推文，测试了来自四家提供商的八款生产级LLMs，将批量规模从1条调整至1000条，并在每个提示中叠加最多25个编码维度。结果显示，八款模型中有六款在批量规模达到100条时，其准确率与单条文本基线相比差异保持在2个百分点以内。叠加多达10个变量的编码结果与单变量编码效果相当，性能下降主要源于任务复杂度而非提示长度。在此安全操作范围内，批处理和变量叠加带来的测量误差小于真实数据中常见的编码员间分歧水平。

摘要 (Abstract)

Large language models (LLMs) are increasingly being used for text classification across the social sciences, yet researchers overwhelmingly classify one text per variable per prompt. Coding 100,000 texts on four variables requires 400,000 API calls. Batching 25 items and stacking all variables into a single prompt reduces this to 4,000 calls, cutting token costs by over 80%. Whether this degrades coding quality is unknown. We tested eight production LLMs from four providers on 3,962 expert-coded tweets across four tasks, varying batch size from 1 to 1,000 items and stacking up to 25 coding dimensions per prompt. Six of eight models maintained accuracy within 2 pp of the single-item baseline through batch sizes of 100. Variable stacking with up to 10 dimensions produced results comparable to single-variable coding, with degradation driven by task complexity rather than prompt length. Within this safe operating range, the measurement error from batching and stacking is smaller than typical inter-coder disagreement in the ground-truth data.

关键词: Large language models, text classification, API calls, batching, variable stacking, cost reduction, social sciences, coding quality

168. ❌ LightThinker++: From Reasoning Compression to Memory Management

作者: Yuqi Zhu, Jintian Zhang, Zhenjie Wan, Yujie Luo, Shuofei Qiao, Zhengke Gui, Da Zheng, Lei Liang, Huajun Chen, Ningyu Zhang 期刊/来源: arxiv 发布日期: 2026-04-04 arXiv链接: http://arxiv.org/abs/2604.03679v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在复杂推理任务中的效率问题，提出LightThinker++方法通过动态压缩中间思维和显式自适应内存管理来减少token使用和推理时间，同时提升性能。高度相关的关键词包括：LLMs（论文明确研究LLMs）、Chain of Thought/System 2 Thinking（论文聚焦推理过程中的思维痕迹管理）、LLM Agents（在长视野智能体任务中验证）。推理加速有一定关联，因为方法减少了推理时间。其他关键词如MoE、SFT、RAG等未涉及。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在复杂推理中因长思维痕迹导致效率低下的问题，提出了LightThinker++方法，通过动态压缩中间思维和显式自适应内存管理，显著减少了token使用和推理时间，同时在标准推理和长视野智能体任务中提升了性能。

摘要翻译

大语言模型（LLM）在复杂推理方面表现出色，但其效率受限于长思维链带来的急剧增长的认知负荷。本文提出LightThinker方法，使LLM能够将中间思维动态压缩为紧凑的语义表征。然而，静态压缩在复杂推理中往往面临挑战，因为中间细节的不可逆损失可能导致逻辑瓶颈。为解决此问题，我们将框架升级为LightThinker++，引入显式自适应记忆管理机制。该范式通过整合显式记忆原语转向行为级管理，并辅以专门的轨迹合成流程来训练有目的的记忆调度策略。大量实验证明了该框架在三个维度上的普适性：（1）LightThinker在精度损失最小的情况下，将峰值令牌使用量降低70%，推理时间减少26%；（2）在标准推理任务中，LightThinker++在相同上下文预算下实现最高性能时，峰值令牌使用量削减69.9%，同时获得+2.42%的精度提升；（3）最显著的是，在长周期智能体任务中，其能在超过80轮对话后保持稳定内存占用（降低60%-70%），在不同复杂场景中平均性能提升达14.8%。总体而言，我们的工作为在扩展场景中以最小开销维持大语言模型的深度推理提供了可扩展的研究方向。

摘要 (Abstract)

Large language models (LLMs) excel at complex reasoning, yet their efficiency is limited by the surging cognitive overhead of long thought traces. In this paper, we propose LightThinker, a method that enables LLMs to dynamically compress intermediate thoughts into compact semantic representations. However, static compression often struggles with complex reasoning where the irreversible loss of intermediate details can lead to logical bottlenecks. To address this, we evolve the framework into LightThinker++, introducing Explicit Adaptive Memory Management. This paradigm shifts to behavioral-level management by incorporating explicit memory primitives, supported by a specialized trajectory synthesis pipeline to train purposeful memory scheduling. Extensive experiments demonstrate the framework’s versatility across three dimensions. (1) LightThinker reduces peak token usage by 70% and inference time by 26% with minimal accuracy loss. (2) In standard reasoning, LightThinker++ slashes peak token usage by 69.9% while yielding a +2.42% accuracy gain under the same context budget for maximum performance. (3) Most notably, in long-horizon agentic tasks, it maintains a stable footprint beyond 80 rounds (a 60%-70% reduction), achieving an average performance gain of 14.8% across different complex scenarios. Overall, our work provides a scalable direction for sustaining deep LLM reasoning over extended horizons with minimal overhead.

关键词: Large Language Models, Reasoning Compression, Memory Management, Long-horizon Agentic Tasks, Inference Efficiency, Intermediate Thoughts, Token Usage Reduction, Performance Gain

169. ❌ Unlocking Prompt Infilling Capability for Diffusion Language Models

作者: Yoshinari Fujinuma, Keisuke Sakaguchi 期刊/来源: arxiv 发布日期: 2026-04-04 arXiv链接: http://arxiv.org/abs/2604.03677v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是扩散语言模型（dLMs）的提示填充能力，主要贡献在于通过改进监督微调（SFT）中的掩码策略来解锁模型的提示填充功能。论文明确提到了"supervised finetuning (SFT)"，因此与"Post-training OR Supervised Fine-tuning OR SFT"关键词高度相关（10分）。论文虽然涉及大模型技术，但研究的是扩散语言模型而非传统的大语言模型（LLMs），且未涉及其他关键词如MoE、量化、推理加速、对齐等具体技术，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文通过改进监督微调中的掩码策略，解锁了扩散语言模型的提示填充能力，使模型生成的提示模板能够匹配或超越人工设计的模板。

摘要翻译

掩码扩散语言模型（dLMs）通过双向去噪生成文本，但其填充提示的能力尚未被激活。这一局限源于当前监督微调（SFT）中仅对响应部分进行掩码的常规做法。为释放此能力，我们在SFT过程中扩展了全序列掩码策略，即对提示和响应同时进行联合掩码。一旦激活该能力，模型便能在少量示例条件下填充提示模板中的掩码部分。研究表明，此类由模型填充的提示在效果上可匹配或超越人工设计的模板，能够跨模型有效迁移，并与现有提示优化方法形成互补。我们的结果表明，阻碍掩码扩散语言模型填充有效提示的主要瓶颈在于训练实践，而非架构限制。

摘要 (Abstract)

Masked diffusion language models (dLMs) generate text through bidirectional denoising, yet this capability remains locked for infilling prompts. This limitation is an artifact of the current supervised finetuning (SFT) convention of applying response-only masking. To unlock this capability, we extend full-sequence masking during SFT, where both prompts and responses are masked jointly. Once unlocked, the model infills masked portions of a prompt template conditioned on few-shot examples. We show that such model-infilled prompts match or surpass manually designed templates, transfer effectively across models, and are complementary to existing prompt optimization methods. Our results suggest that training practices, not architectural limitations, are the primary bottleneck preventing masked diffusion language models from infilling effective prompts

关键词: diffusion language models, masked diffusion, prompt infilling, supervised finetuning, SFT, full-sequence masking, few-shot examples, prompt optimization

170. ❌ ‘Layer su Layer’: Identifying and Disambiguating the Italian NPN Construction in BERT’s family

作者: Greta Gorzoni, Ludovica Pannitto, Francesca Masini 期刊/来源: arxiv 发布日期: 2026-04-04 arXiv链接: http://arxiv.org/abs/2604.03673v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究BERT模型对意大利语NPN结构的编码能力，属于语言模型可解释性研究，仅与’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为论文明确使用层间探测分类器评估模型内部编码的语言信息。其他关键词均涉及大模型技术原理创新或具体应用（如MoE、量化、推理加速、对齐、RAG等），而本文专注于传统BERT架构的语言学分析，不涉及这些前沿技术。

!!! tip deepseek-chat TL;DR

该研究通过层间探测方法评估BERT模型对意大利语NPN结构的编码能力，发现上下文嵌入能反映构式形式和意义，为构式理论与神经语言建模的对话提供了实证证据。

摘要翻译

可解释性研究强调，依据明确的语言学理论评估预训练语言模型（PLMs），特别是上下文嵌入向量，以确定其编码的语言信息至关重要。本研究聚焦于意大利语中的NPN（名词-介词-名词）构式家族，对先前实验设计所依赖的部分理论及方法论假设提出质疑，并将此类研究扩展至一种较少被深入探讨的语言。我们从BERT模型中提取上下文向量表征，并将其作为逐层探测分类器的输入，系统性地评估模型各内部层所编码的信息。研究结果揭示了构式形式与意义在上下文嵌入向量中的反映程度，为构式语法理论与神经语言建模之间的对话提供了实证依据。

摘要 (Abstract)

Interpretability research has highlighted the importance of evaluating Pretrained Language Models (PLMs) and in particular contextual embeddings against explicit linguistic theories to determine what linguistic information they encode. This study focuses on the Italian NPN (noun-preposition-noun) constructional family, challenging some of the theoretical and methodological assumptions underlying previous experimental designs and extending this type of research to a lesser-investigated language. Contextual vector representations are extracted from BERT and used as input to layer-wise probing classifiers, systematically evaluating information encoded across the model’s internal layers. The results shed light on the extent to which constructional form and meaning are reflected in contextual embeddings, contributing empirical evidence to the dialogue between constructionist theory and neural language modelling

关键词: BERT, interpretability, Italian NPN construction, contextual embeddings, layer-wise probing, linguistic information, constructionist theory, neural language modeling

171. ❌ AI Appeals Processor: A Deep Learning Approach to Automated Classification of Citizen Appeals in Government Services

作者: Vladimir Beskorovainyi 期刊/来源: arxiv 发布日期: 2026-04-04 arXiv链接: http://arxiv.org/abs/2604.03672v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究政府服务中公民申诉的自动分类，使用传统NLP和深度学习技术（如Word2Vec、LSTM、BERT），但未涉及大模型（LLMs）或任何评分关键词中的前沿技术（如MoE、RLHF、RAG等），也未应用于科学领域（如生物信息学），因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于深度学习的AI申诉处理器，用于自动分类政府服务中的公民申诉，实验表明Word2Vec+LSTM架构在10,000条真实申诉数据上达到78%的分类准确率，同时将处理时间减少54%。

摘要翻译

全球政府机构面临日益增长的公民诉求处理量，近年来电子化提交数量显著增加。传统人工处理平均每件诉求耗时20分钟，分类准确率仅为67%，对公共服务供给造成显著瓶颈。本文提出AI诉求处理系统——一个基于微服务架构、集成自然语言处理与深度学习技术的自动化公民诉求分类与流转平台。我们在包含三大类别（投诉、申请、建议）和七个主题领域的一万条真实公民诉求代表性数据集上，评估了多种技术方案：包括支持向量机结合词袋模型、支持向量机结合TF-IDF、fastText、Word2Vec结合长短期记忆网络（LSTM）以及BERT。实验结果表明，Word2Vec与LSTM融合架构在将处理时间缩短54%的同时，实现了78%的分类准确率，相较于基于Transformer的模型，在准确率与计算效率之间达到了更优平衡。

摘要 (Abstract)

Government agencies worldwide face growing volumes of citizen appeals, with electronic submissions increasing significantly over recent years. Traditional manual processing averages 20 minutes per appeal with only 67% classification accuracy, creating significant bottlenecks in public service delivery. This paper presents AI Appeals Processor, a microservice-based system that integrates natural language processing and deep learning techniques for automated classification and routing of citizen appeals. We evaluate multiple approaches – including Bag-of-Words with SVM, TF-IDF with SVM, fastText, Word2Vec with LSTM, and BERT – on a representative dataset of 10,000 real citizen appeals across three primary categories (complaints, applications, and proposals) and seven thematic domains. Our experiments demonstrate that a Word2Vec+LSTM architecture achieves 78% classification accuracy while reducing processing time by 54%, offering an optimal balance between accuracy and computational efficiency compared to transformer-based models.

关键词: citizen appeals, automated classification, deep learning, natural language processing, Word2Vec, LSTM, government services, microservice-based system

172. ❌ Document-Level Numerical Reasoning across Single and Multiple Tables in Financial Reports

作者: Yi-Cheng Wang, Wei-An Wang, Chu-Song Chen 期刊/来源: arxiv 发布日期: 2026-04-04 arXiv链接: http://arxiv.org/abs/2604.03664v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在长文档金融报告中的数值推理问题，直接涉及LLMs、长上下文处理、多步推理、RAG方法以及多智能体系统。论文提出FinLongDocAgent方法，采用多智能体多轮RAG方法进行迭代检索和验证，这与LLM Agents、Multi-agent Systems、Retrieval-Augmented Generation、Context Window Extension和Chain of Reasoning高度相关。其他关键词如MoE、量化、对齐等未在论文中涉及，评分为0。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在长文档金融报告中进行跨表格数值推理的困难，提出了一个多智能体多轮检索增强生成方法FinLongDocAgent，通过迭代检索证据和验证结果显著提升了数值问答的可靠性。

摘要翻译

尽管大型语言模型（LLM）具备强大的语言理解能力，其在处理长篇幅结构化文档的可靠问答（QA）方面仍存在困难，尤其是在数值推理任务上。以财务年报为例：财务报表分析通常依赖于精确的算术运算，分析师需要通过整合分散在多个表格和叙述性文本中的证据来推导关键指标。然而，现有基准测试主要聚焦于单表格场景，跨表格的文档级数值推理研究尚不充分。为填补这一空白，我们提出了FinLongDocQA数据集，用于评估长上下文报告中单表格及跨表格的财务数值推理能力。在FinLongDocQA上对闭源和开源LLM的评估揭示了两个瓶颈：（1）年报长度常超过129k个标记，加剧了定位相关表格时的上下文衰减（context rot）问题；（2）即使定位到相关证据，LLM在多步骤数值推理中仍易出错。为此，我们提出FinLongDocAgent——一种多智能体多轮检索增强生成（Multi-Agent Multi-Round Retrieval-Augmented Generation, RAG）方法，通过迭代检索证据、执行中间计算并进行多轮结果验证。实验结果表明，迭代检索与验证对于长篇幅财务文档的可靠数值问答至关重要。

摘要 (Abstract)

Despite the strong language understanding abilities of large language models (LLMs), they still struggle with reliable question answering (QA) over long, structured documents, particularly for numerical reasoning. Financial annual reports exemplify this difficulty: financial statement analysis often hinges on accurate arithmetic, and analysts derive key indicators by integrating evidence scattered across multiple tables and narrative text. However, existing benchmarks focus largely on single-table settings, leaving cross-table document-level numerical reasoning underexplored. To address this gap, we introduce FinLongDocQA, a dataset for both single-table and cross-table financial numerical reasoning in long-context reports. Evaluating both closed-source and open-source LLMs on FinLongDocQA reveals two bottlenecks: (1) annual reports often exceed 129k tokens, exacerbating the context rot problem for locating relevant tables; and (2) even when relevant evidence is located, LLMs remain prone to errors in multi-step numerical reasoning. We propose FinLongDocAgent, a Multi-Agent Multi-Round Retrieval-Augmented Generation (RAG) approach that iteratively retrieves evidence, performs intermediate calculations, and verifies results across rounds. Experiments highlight the importance of iterative retrieval and verification for reliable numerical QA in long financial documents.

关键词: Large Language Models, Numerical Reasoning, Financial Reports, Retrieval-Augmented Generation, Multi-Agent Systems, Long Context, Question Answering, Document-Level Analysis

作者: Minghai Jiao, Jing Xiao, Peng Xiao, Ende Zhang, Shuang Kan, Wenyan Jiang, Jinyao Li, Yixian Liu, Haidong Xin 期刊/来源: arxiv 发布日期: 2026-04-04 arXiv链接: http://arxiv.org/abs/2604.03650v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文CAGMamba专注于多模态情感分析，提出了一种基于Mamba架构的融合框架。虽然涉及深度学习（Mamba是Transformer的替代架构），但研究内容与评分关键词列表中的大模型技术、训练方法、推理优化、对齐技术、代理系统、科学AI应用等主题均无直接关联。论文未提及LLMs、MoE、Scaling Laws、Pre-training、RLHF、RAG、CoT、Agents、Quantization等任何关键词相关技术，也未涉及生物信息学或化学信息学等科学AI应用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态情感分析中跨模态交互建模效率低和上下文依赖捕捉不足的问题，提出了一个基于Mamba的上下文感知门控跨模态框架CAGMamba，在多个基准数据集上取得了先进或具有竞争力的性能。

摘要翻译

多模态情感分析（MSA）需要在对跨模态交互和上下文依赖进行有效建模的同时保持计算效率。现有的融合方法主要依赖于基于Transformer的跨模态注意力机制，其计算复杂度随序列长度呈二次方增长，限制了可扩展性。此外，先前话语的上下文信息通常通过拼接或独立融合的方式引入，缺乏能够捕捉对话轮次间情感演变的显式时序建模。为应对这些局限，我们提出了CAGMamba，一种用于基于对话的情感分析的上下文感知门控跨模态Mamba框架。具体而言，我们将上下文特征与当前话语特征组织成一个时序有序的二元序列，这为Mamba提供了显式的时序结构以建模情感演变。为进一步实现可控的跨模态整合，我们提出了门控跨模态Mamba网络（Gated Cross-Modal Mamba Network, GCMN），它通过可学习的门控机制整合跨模态路径与单模态路径，以平衡信息融合与模态保留，并采用针对文本、音频及融合预测的三分支多任务目标进行训练。在三个基准数据集上的实验表明，CAGMamba在多项评估指标上均达到了最先进或具有竞争力的性能。所有代码已公开于https://github.com/User2024-xj/CAGMamba。

摘要 (Abstract)

Multimodal Sentiment Analysis (MSA) requires effective modeling of cross-modal interactions and contextual dependencies while remaining computationally efficient. Existing fusion approaches predominantly rely on Transformer-based cross-modal attention, which incurs quadratic complexity with respect to sequence length and limits scalability. Moreover, contextual information from preceding utterances is often incorporated through concatenation or independent fusion, without explicit temporal modeling that captures sentiment evolution across dialogue turns. To address these limitations, we propose CAGMamba, a context-aware gated cross-modal Mamba framework for dialogue-based sentiment analysis. Specifically, we organize the contextual and the current-utterance features into a temporally ordered binary sequence, which provides Mamba with explicit temporal structure for modeling sentiment evolution. To further enable controllable cross-modal integration, we propose a Gated Cross-Modal Mamba Network (GCMN) that integrates cross-modal and unimodal paths via learnable gating to balance information fusion and modality preservation, and is trained with a three-branch multi-task objective over text, audio, and fused predictions. Experiments on three benchmark datasets demonstrate that CAGMamba achieves state-of-the-art or competitive results across multiple evaluation metrics. All codes are available at https://github.com/User2024-xj/CAGMamba.

关键词: Multimodal Sentiment Analysis, Mamba, Cross-modal Interaction, Context-aware, Gated Fusion, Dialogue-based, Temporal Modeling, Computational Efficiency

174. ❌ The Format Tax

作者: Ivan Yee Lee, Loris D’Antoni, Taylor Berg-Kirkpatrick 期刊/来源: arxiv 发布日期: 2026-04-04 arXiv链接: http://arxiv.org/abs/2604.03616v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在结构化输出（如JSON、XML）要求下的性能下降问题，与’Large Language Models’高度相关（10分）。研究涉及推理能力下降，与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分），因为论文探讨了格式要求如何影响模型的推理和写作性能。其他关键词如MoE、SFT、RAG等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

研究发现要求大语言模型以JSON等结构化格式输出会显著降低其推理和写作性能，但通过将推理与格式生成解耦的方法可以恢复大部分丢失的准确性。

摘要翻译

要求大型语言模型以JSON格式进行回应本应只是一种格式选择，而非能力代价。然而我们发现，结构化输出要求——包括JSON、XML、LaTeX、Markdown——会显著降低开源模型的推理与写作性能。当前研究多聚焦于约束解码技术，但采样偏差仅能解释性能下降的一小部分。主要代价产生于提示阶段：在应用任何解码器约束之前，仅格式要求指令本身就会导致大部分准确性损失。这一诊断指向一个简单原则：将推理过程与格式生成解耦。无论是通过首先生成自由格式文本再进行二次格式化，还是在单次生成中允许扩展思考，将这两个关注点分离都能显著恢复损失的准确性。在六个开源模型、四个API模型、四种格式以及涵盖数学、科学、逻辑和写作的任务测试中，解耦方法恢复了大部分损失的准确性。值得注意的是，最新闭源模型几乎未表现出格式代价，这表明问题并非结构化生成所固有，而是当前开源模型尚未弥合的技术差距。代码发布于https://github.com/ivnle/the-format-tax。

摘要 (Abstract)

Asking a large language model to respond in JSON should be a formatting choice, not a capability tax. Yet we find that structured output requirements – JSON, XML, LaTeX, Markdown – substantially degrade reasoning and writing performance across open-weight models. The research response has focused on constrained decoding, but sampling bias accounts for only a fraction of the degradation. The dominant cost enters at the prompt: format-requesting instructions alone cause most of the accuracy loss, before any decoder constraint is applied. This diagnosis points to a simple principle: decouple reasoning from formatting. Whether by generating freeform first and reformatting in a second pass, or by enabling extended thinking within a single generation, separating the two concerns substantially recovers lost accuracy. Across six open-weight models, four API models, four formats, and tasks spanning math, science, logic, and writing, decoupling recovers most lost accuracy. Notably, most recent closed-weight models show little to no format tax, suggesting the problem is not inherent to structured generation but a gap that current open-weight models have yet to close. Code is available at https://github.com/ivnle/the-format-tax.

关键词: large language models, structured output, JSON, reasoning performance, format tax, decoupling, open-weight models, accuracy degradation

175. ❌ PointTPA: Dynamic Network Parameter Adaptation for 3D Scene Understanding

作者: Siyuan Liu, Chaoqun Zheng, Xin Zhou, Tianrui Feng, Dingkang Liang, Xiang Bai 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04933v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于3D点云场景理解，提出了一种测试时参数自适应框架PointTPA。该研究属于计算机视觉领域，与大多数大语言模型（LLM）相关关键词无关。唯一相关的关键词是’PEFT OR LoRA OR Parameter-efficient Fine-tuning’，因为论文明确提到’parameter-efficient fine-tuning (PEFT) methods’，并展示了其方法在参数效率上的优势，但论文核心是动态网络参数适应，而非传统PEFT技术，因此给10分（高度相关）。其他关键词均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

论文提出PointTPA框架，通过测试时动态网络参数自适应解决3D点云场景理解中静态参数限制问题，在保持低参数开销的同时显著提升了多个基准测试的性能。

摘要翻译

场景级点云理解因几何结构多样、类别分布不均衡以及空间布局高度变化而持续面临挑战。现有方法虽提升了物体级性能，但在推理时依赖静态网络参数，限制了其对动态场景数据的适应能力。本文提出PointTPA，一种测试时参数自适应框架，可为场景级点云生成输入感知的网络参数。PointTPA采用基于序列化的邻域分组（Serialization-based Neighborhood Grouping, SNG）构建局部连贯的补丁，并利用动态参数投影器（Dynamic Parameter Projector, DPP）生成补丁级自适应权重，使骨干网络能够根据场景特定变化调整其行为，同时保持较低参数量开销。集成至PTv3架构后，PointTPA通过引入两个轻量级模块（参数量不足骨干网络的2%）展现出强大的参数效率。尽管参数量开销极低，PointTPA在ScanNet验证集上实现了78.4%的平均交并比（mIoU），在多个基准测试中超越了现有参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）方法，凸显了我们的测试时动态网络参数自适应机制在增强三维场景理解方面的有效性。代码发布于https://github.com/H-EmbodVis/PointTPA。

摘要 (Abstract)

Scene-level point cloud understanding remains challenging due to diverse geometries, imbalanced category distributions, and highly varied spatial layouts. Existing methods improve object-level performance but rely on static network parameters during inference, limiting their adaptability to dynamic scene data. We propose PointTPA, a Test-time Parameter Adaptation framework that generates input-aware network parameters for scene-level point clouds. PointTPA adopts a Serialization-based Neighborhood Grouping (SNG) to form locally coherent patches and a Dynamic Parameter Projector (DPP) to produce patch-wise adaptive weights, enabling the backbone to adjust its behavior according to scene-specific variations while maintaining a low parameter overhead. Integrated into the PTv3 structure, PointTPA demonstrates strong parameter efficiency by introducing two lightweight modules of less than 2% of the backbone’s parameters. Despite this minimal parameter overhead, PointTPA achieves 78.4% mIoU on ScanNet validation, surpassing existing parameter-efficient fine-tuning (PEFT) methods across multiple benchmarks, highlighting the efficacy of our test-time dynamic network parameter adaptation mechanism in enhancing 3D scene understanding. The code is available at https://github.com/H-EmbodVis/PointTPA.

关键词: 3D scene understanding, point cloud, test-time parameter adaptation, dynamic network parameters, parameter-efficient fine-tuning, ScanNet, mIoU, PointTPA

176. ❌ Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision

作者: Hyunsoo Cha, Wonjung Woo, Byungjun Kim, Hanbyul Joo 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04934v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文《Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision》专注于计算机视觉和生成式AI领域，具体研究基于单张人物图像、服装图像和姿态引导视频生成服装转移的人物动画视频。论文的核心贡献包括：1）统一的单步框架解决传统两阶段流程的身份漂移、服装变形和前后不一致问题；2）大规模三元组监督数据生成方法；3）用于视频扩散Transformer的双模块架构以稳定训练并提升质量。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理（如MoE、Scaling Laws、RLHF、PEFT等）、推理方法（CoT、System 2）、代理系统或AI for Science（生物信息学、化学信息学）直接相关。该论文未涉及任何大语言模型或相关技术，也未应用于科学领域（如生物或化学），因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了Vanast，一个统一的单步框架，通过合成三元组监督和双模块视频扩散Transformer架构，直接从单张人物图像、服装图像和姿态视频生成高保真、身份一致的服装转移人物动画视频，解决了传统两阶段方法中的身份漂移、服装变形和前后不一致问题。

摘要翻译

本文提出Vanast，这是一个统一的框架，能够直接从单张人物图像、服装图像和姿态引导视频生成服装迁移的人类动画视频。传统的两阶段流程将基于图像的虚拟试穿和姿态驱动动画视为独立过程，这通常会导致身份漂移、服装形变及前后不一致等问题。我们的模型通过将整个流程整合为单一统一步骤来实现连贯合成，从而解决上述问题。为实现这一设定，我们构建了大规模三元组监督数据。我们的数据生成流程包括：生成与服装目录图像不同的、具有身份保持性的换装人物图像；采集完整上下装三元组以突破单服装-姿态视频对的限制；以及在不依赖服装目录图像的情况下整合多样化的真实场景三元组。我们进一步为视频扩散变换器引入了双模块架构，以稳定训练过程、保持预训练生成质量，并提升服装准确性、姿态遵循度和身份保持性，同时支持零样本服装插值。这些贡献共同使Vanast能够跨多种服装类型生成高保真、身份一致的角色动画。

摘要 (Abstract)

We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front-back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision. Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment-posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.

关键词: Virtual Try-On, Human Image Animation, Synthetic Triplet Supervision, Video Diffusion Transformer, Garment Transfer, Pose Guidance, Identity Preservation, Zero-shot Garment Interpolation

177. ❌ LoMa: Local Feature Matching Revisited

作者: David Nordström, Johan Edstedt, Georg Bökman, Jonathan Astermark, Anders Heyden, Viktor Larsson, Mårten Wadenbäck, Michael Felsberg, Fredrik Kahl 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04931v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《LoMa: Local Feature Matching Revisited》专注于计算机视觉中的局部特征匹配问题，属于3D视觉和图像处理领域。论文的核心贡献包括：1）提出LoMa方法，通过大规模多样化数据混合、现代训练方法、扩展模型容量和计算资源来改进局部特征匹配；2）创建HardMatch数据集以解决现有基准测试饱和问题。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是传统的计算机视觉任务（局部特征匹配），并未涉及大模型、深度学习技术原理创新或AI在生物医药等科学领域的应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文重新审视了计算机视觉中的局部特征匹配问题，通过结合大规模数据混合、现代训练方法和扩展模型容量，提出了LoMa方法，在多个基准测试中显著超越了现有最先进方法，并创建了HardMatch数据集来解决基准测试饱和问题。

摘要翻译

局部特征匹配长期以来一直是三维视觉系统（如运动恢复结构，Structure-from-Motion，SfM）的基础组成部分，但其发展滞后于现代数据驱动方法的快速进步。较新的方法，如前馈式重建模型，已从数据集规模的扩大中广泛受益，而局部特征匹配模型目前仍仅在少数几个中等规模数据集上进行训练。本文从数据驱动的视角重新审视局部特征匹配。在我们提出的方法（称为LoMa）中，我们结合了大规模多样化的数据混合、现代训练方案、扩展的模型容量以及扩展的计算资源，从而实现了性能的显著提升。由于当前的标准基准测试主要依赖于从成功的三维重建中收集稀疏视图，特征匹配的进展评估一直被局限于相对简单的图像对。为解决由此导致的基准测试性能饱和问题，我们从互联网数据中收集了1000对极具挑战性的图像对，构建了一个名为HardMatch的新数据集。HardMatch的真实对应关系由作者通过人工标注获得。在我们广泛的基准测试套件中，我们发现LoMa在各方面均取得了突出进展：在HardMatch上以+18.6 mAA超越当前最先进方法ALIKED+LightGlue，在WxBS上提升+29.5 mAA，在InLoc上提升+21.4（1米，10$^\circ$），在RUBIK上提升+24.2 AUC，在IMC 2022上提升+12.4 mAA。我们已在https://github.com/davnords/LoMa公开发布代码和模型。

摘要 (Abstract)

Local feature matching has long been a fundamental component of 3D vision systems such as Structure-from-Motion (SfM), yet progress has lagged behind the rapid advances of modern data-driven approaches. The newer approaches, such as feed-forward reconstruction models, have benefited extensively from scaling dataset sizes, whereas local feature matching models are still only trained on a few mid-sized datasets. In this paper, we revisit local feature matching from a data-driven perspective. In our approach, which we call LoMa, we combine large and diverse data mixtures, modern training recipes, scaled model capacity, and scaled compute, resulting in remarkable gains in performance. Since current standard benchmarks mainly rely on collecting sparse views from successful 3D reconstructions, the evaluation of progress in feature matching has been limited to relatively easy image pairs. To address the resulting saturation of benchmarks, we collect 1000 highly challenging image pairs from internet data into a new dataset called HardMatch. Ground truth correspondences for HardMatch are obtained via manual annotation by the authors. In our extensive benchmarking suite, we find that LoMa makes outstanding progress across the board, outperforming the state-of-the-art method ALIKED+LightGlue by +18.6 mAA on HardMatch, +29.5 mAA on WxBS, +21.4 (1m, 10$^\circ$) on InLoc, +24.2 AUC on RUBIK, and +12.4 mAA on IMC 2022. We release our code and models publicly at https://github.com/davnords/LoMa.

关键词: Local Feature Matching, 3D Vision, Structure-from-Motion, Data-driven Approach, HardMatch Dataset, Model Capacity Scaling, Benchmark Evaluation, Computer Vision

178. ❌ Rethinking Model Efficiency: Multi-Agent Inference with Large Models

作者: Sixun Dong, Juhua Hu, Steven Li, Wei Wen, Qi Qian 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04929v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大模型（LLM）与小模型（SLM）在推理效率上的对比，提出多智能体框架（Multi-agent Systems/LLM Agents）通过转移小模型的推理标记（Chain of Thought）来加速大模型推理（Inference Acceleration）。论文直接涉及LLM、SLM、多智能体系统、推理加速和思维链等关键词，其他关键词如MoE、对齐、RAG等未提及。

!!! tip deepseek-chat TL;DR

论文研究了视觉语言模型中大模型与小模型的推理效率问题，发现大模型用更少输出标记可达到相当性能，并提出一个多智能体推理框架，通过复用小模型的推理标记来提升大模型效率。

摘要翻译

大多数视觉语言模型（VLMs）采用大语言模型（LLM）作为解码器，通过自回归方式顺序生成响应标记。因此，输出标记的数量可能成为端到端延迟的瓶颈。然而，不同模型为实现相当的性能可能需要差异巨大的输出标记数量。在本工作中，我们基于模拟数据对视觉语言模型各组成部分的延迟进行了全面分析。实验表明，输出标记较少的大型模型可能比输出序列较长的小型模型更高效。在多样化真实世界基准测试上的实证研究证实了这一观察：大型模型能够以显著更少的输出标记实现优于或相当于小型模型的性能。为利用大型模型的效率优势，我们提出一种多智能体推理框架，该框架保持大型模型生成简短响应，但在必要时从小型模型转移关键推理标记。在基准任务上的比较表明，通过复用小型模型的推理标记，该方法能够帮助模型接近大型模型自身推理所达到的性能，从而验证了我们所提方案的有效性。

摘要 (Abstract)

Most vision-language models (VLMs) apply a large language model (LLM) as the decoder, where the response tokens are generated sequentially through autoregression. Therefore, the number of output tokens can be the bottleneck of the end-to-end latency. However, different models may require vastly different numbers of output tokens to achieve comparable performance. In this work, we conduct a comprehensive analysis of the latency across different components of VLMs on simulated data. The experiment shows that a large model with fewer output tokens can be more efficient than a small model with a long output sequence. The empirical study on diverse real-world benchmarks confirms the observation that a large model can achieve better or comparable performance as a small model with significantly fewer output tokens. To leverage the efficiency of large models, we propose a multi-agent inference framework that keeps large models with short responses but transfers the key reasoning tokens from the small model when necessary. The comparison on benchmark tasks demonstrates that by reusing the reasoning tokens from small models, it can help approach the performance of a large model with its own reasoning, which confirms the effectiveness of our proposal.

关键词: vision-language models, large language models, small models, inference efficiency, multi-agent framework, reasoning tokens, autoregressive generation, latency analysis

179. ❌ Fully Procedural Synthetic Data from Simple Rules for Multi-View Stereo

作者: Zeyu Ma, Alexander Raistrick, Jia Deng 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04925v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多视图立体视觉（MVS）的合成数据生成方法，使用程序化规则（NURBS、位移和纹理模式）生成训练数据。论文主题属于计算机视觉领域，与所有评分关键词（均涉及大模型、深度学习技术原理、AI科学应用等）完全无关。论文未涉及任何大模型、深度学习技术或AI在科学领域的应用，也未提及任何评分关键词中的技术概念。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于简单程序化规则生成多视图立体视觉训练数据的方法，使用少量规则生成的数据在8,000图像规模上优于人工收集数据，在352,000图像规模上达到与692,000人工收集数据相当甚至更好的性能。

摘要翻译

本文探讨了多视图立体视觉（Multi-View Stereo, MVS）中程序化规则的设计空间。我们证明，通过使用一种全新的、完全程序化的生成器SimpleProc，仅需基于少量规则——包括非均匀有理B样条（Non-Uniform Rational Basis Splines, NURBS）以及基本位移与纹理模式——即可生成有效的训练数据。在仅使用8,000张图像的较小规模下，我们的方法相较于从游戏和真实物体中获取的同规模人工筛选图像，取得了更优的结果。当数据规模扩展至352,000张图像时，我们的方法所训练出的模型性能与使用超过692,000张人工筛选图像训练的模型相当，并在多个基准测试中实现了超越。源代码与数据已公开于https://github.com/princeton-vl/SimpleProc。

摘要 (Abstract)

In this paper, we explore the design space of procedural rules for multi-view stereo (MVS). We demonstrate that we can generate effective training data using SimpleProc: a new, fully procedural generator driven by a very small set of rules using Non-Uniform Rational Basis Splines (NURBS), as well as basic displacement and texture patterns. At a modest scale of 8,000 images, our approach achieves superior results compared to manually curated images (at the same scale) sourced from games and real-world objects. When scaled to 352,000 images, our method yields performance comparable to–and in several benchmarks, exceeding–models trained on over 692,000 manually curated images. The source code and the data are available at https://github.com/princeton-vl/SimpleProc.

关键词: multi-view stereo, procedural generation, synthetic data, NURBS, training data, SimpleProc, computer vision, 3D reconstruction

180. ❌ A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

作者: Tommie Kerssies, Gabriele Berton, Ju He, Qihang Yu, Wufei Ma, Daan de Geus, Gijs Dubbelman, Liang-Chieh Chen 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04913v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频世界建模，提出DeltaTok和DeltaWorld方法，核心是生成式世界模型，与关键词’World Models AND General World Models’高度相关（10分）。其他关键词主要涉及大语言模型（LLMs）的技术、训练、对齐、推理、代理、压缩等，而本文研究的是视觉基础模型（VFM）特征空间中的视频预测，未涉及LLMs或相关技术，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对视频世界建模中预测多样未来状态的计算效率问题，提出了DeltaTok编码器和DeltaWorld生成式世界模型，通过将连续帧间的特征差异编码为单token，大幅减少了参数和计算量，同时实现了更准确的未来预测。

摘要翻译

预测多样化的未来状态是视频世界建模的核心挑战。判别式世界模型生成确定性预测，隐式地对可能未来进行平均化处理，而现有生成式世界模型仍存在计算成本高昂的问题。近期研究表明，在视觉基础模型（Vision Foundation Model, VFM）的特征空间中预测未来（而非在优化像素重建的潜在空间中），可大幅减少世界模型所需参数量。然而，此类方法大多仍属于判别式框架。本文提出DeltaTok——一种将连续帧间VFM特征差异编码为单个连续“增量”令牌的编码器，以及DeltaWorld——一种基于这些令牌运行的生成式世界模型，能高效生成多样化的合理未来。增量令牌将视频从三维时空表示简化为一维时间序列，例如在处理512x512帧时实现1,024倍的令牌压缩。这种紧凑表示使得可扩展的多假设训练成为可能：系统并行生成多个未来序列，仅对最优结果进行监督。在推理阶段，该方法通过单次前向传播即可实现多样化预测。在密集预测任务上的实验表明，DeltaWorld生成的未来预测更贴合真实世界结果，同时参数量比现有生成式世界模型减少35倍以上，计算量（FLOPs）降低2,000倍。代码与权重：https://deltatok.github.io。

摘要 (Abstract)

Anticipating diverse future states is a central challenge in video world modeling. Discriminative world models produce a deterministic prediction that implicitly averages over possible futures, while existing generative world models remain computationally expensive. Recent work demonstrates that predicting the future in the feature space of a vision foundation model (VFM), rather than a latent space optimized for pixel reconstruction, requires significantly fewer world model parameters. However, most such approaches remain discriminative. In this work, we introduce DeltaTok, a tokenizer that encodes the VFM feature difference between consecutive frames into a single continuous “delta” token, and DeltaWorld, a generative world model operating on these tokens to efficiently generate diverse plausible futures. Delta tokens reduce video from a three-dimensional spatio-temporal representation to a one-dimensional temporal sequence, for example yielding a 1,024x token reduction with 512x512 frames. This compact representation enables tractable multi-hypothesis training, where many futures are generated in parallel and only the best is supervised. At inference, this leads to diverse predictions in a single forward pass. Experiments on dense forecasting tasks demonstrate that DeltaWorld forecasts futures that more closely align with real-world outcomes, while having over 35x fewer parameters and using 2,000x fewer FLOPs than existing generative world models. Code and weights: https://deltatok.github.io.

关键词: generative world modeling, video world models, delta tokens, vision foundation model, multi-hypothesis training, efficient forecasting, feature difference encoding, diverse future prediction

181. ❌ SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

作者: Yicheng Xiao, Wenhu Zhang, Lin Song, Yukang Chen, Wenbo Li, Nan Jiang, Tianhe Ren, Haokun Lin, Wei Huang, Haoyang Huang, Xiu Li, Nan Duan, Xiaojuan Qi 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04911v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的图像空间编辑任务，涉及基准构建、数据集生成和模型开发，但完全不涉及大语言模型、深度学习技术原理或科学AI应用等关键词领域。所有关键词均与大模型技术、训练方法、推理优化、对齐、代理系统、科学应用等相关，而本文是纯粹的计算机视觉研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对现有模型在细粒度图像空间编辑能力上的不足，提出了一个包含基准测试、合成数据集和基线模型的完整评估框架，显著提升了空间编辑任务的性能。

摘要翻译

图像空间编辑执行几何驱动的变换，实现对物体布局和相机视点的精确控制。现有模型难以实现细粒度的空间操控，这促使我们构建专门的评估体系。我们的贡献如下：（i）我们提出SpatialEdit-Bench，这是一个通过视点重建与构图分析联合度量感知合理性与几何保真度的完整基准测试框架，用于系统评估空间编辑能力。（ii）为突破可扩展训练的数据瓶颈，我们构建了SpatialEdit-500k合成数据集。该数据集通过可控的Blender管线生成，可在多样化背景中渲染物体并沿系统化相机轨迹采集图像，为物体中心与相机中心的操作提供精确的真实变换标注。（iii）基于此数据，我们开发了SpatialEdit-16B基线模型，专门用于细粒度空间编辑。该方法在通用编辑任务上达到可比性能，同时在空间操控任务上显著超越现有方法。所有资源将在https://github.com/EasonXiao-888/SpatialEdit公开。

摘要 (Abstract)

Image spatial editing performs geometry-driven transformations, allowing precise control over object layout and camera viewpoints. Current models are insufficient for fine-grained spatial manipulations, motivating a dedicated assessment suite. Our contributions are listed: (i) We introduce SpatialEdit-Bench, a complete benchmark that evaluates spatial editing by jointly measuring perceptual plausibility and geometric fidelity via viewpoint reconstruction and framing analysis. (ii) To address the data bottleneck for scalable training, we construct SpatialEdit-500k, a synthetic dataset generated with a controllable Blender pipeline that renders objects across diverse backgrounds and systematic camera trajectories, providing precise ground-truth transformations for both object- and camera-centric operations. (iii) Building on this data, we develop SpatialEdit-16B, a baseline model for fine-grained spatial editing. Our method achieves competitive performance on general editing while substantially outperforming prior methods on spatial manipulation tasks. All resources will be made public at https://github.com/EasonXiao-888/SpatialEdit.

关键词: Image Spatial Editing, Fine-grained Manipulation, Benchmark Evaluation, Synthetic Dataset, Geometric Fidelity, Viewpoint Reconstruction, Camera Trajectories, Baseline Model

182. ❌ ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality

作者: Dawar Khan, Alexandre Kouyoumdjian, Xinyu Liu, Omar Mena, Dominik Engel, Ivan Viola 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04905v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究XR环境中的on-device多模态视觉语言交互系统，核心创新在于结合控制器点击选择与本地VLM推理。与关键词相关性分析：1）‘Small Language Models OR SLMs OR On-device AI’高度相关（10分），论文明确使用on-device VLM并强调本地推理优势；2）‘Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），论文虽未直接研究LLM技术原理，但VLM属于大模型范畴，且与ChatGPT/Gemini对比；3）其他关键词（如MoE、Scaling Laws、训练方法、推理优化等）均未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了ClickAIXR框架，通过结合控制器点击选择和本地视觉语言模型，实现了XR环境中对真实物体的隐私保护、低延迟多模态交互，用户研究表明该系统具有可接受的延迟和良好的用户体验。

摘要翻译

本文提出ClickAIXR，一种用于扩展现实（XR）环境中物体多模态视觉-语言交互的新型设备端框架。与以往依赖云端人工智能（如ChatGPT）或基于注视的交互系统（如GazePointAR）不同，ClickAIXR将设备端视觉-语言模型（Vision-Language Model, VLM）与基于控制器的物体选择范式相结合，使用户能够在XR环境中精确点击真实世界物体。选定物体后，其图像由VLM在本地处理，通过文本和语音回答自然语言问题。这种以物体为中心的交互方式减少了纯注视或纯语音界面固有的歧义性，并通过在设备端执行全部推理提升了透明度，同时解决了隐私和延迟方面的顾虑。我们在Magic Leap SDK（C API）中实现了ClickAIXR，采用基于ONNX的本地VLM推理。通过一项用户研究，我们将ClickAIXR与Gemini 2.5 Flash和ChatGPT 5进行比较，评估了可用性、信任度和用户满意度。结果表明，系统延迟处于中等水平，用户体验可被接受。我们的研究证明了基于点击的物体选择与设备端人工智能相结合，在推进可信赖、保护隐私的XR交互方面的潜力。源代码及补充材料发布于：nanovis.org/ClickAIXR.html

摘要 (Abstract)

We present ClickAIXR, a novel on-device framework for multimodal vision-language interaction with objects in extended reality (XR). Unlike prior systems that rely on cloud-based AI (e.g., ChatGPT) or gaze-based selection (e.g., GazePointAR), ClickAIXR integrates an on-device vision-language model (VLM) with a controller-based object selection paradigm, enabling users to precisely click on real-world objects in XR. Once selected, the object image is processed locally by the VLM to answer natural language questions through both text and speech. This object-centered interaction reduces ambiguity inherent in gaze- or voice-only interfaces and improves transparency by performing all inference on-device, addressing concerns around privacy and latency. We implemented ClickAIXR in the Magic Leap SDK (C API) with ONNX-based local VLM inference. We conducted a user study comparing ClickAIXR with Gemini 2.5 Flash and ChatGPT 5, evaluating usability, trust, and user satisfaction. Results show that latency is moderate and user experience is acceptable. Our findings demonstrate the potential of click-based object selection combined with on-device AI to advance trustworthy, privacy-preserving XR interactions. The source code and supplementary materials are available at: nanovis.org/ClickAIXR.html

关键词: on-device AI, vision-language model, extended reality, multimodal interaction, object selection, privacy-preserving, local inference, user study

183. ❌ HorizonWeaver: Generalizable Multi-Level Semantic Editing for Driving Scenes

作者: Mauricio Soroco, Francesco Pittaluga, Zaid Tasneem, Abhishek Aich, Bingbing Zhuang, Wuyang Chen, Manmohan Chandraker, Ziyu Jiang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04887v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文HorizonWeaver专注于自动驾驶场景的图像编辑，核心贡献在于数据集构建、语言引导掩码和训练损失设计，以实现多粒度、语义丰富的驾驶场景编辑。虽然属于AI应用领域，但所有评分关键词均针对大模型/深度学习技术原理（如LLM架构、训练方法、推理优化等），而本文未涉及任何大模型技术，也未使用或改进LLM、MoE、Scaling Laws、微调方法、推理技术、代理系统等。关键词’AI for Science’特指生物/化学信息学，与自动驾驶工程应用无关。因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

论文提出HorizonWeaver框架，通过大规模数据集生成、语言引导掩码和联合训练损失，解决了自动驾驶场景中多粒度语义编辑的挑战，实现了对复杂驾驶场景的逼真、指令驱动的编辑，并在多个指标上优于现有方法。

摘要翻译

确保自动驾驶的安全性需要生成超越真实道路测试范围、可扩展且可控的逼真驾驶场景。然而，现有的指令引导图像编辑方法主要在以物体为中心或艺术类数据上训练，难以处理密集且安全关键的驾驶场景布局。我们提出HorizonWeaver，旨在解决驾驶场景编辑中的三个核心挑战：(1) 多层级粒度控制，要求在密集环境中实现物体层级与场景层级的连贯编辑；(2) 丰富的高层语义理解，在遵循详细指令的同时保持场景中多样物体的完整性；(3) 普遍存在的域适应问题，能够处理未知环境中气候、布局及交通状况的变化。HorizonWeaver的核心在于数据、模型与训练三方面的互补性贡献：(1) 数据层面：通过整合Boreas、nuScenes和Argoverse2构建大规模配对真实/合成数据集，以提升模型泛化能力；(2) 模型层面：引入语言引导掩码实现细粒度编辑，通过语义增强的掩码与提示实现精准的语言驱动编辑；(3) 训练层面：采用内容保持与指令对齐的联合损失函数，确保场景一致性与指令遵循度。综合而言，HorizonWeaver为复杂驾驶场景的逼真指令驱动编辑提供了一个可扩展框架，共收集涵盖13个编辑类别的25.5万张图像，在L1、CLIP和DINO指标上超越现有方法，获得+46.4%的用户偏好度，并将鸟瞰图分割交并比提升+33%。项目页面：https://msoroco.github.io/horizonweaver/

摘要 (Abstract)

Ensuring safety in autonomous driving requires scalable generation of realistic, controllable driving scenes beyond what real-world testing provides. Yet existing instruction guided image editors, trained on object-centric or artistic data, struggle with dense, safety-critical driving layouts. We propose HorizonWeaver, which tackles three fundamental challenges in driving scene editing: (1) multi-level granularity, requiring coherent object- and scene-level edits in dense environments; (2) rich high-level semantics, preserving diverse objects while following detailed instructions; and (3) ubiquitous domain shifts, handling changes in climate, layout, and traffic across unseen environments. The core of HorizonWeaver is a set of complementary contributions across data, model, and training: (1) Data: Large-scale dataset generation, where we build a paired real/synthetic dataset from Boreas, nuScenes, and Argoverse2 to improve generalization; (2) Model: Language-Guided Masks for fine-grained editing, where semantics-enriched masks and prompts enable precise, language-guided edits; and (3) Training: Content preservation and instruction alignment, where joint losses enforce scene consistency and instruction fidelity. Together, HorizonWeaver provides a scalable framework for photorealistic, instruction-driven editing of complex driving scenes, collecting 255K images across 13 editing categories and outperforming prior methods in L1, CLIP, and DINO metrics, achieving +46.4% user preference and improving BEV segmentation IoU by +33%. Project page: https://msoroco.github.io/horizonweaver/

关键词: autonomous driving, scene editing, language-guided editing, multi-level granularity, domain generalization, photorealistic editing, instruction-driven editing, driving scenes

184. ❌ Free-Range Gaussians: Non-Grid-Aligned Generative 3D Gaussian Reconstruction

作者: Ahan Shabanov, Peter Hedman, Ethan Weber, Zhengqin Li, Denis Rozumny, Gael Le Lan, Naina Dhingra, Lei Luo, Andrea Vedaldi, Christian Richardt, Andrea Tagliasacchi, Bo Zhu, Numair Khan 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04874v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于3D高斯重建的计算机视觉方法，使用生成式流匹配和分层补丁方案等技术，但完全不涉及大语言模型、深度学习技术原理或AI在科学领域的应用。所有关键词均与大模型、深度学习技术或AI科学应用相关，而本文是纯粹的3D重建计算机视觉研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Free-Range Gaussians的多视图3D重建方法，通过流匹配预测非网格对齐的3D高斯，使用分层补丁方案和训练/推理优化，在少量输入图像下实现了比现有方法更高质量的重建，并减少了高斯数量。

摘要翻译

我们提出自由范围高斯模型（Free-Range Gaussians），这是一种多视角重建方法，能够仅从四张图像预测非像素、非体素对齐的三维高斯分布。该方法通过高斯参数流匹配实现。我们提出的生成式重建框架允许模型使用非网格对齐的三维数据进行监督，并使其能够在未观测区域合成合理内容。因此，本方法改进了先前方法中产生高度冗余的网格对齐高斯分布、在未观测区域出现空洞或模糊条件均值的问题。为处理高质量结果所需的高斯分布数量，我们引入分层分块方案，将空间相关的高斯分布分组为联合变换器（transformer）标记，在保持结构的同时将序列长度减半。我们进一步提出训练时采用时间步加权渲染损失，在推理时使用光度梯度引导和无分类器引导以提升保真度。在Objaverse和Google Scanned Objects数据集上的实验表明，相较于像素和体素对齐方法，本方法在使用更少高斯分布的同时实现了性能的持续提升，当输入视角存在物体局部未观测时优势尤为显著。

摘要 (Abstract)

We present Free-Range Gaussians, a multi-view reconstruction method that predicts non-pixel, non-voxel-aligned 3D Gaussians from as few as four images. This is done through flow matching over Gaussian parameters. Our generative formulation of reconstruction allows the model to be supervised with non-grid-aligned 3D data, and enables it to synthesize plausible content in unobserved regions. Thus, it improves on prior methods that produce highly redundant grid-aligned Gaussians, and suffer from holes or blurry conditional means in unobserved regions. To handle the number of Gaussians needed for high-quality results, we introduce a hierarchical patching scheme to group spatially related Gaussians into joint transformer tokens, halving the sequence length while preserving structure. We further propose a timestep-weighted rendering loss during training, and photometric gradient guidance and classifier-free guidance at inference to improve fidelity. Experiments on Objaverse and Google Scanned Objects show consistent improvements over pixel and voxel-aligned methods while using significantly fewer Gaussians, with large gains when input views leave parts of the object unobserved.

关键词: 3D Gaussian reconstruction, multi-view reconstruction, flow matching, non-grid-aligned Gaussians, hierarchical patching, transformer tokens, photometric gradient guidance, classifier-free guidance

185. ❌ Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations

作者: Tuan Dung Nguyen, Minh Khoi Ho, Qi Chen, Yutong Xie, Nguyen Cam-Tu, Minh Khoi Nguyen, Dang Huy Pham Nguyen, Anton van den Hengel, Johan W. Verjans, Phi Le Nguyen, Vu Minh Hieu Phan 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04863v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于大型视觉语言模型（LVLMs）的幻觉检测，属于大模型应用领域。与关键词高度相关的是：1）‘Hallucination Mitigation’（10分）- 论文核心研究幻觉检测方法；2）‘Large Language Models’（8分）- LVLMs是大语言模型的视觉扩展；3）‘Mechanistic Interpretability’（8分）- 通过分析注意力模式和语义对齐来理解模型行为。其他关键词如MoE、SFT、RAG等未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文针对大型视觉语言模型中的幻觉问题，提出了一种基于细粒度令牌定位的检测框架，通过分析注意力模式和语义对齐实现了高达90%的检测准确率。

摘要翻译

大型视觉语言模型（LVLMs）在视觉推理任务上展现出强大性能，但仍极易产生幻觉现象。现有的检测方法主要依赖于对象标记（object token）与输入图像之间粗粒度的全局关联度量。这种全局策略存在局限：幻觉标记可能在许多局部区域表现出微弱但广泛分散的相关性，这些相关性会聚合为具有欺骗性的整体高关联度，从而规避当前基于全局的幻觉检测器。我们从一个简单而关键的观察出发：一个真实的对象标记必须牢固地锚定在图像的特定区域。基于这一洞见，我们提出了一种基于图像块（patch-level）的幻觉检测框架，该框架细粒度地考察模型各层间的标记级交互。我们的分析揭示了幻觉标记的两个特征性信号：（i）它们产生分散、非局部化的注意力模式，与真实标记所表现出的紧凑、高度聚焦的注意力形成鲜明对比；（ii）它们未能与任何视觉区域展现出有意义的语义对齐。基于这些发现，我们开发了一种轻量级且可解释的检测方法，该方法利用图像块级别的统计特征，并结合隐藏层表征。我们的方法在标记级幻觉检测中达到了高达90%的准确率，证明了细粒度结构分析在检测幻觉方面的优越性。

摘要 (Abstract)

Large vision-language models (LVLMs) achieve strong performance on visual reasoning tasks but remain highly susceptible to hallucination. Existing detection methods predominantly rely on coarse, whole-image measures of how an object token relates to the input image. This global strategy is limited: hallucinated tokens may exhibit weak but widely scattered correlations across many local regions, which aggregate into deceptively high overall relevance, thus evading the current global hallucination detectors. We begin with a simple yet critical observation: a faithful object token must be strongly grounded in a specific image region. Building on this insight, we introduce a patch-level hallucination detection framework that examines fine-grained token-level interactions across model layers. Our analysis uncovers two characteristic signatures of hallucinated tokens: (i) they yield diffuse, non-localized attention patterns, in contrast to the compact, well-focused attention seen in faithful tokens; and (ii) they fail to exhibit meaningful semantic alignment with any visual region. Guided by these findings, we develop a lightweight and interpretable detection method that leverages patch-level statistical features, combined with hidden-layer representations. Our approach achieves up to 90% accuracy in token-level hallucination detection, demonstrating the superiority of fine-grained structural analysis for detecting hallucinations.

关键词: Large Vision-Language Models, Hallucination Detection, Token Grounding, Attention Patterns, Semantic Alignment, Fine-grained Analysis, Patch-level Detection, Interpretable Method

186. ❌ Unified Vector Floorplan Generation via Markup Representation

作者: Kaede Shiohara, Toshihiko Yamasaki 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04859v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是建筑平面图自动生成，提出了一种新的标记语言表示法（FML）和基于Transformer的生成模型（FMLM）。虽然使用了Transformer架构，但论文专注于计算机图形学和建筑设计的特定应用领域，并未涉及大语言模型（LLM）或深度学习技术原理的创新。所有评分关键词均与大语言模型、深度学习技术原理或AI for Science（如生物信息学）直接相关，而本论文的应用领域（建筑平面图生成）不属于这些范畴，也未讨论任何评分关键词中的技术（如MoE、RLHF、RAG等）。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对住宅平面图自动生成任务，提出了一种通用的Floorplan Markup Language（FML）表示法和基于Transformer的生成模型FMLM，在RPLAN数据集上超越了之前的任务特定方法。

摘要翻译

住宅户型平面图自动生成长期以来是连接建筑学与计算机图形学的核心挑战，其目标在于提升空间设计的效率与可及性。早期基于约束满足或组合优化的方法虽能确保方案可行性，但缺乏多样性与灵活性。近期生成模型虽取得显著成果，却因表示方式欠佳，难以在异构条件任务（如依据场地边界、房间邻接图或局部布局生成完整平面图）中实现泛化。为弥补这一不足，我们提出了户型平面图标记语言（Floorplan Markup Language, FML），这是一种在单一结构化语法中编码户型信息的通用表示方法，将整个户型生成问题转化为下一标记预测任务。基于FML，我们开发了FMLM模型——一种基于Transformer的生成模型，能够在多样化条件下生成高保真且功能完备的户型平面图。在RPLAN数据集上的综合实验表明，FMLM作为单一模型，其性能超越了以往针对特定任务的先进方法。

摘要 (Abstract)

Automatic residential floorplan generation has long been a central challenge bridging architecture and computer graphics, aiming to make spatial design more efficient and accessible. While early methods based on constraint satisfaction or combinatorial optimization ensure feasibility, they lack diversity and flexibility. Recent generative models achieve promising results but struggle to generalize across heterogeneous conditional tasks, such as generation from site boundaries, room adjacency graphs, or partial layouts, due to their suboptimal representations. To address this gap, we introduce Floorplan Markup Language (FML), a general representation that encodes floorplan information within a single structured grammar, which casts the entire floorplan generation problem into a next token prediction task. Leveraging FML, we develop a transformer-based generative model, FMLM, capable of producing high-fidelity and functional floorplans under diverse conditions. Comprehensive experiments on the RPLAN dataset demonstrate that FMLM, despite being a single model, surpasses the previous task-specific state-of-the-art methods.

关键词: floorplan generation, markup representation, transformer, generative model, architecture, computer graphics, FML, FMLM

作者: Runhao Mao, Hanshi Wang, Yixiang Yang, Qianli Ma, Jingmeng Zhou, Zhipeng Zhang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04857v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	5.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Vision-Language Models (VLMs)在自动驾驶领域的微调适应问题，核心关注灾难性遗忘现象。与关键词的相关性分析如下：1) 论文涉及VLMs（视觉语言模型），属于大模型范畴，但与纯LLMs/Foundation Models关联较弱（5分）；2) 直接研究微调（fine-tuning）过程，与Post-training/SFT高度相关（10分）；3) 提出Drive Expert Adapter框架，通过提示空间而非权重空间进行适应，属于参数高效微调范畴，与PEFT/LoRA高度相关（10分）；4) 涉及领域适应（Domain Adaptation）问题（8分）；5) 提及预训练世界知识（pre-trained world knowledge），与World Models概念有一定关联（5分）。其他关键词如MoE、SLMs、RAG、RLHF、量化等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了视觉语言模型在自动驾驶领域微调时面临的灾难性遗忘问题，提出了Drive Expert Adapter框架，通过在提示空间而非权重空间进行动态路由，有效提升了驾驶任务性能并保留了模型的泛化能力。

摘要翻译

将视觉语言模型（Vision-Language Models, VLMs）集成到自动驾驶中，有望解决长尾场景问题，但这一范式面临着一个关键且尚未解决的挑战：灾难性遗忘。为适应驾驶专用数据而进行的微调过程，同时会侵蚀模型原本宝贵的预训练世界知识，从而形成一个自相矛盾的困境，削弱了其应用的核心价值。本文首次对这一现象进行了系统性研究。我们引入了一个包含18万场景的大规模新数据集，首次构建了专门用于量化自动驾驶中灾难性遗忘的基准。分析表明，现有方法存在显著的知识退化问题。为解决此问题，我们提出了驾驶专家适配器（Drive Expert Adapter, DEA），这是一种新颖的框架，通过将适应过程从权重空间转移到提示空间，从而规避了上述权衡。DEA能够根据场景特定线索，动态地将推理路由至不同的知识专家，在提升驾驶任务性能的同时，不破坏模型的基础参数。大量实验证明，我们的方法不仅在驾驶任务上取得了最先进的结果，还有效缓解了灾难性遗忘，保留了使VLMs成为自动驾驶变革性力量的关键泛化能力。数据和模型已在FidelityDrivingBench平台发布。

摘要 (Abstract)

The integration of Vision-Language Models (VLMs) into autonomous driving promises to solve long-tail scenarios, but this paradigm faces the critical and unaddressed challenge of catastrophic forgetting. The very fine-tuning process used to adapt these models to driving-specific data simultaneously erodes their invaluable pre-trained world knowledge, creating a self-defeating paradox that undermines the core reason for their use. This paper provides the first systematic investigation into this phenomenon. We introduce a new large-scale dataset of 180K scenes, which enables the first-ever benchmark specifically designed to quantify catastrophic forgetting in autonomous driving. Our analysis reveals that existing methods suffer from significant knowledge degradation. To address this, we propose the Drive Expert Adapter (DEA), a novel framework that circumvents this trade-off by shifting adaptation from the weight space to the prompt space. DEA dynamically routes inference through different knowledge experts based on scene-specific cues, enhancing driving-task performance without corrupting the model’s foundational parameters. Extensive experiments demonstrate that our approach not only achieves state-of-the-art results on driving tasks but also effectively mitigates catastrophic forgetting, preserving the essential generalization capabilities that make VLMs a transformative force for autonomous systems. Data and model are released at FidelityDrivingBench.

关键词: Vision-Language Models, autonomous driving, catastrophic forgetting, fine-tuning, domain adaptation, parameter-efficient adaptation, prompt space, driving benchmark

188. ❌ Less Detail, Better Answers: Degradation-Driven Prompting for VQA

作者: Haoxuan Han, Weijie Wang, Zeyu Zhang, Yefei He, Bohan Zhuang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04838v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	8.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Degradation-Driven Prompting (DDP)框架，通过降低图像分辨率、添加结构提示来改善VLM在VQA任务中的表现。与关键词的相关性分析：1. “Large Language Models"得5分：论文使用Vision-Language Models (VLMs)，属于大模型在视觉语言领域的应用。2. “Chain of Thought"和"System 2 Thinking"各得5分：论文关注模型推理过程，通过结构化提示引导深度推理。3. “Hallucination Mitigation"得8分：论文核心目标之一是减少因高分辨率细节导致的幻觉和推理错误。4. “Mechanistic Interpretability"得5分：通过控制输入质量来理解模型行为，涉及可解释性。5. “In-context Learning"得8分：论文明确使用In-Context Learning (ICL)作为方法组成部分。其他关键词与论文的视觉问答、图像处理焦点无直接关联，得0分。

!!! tip deepseek-chat TL;DR

该论文针对视觉问答中高分辨率细节导致幻觉和推理错误的问题，提出了Degradation-Driven Prompting框架，通过降低图像分辨率和添加结构提示，使视觉语言模型能更好地聚焦关键信息，从而在挑战性视觉基准上实现更高的推理准确率。

摘要翻译

视觉语言模型（VLMs）的最新进展显著推动了视觉问答（VQA）领域的边界。然而，高分辨率细节有时会成为噪声，导致模型产生幻觉或推理错误。本文提出了一种新颖的框架——退化驱动提示（Degradation-Driven Prompting, DDP），该框架通过策略性地降低图像保真度，迫使模型聚焦于关键的结构信息，从而提升VQA性能。我们在两项不同任务中评估了DDP。在物理属性任务中，针对易导致人类误判的图像，DDP结合了80%下采样、结构视觉辅助（白色背景遮罩与正交线条）以及上下文学习（In-Context Learning, ICL）来校准模型的注意力。在感知现象任务中，针对多种机器易受影响的视觉异常与错觉——包括视觉异常（VA）、颜色错觉（CI）、运动错觉（MI）、格式塔错觉（GI）、几何错觉（GSI）和视觉幻觉（VI）——DDP集成了任务分类阶段，并配合使用模糊遮罩、对比度增强及下采样等专用工具。我们的实验结果表明“少即是多”：通过有意降低视觉输入质量并提供有针对性的结构提示，DDP能够帮助VLMs避开干扰性纹理，在具有挑战性的视觉基准测试中实现更优的推理准确性。

摘要 (Abstract)

Recent advancements in Vision-Language Models (VLMs) have significantly pushed the boundaries of Visual Question Answering (VQA).However,high-resolution details can sometimes become noise that leads to hallucinations or reasoning errors. In this paper,we propose Degradation-Driven Prompting (DDP), a novel framework that improves VQA performance by strategically reducing image fidelity to force models to focus on essential structural information. We evaluate DDP across two distinct tasks. Physical attributes targets images prone to human misjudgment, where DDP employs a combination of 80p downsampling, structural visual aids (white background masks and orthometric lines), and In-Context Learning (ICL) to calibrate the model’s focus. Perceptual phenomena addresses various machine-susceptible visual anomalies and illusions, including Visual Anomaly (VA), Color (CI), Motion(MI),Gestalt (GI), Geometric (GSI), and Visual Illusions (VI).For this task, DDP integrates a task-classification stage with specialized tools such as blur masks and contrast enhancement alongside downsampling. Our experimental results demonstrate that less is more: by intentionally degrading visual inputs and providing targeted structural prompts, DDP enables VLMs to bypass distracting textures and achieve superior reasoning accuracy on challenging visual benchmarks.

关键词: Vision-Language Models, Visual Question Answering, Degradation-Driven Prompting, Hallucination Mitigation, In-Context Learning, Image Downsampling, Structural Prompts, Reasoning Accuracy

189. ❌ E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes

作者: Jiajun Zhai, Hao Shi, Shangwei Guo, Kailun Yang, Kaiwei Wang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04834v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文E-VLA专注于机器人视觉-语言-动作模型在恶劣视觉条件下的鲁棒性提升，通过事件相机增强感知能力。与大多数大模型技术关键词（如LLM、MoE、RLHF等）无直接关联，因为这些关键词主要针对语言模型而非视觉-语言-动作模型。仅与两个关键词有弱关联：1. ‘Pre-training OR Continual Pre-training OR Domain Adaptation’（5分）：论文提到使用预训练兼容的事件集成策略，涉及领域适应思想；2. ‘AI for Science OR Bioinformatics OR Cheminformatics’（5分）：属于AI在机器人学（可视为科学应用）的交叉研究，但非生物/化学信息学核心领域。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文解决了机器人视觉-语言-动作模型在黑暗和模糊场景中感知脆弱的问题，通过引入事件相机数据增强框架，显著提高了模型在低光照和运动模糊条件下的操作成功率。

摘要翻译

机器人视觉-语言-动作（VLA）模型在开放式操作任务中展现出良好的泛化能力，但其感知系统在极端低光、运动模糊和黑场裁剪等传感退化条件下表现脆弱。本文提出E-VLA，一种事件增强的VLA框架，可在传统帧式视觉不可靠时提升操作鲁棒性。与从事件流重建图像的方法不同，E-VLA直接利用事件流中的运动与结构线索，在恶劣条件下保持语义感知与感知-动作一致性。我们搭建了配备DAVIS346事件相机的开源遥操作平台，采集了涵盖多任务与多光照条件的真实世界同步RGB-事件-动作操作数据集。同时，我们提出轻量级、与预训练模型兼容的事件集成策略，并研究了事件窗口化与融合方法以实现稳定部署。实验表明，即使采用简单的无参数融合（如在RGB图像上叠加累积事件图），也能显著提升黑暗与强模糊场景的鲁棒性：在20勒克斯照度的Pick-Place任务中，成功率从纯图像的0%提升至叠加融合的60%，采用我们的事件适配器后可达90%；在严重运动模糊条件下（1000毫秒曝光），Pick-Place任务成功率从0%提升至20-25%，Sorting任务从5%提升至32.5%。总体而言，E-VLA系统性地证明了事件驱动感知能有效融入VLA模型，为超越传统帧式成像的鲁棒具身智能指明了方向。代码与数据集将在https://github.com/JJayzee/E-VLA 公开。

摘要 (Abstract)

Robotic Vision-Language-Action (VLA) models generalize well for open-ended manipulation, but their perception is fragile under sensing-stage degradations such as extreme low light, motion blur, and black clipping. We present E-VLA, an event-augmented VLA framework that improves manipulation robustness when conventional frame-based vision becomes unreliable. Instead of reconstructing images from events, E-VLA directly leverages motion and structural cues in event streams to preserve semantic perception and perception-action consistency under adverse conditions. We build an open-source teleoperation platform with a DAVIS346 event camera and collect a real-world synchronized RGB-event-action manipulation dataset across diverse tasks and illumination settings. We also propose lightweight, pretrained-compatible event integration strategies and study event windowing and fusion for stable deployment. Experiments show that even a simple parameter-free fusion, i.e., overlaying accumulated event maps onto RGB images, could substantially improve robustness in dark and blur-heavy scenes: on Pick-Place at 20 lux, success increases from 0% (image-only) to 60% with overlay fusion and to 90% with our event adapter; under severe motion blur (1000 ms exposure), Pick-Place improves from 0% to 20-25%, and Sorting from 5% to 32.5%. Overall, E-VLA provides systematic evidence that event-driven perception can be effectively integrated into VLA models, pointing toward robust embodied intelligence beyond conventional frame-based imaging. Code and dataset will be available at https://github.com/JJayzee/E-VLA.

关键词: Vision-Language-Action, Event Camera, Robust Perception, Dark Scenes, Motion Blur, Manipulation Dataset, Event Fusion, Embodied Intelligence

190. ❌ AnyUser: Translating Sketched User Intent into Domestic Robots

作者: Songyuan Yang, Huibin Tan, Kailun Yang, Wenjing Yang, Shaowu Yang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04811v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于机器人指令系统，通过草图、视觉和语言的多模态输入生成机器人动作，属于机器人学和人机交互领域。虽然涉及AI技术（如多模态融合），但未明确提及或专注于大模型、深度学习技术原理创新或科学应用。所有关键词均与大模型技术、训练方法、推理优化、对齐、压缩、科学AI等直接相关，而本文核心是机器人任务执行系统，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了AnyUser系统，通过草图、视觉和语言的多模态输入将用户意图转化为可执行的机器人动作，解决了非专家用户与家用机器人交互的难题，并在模拟和真实环境中验证了其高效性和实用性。

摘要翻译

我们提出AnyUser，一种统一的机器人指令系统，通过相机图像上的自由手绘草图（可结合语言）实现直观的家庭任务指令。该系统将多模态输入（草图、视觉、语言）解析为空间语义基元，以生成无需先验地图或模型的可执行机器人动作。其创新组件包括用于理解的多模态融合模块和用于鲁棒动作生成的层次化策略。通过广泛评估验证了系统效能：（1）在大规模数据集上的定量基准测试表明，系统在多种模拟家庭场景中解读多样化草图指令的准确率较高。（2）在两个不同机器人平台上进行了真实世界验证：静态安装的7自由度辅助机械臂（KUKA LBR iiwa）和双臂移动操作机器人（Realman RMC-AIDAL），成功执行了目标擦拭和区域清洁等代表性任务，证实了系统在物理环境中具象化指令并可靠执行的能力。（3）涵盖多样化人群（老年人、模拟非言语使用者、低技术素养者）的综合用户研究表明，系统在可用性和任务指定效率方面有显著提升，实现了高任务完成率（85.7%-96.4%）和用户满意度。AnyUser弥合了先进机器人能力与非专业用户可及交互需求之间的鸿沟，为适应真实人类环境的实用辅助机器人奠定了基础。

摘要 (Abstract)

We introduce AnyUser, a unified robotic instruction system for intuitive domestic task instruction via free-form sketches on camera images, optionally with language. AnyUser interprets multimodal inputs (sketch, vision, language) as spatial-semantic primitives to generate executable robot actions requiring no prior maps or models. Novel components include multimodal fusion for understanding and a hierarchical policy for robust action generation. Efficacy is shown via extensive evaluations: (1) Quantitative benchmarks on the large-scale dataset showing high accuracy in interpreting diverse sketch-based commands across various simulated domestic scenes. (2) Real-world validation on two distinct robotic platforms, a statically mounted 7-DoF assistive arm (KUKA LBR iiwa) and a dual-arm mobile manipulator (Realman RMC-AIDAL), performing representative tasks like targeted wiping and area cleaning, confirming the system’s ability to ground instructions and execute them reliably in physical environments. (3) A comprehensive user study involving diverse demographics (elderly, simulated non-verbal, low technical literacy) demonstrating significant improvements in usability and task specification efficiency, achieving high task completion rates (85.7%-96.4%) and user satisfaction. AnyUser bridges the gap between advanced robotic capabilities and the need for accessible non-expert interaction, laying the foundation for practical assistive robots adaptable to real-world human environments.

关键词: robotic instruction system, multimodal fusion, sketch-based interaction, domestic robots, hierarchical policy, user study, task execution, assistive robotics

作者: Mayank Mayank, Bharanidhar Duraisamy, Florian Geiß, Abhinav Valada 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04797v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于自动驾驶中的多模态传感器融合（雷达和摄像头），使用变形注意力机制进行特征对齐和融合。所有评分关键词均与大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文研究的是计算机视觉和传感器融合的具体工程应用，与LLM、MoE、Scaling Laws、对齐、推理、代理、模型压缩等关键词无直接关联，也未涉及生物信息学或化学信息学等科学AI应用。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于变形注意力的雷达-摄像头BEV融合框架MMF-BEV，用于自动驾驶中的3D物体检测，在VoD数据集上超越了单模态基线并取得了有竞争力的融合性能。

摘要翻译

自动驾驶的精确三维物体检测需要互补的传感器。相机提供密集的语义信息但深度不可靠，而毫米波雷达则能提供精确的距离和速度测量，但其几何信息稀疏。我们提出了MMF-BEV，一个雷达-相机鸟瞰图融合框架，该框架在View-of-Delft（VoD）四维雷达数据集[1]上利用可变形注意力机制进行跨模态特征对齐。MMF-BEV构建了一个BEVDepth[2]相机分支和一个RadarBEVNet[3]雷达分支，每个分支都通过可变形自注意力机制增强，并通过一个可变形交叉注意力模块进行融合。我们评估了三种配置：仅相机、仅雷达以及混合融合。一项传感器贡献分析量化了不同距离下的模态权重，为传感器的互补性提供了可解释的证据。采用两阶段训练策略——先用深度监督预训练相机分支，然后联合训练雷达和融合模块——以稳定学习过程。在VoD数据集上的实验表明，MMF-BEV始终优于单模态基线模型，并且在完整标注区域和近程感兴趣区域内，针对所有物体类别，相较于先前的融合方法都取得了具有竞争力的结果。

摘要 (Abstract)

Accurate 3D object detection for autonomous driving requires complementary sensors. Cameras provide dense semantics but unreliable depth, while millimeter-wave radar offers precise range and velocity measurements with sparse geometry. We propose MMF-BEV, a radar-camera BEV fusion framework that leverages deformable attention for cross-modal feature alignment on the View-of-Delft (VoD) 4D radar dataset [1]. MMF-BEV builds a BEVDepth [2] camera branch and a RadarBEVNet [3] radar branch, each enhanced with Deformable Self-Attention, and fuses them via a Deformable Cross-Attention module. We evaluate three configurations: camera-only, radar-only, and hybrid fusion. A sensor contribution analysis quantifies per-distance modality weighting, providing interpretable evidence of sensor complementarity. A two-stage training strategy - pre-training the camera branch with depth supervision, then jointly training radar and fusion modules stabilizes learning. Experiments on VoD show that MMF-BEV consistently outperforms unimodal baselines and achieves competitive results against prior fusion methods across all object classes in both the full annotated area and near-range Region of Interest.

关键词: autonomous driving, 3D object detection, sensor fusion, radar-camera fusion, BEV (Bird’s Eye View), deformable attention, multi-modal learning, View-of-Delft dataset

192. ❌ AvatarPointillist: AutoRegressive 4D Gaussian Avatarization

作者: Hongyu Liu, Xuan Wang, Yating Wang, Zijian Wu, Ziyu Wan, Yue Ma, Runtao Liu, Boyao Zhou, Yujun Shen, Qifeng Chen 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04787v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉和图形学领域的4D高斯头像生成，使用自回归Transformer生成点云用于3D高斯泼溅。所有评分关键词均与大语言模型、深度学习技术原理或科学AI应用相关，而本文研究的是特定视觉任务（头像生成），未涉及大模型技术、训练方法、推理优化、对齐、代理系统或科学领域应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为AvatarPointillist的新框架，通过自回归Transformer从单张肖像图像生成动态4D高斯头像，实现了高质量、逼真且可控的虚拟化身生成。

摘要翻译

我们提出了AvatarPointillist，一个从单张肖像图像生成动态4D高斯化身的创新框架。该方法的核心是一个仅包含解码器的Transformer，它通过自回归方式为3D高斯泼溅（3D Gaussian Splatting）生成点云。这种序列化方法实现了精确、自适应的构建，能够根据主体复杂度动态调整点密度与总点数。在点生成过程中，自回归模型还联合预测了每点的绑定信息，从而实现逼真的动画效果。生成完成后，专用的高斯解码器将这些点转换为完整、可渲染的高斯属性。我们证明，将解码器以自回归生成器的潜在特征为条件，能够实现阶段间的有效交互，并显著提升保真度。大量实验验证了AvatarPointillist能够生成高质量、照片级真实感且可控的数字化身。我们相信这种自回归范式代表了化身生成的新范式，并将公开代码以促进未来研究。

摘要 (Abstract)

We introduce AvatarPointillist, a novel framework for generating dynamic 4D Gaussian avatars from a single portrait image. At the core of our method is a decoder-only Transformer that autoregressively generates a point cloud for 3D Gaussian Splatting. This sequential approach allows for precise, adaptive construction, dynamically adjusting point density and the total number of points based on the subject’s complexity. During point generation, the AR model also jointly predicts per-point binding information, enabling realistic animation. After generation, a dedicated Gaussian decoder converts the points into complete, renderable Gaussian attributes. We demonstrate that conditioning the decoder on the latent features from the AR generator enables effective interaction between stages and markedly improves fidelity. Extensive experiments validate that AvatarPointillist produces high-quality, photorealistic, and controllable avatars. We believe this autoregressive formulation represents a new paradigm for avatar generation, and we will release our code inspire future research.

关键词: Avatar generation, 4D Gaussian avatars, Autoregressive Transformer, Point cloud generation, 3D Gaussian Splatting, Dynamic animation, Decoder-only Transformer, Photorealistic avatars

193. ❌ CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

作者: Xiangzhao Hao, Zefeng Zhang, Zhenyu Zhang, Linhao Yu, Yao Chen, Yiqian Zhang, Haiyun Guo, Shuohuan Wang, Yu Sun 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04780v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究统一多模态模型在退化图像理解中的生成能力应用，与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分），因为论文明确使用监督微调建立推理模式。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为统一多模态模型通常基于大模型架构。与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’有一定关联（5分），因为论文涉及生成-回答推理模式。其他关键词与论文核心内容无关（0分）。

!!! tip deepseek-chat TL;DR

论文提出CLEAR框架，通过监督微调、潜在表示桥接和强化学习，解决了统一多模态模型在退化图像理解中无法有效利用自身生成能力的问题，显著提升了模型在退化输入上的鲁棒性。

摘要翻译

图像因模糊、噪声、压缩及光照不足而产生的退化，严重削弱了现实场景中的多模态理解能力。将理解与生成结合于单一架构的统一多模态模型，天然适合应对这一挑战，因为其生成路径能够建模被退化破坏的细粒度视觉结构。然而，现有模型未能针对退化输入有效利用自身的生成能力。我们将这种脱节归因于两个相互强化的因素：现有训练机制从未要求模型在推理过程中调用生成能力，且标准的“解码-再编码”路径不支持有效的联合优化。本文提出CLEAR框架，通过三个渐进步骤连接这两种能力：（1）在感知退化的数据集上进行监督微调，以建立“先生成后回答”的推理模式；（2）引入潜在表示桥接，用生成与推理间直接、可优化的连接取代迂回的“解码-再编码”路径；（3）设计交错式GRPO（生成式强化策略优化），这是一种在答案正确性奖励下联合优化文本推理与视觉生成的强化学习方法。我们构建了MMD-Bench评测集，涵盖六个标准多模态基准中三种不同严重程度的退化类型。实验表明，CLEAR在显著提升退化输入鲁棒性的同时，保持了干净图像上的性能。进一步分析表明，移除像素级重建监督会催生具有更高感知质量的中间视觉状态，这提示任务驱动的优化与视觉质量本质上是协同一致的。

摘要 (Abstract)

Image degradation from blur, noise, compression, and poor illumination severely undermines multimodal understanding in real-world settings. Unified multimodal models that combine understanding and generation within a single architecture are a natural fit for this challenge, as their generative pathway can model the fine-grained visual structure that degradation destroys. Yet these models fail to leverage their own generative capacity on degraded inputs. We trace this disconnect to two compounding factors: existing training regimes never ask the model to invoke generation during reasoning, and the standard decode-reencode pathway does not support effective joint optimization. We present CLEAR, a framework that connects the two capabilities through three progressive steps: (1) supervised fine-tuning on a degradation-aware dataset to establish the generate-then-answer reasoning pattern; (2) a Latent Representation Bridge that replaces the decode-reencode detour with a direct, optimizable connection between generation and reasoning; (3) Interleaved GRPO, a reinforcement learning method that jointly optimizes text reasoning and visual generation under answer-correctness rewards. We construct MMD-Bench, covering three degradation severity levels across six standard multimodal benchmarks. Experiments show that CLEAR substantially improves robustness on degraded inputs while preserving clean-image performance. Our analysis further reveals that removing pixel-level reconstruction supervision leads to intermediate visual states with higher perceptual quality, suggesting that task-driven optimization and visual quality are naturally aligned.

关键词: multimodal models, image degradation, generative capacity, supervised fine-tuning, reinforcement learning, robustness, visual generation, degraded image understanding

194. ❌ Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

作者: Lei Zhang, Junjiao Tian, Zhipeng Fan, Kunpeng Li, Jialiang Wang, Weifeng Chen, Markos Georgopoulos, Felix Juefei-Xu, Yuxiang Bao, Julian McAuley, Manling Li, Zecheng He 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04746v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种多步图像生成方法，将合成过程分解为交错推理轨迹，涉及文本规划、视觉草稿、文本反思和视觉细化。该方法与’Chain of Thought’和’System 2 Thinking’高度相关，因为它模拟了人类的多步推理过程。与’Self-Correction’和’Explainable AI’有一定关联，因为该方法通过文本反思和视觉细化实现自我纠正，并使生成过程可解释。与’Large Language Models’有中等关联，因为论文提到使用统一的多模态模型，但未明确指定LLMs。其他关键词与论文内容无关，因为论文专注于图像生成过程，而非模型架构、训练技术、压缩、加速或特定科学应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种过程驱动的图像生成方法，通过交错推理轨迹将合成分解为多步迭代，使生成过程明确、可解释且可监督，从而提高了图像生成的质量和可控性。

摘要翻译

人类以渐进方式绘制图像：他们规划整体布局、勾勒粗略草图、审视并细化细节，且最关键的是，每一步都基于不断演化的视觉状态。然而，在文本-图像交错数据集上训练的统一多模态模型是否也能构想出中间状态的链条？本文提出过程驱动的图像生成方法，这是一种多步骤范式，将合成过程分解为思维与动作交错的推理轨迹。我们的方法并非单步生成图像，而是通过多次迭代展开，每次迭代包含四个阶段：文本规划、视觉草图绘制、文本反思与视觉细化。文本推理显式地规定了视觉状态应如何演化，而生成的视觉中间结果反过来约束并锚定下一轮文本推理。过程驱动生成的核心挑战源于中间状态的模糊性：模型应如何评估每幅部分完成的图像？我们通过密集的逐步骤监督来解决这一问题，该监督保持两个互补约束：对于视觉中间状态，我们强制空间与语义一致性；对于文本中间状态，我们在保留先验视觉知识的同时，使模型能够识别并修正违反提示要求的元素。这使得生成过程变得显式化、可解释且可直接监督。为验证所提方法，我们在多种文本到图像生成基准测试中进行了实验。

摘要 (Abstract)

Humans paint images incrementally: they plan a global layout, sketch a coarse draft, inspect, and refine details, and most importantly, each step is grounded in the evolving visual states. However, can unified multimodal models trained on text-image interleaved datasets also imagine the chain of intermediate states? In this paper, we introduce process-driven image generation, a multi-step paradigm that decomposes synthesis into an interleaved reasoning trajectory of thoughts and actions. Rather than generating images in a single step, our approach unfolds across multiple iterations, each consisting of 4 stages: textual planning, visual drafting, textual reflection, and visual refinement. The textual reasoning explicitly conditions how the visual state should evolve, while the generated visual intermediate in turn constrains and grounds the next round of textual reasoning. A core challenge of process-driven generation stems from the ambiguity of intermediate states: how can models evaluate each partially-complete image? We address this through dense, step-wise supervision that maintains two complementary constraints: for the visual intermediate states, we enforce the spatial and semantic consistency; for the textual intermediate states, we preserve the prior visual knowledge while enabling the model to identify and correct prompt-violating elements. This makes the generation process explicit, interpretable, and directly supervisable. To validate proposed method, we conduct experiments under various text-to-image generation benchmarks.

关键词: process-driven image generation, interleaved reasoning, multi-step paradigm, textual planning, visual refinement, intermediate states, interpretable generation, text-to-image generation

195. ❌ Don’t Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

作者: Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Patrick Woods, Gabriel Hillesheim, Abolfazl Razi 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04722v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究KV-cache量化以优化设备端LLM推理，与’Large Language Models’、‘Small Language Models OR On-device AI’、‘KV Cache Compression’、‘Quantization OR Model Compression’、‘Speculative Decoding OR Inference Acceleration’高度相关（10分），因为这些是论文直接解决的技术问题；其他关键词如MoE、Scaling Laws、Alignment等未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种自适应KV-cache量化方法，通过动态分配比特宽度来减少设备端LLM推理的内存和延迟，在多个基准测试中实现了比静态量化更高的准确性和更低的延迟。

摘要翻译

大语言模型（LLM）在推理、生成和决策任务上取得了显著进展，然而将其部署于移动设备、嵌入式系统和边缘设备上仍面临严峻挑战。设备端LLM推理主要受限于键值（KV）缓存的存储与带宽开销，该开销随上下文长度线性增长，并常成为解码成本的主导因素。现有的KV缓存量化方案通常依赖固定精度或人工设计的启发式策略，导致对低影响令牌浪费比特位，而对高信息量令牌过度压缩，从而造成可避免的精度损失。受霍夫曼编码变长分配原理的启发，我们提出自适应KV缓存量化方法——一种通过学习策略，根据令牌重要性按比例分配比特位宽，在保持竞争力的精度同时最小化预期内存与延迟。我们的框架提取轻量级令牌级特征，包括令牌频率、质量分数、注意力方差和基于熵的不确定性，并将其输入紧凑的数据驱动控制器；该控制器在解码过程中动态从{2比特、4比特、8比特、FP16}中选择KV精度。相比静态KV量化与基于规则的基线方法，这种自适应精度策略在降低KV内存占用和延迟的同时提升了精度，并在标准LLM基准测试中保持接近FP16推理的竞争力精度。通过在SmolLM-135M、SmolLM-360M和SmolLM-1.7B模型上对多个常识推理基准进行广泛实验，结果表明我们的控制器持续优化了精度-延迟权衡。例如，在HellaSwag基准上使用SmolLM-360M时，本方法相较于静态KV量化将解码延迟（毫秒/令牌）降低17.75%，精度提升7.60个百分点，且与FP16推理的精度差距仅为0.30个百分点。

摘要 (Abstract)

Large Language Models (LLMs) have achieved remarkable progress across reasoning, generation, and decision-making tasks, yet deploying them on mobile, embedded, and edge devices remains particularly challenging. On-device LLM inference is heavily constrained by the memory and bandwidth overhead of the key-value (KV) cache, which grows linearly with context length and often dominates decoding cost. Existing KV-cache quantization schemes typically rely on fixed precision or hand-crafted heuristics, thereby wasting bits on low-impact tokens while over-compressing informative ones, leading to avoidable accuracy degradation. Inspired by Huffman coding’s principle of variable-length allocation, we propose adaptive KV-cache quantization, a learned policy that assigns bit-width proportional to token importance, minimizing expected memory and latency without sacrificing competitive accuracy. Our framework extracts lightweight token-level features, including token frequency, quality score, attention variance, and entropy-based uncertainty, and feeds them into a compact data-driven controller that dynamically selects KV precision from {2-bit, 4-bit, 8-bit, FP16} during decoding. This adaptive precision policy reduces KV memory footprint and latency while improving accuracy compared to static KV quantization and rule-based baselines, and maintaining competitive accuracy close to FP16 inference across standard LLM benchmarks. Extensive experiments across multiple commonsense reasoning benchmarks using SmolLM-135M, SmolLM-360M, and SmolLM-1.7B demonstrate that our controller consistently improves the accuracy-latency trade-off. For instance, with SmolLM-360M on HellaSwag, our method reduces decoding latency (ms/token) by 17.75% relative to static KV quantization, improves accuracy by 7.60 points, and remains within only 0.30 points of FP16 inference.

关键词: KV-cache quantization, on-device LLMs, adaptive precision, inference acceleration, memory optimization, lightweight models, token importance, decoding latency

196. ❌ OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

作者: DataFlow Team, Bohan Zeng, Daili Hua, Kaixin Zhu, Yifan Dai, Bozhou Li, Yuran Wang, Chengzhuo Tong, Yifan Yang, Mingkun Chang, Jianbin Zhao, Zhou Liu, Hao Liang, Xiaochen Ma, Ruichuan An, Junbo Niu, Zimo Meng, Tianyi Bai, Meiyi Qiang, Huanyao Zhang, Zhiyou Xiao, Tianyu Guo, Qinhan Yu, Runhao Zhao, Zhengpin Li, Xinyi Huang, Yisheng Pan, Yiwen Tang, Yang Shi, Yue Ding, Xinlong Chen, Hongcheng Gao, Minglei Shi, Jialong Wu, Zekun Wang, Yuanxing Zhang, Xintao Wang, Pengfei Wan, Yiren Song, Mike Zheng Shou, Wentao Zhang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04707v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《OpenWorldLib: A Unified Codebase and Definition of Advanced World Models》主要贡献在于为世界模型（World Models）提供了一个清晰的定义、系统分类和一个统一的推理框架。论文的核心主题是“世界模型”本身，而非大语言模型（LLM）或深度学习技术原理的创新。它不涉及LLM的训练、微调、对齐、推理优化、代理系统、模型压缩等具体技术。因此，除了关键词“World Models AND General World Models”高度相关（评分为10）外，其他所有关键词均与论文内容完全无关（评分为0）。论文虽然属于人工智能领域，但其具体内容并未触及评分列表中除“世界模型”定义和框架外的任何技术点。

!!! tip deepseek-chat TL;DR

该论文针对世界模型缺乏统一定义的问题，提出了一个清晰的定义和系统分类，并开发了OpenWorldLib这一统一的推理框架来集成不同任务的世界模型，以促进高效复用和协作推理。

摘要翻译

世界模型作为人工智能领域一个颇具前景的研究方向已获得广泛关注，但其清晰统一的定义仍显缺失。本文提出OpenWorldLib，一个面向先进世界模型的综合性标准化推理框架。借鉴世界模型的发展脉络，我们提出了明确的定义：世界模型是以感知为核心、具备交互与长期记忆能力，用于理解和预测复杂世界的模型或框架。我们进一步系统化地归纳了世界模型应具备的核心能力。基于此定义，OpenWorldLib将不同任务的模型整合在统一框架内，实现了高效复用与协同推理。最后，我们对世界模型未来可能的研究方向提出了进一步的思考与分析。代码链接：https://github.com/OpenDCAI/OpenWorldLib

摘要 (Abstract)

World models have garnered significant attention as a promising research direction in artificial intelligence, yet a clear and unified definition remains lacking. In this paper, we introduce OpenWorldLib, a comprehensive and standardized inference framework for Advanced World Models. Drawing on the evolution of world models, we propose a clear definition: a world model is a model or framework centered on perception, equipped with interaction and long-term memory capabilities, for understanding and predicting the complex world. We further systematically categorize the essential capabilities of world models. Based on this definition, OpenWorldLib integrates models across different tasks within a unified framework, enabling efficient reuse and collaborative inference. Finally, we present additional reflections and analyses on potential future directions for world model research. Code link: https://github.com/OpenDCAI/OpenWorldLib

关键词: World Models, Advanced World Models, Unified Framework, Inference Framework, Perception, Interaction, Long-term Memory, OpenWorldLib

197. ❌ Explainable Machine Learning for Sepsis Outcome Prediction Using a Novel Romanian Electronic Health Record Dataset

作者: Andrei-Alexandru Bunea, Ovidiu Ghibea, Dan-Matei Popovici, Ion Daniel, Octavian Andronic 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04698v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于使用传统机器学习方法（如XGBoost、随机森林等）进行败血症结果预测，并强调模型的可解释性（使用SHAP分析）。论文内容与大多数关键词（涉及大模型技术、训练方法、推理优化、智能体等）完全无关，仅与’Mechanistic Interpretability OR Explainable AI’高度相关（核心使用SHAP进行解释），与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（属于生物信息学/医疗AI应用，但未使用深度学习或大模型）。

!!! tip deepseek-chat TL;DR

本研究利用罗马尼亚的新型电子健康记录数据集，开发了可解释的机器学习模型来预测败血症结果，在'死亡vs康复'任务中取得了最高性能（AUC=0.983），并通过SHAP分析识别了心血管合并症、尿素水平等关键临床预测因子。

摘要翻译

本研究基于罗马尼亚一家大型急救医院12,286例住院病例的新型电子健康记录（EHR）数据集，开发并分析了用于脓毒症结局预测的可解释机器学习（ML）模型。该数据集包含人口统计学信息、国际疾病分类（ICD-10）诊断以及600种实验室检测类型。本研究旨在识别临床强预测因子，并在三项分类任务中取得先进成果：（1）死亡与出院，（2）死亡与康复，（3）康复与好转。我们训练了五种ML模型以捕捉复杂的分布特征，同时保持临床可解释性。实验通过使用10至50种最频繁实验室检测的子集，探讨了特征丰富度与患者覆盖率之间的权衡。模型性能采用准确率和曲线下面积（AUC）进行评估，可解释性则通过SHapley加性解释（SHAP）方法进行评估。在死亡与康复的案例研究中获得了最高性能（AUC=0.983，准确率=0.93）。SHAP分析识别出若干强预测因子，如心血管合并症、尿素水平、天冬氨酸氨基转移酶、血小板计数和嗜酸性粒细胞百分比。嗜酸性粒细胞减少症成为一个顶级预测因子，凸显了其作为当前评估标准未涵盖但未被充分利用的标志物价值，而模型的高性能表明这些模型在临床环境中具有适用性。

摘要 (Abstract)

We develop and analyze explainable machine learning (ML) models for sepsis outcome prediction using a novel Electronic Health Record (EHR) dataset from 12,286 hospitalizations at a large emergency hospital in Romania. The dataset includes demographics, International Classification of Diseases (ICD-10) diagnostics, and 600 types of laboratory tests. This study aims to identify clinically strong predictors while achieving state-of-the-art results across three classification tasks: (1)deceased vs. discharged, (2)deceased vs. recovered, and (3)recovered vs. ameliorated. We trained five ML models to capture complex distributions while preserving clinical interpretability. Experiments explored the trade-off between feature richness and patient coverage, using subsets of the 10–50 most frequent laboratory tests. Model performance was evaluated using accuracy and area under the curve (AUC), and explainability was assessed using SHapley Additive exPlanations (SHAP). The highest performance was obtained for the deceased vs. recovered case study (AUC=0.983, accuracy=0.93). SHAP analysis identified several strong predictors such as cardiovascular comorbidities, urea levels, aspartate aminotransferase, platelet count, and eosinophil percentage. Eosinopenia emerged as a top predictor, highlighting its value as an underutilized marker that is not included in current assessment standards, while the high performance suggests the applicability of these models in clinical settings.

关键词: sepsis outcome prediction, explainable machine learning, electronic health records, SHAP analysis, clinical interpretability, Romanian dataset, laboratory tests, AUC performance

198. ❌ 3D Gaussian Splatting for Annular Dark Field Scanning Transmission Electron Microscopy Tomography Reconstruction

作者: Beiyuan Zhang, Hesong Li, Ruiwen Shao, Ying Fu 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04693v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于将3D Gaussian Splatting技术应用于ADF-STEM断层扫描重建，属于计算机视觉和计算成像领域。论文内容与绝大多数关键词（涉及大模型、训练方法、推理优化、对齐、智能体等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在科学（具体是材料科学成像）中的应用，但并非其核心创新点（核心是3DGS的领域适应），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文解决了稀疏视角下ADF-STEM断层扫描重建质量下降的问题，通过提出DenZa-Gaussian方法（包含可学习散射场、散射视角归一化和傅里叶振幅损失），在45-view和15-view数据上实现了更高保真度的3D重建和2D投影。

摘要翻译

分析型暗场扫描透射电子显微镜（ADF-STEM）层析成像技术通过整合多视角倾斜系列图像，实现纳米材料的三维重构，从而能够精确分析其结构与成分特征。尽管增加倾斜视角数量可提升三维重构质量，但这需要延长电子束曝光时间，可能导致剂量敏感材料受损，并引入漂移与对准误差，使得重构保真度与样品保护难以兼顾。实践中常需采用稀疏视角采集，然而传统ADF-STEM方法在有限视角下性能退化，易产生伪影并降低结构保真度。为解决这些问题，本文通过三个关键组件将三维高斯溅射（3D GS）方法适配于该领域。我们首先将局部散射强度建模为可学习标量场denza，以解决3DGS与ADF-STEM成像物理机制不匹配的问题。随后引入系数$γ$以稳定跨倾斜角的散射强度，通过散射视角归一化确保denza的一致性。最后，我们设计了包含二维傅里叶振幅项的损失函数，以抑制稀疏视角重构中的缺失楔形伪影。在45视角和15视角倾斜系列上的实验表明，DenZa-高斯方法能生成高保真度重构结果，其二维投影与原始倾斜图像吻合度更高，在稀疏视角条件下展现出卓越的鲁棒性。

摘要 (Abstract)

Analytical Dark Field Scanning Transmission Electron Microscopy (ADF-STEM) tomography reconstructs nanoscale materials in 3D by integrating multi-view tilt-series images, enabling precise analysis of their structural and compositional features. Although integrating more tilt views improves 3D reconstruction, it requires extended electron exposure that risks damaging dose-sensitive materials and introduces drift and misalignment, making it difficult to balance reconstruction fidelity with sample preservation. In practice, sparse-view acquisition is frequently required, yet conventional ADF-STEM methods degrade under limited views, exhibiting artifacts and reduced structural fidelity. To resolve these issues, in this paper, we adapt 3D GS to this domain with three key components. We first model the local scattering strength as a learnable scalar field, denza, to address the mismatch between 3DGS and ADF-STEM imaging physics. Then we introduce a coefficient $γ$ to stabilize scattering across tilt angles, ensuring consistent denza via scattering view normalization. Finally, We incorporate a loss function that includes a 2D Fourier amplitude term to suppress missing wedge artifacts in sparse-view reconstruction. Experiments on 45-view and 15-view tilt series show that DenZa-Gaussian produces high-fidelity reconstructions and 2D projections that align more closely with original tilts, demonstrating superior robustness under sparse-view conditions.

关键词: 3D Gaussian Splatting, ADF-STEM tomography, sparse-view reconstruction, scattering strength field, missing wedge artifacts, tilt-series images, nanoscale materials, 3D reconstruction

199. ❌ Unsharp Measurement with Adaptive Gaussian POVMs for Quantum-Inspired Image Processing

作者: Debashis Saikia, Bikash K. Behera, Mayukha Pal, Prasanta K. Panigrahi 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04685v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究量子测量框架在图像处理中的应用，核心是量子力学中的正算子值测度（POVMs）和希尔伯特空间嵌入，属于量子计算与图像处理的交叉领域。所有评分关键词均涉及大模型、深度学习及其技术原理（如训练方法、推理优化、对齐、代理系统等），而本文完全不涉及这些主题，未使用任何深度学习或大模型技术，也未讨论相关应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于自适应高斯正算子值测度的量子测量框架，用于灰度图像的随机变换，通过控制测量定位参数实现了从非精确测量到投影测量的连续过渡，并在标准基准图像上验证了该方法能有效保持结构信息。

摘要翻译

本文提出一种基于量子测量的概率性灰度图像变换框架，该方法采用自适应正算子值测度（POVMs）。与现有主要围绕分割或阈值化的方法不同，本框架将图像变换表述为直接作用于像素强度的测量诱导过程。强度值被嵌入有限维希尔伯特空间，从而能够基于图像直方图的高斯模型构建数据自适应的测量算子。这些算子自然地定义了对强度可观测量的一种非锐化测量，重建图像通过测量结果的期望值获得。为控制测量局域化程度，我们引入一种带有锐化参数$γ$的非线性锐化变换，该参数可诱导测量从非锐化状态到投影测量的连续过渡。这一过渡反映了强度结构的概率性平滑与局域化之间的固有权衡。除非线性锐化参数外，我们还引入另一参数$k$（高斯中心数量），用于控制变换过程中图像的分辨率。在标准基准图像上的实验结果表明，所提方法在保持结构信息的同时，能够实现有效的数据自适应变换。

摘要 (Abstract)

We propose a quantum measurement-based framework for probabilistic transformation of grayscale images using adaptive positive operator-valued measures (POVMs). In contrast, to existing approaches that are largely centered around segmentation or thresholding, the transformation is formulated here as a measurement-induced process acting directly on pixel intensities. The intensity values are embedded in a finite-dimensional Hilbert space, which allows the construction of data-adaptive measurement operators derived from Gaussian models of the image histogram. These operators naturally define an unsharp measurement of the intensity observable, with the reconstructed image obtained through expectation values of the measurement outcomes. To control the degree of measurement localization, we introduce a nonlinear sharpening transformation with a sharpening parameter, $γ$, that induces a continuous transition from unsharp measurements to projective measurements. This transition reflects an inherent trade-off between probabilistic smoothing and localization of intensity structures. In addition to the nonlinear sharpening parameter, we introduce another parameter $k$ (number of gaussian centers) which controls the resolution of the image during the transformation. Experimental results on standard benchmark images show that the proposed method gives effective data-adaptive transformations while preserving structural information.

关键词: quantum measurement, positive operator-valued measures (POVMs), image processing, grayscale images, Hilbert space, adaptive Gaussian models, unsharp measurement, sharpening transformation

200. ❌ Batch Loss Score for Dynamic Data Pruning

作者: Qing Zhou, Bingxuan Zhao, Tao Yang, Hongyuan Zhang, Junyu Gao, Qi Wang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04681v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种名为Batch Loss Score（BLS）的动态数据剪枝方法，用于加速深度学习训练。该方法通过指数移动平均（EMA）批量损失来评估样本重要性，适用于复杂模型或损失函数中难以获取逐样本损失的情况。论文的核心贡献在于数据剪枝的效率提升和算法简化，而非大模型技术原理创新或特定领域应用。所有评分关键词均聚焦于大模型（LLMs）相关技术、训练方法、推理优化、对齐、代理系统、科学AI应用等具体方向，而本文研究的是通用的深度学习训练数据选择方法，不涉及大模型、MoE、量化、推理加速、对齐、科学AI等任何特定关键词内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于批量损失指数移动平均的Batch Loss Score方法，用于高效评估训练样本重要性，实现了在复杂场景下无需逐样本损失即可动态剪枝20%-50%数据而不损失性能。

摘要翻译

动态数据剪枝通过选择性忽略训练过程中信息量较少的样本来加速深度学习。虽然逐样本损失是常见的重要性度量指标，但对于复杂模型或损失函数而言，获取该指标可能具有挑战性或不可行，通常需要大量的实现工作。本研究提出批损失分数（Batch Loss Score, BLS），作为一种计算高效的替代方案，它利用现成可得的批损失指数移动平均（Exponential Moving Average, EMA）为单个样本分配分数。我们从单个样本的视角出发，将批损失视作其缩放后个体损失的一个含噪声测量值，噪声来源于随机的批次组合。理论分析表明，EMA机制起到一阶低通滤波器的作用，可衰减高频的批次组合噪声。由此产生的分数近似于个体样本对损失平滑且持续的贡献，这为BLS作为样本重要性代理指标提供了理论基础。BLS展现出显著的代码集成简洁性（仅需三行代码注入），并能轻松适配现有的基于逐样本损失的方法（一行代码代理）。其有效性通过增强两种此类方法得以验证，在14个数据集、11种任务和18个模型上无损剪除了20%-50%的样本，凸显了其实用性和广泛适用性，尤其适用于难以获取逐样本损失的复杂场景。代码发布于https://github.com/mrazhou/BLS。

摘要 (Abstract)

Dynamic data pruning accelerates deep learning by selectively omitting less informative samples during training. While per-sample loss is a common importance metric, obtaining it can be challenging or infeasible for complex models or loss functions, often requiring significant implementation effort. This work proposes the Batch Loss Score (BLS), a computationally efficient alternative using an Exponential Moving Average (EMA) of readily available batch losses to assign scores to individual samples. We frame the batch loss, from the perspective of a single sample, as a noisy measurement of its scaled individual loss, with noise originating from stochastic batch composition. It is formally shown that the EMA mechanism functions as a first-order low-pass filter, attenuating high-frequency batch composition noise. This yields a score approximating the smoothed and persistent contribution of the individual sample to the loss, providing a theoretical grounding for BLS as a proxy for sample importance. BLS demonstrates remarkable code integration simplicity (\textbf{three-line injection}) and readily adapts existing per-sample loss-based methods (\textbf{one-line proxy}). Its effectiveness is demonstrated by enhancing two such methods to losslessly prune \textbf{20%-50%} of samples across \textit{14 datasets}, \textit{11 tasks} and \textit{18 models}, highlighting its utility and broad applicability, especially for complex scenarios where per-sample loss is difficult to access. Code is available at https://github.com/mrazhou/BLS.

关键词: Dynamic Data Pruning, Batch Loss Score, Exponential Moving Average, Sample Importance, Training Acceleration, Deep Learning, Data Selection, Loss Approximation

201. ❌ ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging

作者: Selim Ahmet Iz, Francesco Nex, Norman Kerle, Henry Meissner, Ralf Berger 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04667v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和遥感领域，研究使用扩散模型进行无人机图像的实时深度重建，并通过集束调整提高度量一致性。论文内容与所有评分关键词（均围绕大语言模型、深度学习技术原理及其应用）完全无关，未涉及任何大模型、语言模型、对齐、微调、推理、代理、量化等主题，也未涉及生物信息学或化学信息学等AI for Science的具体领域。

!!! tip deepseek-chat TL;DR

该论文提出ZeD-MAP框架，通过集成集束调整来改进零样本扩散深度模型，解决了无人机图像实时深度重建中的度量精度和时序一致性问题，实现了亚米级精度并保持实时处理速度。

摘要翻译

基于超高分辨率无人机影像的实时深度重建对于灾害响应等时效性要求高的地理空间任务至关重要，但由于宽基线视差、大图像尺寸、低纹理或镜面表面、遮挡以及严格的计算限制，该任务仍具挑战性。近期的零样本扩散模型无需针对特定任务进行重新训练即可实现快速的单幅图像密集预测，与基于Transformer的预测器相比，其所需标注数据集更少，同时避免了经典多视图立体视觉方法对固定采集几何结构的严格要求。然而，其概率性推断方式难以保证序列帧与重叠图块之间可靠的度量精度与时间一致性。我们提出了ZeD-MAP，一种集群级框架，通过集成增量式基于集群的光束法平差，将测试阶段的扩散深度模型转换为具有度量一致性、类似SLAM的建图流程。流式输入的无人机帧被分组为重叠的集群；周期性的光束法平差产生度量一致的位姿和稀疏三维连接点，这些点被重投影至选定帧中，并用作基于扩散的深度估计的度量引导。使用德国宇航中心模块化航空相机系统在约50米飞行高度（地面采样距离约为0.85厘米/像素，对应每帧约2650平方米的地面覆盖范围）采集的地面标记飞行数据进行验证，结果表明：我们的方法实现了亚米级精度，水平面误差约为0.87米，垂直方向误差约为0.12米，同时保持每幅图像处理时间在1.47至4.91秒之间。结果受到手动点云标注带来的轻微噪声影响。这些发现表明，基于光束法平差的度量引导能提供与经典摄影测量方法相当的一致性，同时显著加速处理过程，实现了实时三维地图生成。

摘要 (Abstract)

Real-time depth reconstruction from ultra-high-resolution UAV imagery is essential for time-critical geospatial tasks such as disaster response, yet remains challenging due to wide-baseline parallax, large image sizes, low-texture or specular surfaces, occlusions, and strict computational constraints. Recent zero-shot diffusion models offer fast per-image dense predictions without task-specific retraining, and require fewer labelled datasets than transformer-based predictors while avoiding the rigid capture geometry requirement of classical multi-view stereo. However, their probabilistic inference prevents reliable metric accuracy and temporal consistency across sequential frames and overlapping tiles. We present ZeD-MAP, a cluster-level framework that converts a test-time diffusion depth model into a metrically consistent, SLAM-like mapping pipeline by integrating incremental cluster-based bundle adjustment (BA). Streamed UAV frames are grouped into overlapping clusters; periodic BA produces metrically consistent poses and sparse 3D tie-points, which are reprojected into selected frames and used as metric guidance for diffusion-based depth estimation. Validation on ground-marker flights captured at approximately 50 m altitude (GSD is approximately 0.85 cm/px, corresponding to 2,650 square meters ground coverage per frame) with the DLR Modular Aerial Camera System (MACS) shows that our method achieves sub-meter accuracy, with approximately 0.87 m error in the horizontal (XY) plane and 0.12 m in the vertical (Z) direction, while maintaining per-image runtimes between 1.47 and 4.91 seconds. Results are subject to minor noise from manual point-cloud annotation. These findings show that BA-based metric guidance provides consistency comparable to classical photogrammetric methods while significantly accelerating processing, enabling real-time 3D map generation.

关键词: zero-shot depth estimation, diffusion models, bundle adjustment, UAV imagery, real-time 3D mapping, metric consistency, depth reconstruction, aerial imaging

202. ❌ Synthesis4AD: Synthetic Anomalies are All You Need for 3D Anomaly Detection

作者: Yihan Sun, Yuqi Cheng, Junjie Zu, Yuxiang Tan, Guoyang Xie, Yucheng Wang, Yunkang Cao, Weiming Shen 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04658v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究3D异常检测，核心创新是使用合成异常数据解决工业场景中异常样本稀缺问题。与大多数关键词无关，因为论文不涉及大模型技术原理创新（如MoE、量化、推理加速等）、训练方法（如预训练、微调、对齐等）或推理技术（如CoT、RAG等）。唯一高度相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’（10分），因为论文属于AI在工业科学（3D检测）领域的应用。‘Large Language Models OR LLMs OR Foundation Models’得8分，因为论文使用了多模态大语言模型（MLLM）来解释产品设计信息并生成合成指令，但这不是论文的核心技术贡献（核心是合成方法和检测框架）。

!!! tip deepseek-chat TL;DR

该论文提出Synthesis4AD框架，通过大规模高保真合成异常数据解决工业3D异常检测中异常样本稀缺问题，实现了最先进的检测性能。

摘要翻译

工业三维异常检测性能从根本上受限于异常样本的稀缺性与长尾分布。为应对这一挑战，我们提出Synthesis4AD——一种端到端范式，通过利用大规模高保真合成异常来学习更具判别力的三维异常检测表征。该范式的核心是3D-DefectStudio，这是一个基于可控合成引擎MPAS构建的软件平台，该平台通过高维支撑基元引导注入几何真实的缺陷，同时生成精确的点级异常掩码。此外，Synthesis4AD融合了多模态大语言模型（MLLM）来解析产品设计信息，并自动将其转化为可执行的异常合成指令，从而实现可扩展的知识驱动型异常数据生成。为提升下游检测器在非结构化点云上的鲁棒性与泛化能力，Synthesis4AD进一步引入了基于空间分布归一化与几何保真数据增强的训练流程，缓解了点Transformer架构对绝对坐标的敏感性，并提升了在真实数据变化下的特征学习能力。大量实验在Real3D-AD、MulSen-AD及真实工业零件数据集上验证了其领先性能。所提出的合成方法MPAS与交互系统3D-DefectStudio将在https://github.com/hustCYQ/Synthesis4AD 开源发布。

摘要 (Abstract)

Industrial 3D anomaly detection performance is fundamentally constrained by the scarcity and long-tailed distribution of abnormal samples. To address this challenge, we propose Synthesis4AD, an end-to-end paradigm that leverages large-scale, high-fidelity synthetic anomalies to learn more discriminative representations for 3D anomaly detection. At the core of Synthesis4AD is 3D-DefectStudio, a software platform built upon the controllable synthesis engine MPAS, which injects geometrically realistic defects guided by higher-dimensional support primitives while simultaneously generating accurate point-wise anomaly masks. Furthermore, Synthesis4AD incorporates a multimodal large language model (MLLM) to interpret product design information and automatically translate it into executable anomaly synthesis instructions, enabling scalable and knowledge-driven anomalous data generation. To improve the robustness and generalization of the downstream detector on unstructured point clouds, Synthesis4AD further introduces a training pipeline based on spatial-distribution normalization and geometry-faithful data augmentations, which alleviates the sensitivity of Point Transformer architectures to absolute coordinates and improves feature learning under realistic data variations. Extensive experiments demonstrate state-of-the-art performance on Real3D-AD, MulSen-AD, and a real-world industrial parts dataset. The proposed synthesis method MPAS and the interactive system 3D-DefectStudio will be publicly released at https://github.com/hustCYQ/Synthesis4AD.

关键词: 3D anomaly detection, synthetic anomalies, multimodal large language model, industrial inspection, point cloud, data generation, defect synthesis, transformer architecture

203. ❌ InCTRLv2: Generalist Residual Models for Few-Shot Anomaly Detection and Segmentation

作者: Jiawen Zhu, Mengjia Niu, Guansong Pang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04632v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	8.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文InCTRLv2专注于计算机视觉领域的异常检测与分割，核心创新在于提出了一种基于上下文残差学习的少样本通用异常检测框架。与评分关键词的相关性分析如下：1）论文使用了大规模视觉-语言模型（VLMs）作为语义先验编码器，这与’Large Language Models OR LLMs OR Foundation Models’有一定关联，但VLMs是视觉-语言交叉模型而非纯语言模型，因此给5分。2）论文的核心方法’InCTRL’和’InCTRLv2’都基于’In-context Learning’思想，通过少样本正常示例学习上下文残差来检测异常，这与’In-context Learning OR Many-shot Learning’高度相关，给8分。3）其他关键词主要涉及纯语言模型技术、对齐、推理、代理、压缩等，与这篇计算机视觉论文无直接关联，均给0分。

!!! tip deepseek-chat TL;DR

该论文针对现有异常检测模型泛化能力不足的问题，提出了InCTRLv2框架，通过双分支结构和视觉-语言语义引导，实现了少样本条件下的跨领域通用异常检测与分割，在十个数据集上达到了最先进的性能。

摘要翻译

尽管近年来的异常检测方法在识别特定领域内的异常模式方面取得了显著进展，但大多数方法属于专家模型，需依赖特定目标数据集的大量训练样本进行训练，难以泛化至未见过的数据集。为应对这一局限，通用异常检测范式近年来应运而生，其目标是学习一个单一的通用模型，无需重新训练即可跨多样领域检测异常。为此，本研究提出了InCTRLv2，一种新颖的小样本通用异常检测与分割框架，该框架显著扩展了我们先前提出的通用异常检测模型InCTRL。基于InCTRL中利用少量正常样本学习上下文残差以检测异常的核心思想，InCTRLv2在双分支框架下引入了两个新颖且互补的异常感知视角。这是通过在InCTRL基础上构建的两个新模块实现的：i）在主分支中利用正常与异常数据进行判别式异常分数学习，该模块学习一个语义引导的异常与正常空间，支持从异常和正常双重视角对查询样本进行分类；ii）在辅助分支中仅利用正常数据进行单类别异常分数学习，该模块在语义空间中学习泛化的正常模式，专注于仅从正常性视角检测异常。两个分支均受到大规模视觉-语言模型所编码的丰富视觉-文本语义先验的引导。二者共同为异常检测提供了双重语义视角：一个强调正常与异常的判别，另一个则强调偏离正常性的语义。在十个异常检测数据集上的大量实验表明，InCTRLv2在各种设置下的异常检测与分割任务中均达到了最先进的性能。

摘要 (Abstract)

While recent anomaly detection (AD) methods have made substantial progress in recognizing abnormal patterns within specific domains, most of them are specialist models that are trained on large training samples from a specific target dataset, struggling to generalize to unseen datasets. To address this limitation, the paradigm of Generalist Anomaly Detection (GAD) has emerged in recent years, aiming to learn a single generalist model to detect anomalies across diverse domains without retraining. To this end, this work introduces InCTRLv2, a novel few-shot Generalist Anomaly Detection and Segmentation (GADS) framework that significantly extends our previously proposed GAD model, InCTRL. Building on the idea of learning in-context residuals with few-shot normal examples to detect anomalies as in InCTRL, InCTRLv2 introduces two new, complementary perspectives of anomaly perception under a dual-branch framework. This is accomplished by two novel modules upon InCTRL: i) Discriminative Anomaly Score Learning (DASL) with both normal and abnormal data in the main branch, which learns a semantic-guided abnormality and normality space that supports the classification of query samples from both the abnormality and normality perspectives; and ii) One-class Anomaly Score Learning (OASL) using only the normal data, which learns generalized normality patterns in a semantic space via an auxiliary branch, focusing on detecting anomalies through the lens of normality solely. Both branches are guided by rich visual-text semantic priors encoded by large-scale vision-language models. Together, they offer a dual semantic perspective for AD: one emphasizes normal-abnormal discriminations, while the other emphasizes normality-deviated semantics. Extensive experiments on ten AD datasets demonstrate that InCTRLv2 achieves SotA performance in both anomaly detection and segmentation tasks across various settings.

关键词: Anomaly Detection, Anomaly Segmentation, Generalist Model, Few-shot Learning, In-context Learning, Vision-Language Models, Dual-branch Framework, Semantic Guidance

作者: Mei Qiu, Jianqiang Zhao, Yanyun Qu 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04608v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究AI生成图像（AIGC）的检测，通过物理特征（如拉普拉斯方差、索贝尔统计等）来区分真实与合成图像，并集成到CLIP多模态模型中。论文与绝大多数关键词无关，因为这些关键词主要涉及大语言模型（LLM）的技术原理、训练方法、推理优化、代理系统等。唯一相关的关键词是’Hallucination Mitigation OR Factuality OR Truthfulness’，因为论文提到其方法有助于缓解大型多模态模型中的幻觉和文本不准确性问题，但这不是论文的核心内容，只是潜在应用方向，因此给予5分（有一定关联）。论文未涉及大模型技术原理创新或大模型在科学领域的应用，因此不符合研究背景要求。

!!! tip deepseek-chat TL;DR

该论文研究了如何利用稳定的物理特征（如拉普拉斯方差）来跨数据集和生成架构检测AI生成图像，并通过集成到CLIP模型中实现了在多个基准测试上的最先进性能。

摘要翻译

人工智能生成内容（AIGC）的快速发展模糊了真实图像与合成图像之间的界限，暴露出现有深度伪造检测器的局限性——这些检测器往往过度拟合特定的生成模型。这一适应性危机要求我们从根本上重新审视区分自然图像与AI生成图像的内在物理特征。本文旨在解决两个关键研究问题：（1）哪些物理特征能够稳定且鲁棒地跨不同数据集和生成架构区分AI生成图像？（2）这些客观的像素级特征能否集成到如CLIP等多模态模型中，以提升检测性能，同时减轻基于语言信息的不可靠性？为回答这些问题，我们对涵盖各类GAN和扩散模型生成的20余个数据集中的15种物理特征进行了全面探索，并提出一种新颖的特征选择算法，识别出包括拉普拉斯方差、索贝尔统计量和残差噪声方差在内的五个核心物理特征——这些特征在所有测试数据集中均表现出稳定的判别能力。随后，这些特征被转化为文本编码值，并与语义描述结合，以指导CLIP中的图像-文本表示学习。大量实验表明，我们的方法在多个GenImage基准测试中取得了最先进的性能，在如Wukong和SDv1.4等数据集上实现了接近完美的准确率（99.8%）。通过将像素级真实性分析与语义理解相融合，本研究开创了基于物理特征的可信视觉-语言建模方法，并为缓解大型多模态模型中的幻觉和文本不准确问题开辟了新方向。

摘要 (Abstract)

The rapid advancement of AI generated content (AIGC) has blurred the boundaries between real and synthetic images, exposing the limitations of existing deepfake detectors that often overfit to specific generative models. This adaptability crisis calls for a fundamental reexamination of the intrinsic physical characteristics that distinguish natural from AI-generated images. In this paper, we address two critical research questions: (1) What physical features can stably and robustly discriminate AI generated images across diverse datasets and generative architectures? (2) Can these objective pixel-level features be integrated into multimodal models like CLIP to enhance detection performance while mitigating the unreliability of language-based information? To answer these questions, we conduct a comprehensive exploration of 15 physical features across more than 20 datasets generated by various GANs and diffusion models. We propose a novel feature selection algorithm that identifies five core physical features including Laplacian variance, Sobel statistics, and residual noise variance that exhibit consistent discriminative power across all tested datasets. These features are then converted into text encoded values and integrated with semantic captions to guide image text representation learning in CLIP. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple Genimage benchmarks, with near-perfect accuracy (99.8%) on datasets such as Wukong and SDv1.4. By bridging pixel level authenticity with semantic understanding, this work pioneers the use of physically grounded features for trustworthy vision language modeling and opens new directions for mitigating hallucinations and textual inaccuracies in large multimodal models.

关键词: AI generated content, deepfake detection, physical features, CLIP, multimodal models, hallucination mitigation, image-text representation, synthetic image detection

205. ❌ LP-GEMM: Integrating Layout Propagation into GEMM Operations

作者: César Guedes Carneiro, Lucas Alvarenga, Guido Araujo, Sandro Rigo 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04599v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文LP-GEMM专注于优化通用矩阵乘法（GEMM）操作，通过布局传播技术减少冗余数据打包，提升科学计算和机器学习工作负载中序列GEMM的性能。论文的核心贡献是底层计算优化，而非大模型技术原理或应用创新。所有关键词均与大模型技术、训练方法、对齐、推理、代理、科学AI应用等直接相关，而本文仅涉及底层计算加速，因此绝大多数关键词评分为0。唯一相关的是“Speculative Decoding OR Inference Acceleration”，因为论文优化了GEMM操作，间接有助于推理加速，但非直接针对大模型推理优化，故给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

本文提出LP-GEMM，一种通过布局传播优化序列通用矩阵乘法（GEMM）操作的方法，消除了冗余数据打包，在x86和RISC-V架构上实现了平均2.25倍的加速，并应用于Llama-3.2推理路径以验证其性能提升。

摘要翻译

在科学计算与现代机器学习（ML）工作负载中，一系列相互依赖的通用矩阵乘法（GEMM）操作往往主导着执行时间。尽管先进的BLAS库对单个GEMM调用进行了深度优化，但它们仍受限于BLAS应用程序接口（API）的约束——该接口要求每次调用都必须独立打包输入矩阵，并将输出恢复至标准内存布局。在连续的GEMM操作中，这些约束导致了冗余的打包与解包过程，浪费了宝贵的计算资源。
本文提出了LP-GEMM，这是一种对GEMM内核的分解方法，能够实现跨连续GEMM操作的打包布局传播。该方法在保持边界处完整BLAS语义正确性的同时，消除了不必要的数据重新打包过程。我们在x86（AVX-512）和RISC-V（RVV 1.0）架构上，针对类多层感知机（MLP-like）和类注意力机制（Attention-like）工作负载对LP-GEMM进行了评估。实验结果显示，在Intel x86平台上，对于连续GEMM操作，LP-GEMM相比OpenBLAS平均实现了2.25倍的加速；相较于Intel MKL等厂商优化库，也取得了具有竞争力的性能提升。
我们通过完全基于BLAS层级的GEMM调用，实现了一个独立的Llama-3.2推理路径的C++版本，从而在微基准测试之外验证了该方法的实用性。这些结果证实，利用操作间的数据布局传播能够显著提升系统性能。

摘要 (Abstract)

In Scientific Computing and modern Machine Learning (ML) workloads, sequences of dependent General Matrix Multiplications (GEMMs) often dominate execution time. While state-of-the-art BLAS libraries aggressively optimize individual GEMM calls, they remain constrained by the BLAS API, which requires each call to independently pack input matrices and restore outputs to a canonical memory layout. In sequential GEMMs, these constraints cause redundant packing and unpacking, wasting valuable computational resources. This paper introduces LP-GEMM, a decomposition of the GEMM kernel that enables packing-layout propagation across sequential GEMM operations. This approach eliminates unnecessary data repacking while preserving full BLAS semantic correctness at the boundaries. We evaluate LP-GEMM on x86 (AVX-512) and RISC-V (RVV 1.0) architectures across MLP-like and Attention-like workloads. Our results show average speedups of 2.25x over OpenBLAS on Intel x86 for sequential GEMMs and competitive gains relative to vendor-optimized libraries such as Intel MKL. We demonstrate the practicality of the approach beyond microbenchmarks by implementing a standalone C++ version of the Llama-3.2 inference path using exclusively BLAS-level GEMM calls. These results confirm that leveraging data layout propagation between operations can significantly boost performance.

关键词: GEMM optimization, layout propagation, sequential GEMMs, BLAS libraries, inference acceleration, machine learning workloads, performance improvement, Llama-3.2 inference

206. ❌ Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

作者: Quoc-Huy Trinh, Mustapha Abdullahi, Bo Zhao, Debesh Jha 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04579v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究高效的多模态大语言模型（MLLMs），属于大模型技术范畴，因此与’Large Language Models’高度相关（10分）。论文专注于提升模型效率以适用于资源受限场景，这与’Small Language Models’和’Speculative Decoding’有一定关联（各8分），因为涉及高效推理和部署。其他关键词如MoE、Scaling Laws、Alignment等未在摘要中提及，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型计算成本高、难以在资源受限场景部署的问题，提出了Firebolt-VL模型，通过使用Liquid Foundation Model解码器和Token-Grid Correlation Module，实现了高效且精确的视觉语言理解。

摘要翻译

多模态大语言模型（MLLMs）的最新进展显著推动了视觉-语言理解领域的进步，但其高昂的计算成本限制了在资源受限场景（如个人助手、文档理解和智能摄像头）中的部署。现有方法大多依赖基于Transformer的交叉注意力机制，其二次复杂度制约了效率。此外，小型视觉-语言模型往往难以精确捕捉细粒度、任务相关的视觉区域，导致在细粒度推理任务上性能下降，从而限制了其在实际应用中的有效性。为解决这些问题，我们提出了Firebolt-VL——一种高效的视觉-语言模型，它使用液态基础模型（Liquid Foundation Model, LFM）解码器替代了基于Transformer的解码器。为进一步增强视觉定位能力，我们提出了一种令牌-网格关联模块（Token-Grid Correlation Module），该模块计算文本令牌与图像块之间的轻量级关联，并通过状态空间模型结合FiLM条件进行调制。这使得模型能够选择性地强调与文本提示相关的视觉区域，同时保持线性时间推理。在多个基准测试上的实验结果表明，Firebolt-VL能够实现准确、细粒度的理解，并显著提升了效率。我们的模型和代码已公开于：https://fireboltvl.github.io

摘要 (Abstract)

Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as personal assistants, document understanding, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Firebolt-VL, an efficient vision-language model that replaces the Transformer-based decoder with a Liquid Foundation Model (LFM) decoder. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates via the state-space model with FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Firebolt-VL achieves accurate, fine-grained understanding with significantly improved efficiency. Our model and code are available at: https://fireboltvl.github.io

关键词: Vision-Language Model, Multimodal LLM, Efficient Inference, Cross-Modality Modulation, Liquid Foundation Model, Token-Grid Correlation, Fine-grained Understanding, Linear-time Inference

207. ❌ PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis

作者: Inseong Choi, Siwoo Lee, Seung-Hun Nam, Soohwan Song 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04576v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于扩散模型在稀疏视图新视角合成中的应用，提出了一种部分参考图像质量评估方法（PR-IQA）来评估扩散生成视图的质量，并将其集成到3D高斯溅射（3DGS）管道中以提高3D重建质量。论文的核心内容涉及计算机视觉、图像处理和3D重建，与所有评分关键词（主要关于大语言模型、深度学习技术原理及其在科学领域的应用）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种部分参考图像质量评估框架（PR-IQA），用于评估扩散模型生成的新视角图像质量，并将其集成到3D高斯溅射管道中，以过滤不一致区域并提升3D重建效果。

摘要翻译

扩散模型在稀疏视角新视图合成领域展现出潜力，因其能够生成伪真实视图以辅助如3D高斯溅射等三维重建流程。然而，这些合成图像常存在光度与几何不一致性，直接将其用于监督会损害重建质量。为解决此问题，我们提出部分参考图像质量评估框架，该框架利用不同位姿的参考图像评估扩散生成视图，无需真实数据作为基准。PR-IQA首先在重叠区域计算几何一致的部分质量图，随后通过质量补全将该部分图修复为稠密的完整图像质量图。此补全过程通过交叉注意力机制实现，该机制融合了参考视图的上下文信息，确保跨视图一致性并实现全面质量评估。当PR-IQA被集成至扩散增强的3DGS流程时，其质量图识别出的高置信度区域将作为监督约束范围。实验表明，PR-IQA优于现有图像质量评估方法，在无需真实监督的情况下达到全参考级别的精度。因此，我们提出的质量感知3DGS方法能更有效地过滤不一致信息，生成更优的三维重建与新视图合成结果。项目页面详见：https://kakaomacao.github.io/pr-iqa-project-page/。

摘要 (Abstract)

Diffusion models are promising for sparse-view novel view synthesis (NVS), as they can generate pseudo-ground-truth views to aid 3D reconstruction pipelines like 3D Gaussian Splatting (3DGS). However, these synthesized images often contain photometric and geometric inconsistencies, and their direct use for supervision can impair reconstruction. To address this, we propose Partial-Reference Image Quality Assessment (PR-IQA), a framework that evaluates diffusion-generated views using reference images from different poses, eliminating the need for ground truth. PR-IQA first computes a geometrically consistent partial quality map in overlapping regions. It then performs quality completion to inpaint this partial map into a dense, full-image map. This completion is achieved via a cross-attention mechanism that incorporates reference-view context, ensuring cross-view consistency and enabling thorough quality assessment. When integrated into a diffusion-augmented 3DGS pipeline, PR-IQA restricts supervision to high-confidence regions identified by its quality maps. Experiments demonstrate that PR-IQA outperforms existing IQA methods, achieving full-reference-level accuracy without ground-truth supervision. Thus, our quality-aware 3DGS approach more effectively filters inconsistencies, producing superior 3D reconstructions and NVS results.The project page is available at https://kakaomacao.github.io/pr-iqa-project-page/.

关键词: Diffusion Models, Novel View Synthesis, Image Quality Assessment, 3D Gaussian Splatting, 3D Reconstruction, Partial-Reference, Cross-view Consistency, Quality Completion

208. ❌ Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models

作者: Arian Komaei Koma, Seyed Amir Kasaei, Ali Aghayari, AmirMahdi Sadeghzadeh, Mohammad Hossein Rohban 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04575v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究文本到图像扩散模型的概念遗忘技术，主要关注Stable Diffusion模型中的概念移除及其对组合生成能力的影响。所有评分关键词均与大语言模型（LLM）相关，而本文研究对象是文本到图像扩散模型（如Stable Diffusion），属于不同的模型架构和应用领域。论文未涉及LLM技术、训练方法、推理优化、对齐技术、代理系统或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文系统评估了文本到图像扩散模型中概念遗忘方法的有效性，发现强遗忘方法会显著损害模型的组合生成能力（如属性绑定和空间推理），而保留组合结构的方法则难以实现稳健的概念擦除。

摘要翻译

事后遗忘已成为从大型文本到图像扩散模型中移除不良概念的一种实用机制。然而，先前的研究主要通过擦除成功率来评估遗忘效果；其对更广泛生成能力的影响仍知之甚少。在本研究中，我们通过组合式文本到图像生成的视角，对概念遗忘进行了系统的实证研究。聚焦于Stable Diffusion 1.4中的裸露内容移除，我们使用T2I-CompBench++和GenEval评估平台，结合成熟的遗忘基准，对多种最先进的遗忘方法进行了评估。我们的结果揭示了遗忘效果与组合完整性之间始终存在的权衡：实现强擦除效果的方法常常导致属性绑定、空间推理和计数能力的大幅退化。相反，能保持组合结构的方法往往无法提供稳健的擦除效果。这些发现凸显了当前评估实践的局限性，并强调需要建立明确考虑超越目标抑制的语义保存的遗忘目标。

摘要 (Abstract)

Post-hoc unlearning has emerged as a practical mechanism for removing undesirable concepts from large text-to-image diffusion models. However, prior work primarily evaluates unlearning through erasure success; its impact on broader generative capabilities remains poorly understood. In this work, we conduct a systematic empirical study of concept unlearning through the lens of compositional text-to-image generation. Focusing on nudity removal in Stable Diffusion 1.4, we evaluate a diverse set of state-of-the-art unlearning methods using T2I-CompBench++ and GenEval, alongside established unlearning benchmarks. Our results reveal a consistent trade-off between unlearning effectiveness and compositional integrity: methods that achieve strong erasure frequently incur substantial degradation in attribute binding, spatial reasoning, and counting. Conversely, approaches that preserve compositional structure often fail to provide robust erasure. These findings highlight limitations of current evaluation practices and underscore the need for unlearning objectives that explicitly account for semantic preservation beyond targeted suppression.

关键词: text-to-image diffusion models, concept unlearning, compositional generation, erasure effectiveness, semantic preservation, Stable Diffusion, post-hoc unlearning, generative capabilities

209. ❌ TAPE: A two-stage parameter-efficient adaptation framework for foundation models in OCT-OCTA analysis

作者: Xiaofei Su, Zengshuo Wang, Minghe Sun, Xin Zhao, Mingzhu Sun 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04571v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	15.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出TAPE框架，专门针对医学图像分析中的基础模型进行参数高效微调，核心涉及基础模型（Foundation Models）、参数高效微调（PEFT）、领域适应（Domain Adaptation）和AI在科学领域的应用（AI for Science/Bioinformatics）。其他关键词如MoE、SLMs、RLHF、RAG等与论文内容无关，因为论文专注于医学图像分割的特定适应方法，而非语言模型、推理技术或代理系统。

!!! tip deepseek-chat TL;DR

该论文针对OCT-OCTA图像分析中基础模型面临的领域偏移和任务不对齐问题，提出了一个两阶段参数高效适应框架TAPE，通过解耦领域对齐和任务拟合，在视网膜层分割任务上实现了卓越的参数效率和泛化性能。

摘要翻译

光学相干断层扫描（OCT）与光学相干断层扫描血管成像（OCTA）图像的自动化分析对于实现稳健的眼科诊断至关重要。现有主流方法依赖从零开始训练，严重受制于海量数据与模型规模，从而阻碍了其在资源受限的临床环境中的实际部署。尽管基于基础模型（Foundation Models, FMs）的迁移学习前景广阔，但仍面临显著挑战：领域偏移与任务失配。为解决这些问题，我们提出TAPE：一种通过参数高效微调（Parameter-Efficient Fine-tuning, PEFT）的两阶段适应框架，该框架策略性地将下游分割任务的适应过程解耦为领域对齐与任务拟合两个阶段。在领域适应阶段，我们创新性地将参数高效微调应用于掩码图像建模（masked image modeling）中以实现医学图像领域适应，据我们所知，这是一种新颖的方法。将TAPE应用于通用基础模型（掩码自编码器，MAE）与专用基础模型（RETFound）的视网膜层分割任务时，其展现出卓越的参数效率，并在多种病理条件下取得了最先进的泛化性能。

摘要 (Abstract)

Automated analysis of optical coherence tomography (OCT) and OCT angiography (OCTA) images is critical for robust ophthalmic diagnosis. Existing mainstream methods trained from scratch rely heavily on massive data and model scale, thereby hindering their practical deployment in resource-constrained clinical settings. Although transfer learning based on foundation models (FMs) is promising, it still faces significant challenges: domain shift and task misalignment. To address these, we propose TAPE: A Two-stage Adaptation Framework via Parameter-Efficient Fine-tuning, which strategically decouples adaptation into domain alignment and task fitting for downstream segmentation. The domain adaptation stage notably applies parameter-efficient fine-tuning (PEFT) in the context of masked image modeling for medical image domain adaptation, a novel approach to the best of our knowledge. Applying TAPE to retinal layer segmentation on both universal (masked auto-encoder, MAE) and specialized (RETFound) FMs, it demonstrates superior parameter efficiency and achieves state-of-the-art generalization performance across diverse pathologies.

关键词: Foundation Models, Parameter-efficient Fine-tuning, Domain Adaptation, Medical Image Analysis, OCT-OCTA, Retinal Layer Segmentation, Masked Auto-encoder, RETFound

210. ❌ Visual Prompt Based Reasoning for Offroad Mapping using Multimodal LLMs

作者: Abdelmoamen Nasser, Yousef Baba’a, Murad Mebrahtu, Nadya Abdel Madjid, Jorge Dias, Majid Khonji 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04564v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文提出了一种基于视觉语言模型（VLM）的零样本越野导航框架，利用SAM2进行环境分割，并通过VLM推理可行驶区域。论文核心涉及大模型（VLM）在机器人/自动驾驶领域的应用，特别是利用VLM的推理能力替代传统多模型方法。因此，与’Large Language Models’（VLM属于多模态大模型）高度相关（8分）；与’Chain of Thought’和’System 2 Thinking’相关，因为VLM被用于多步推理识别可行驶区域（8分）；与’LLM Agents’相关，因为该框架作为自主导航代理的一部分（8分）；与’AI for Science’相关，属于AI在机器人/自动驾驶科学领域的应用（8分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于视觉语言模型（VLM）的零样本越野导航框架，通过结合SAM2分割和VLM推理来识别可行驶区域，在分割数据集上超越了最先进的可训练模型，并实现了完整的越野导航堆栈。

摘要翻译

传统越野自主系统通常依赖独立模型分别进行地形分类、高度估计以及滑移或坡度条件量化。使用多个模型需要分别训练各组件、准备特定任务的数据集并进行微调。本研究提出一种零样本方法，利用SAM2进行环境分割，并借助视觉语言模型（VLM）推理可通行区域。该方法同时向VLM输入原始图像及经过分割的图像——后者每个掩码均标注数字标签，随后通过提示要求VLM识别哪些由数字标签代表的区域具备可通行性。结合规划与控制模块，这一统一框架无需依赖显式的地形专用模型，转而利用VLM固有的推理能力。我们的方法在高分辨率分割数据集上超越了当前最先进的可训练模型，并在Isaac Sim越野环境中实现了全栈导航能力。

摘要 (Abstract)

Traditional approaches to off-road autonomy rely on separate models for terrain classification, height estimation, and quantifying slip or slope conditions. Utilizing several models requires training each component separately, having task specific datasets, and fine-tuning. In this work, we present a zero-shot approach leveraging SAM2 for environment segmentation and a vision-language model (VLM) to reason about drivable areas. Our approach involves passing to the VLM both the original image and the segmented image annotated with numeric labels for each mask. The VLM is then prompted to identify which regions, represented by these numeric labels, are drivable. Combined with planning and control modules, this unified framework eliminates the need for explicit terrain-specific models and relies instead on the inherent reasoning capabilities of the VLM. Our approach surpasses state-of-the-art trainable models on high resolution segmentation datasets and enables full stack navigation in our Isaac Sim offroad environment.

关键词: vision-language model, zero-shot approach, off-road autonomy, drivable area reasoning, environment segmentation, autonomous navigation, multimodal LLM, visual prompt

211. ❌ Relational Epipolar Graphs for Robust Relative Camera Pose Estimation

作者: Prateeth Rao, Sachit Rao 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04554v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Relational Epipolar Graphs for Robust Relative Camera Pose Estimation》专注于计算机视觉中的相对相机姿态估计问题，采用图神经网络方法处理对极几何关系。所有评分关键词均与大语言模型、深度学习技术原理创新或AI在科学领域的应用相关，而该论文研究的是传统计算机视觉几何问题，未涉及任何大模型技术、深度学习创新方法或AI在生物/化学等科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于关系对极图的相对相机姿态估计方法，通过图操作处理关键点对应关系，在室内外基准测试中相比传统方法和学习引导方法表现出更好的鲁棒性。

摘要翻译

视觉同步定位与建图（VSLAM）的关键组成部分之一是利用匹配关键点估计相机相对位姿。噪声干扰的对应关系对此类估计的准确性构成挑战。经典方法依赖于随机假设采样与迭代估计，而基于学习的方法往往缺乏显式几何结构。本研究将相对位姿估计重新定义为极线对应图上的关系推理问题：其中匹配关键点作为节点，相邻节点通过边连接。通过剪枝、消息传递与池化等图操作，可估计四元数旋转、平移向量及本质矩阵（Essential Matrix, EM）。通过最小化包含以下部分的损失函数：（i）与真值（GT）的$\mathcal{L}_2$差异，（ii）估计本质矩阵与真值本质矩阵间的弗罗贝尼乌斯范数，（iii）奇异值差异，（iv）航向角差异，以及（v）尺度差异，最终得到图像对间的相对位姿。匹配采用无密集检测器方法LoFTR实现。在室内外基准数据集上的实验表明，相较于经典方法与学习引导方法，本方法对密集噪声和大基线变化具有更强的鲁棒性，凸显了全局关系一致性机制的有效性。

摘要 (Abstract)

A key component of Visual Simultaneous Localization and Mapping (VSLAM) is estimating relative camera poses using matched keypoints. Accurate estimation is challenged by noisy correspondences. Classical methods rely on stochastic hypothesis sampling and iterative estimation, while learning-based methods often lack explicit geometric structure. In this work, we reformulate relative pose estimation as a relational inference problem over epipolar correspondence graphs, where matched keypoints are nodes and nearby ones are connected by edges. Graph operations such as pruning, message passing, and pooling estimate a quaternion rotation, translation vector, and the Essential Matrix (EM). Minimizing a loss comprising (i) $\mathcal{L}_2$ differences with ground truth (GT), (ii) Frobenius norm between estimated and GT EMs, (iii) singular value differences, (iv) heading angle differences, and (v) scale differences, yields the relative pose between image pairs. The dense detector-free method LoFTR is used for matching. Experiments on indoor and outdoor benchmarks show improved robustness to dense noise and large baseline variation compared to classical and learning-guided approaches, highlighting the effectiveness of global relational consensus.

关键词: relative camera pose estimation, epipolar correspondence graphs, graph neural networks, essential matrix, visual SLAM, robust estimation, LoFTR, geometric vision

212. ❌ G-EDF-Loc: 3D Continuous Gaussian Distance Field for Robust Gradient-Based 6DoF Localization

作者: José E. Maese, Lucía Coto-Elena, Luis Merino, Fernando Caballero 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04525v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于机器人定位和3D距离场表示，使用高斯混合模型和空间分区技术，与所有评分关键词（均涉及大模型、深度学习技术原理或AI科学应用）完全无关，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于连续3D高斯距离场（G-EDF）的鲁棒6自由度定位框架，通过块稀疏高斯混合模型实现高效内存使用和C^1连续性，在大规模数据集上表现出与最先进方法竞争的性能，并在里程计退化或无IMU先验情况下具有强韧性。

摘要翻译

本文提出了一种基于CPU直接扫描到地图配准流程的鲁棒六自由度定位框架。该系统采用G-EDF——一种新颖的连续且内存高效的3D距离场表示方法。该方法通过具有自适应空间分区的块稀疏高斯混合模型对欧几里得距离场进行建模，确保了块间过渡的$C^1$连续性，并有效减少了边界伪影。通过利用该连续地图的解析梯度（其保持Eikonal一致性），所提方法实现了高保真度的空间重建与实时定位。在大规模数据集上的实验结果表明，G-EDF-Loc的性能与当前最先进方法相当，即使在里程计严重退化或完全缺乏IMU先验信息的极端条件下，仍展现出卓越的鲁棒性。

摘要 (Abstract)

This paper presents a robust 6-DoF localization framework based on a direct, CPU-based scan-to-map registration pipeline. The system leverages G-EDF, a novel continuous and memory-efficient 3D distance field representation. The approach models the Euclidean Distance Field (EDF) using a Block-Sparse Gaussian Mixture Model with adaptive spatial partitioning, ensuring $C^1$ continuity across block transitions and mitigating boundary artifacts. By leveraging the analytical gradients of this continuous map, which maintain Eikonal consistency, the proposed method achieves high-fidelity spatial reconstruction and real-time localization. Experimental results on large-scale datasets demonstrate that G-EDF-Loc performs competitively against state-of-the-art methods, exhibiting exceptional resilience even under severe odometry degradation or in the complete absence of IMU priors.

关键词: 6-DoF localization, Gaussian Distance Field, scan-to-map registration, Block-Sparse Gaussian Mixture Model, Euclidean Distance Field, real-time localization, odometry degradation, IMU priors

213. ❌ MPTF-Net: Multi-view Pyramid Transformer Fusion Network for LiDAR-based Place Recognition

作者: Shuyuan Li, Zihang Wang, Xieyuanli Chen, Wenkai Zhu, Xiaoteng Fang, Peizhou Ni, Junhao Yang, Dong Kong 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04513v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和机器人学领域的LiDAR点云地点识别，提出了一种基于Transformer的多视图融合网络。论文内容与所有评分关键词（均围绕大模型、深度学习技术原理及其在科学领域的应用）完全无关。论文未涉及任何形式的大语言模型、模型训练/对齐技术、推理优化、智能体系统或AI for Science的具体应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为MPTF-Net的新型多视图金字塔Transformer融合网络，通过NDT-BEV编码和跨视图交互，解决了LiDAR点云地点识别中传统BEV表示无法捕捉细粒度几何结构的问题，在多个数据集上实现了最先进的性能并保持了实时推理能力。

摘要翻译

基于激光雷达的地点识别（LiDAR-based Place Recognition, LPR）在大规模SLAM系统中对于全局定位与回环检测至关重要。现有方法通常从距离图像（Range Images）或鸟瞰图（BEV）表示中构建全局描述符进行匹配。由于鸟瞰图具有显式的二维空间布局编码和高效的检索能力，其被广泛采用。然而，传统的鸟瞰图表示依赖于简单的统计聚合，无法捕捉细粒度的几何结构，导致在复杂或重复环境中性能下降。为解决这一问题，我们提出了MPTF-Net，一种新颖的多视角多尺度金字塔Transformer融合网络。我们的核心贡献是一种基于多通道正态分布变换（Normal Distribution Transform, NDT）的鸟瞰图编码方法，它通过正态分布变换显式建模局部几何复杂度和强度分布，提供了抗噪声的结构先验。为了有效整合这些特征，我们设计了一个定制的金字塔Transformer模块，该模块在多个空间尺度上捕获距离图像视角（Range Image Views, RIV）与NDT-BEV之间的跨视角交互关联。在nuScenes、KITTI和NCLT数据集上的大量实验表明，MPTF-Net实现了最先进的性能，特别是在nuScenes Boston子集上获得了96.31%的Recall@1，同时仅保持10.02毫秒的推理延迟，使其非常适用于实时自主无人系统。

摘要 (Abstract)

LiDAR-based place recognition (LPR) is essential for global localization and loop-closure detection in large-scale SLAM systems. Existing methods typically construct global descriptors from Range Images or BEV representations for matching. BEV is widely adopted due to its explicit 2D spatial layout encoding and efficient retrieval. However, conventional BEV representations rely on simple statistical aggregation, which fails to capture fine-grained geometric structures, leading to performance degradation in complex or repetitive environments. To address this, we propose MPTF-Net, a novel multi-view multi-scale pyramid Transformer fusion network. Our core contribution is a multi-channel NDT-based BEV encoding that explicitly models local geometric complexity and intensity distributions via Normal Distribution Transform, providing a noise-resilient structural prior. To effectively integrate these features, we develop a customized pyramid Transformer module that captures cross-view interactive correlations between Range Image Views (RIV) and NDT-BEV at multiple spatial scales. Extensive experiments on the nuScenes, KITTI and NCLT datasets demonstrate that MPTF-Net achieves state-of-the-art performance, specifically attaining a Recall@1 of 96.31% on the nuScenes Boston split while maintaining an inference latency of only 10.02 ms, making it highly suitable for real-time autonomous unmanned systems.

关键词: LiDAR-based place recognition, Multi-view fusion, Pyramid Transformer, NDT-BEV encoding, Range Image Views, Autonomous unmanned systems, Real-time inference, Global localization

214. ❌ MedROI: Codec-Agnostic Region of Interest-Centric Compression for Medical Images

作者: Jiwon Kim, Ikbeom Jang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04511v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《MedROI: Codec-Agnostic Region of Interest-Centric Compression for Medical Images》专注于医学图像压缩技术，提出了一种与编解码器无关的、以感兴趣区域为中心的压缩框架。该研究属于医学影像处理领域，与大多数关键词（涉及大模型、深度学习技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，因此评分为0。唯一相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为该论文将AI技术应用于医学图像处理（属于科学领域），但并非核心内容，只是应用场景，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对医学图像存储和传输效率问题，提出了一种通用的感兴趣区域压缩框架MedROI，通过裁剪背景并压缩ROI，在保持重建质量的同时显著提高了压缩比和编码/解码速度。

摘要翻译

医学影像档案库的规模和分辨率均在快速增长，这使得高效压缩对于存储和数据传输日益重要。现有的大多数编解码器要么压缩完整图像/体数据（包括非诊断性背景），要么采用仍保留背景比特的差异化感兴趣区域编码方法。我们提出MedROI，一个与编解码器无关、即插即用的以感兴趣区域为中心的框架，该框架在压缩前丢弃背景体素。MedROI通过轻量级基于强度的阈值分割提取紧密的组织边界框，并存储一个固定的54字节元数据记录，以便在解压缩时实现空间重建。随后，裁剪出的感兴趣区域可使用任何现有的二维或三维编解码器进行压缩，无需修改架构或重新训练。我们在ADNI数据库的200个T1加权脑部MRI体数据上评估MedROI，使用了涵盖传统编解码器（JPEG2000 2D/3D, HEIF）和神经压缩器（LIC_TCM, TCM+AuxT, BCM-Net, SirenMRI）的6种编解码配置。对于大多数配置，MedROI在压缩比和编码/解码时间上均带来了统计学上显著的提升（经多重比较校正的双侧t检验），同时在感兴趣区域内测量时保持了相当的重建质量；HEIF是在压缩比增益方面的主要例外。例如，在JPEG2000 2D（lv3）上，MedROI将压缩比从20.35提升至27.37，同时将平均压缩时间从1.701秒减少至1.380秒。代码发布于https://github.com/labhai/MedROI。

摘要 (Abstract)

Medical imaging archives are growing rapidly in both size and resolution, making efficient compression increasingly important for storage and data transfer. Most existing codecs compress full images/volumes(including non-diagnostic background) or apply differential ROI coding that still preserves background bits. We propose MedROI, a codec-agnostic, plug-and-play ROI-centric framework that discards background voxels prior to compression. MedROI extracts a tight tissue bounding box via lightweight intensity-based thresholding and stores a fixed 54byte meta data record to enable spatial restoration during decompression. The cropped ROI is then compressed using any existing 2D or 3D codec without architectural modifications or retraining. We evaluate MedROI on 200 T1-weighted brain MRI volumes from ADNI using 6 codec configurations spanning conventional codecs (JPEG2000 2D/3D, HEIF) and neural compressors (LIC_TCM, TCM+AuxT, BCM-Net, SirenMRI). MedROI yields statistically significant improvements in compression ratio and encoding/decoding time for most configurations (two-sided t-test with multiple-comparison correction), while maintaining comparable reconstruction quality when measured within the ROI; HEIF is the primary exception in compression-ratio gains. For example, on JPEG20002D (lv3), MedROI improves CR from 20.35 to 27.37 while reducing average compression time from 1.701s to 1.380s. Code is available at https://github.com/labhai/MedROI.

关键词: medical image compression, region of interest, codec-agnostic, MRI, compression ratio, encoding time, decoding time, reconstruction quality

215. ❌ Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

作者: Shizhan Gong, Minda Hu, Qiyuan Zhang, Chen Ma, Qi Dou 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04500v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	8.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Vision-language models (VLMs)的interpretability和faithfulness问题，提出Saliency-R1框架。与以下关键词高度相关：‘Hallucination Mitigation’和’Mechanistic Interpretability’（核心内容，10分），因为论文直接解决幻觉、事实性和可解释性问题；‘Instruction Tuning/Alignment’和’RLHF/DPO’（8分），因为使用GRPO进行对齐优化；‘Chain of Thought Reasoning’（8分），因为涉及推理过程的可视化追踪；‘Large Language Models’和’Post-training/SFT’（5分），因为VLMs属于大模型范畴且涉及训练后优化；‘System 2 Thinking’和’Self-Correction’（5分），因为涉及深度推理和自我改进概念。其他关键词与论文内容无关或关联度极低（0分）。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型在推理过程中过度依赖文本线索、产生未接地或虚构响应的问题，提出了Saliency-R1框架，通过新颖的显著性图技术和GRPO优化，提高了模型的解释性、忠实性和整体任务性能。

摘要翻译

视觉语言模型（VLMs）已在多种任务中取得显著成功。然而，其可信度问题依然存在，尤其是模型倾向于更依赖文本线索而非视觉证据，以及可能产生缺乏依据或虚构回答的风险。为解决这些问题，我们提出Saliency-R1框架，旨在提升视觉语言模型推理的可解释性与忠实性。具体而言，我们引入一种新颖的显著性图技术，能够高效地突出对生成文本标记起关键作用的图像区域，且无需额外计算开销。该技术可进一步扩展以追踪视觉信息如何通过推理过程流向最终答案，从而揭示思维过程与视觉语境之间的对齐关系。我们使用显著性图与人工标注边界框之间的重叠度作为奖励函数，并应用组相对策略优化（Group Relative Policy Optimization, GRPO）来对齐显著区域与关键区域，促使模型在推理时聚焦于相关区域。实验表明，Saliency-R1提升了推理的忠实性、可解释性及整体任务性能。

摘要 (Abstract)

Vision-language models (VLMs) have achieved remarkable success across diverse tasks. However, concerns about their trustworthiness persist, particularly regarding tendencies to lean more on textual cues than visual evidence and the risk of producing ungrounded or fabricated responses. To address these issues, we propose Saliency-R1, a framework for improving the interpretability and faithfulness of VLMs reasoning. Specifically, we introduce a novel saliency map technique that efficiently highlights critical image regions contributing to generated tokens without additional computational overhead. This can further be extended to trace how visual information flows through the reasoning process to the final answers, revealing the alignment between the thinking process and the visual context. We use the overlap between the saliency maps and human-annotated bounding boxes as the reward function, and apply Group Relative Policy Optimization (GRPO) to align the salient parts and critical regions, encouraging models to focus on relevant areas when conduct reasoning. Experiments show Saliency-R1 improves reasoning faithfulness, interpretability, and overall task performance.

关键词: Vision-language models, Interpretability, Faithfulness, Saliency map, Grounded reasoning, Group Relative Policy Optimization, Hallucination mitigation, Visual evidence

216. ❌ The Indra Representation Hypothesis for Multimodal Alignment

作者: Jianglin Lu, Hailing Wang, Kuo Yang, Yitian Zhang, Simon Jenni, Yun Fu 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04496v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Indra表示假设，研究单模态基础模型表示的对齐问题，与’Foundation Models’和’Alignment’高度相关（分别8分和10分），因为核心关注基础模型的表示收敛与对齐。与’Pre-training’有一定关联（5分），因为涉及基础模型的训练表示。与’Explainable AI’有一定关联（5分），因为从哲学和理论角度解释表示收敛现象。其他关键词如MoE、SFT、RAG等与论文内容无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出Indra表示假设，利用范畴论形式化单模态基础模型的表示收敛现象，并通过实验证明Indra表示能增强跨模型和跨模态的鲁棒性与对齐，无需额外训练。

摘要翻译

近期研究发现了一个有趣现象：单模态基础模型倾向于学习收敛的表征，无论其架构、训练目标或数据模态存在何种差异。然而，这些表征本质上是样本的内部抽象，仅能独立刻画样本特征，导致表达能力受限。受哲学隐喻“因陀罗网”的启发，本文提出因陀罗表征假说。我们认为，单模态基础模型获得的表征正在收敛，以隐式反映现实背后共享的关系结构，这与因陀罗网的关系本体论相类似。我们运用范畴论中的V-充实米田嵌入对该假说进行形式化定义，将因陀罗表征定义为每个样本相对于其他样本的关系轮廓。该形式化表征被证明在给定代价函数下具有唯一性、完备性和结构保持性。我们使用角距离实例化了因陀罗表征，并在涉及视觉、语言和音频的跨模型与跨模态场景中对其进行了评估。大量实验表明，因陀罗表征能持续增强不同架构与模态间的鲁棒性和对齐性，为单模态基础模型的无训练对齐提供了一个理论坚实且实践可行的框架。代码已发布于https://github.com/Jianglin954/Indra。

摘要 (Abstract)

Recent studies have uncovered an interesting phenomenon: unimodal foundation models tend to learn convergent representations, regardless of differences in architecture, training objectives, or data modalities. However, these representations are essentially internal abstractions of samples that characterize samples independently, leading to limited expressiveness. In this paper, we propose The Indra Representation Hypothesis, inspired by the philosophical metaphor of Indra’s Net. We argue that representations from unimodal foundation models are converging to implicitly reflect a shared relational structure underlying reality, akin to the relational ontology of Indra’s Net. We formalize this hypothesis using the V-enriched Yoneda embedding from category theory, defining the Indra representation as a relational profile of each sample with respect to others. This formulation is shown to be unique, complete, and structure-preserving under a given cost function. We instantiate the Indra representation using angular distance and evaluate it in cross-model and cross-modal scenarios involving vision, language, and audio. Extensive experiments demonstrate that Indra representations consistently enhance robustness and alignment across architectures and modalities, providing a theoretically grounded and practical framework for training-free alignment of unimodal foundation models. Our code is available at https://github.com/Jianglin954/Indra.

关键词: Indra Representation Hypothesis, unimodal foundation models, multimodal alignment, relational structure, category theory, training-free alignment, cross-modal scenarios, representation convergence

217. ❌ A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models

作者: Tianmeng Fang, Yong Wang, Zetai Kong, Zengzhen Su, Jun Wang, Chengjin Yu, Wei Wang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04488v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多模态大语言模型（MLLMs）的后门防御，核心涉及大语言模型（LLMs）和其监督微调（SFT）阶段的安全问题。因此，与’Large Language Models’和’Post-training/Supervised Fine-tuning’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、推理方法、代理、压缩、科学AI应用等，论文均未涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在监督微调中易受后门攻击的问题，提出了一种基于补丁增强和跨视图正则化的防御框架，有效降低了攻击成功率并保持了模型的正常文本生成能力。

摘要翻译

多模态大语言模型已成为统一处理视觉与语言任务的重要基础设施。然而，此类模型在有监督微调过程中极易被植入后门，一旦特定触发模式被激活，模型将持续输出攻击者预定义的有害响应。后门防御的核心挑战在于：在低投毒比例下抑制攻击成功率的同时，保持模型的正常生成能力。这两个目标本质上是相互冲突的——强抑制常导致良性性能下降，而弱正则化则无法有效缓解后门行为。为此，我们提出一种基于补丁增强与跨视图一致性的统一防御框架，该框架从特征表征和输出分布两个层面同时约束模型对触发模式的异常响应。具体而言，通过结合补丁级数据增强与跨视图输出差异正则化，利用后门响应对非语义扰动异常不变这一特性，主动拉大原始视图与扰动视图的输出分布差异，从而显著抑制后门触发的成功率。同时，我们通过施加输出熵约束避免防御过程中对模型的过度压制，确保正常指令的生成质量。在三种模型、两类任务和六种攻击场景下的实验结果表明，所提出的防御方法在保持高水平正常文本生成能力的同时，能有效降低攻击成功率。本工作为实现大规模多模态模型在现实低频投毒与隐蔽触发场景下的安全可控部署提供了支持。

摘要 (Abstract)

Multimodal large language models have become an important infrastructure for unified processing of visual and linguistic tasks. However, such models are highly susceptible to backdoor implantation during supervised fine-tuning and will steadily output the attacker’s predefined harmful responses once a specific trigger pattern is activated. The core challenge of backdoor defense lies in suppressing attack success under low poisoning ratios while preserving the model’s normal generation ability. These two objectives are inherently conflicting. Strong suppression often degrades benign performance, whereas weak regularization fails to mitigate backdoor behaviors. To this end, we propose a unified defense framework based on patch augmentation and cross-view regularity, which simultaneously constrains the model’s anomalous behaviors in response to triggered patterns from both the feature representation and output distribution levels. Specifically, patch-level data augmentation is combined with cross-view output difference regularization to exploit the fact that backdoor responses are abnormally invariant to non-semantic perturbations and to proactively pull apart the output distributions of the original and perturbed views, thereby significantly suppressing the success rate of backdoor triggering. At the same time, we avoid over-suppression of the model during defense by imposing output entropy constraints, ensuring the quality of normal command generation. Experimental results across three models, two tasks, and six attacks show that our proposed defense method effectively reduces the attack success rate while maintaining a high level of normal text generation capability. Our work enables the secure, controlled deployment of large-scale multimodal models in realistic low-frequency poisoning and covert triggering scenarios.

关键词: Multimodal Large Language Models, Backdoor Defense, Supervised Fine-tuning, Patch Augmentation, Cross-view Regularization, Attack Success Rate, Text Generation

作者: Junyoung Park, Youngjin Oh, Nam Ik Cho 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04484v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的自监督图像去噪，提出了一种新的盲点网络架构（TM-BSN）来解决真实世界sRGB图像中的空间相关噪声问题。论文内容涉及卷积神经网络架构设计、知识蒸馏和图像处理，但完全不涉及大语言模型、深度学习技术原理创新或任何评分关键词中的大模型相关主题（如预训练、微调、对齐、推理加速等）。所有关键词均与大模型、深度学习技术原理或AI for Science应用相关，而本论文是纯粹的计算机视觉图像处理研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种新的三角形掩码盲点网络（TM-BSN），通过设计三角形掩码卷积来准确建模真实sRGB噪声的空间相关性，解决了现有盲点网络在真实世界图像去噪中因噪声空间相关性而失效的问题，并在多个真实世界基准测试中取得了最先进的性能。

摘要翻译

盲点网络通过禁止访问目标像素，实现了无需真实值监督的干净信号估计，从而达成自监督图像去噪。然而，该方法假设像素级噪声独立性，这一假设在实际sRGB图像中并不成立，因为相机图像信号处理流水线会产生空间相关的噪声。尽管已有多种方法采用下采样来消除噪声相关性，但这些方法改变了噪声统计特性，并限制了网络利用完整上下文信息的能力。本文提出三角掩膜盲点网络，这是一种新颖的盲点架构，能够精确建模真实sRGB噪声的空间相关性。这种相关性源于去马赛克过程，其中每个像素由具有空间衰减权重的相邻样本重建，形成菱形模式。为使感受野与该几何结构对齐，我们引入三角掩膜卷积，将卷积核限制在其上三角区域，从而在原始分辨率下创建菱形盲点。该设计在排除相关像素的同时，充分利用了不相关的上下文信息，无需进行下采样或后处理。此外，我们采用知识蒸馏技术，将来自多个盲点预测的互补知识迁移至轻量级U-Net中，从而同时提升精度与效率。在真实场景基准测试上的大量实验表明，我们的方法取得了最先进的性能，显著优于现有的自监督方法。代码公开于https://github.com/parkjun210/TM-BSN。

摘要 (Abstract)

Blind-spot networks (BSNs) enable self-supervised image denoising by preventing access to the target pixel, allowing clean signal estimation without ground-truth supervision. However, this approach assumes pixel-wise noise independence, which is violated in real-world sRGB images due to spatially correlated noise from the camera’s image signal processing (ISP) pipeline. While several methods employ downsampling to decorrelate noise, they alter noise statistics and limit the network’s ability to utilize full contextual information. In this paper, we propose the Triangular-Masked Blind-Spot Network (TM-BSN), a novel blind-spot architecture that accurately models the spatial correlation of real sRGB noise. This correlation originates from demosaicing, where each pixel is reconstructed from neighboring samples with spatially decaying weights, resulting in a diamond-shaped pattern. To align the receptive field with this geometry, we introduce a triangular-masked convolution that restricts the kernel to its upper-triangular region, creating a diamond-shaped blind spot at the original resolution. This design excludes correlated pixels while fully leveraging uncorrelated context, eliminating the need for downsampling or post-processing. Furthermore, we use knowledge distillation to transfer complementary knowledge from multiple blind-spot predictions into a lightweight U-Net, improving both accuracy and efficiency. Extensive experiments on real-world benchmarks demonstrate that our method achieves state-of-the-art performance, significantly outperforming existing self-supervised approaches. Our code is available at https://github.com/parkjun210/TM-BSN.

关键词: self-supervised image denoising, blind-spot network, spatial noise correlation, triangular-masked convolution, real-world sRGB images, knowledge distillation, U-Net, demosaicing

219. ❌ MVis-Fold: A Three-Dimensional Microvascular Structure Inference Model for Super-Resolution Ultrasound

作者: Jincao Yao, Ke Zhang, Yahan Zhou, Jiafei Shen, Jie Liu, Mudassar Ali, Bojian Feng, Jiye Chen, Jinlong Fan, Ping Liang, Dong Xu 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04477v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于医学影像处理，特别是超分辨率超声（SRUS）的三维微血管重建，使用深度学习模型MVis-Fold。所有关键词均与大模型、深度学习技术原理或AI在科学领域的应用相关，但论文内容与绝大多数关键词（如LLM、MoE、SFT、RAG、CoT、量化等）无直接关联。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为该研究属于AI在生物医学成像领域的应用，但论文未明确提及生物信息学或化学信息学，且核心是影像重建而非典型的生物信息学分析，因此给予5分（有一定关联）。其他关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了从二维超分辨率超声图像进行三维微血管结构重建的挑战，开发了MVis-Fold模型，实现了高保真度的三维微血管网络推断和关键参数计算，为疾病诊断和监测提供了新工具。

摘要翻译

超分辨率超声（SRUS，Super-Resolution Ultrasound）技术克服了传统超声的分辨率限制，实现了对微血管系统的微米级成像。然而，由于成像原理的特性，基于SRUS的微血管三维重建仍是一个开放的挑战。我们开发了微血管可视化折叠（MVis-Fold），这是一种创新的三维微血管重建模型，它融合了跨尺度网络架构。该模型能够从二维SRUS图像中，对三维微血管网络进行高保真度的推断与重建。它能精确计算传统二维SRUS难以轻易获取的三维空间关键参数。我们在实体肿瘤的三维微血管重建中验证了该模型的准确性与可靠性。本研究为微血管系统的三维定量分析奠定了基础，并为多种疾病的诊断与监测提供了新的工具与方法。

摘要 (Abstract)

Super-resolution ultrasound (SRUS) technology has overcome the resolution limitations of conventional ultrasound, enabling micrometer-scale imaging of microvasculature. However, due to the nature of imaging principles, three-dimensional reconstruction of microvasculature from SRUS remains an open challenge. We developed microvascular visualization fold (MVis-Fold), an innovative three-dimensional microvascular reconstruction model that integrates a cross-scale network architecture. This model can perform high-fidelity inference and reconstruction of three-dimensional microvascular networks from two-dimensional SRUS images. It precisely calculates key parameters in three-dimensional space that traditional two-dimensional SRUS cannot readily obtain. We validated the model’s accuracy and reliability in three-dimensional microvascular reconstruction of solid tumors. This study establishes a foundation for three-dimensional quantitative analysis of microvasculature. It provides new tools and methods for diagnosis and monitoring of various diseases.

关键词: super-resolution ultrasound, microvascular reconstruction, three-dimensional inference, cross-scale network, solid tumor imaging, quantitative analysis, medical imaging, deep learning model

220. ❌ Beyond Standard Benchmarks: A Systematic Audit of Vision-Language Model’s Robustness to Natural Semantic Variation Across Diverse Tasks

作者: Jia Chengyu, AprilPyone MaungMaung, Huy H. Nguyen, Jinyin Chen, Isao Echizen 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04473v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视觉语言模型（VLMs）在自然对抗场景下的系统性评估，研究内容包括CLIP、BLIP2、SigLIP2等模型在图像分类、语义分割和视觉问答任务中的鲁棒性分析。所有给定的关键词均与大语言模型（LLMs）或深度学习技术原理相关，而本文的核心是视觉语言模型（多模态模型），与纯文本大语言模型或深度学习技术创新的关键词无直接关联。论文未涉及大模型在科学领域的应用或技术原理创新，也未涉及任何关键词中的具体技术方法（如MoE、RLHF、RAG等）。

!!! tip deepseek-chat TL;DR

该论文系统评估了视觉语言模型在自然对抗场景下的鲁棒性，发现某些鲁棒CLIP模型可能放大对抗脆弱性，而CLIP模型在自然语言诱导的对抗样本上性能显著下降。

摘要翻译

近期，基于网络规模图文对训练的视觉语言模型取得了显著进展，使其能够在多种视觉任务中实现令人印象深刻的零样本迁移。然而，要深入理解这些模型的鲁棒性、局限性和实际适用性，仅依靠标准基准测试是不够的，必须进行更全面且独立的评估。本文针对先前评估工作中被忽视的自然对抗场景，提出了一个面向多样化下游任务的视觉语言模型系统化评估框架。我们在精心构建的对抗数据集（包括排版攻击、ImageNet-A以及自然语言诱导的对抗样本）上，对一系列视觉语言模型（如CLIP、鲁棒CLIP、BLIP2和SigLIP2）进行了评估。我们测量了所选模型在零样本图像分类、语义分割和视觉问答任务中的自然对抗性能。分析表明，鲁棒CLIP模型可能放大自然对抗性漏洞，而CLIP模型在自然语言诱导的对抗样本上性能显著下降。此外，我们提供了可解释性分析以识别其失效模式。我们希望这些发现能启发未来在鲁棒且公平的多模态模式识别领域的研究。

摘要 (Abstract)

Recent advances in vision-language models (VLMs) trained on web-scale image-text pairs have enabled impressive zero-shot transfer across a diverse range of visual tasks. However, comprehensive and independent evaluation beyond standard benchmarks is essential to understand their robustness, limitations, and real-world applicability. This paper presents a systematic evaluation framework for VLMs under natural adversarial scenarios for diverse downstream tasks, which has been overlooked in previous evaluation works. We evaluate a wide range of VLMs (CLIP, robust CLIP, BLIP2, and SigLIP2) on curated adversarial datasets (typographic attacks, ImageNet-A, and natural language-induced adversarial examples). We measure the natural adversarial performance of selected VLMs for zero-shot image classification, semantic segmentation, and visual question answering. Our analysis reveals that robust CLIP models can amplify natural adversarial vulnerabilities, and CLIP models significantly reduce performance for natural language-induced adversarial examples. Additionally, we provide interpretable analyses to identify failure modes. We hope our findings inspire future research in robust and fair multimodal pattern recognition.

关键词: vision-language models, robustness evaluation, adversarial scenarios, zero-shot transfer, multimodal pattern recognition, natural adversarial examples, interpretable analysis, failure modes

221. ❌ Group-DINOmics: Incorporating People Dynamics into DINO for Self-supervised Group Activity Feature Learning

作者: Ryuki Tezuka, Chihiro Nakatani, Norimichi Ukita 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04467v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是计算机视觉中的群体活动特征学习，使用DINO进行自监督学习，关注人物动态和群体感知任务。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文专注于视觉特征学习，未涉及任何大模型、深度学习技术原理创新或AI在生物医药等科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需群体活动标注的自监督群体活动特征学习方法，通过人物流估计和群体相关物体位置估计等任务，在公共数据集上实现了群体活动检索和识别的最先进性能。

摘要翻译

本文提出了一种无需群体活动标注的群体活动特征学习方法。与以往使用低层次静态局部特征学习群体活动特征的研究不同，我们提出利用动态感知与群体感知的预训练任务，并结合DINO提供的局部与全局特征，进行群体动态感知的群体活动特征学习。为使DINO及群体活动特征学习适应局部动态与全局群体特征，我们的预训练任务分别采用了人员流估计和群体相关物体位置估计。人员流估计用于表征每个个体的局部运动，这是理解群体活动的重要线索。相比之下，群体相关物体位置估计则促使群体活动特征学习场景上下文（例如人与物体的空间关系）作为全局特征。在公开数据集上的综合实验表明，我们的方法在群体活动检索与识别任务中达到了最先进的性能。消融研究验证了我们方法中各组件的有效性。代码：https://github.com/tezuka0001/Group-DINOmics。

摘要 (Abstract)

This paper proposes Group Activity Feature (GAF) learning without group activity annotations. Unlike prior work, which uses low-level static local features to learn GAFs, we propose leveraging dynamics-aware and group-aware pretext tasks, along with local and global features provided by DINO, for group-dynamics-aware GAF learning. To adapt DINO and GAF learning to local dynamics and global group features, our pretext tasks use person flow estimation and group-relevant object location estimation, respectively. Person flow estimation is used to represent the local motion of each person, which is an important cue for understanding group activities. In contrast, group-relevant object location estimation encourages GAFs to learn scene context (e.g., spatial relations of people and objects) as global features. Comprehensive experiments on public datasets demonstrate the state-of-the-art performance of our method in group activity retrieval and recognition. Our ablation studies verify the effectiveness of each component in our method. Code: https://github.com/tezuka0001/Group-DINOmics.

关键词: Group Activity Feature Learning, Self-supervised Learning, DINO, Person Flow Estimation, Group-aware Pretext Tasks, Group Activity Recognition, Group Activity Retrieval, Scene Context Learning

222. ❌ Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse

作者: Hao Liu, Ye Huang, Chenghuan Huang, Zhenyi Zheng, Jiangsu Du, Ziyang Ma, Jing Lyu, Yutong Lu 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04451v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视频扩散Transformer模型的推理加速技术，通过跨请求缓存重用机制优化模型服务效率。论文与绝大多数关键词无关，因为这些关键词主要涉及大语言模型的技术原理、训练方法、对齐技术、推理优化、应用场景等，而本文研究的是视频生成模型的推理加速，属于计算机视觉和系统优化领域。唯一相关的关键词是’Speculative Decoding OR Inference Acceleration’，因为论文的核心贡献是加速视频扩散模型的推理过程，但并非针对大语言模型，因此给予8分（有一定关联但非核心）。

!!! tip deepseek-chat TL;DR

本文提出了一种名为Chorus的跨请求缓存重用方法，通过利用相似请求间的特征相似性来加速视频扩散Transformer模型的推理服务，在工业级4步蒸馏模型上实现了高达45%的加速效果。

摘要翻译

视频扩散变换器（Video Diffusion Transformer，简称DiT）模型是实现高质量视频生成的主流方法，但其迭代去噪过程导致推理成本高昂。现有的缓存方法主要利用单个请求在扩散过程中的内部相似性来跳过冗余的去噪步骤。本文提出Chorus，一种通过跨请求相似性来加速视频扩散模型服务的缓存方法。在工业级4步蒸馏模型上，Chorus实现了高达45%的加速效果，而此前基于请求内缓存的方法在此类场景中收效甚微。具体而言，Chorus沿去噪过程采用三阶段缓存策略：第一阶段对相似请求的潜在特征进行完全复用；第二阶段在中间去噪步骤中针对特定潜在区域实施跨请求缓存。该阶段结合了令牌引导注意力增强技术，以提升生成视频与条件提示之间的语义对齐度，从而将完全复用的适用性扩展至后续去噪步骤。

摘要 (Abstract)

Video Diffusion Transformer (DiT) models are a dominant approach for high-quality video generation but suffer from high inference cost due to iterative denoising. Existing caching approaches primarily exploit similarity within the diffusion process of a single request to skip redundant denoising steps. In this paper, we introduce Chorus, a caching approach that leverages similarity across requests to accelerate video diffusion model serving. Chorus achieves up to 45% speedup on industrial 4-step distilled models, where prior intra-request caching approaches are ineffective. Particularly, Chorus employs a three-stage caching strategy along the denoising process. Stage 1 performs full reuse of latent features from similar requests. Stage 2 exploits inter-request caching in specific latent regions during intermediate denoising steps. This stage is combined with Token-Guided Attention Amplification to improve semantic alignment between the generated video and the conditional prompts, thereby extending the applicability of full reuse to later denoising steps.

关键词: Video Diffusion Transformer, inference acceleration, caching strategy, inter-request similarity, denoising process, model serving, latent feature reuse, Token-Guided Attention Amplification

223. ❌ Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection

作者: Weihao Cao, Runqi Wang, Xiaoyue Duan, Jinchao Zhang, Ang Yang, Liping Jing 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04444v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的开放词汇目标检测（OVOD），提出了一种参数高效的语义增强框架HSA-DINO。核心创新在于参数高效微调（PEFT）方法，通过多尺度提示库和语义感知路由器实现领域适应，同时保持预训练模型的泛化能力。论文与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分），因为标题和摘要明确强调’parameter-efficient’方法。与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（8分），因为论文涉及预训练模型的领域适应问题。其他关键词主要涉及大语言模型（LLM）技术、推理方法、代理系统、科学AI应用等，与这篇计算机视觉论文无关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文针对开放词汇目标检测在领域转移时性能下降的问题，提出了一种参数高效的语义增强框架HSA-DINO，通过多尺度提示库和语义感知路由器实现了领域适应性与开放词汇泛化能力的更好平衡。

摘要翻译

开放词汇目标检测（Open-vocabulary Object Detection, OVOD）使模型能够检测任意物体类别，包括未见过的类别。得益于大规模预训练，现有OVOD方法在通用场景（如OV-COCO）上取得了较强的检测性能，但在迁移至存在显著领域偏移的下游任务时，性能会严重下降。这种性能退化源于领域特定任务中类别标签的稀缺性与语义薄弱性，以及现有模型难以捕获粗粒度类别标签之外的辅助语义。为解决这些问题，我们提出了HSA-DINO，一种参数高效的语义增强框架，用于提升开放词汇目标检测性能。具体而言，我们设计了一个多尺度提示库，利用图像特征金字塔捕获层次化语义并选择领域特定的局部语义提示，从而从粗粒度到细粒度逐步丰富文本表示。此外，我们引入了一个语义感知路由器，在推理过程中动态选择合适的语义增强策略，从而避免参数更新损害预训练OVOD模型的泛化能力。我们在OV-COCO、多个垂直领域数据集以及修改的基准设置上评估了HSA-DINO。实验结果表明，HSA-DINO相较于以往先进方法具有优越性能，在领域适应性与开放词汇泛化能力之间实现了更佳的平衡。

摘要 (Abstract)

Open-vocabulary object detection (OVOD) enables models to detect any object category, including unseen ones. Benefiting from large-scale pre-training, existing OVOD methods achieve strong detection performance on general scenarios (e.g., OV-COCO) but suffer severe performance drops when transferred to downstream tasks with substantial domain shifts. This degradation stems from the scarcity and weak semantics of category labels in domain-specific task, as well as the inability of existing models to capture auxiliary semantics beyond coarse-grained category label. To address these issues, we propose HSA-DINO, a parameter-efficient semantic augmentation framework for enhancing open-vocabulary object detection. Specifically, we propose a multi-scale prompt bank that leverages image feature pyramids to capture hierarchical semantics and select domain-specific local semantic prompts, progressively enriching textual representations from coarse to fine-grained levels. Furthermore, we introduce a semantic-aware router that dynamically selects the appropriate semantic augmentation strategy during inference, thereby preventing parameter updates from degrading the generalization ability of the pre-trained OVOD model. We evaluate HSA-DINO on OV-COCO, several vertical domain datasets, and modified benchmark settings. The results show that HSA-DINO performs favorably against previous state-of-the-art methods, achieving a superior trade-off between domain adaptability and open-vocabulary generalization.

关键词: open-vocabulary object detection, parameter-efficient, semantic augmentation, domain adaptation, multi-scale prompt bank, semantic-aware router, HSA-DINO, generalization

224. ❌ Estimating Central, Peripheral, and Temporal Visual Contributions to Human Decision Making in Atari Games

作者: Henrik Krauss, Takehisa Yairi 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04439v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究人类在Atari游戏中的决策机制，通过眼动追踪和行为分析来量化不同视觉信息源（周边视觉、注视信息、历史状态）的贡献。论文的核心是认知科学和人类行为分析，使用传统的机器学习方法（如动作预测网络）而非大模型或深度学习技术。所有评分关键词均涉及大模型、深度学习技术原理或特定AI应用领域（如生物信息学），而本文完全不涉及这些内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文通过分析Atari游戏中的眼动追踪数据，量化了周边视觉、注视信息和历史状态对人类决策的贡献，发现周边视觉信息贡献最大，而提出的框架可用于从行为中估计不同信息源的贡献。

摘要翻译

本研究探讨在动态视觉环境中，不同视觉信息来源对人类决策的贡献。利用Atari-HEAD（一个包含同步眼动追踪的大规模Atari游戏数据集），我们引入了一种受控消融框架，以逆向解析来自人类行为的周边视觉信息、以注视热图形式呈现的显性注视信息以及历史状态信息的贡献。我们在六种实验设置下训练了动作预测网络，这些设置选择性地包含或排除上述信息来源。在20款游戏中，周边信息显示出迄今为止最强的贡献度，当其被移除时，预测准确率中位数下降范围达35.27%-43.90%。注视信息导致的下降幅度较小，为2.11%-2.76%，而历史状态信息的影响范围较广（1.52%-15.51%），其较高区间的贡献可能更具信息量，因为此时周边信息泄露的影响已降低。为补充整体准确率分析，我们根据不同模型配置所分配的真实动作概率对游戏状态进行聚类。该分析识别出粗略的行为模式，包括注视主导型、周边主导型以及更具情境依赖性的决策情境。这些结果表明，人类在Atari游戏中的决策强烈依赖于当前注视焦点之外的信息，而所提出的框架为从行为中量化此类信息源的贡献提供了一种方法。

摘要 (Abstract)

We study how different visual information sources contribute to human decision making in dynamic visual environments. Using Atari-HEAD, a large-scale Atari gameplay dataset with synchronized eye-tracking, we introduce a controlled ablation framework as a means to reverse-engineer the contribution of peripheral visual information, explicit gaze information in form of gaze maps, and past-state information from human behavior. We train action-prediction networks under six settings that selectively include or exclude these information sources. Across 20 games, peripheral information shows by far the strongest contribution, with median prediction-accuracy drops in the range of 35.27-43.90% when removed. Gaze information yields smaller drops of 2.11-2.76%, while past-state information shows a broader range of 1.52-15.51%, with the upper end likely more informative due to reduced peripheral-information leakage. To complement aggregate accuracies, we cluster states by true-action probabilities assigned by the different model configurations. This analysis identifies coarse behavioral regimes, including focus-dominated, periphery-dominated, and more contextual decision situations. These results suggest that human decision making in Atari depends strongly on information beyond the current focus of gaze, while the proposed framework provides a way to estimate such information-source contributions from behavior.

关键词: human decision making, visual information sources, eye-tracking, Atari games, peripheral vision, action-prediction networks, behavioral regimes, information-source contributions

225. ❌ HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance

作者: Green Rosh, Prateek Kukreja, Vishakha SR, Pawan Prasad B H 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04425v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于3D手部模型生成，使用扩散模型和MANO手部模型等技术，属于计算机视觉和图形学领域。所有关键词均与大语言模型（LLM）相关，而论文未涉及LLM或深度学习技术原理的创新，仅与’AI for Science’有微弱关联（可视为AI在虚拟现实中的应用），因此除该关键词外均评0分。

!!! tip deepseek-chat TL;DR

该论文提出了HandDreamer方法，通过MANO手部模型初始化和手部骨架引导的扩散过程，解决了零样本文本到3D手部模型生成中的视图不一致和几何失真问题，实现了优于现有方法的性能。

摘要翻译

虚拟现实的出现使得为虚拟世界交互生成精细且可定制的三维手部模型成为必要。然而，当前的三维手部模型生成方法成本高昂且流程繁琐，为用户提供的可定制性极为有限。尽管零样本文本到三维合成领域的最新进展，通过分数蒸馏采样技术，已能利用文本提示生成多样化且可定制的三维模型，但这些方法在三维手部模型生成上泛化能力不足，常导致手部结构不自然、视角不一致以及细节丢失等问题。为应对这些局限，我们提出了HandDreamer，这是首个基于文本提示实现零样本三维手部模型生成的方法。我们的研究发现，分数蒸馏采样中的视角不一致问题主要源于文本提示所描述的概率景观存在模糊性，导致相似视角收敛至分布的不同模式。由于手部在关节和姿态上存在巨大差异，这一问题尤为突出。为缓解此问题，我们提出采用基于MANO手部模型的初始化以及手部骨架引导的扩散过程，为手部结构提供强先验，并确保视角与姿态的一致性。此外，我们提出了一种新颖的校正性手部形状引导损失函数，以确保三维手部模型的所有视角都能收敛至视角一致的模式，同时避免几何失真。大量评估实验证明，我们的方法优于现有最先进技术，为三维手部模型生成开辟了新路径。

摘要 (Abstract)

The emergence of virtual reality has necessitated the generation of detailed and customizable 3D hand models for interaction in the virtual world. However, the current methods for 3D hand model generation are both expensive and cumbersome, offering very little customizability to the users. While recent advancements in zero-shot text-to-3D synthesis have enabled the generation of diverse and customizable 3D models using Score Distillation Sampling (SDS), they do not generalize very well to 3D hand model generation, resulting in unnatural hand structures, view-inconsistencies and loss of details. To address these limitations, we introduce HandDreamer, the first method for zero-shot 3D hand model generation from text prompts. Our findings suggest that view-inconsistencies in SDS is primarily caused due to the ambiguity in the probability landscape described by the text prompt, resulting in similar views converging to different modes of the distribution. This is particularly aggravated for hands due to the large variations in articulations and poses. To alleviate this, we propose to use MANO hand model based initialization and a hand skeleton guided diffusion process to provide a strong prior for the hand structure and to ensure view and pose consistency. Further, we propose a novel corrective hand shape guidance loss to ensure that all the views of the 3D hand model converges to view-consistent modes, without leading to geometric distortions. Extensive evaluations demonstrate the superiority of our method over the state-of-the-art methods, paving a new way forward in 3D hand model generation.

关键词: 3D hand model generation, zero-shot text-to-3D, Score Distillation Sampling, MANO hand model, view consistency, hand skeleton guidance, corrective shape guidance, virtual reality

226. ❌ BoxComm: Benchmarking Category-Aware Commentary Generation and Narration Rhythm in Boxing

作者: Kaiwen Wang, Kaili Zheng, Rongrong Deng, Yiming Shi, Chenyi Guo, Ji Wu 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04419v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究多模态大语言模型（MLLMs）在拳击比赛解说生成中的应用，属于大模型在特定领域（体育AI）的应用研究。与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（8分），因为论文明确提到MLLMs并对其进行评估。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为体育AI可视为AI在科学或应用领域的一个分支，但论文未直接涉及生物信息学或化学信息学。其他关键词如MoE、SFT、RAG等均未在论文中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了BoxComm数据集和评估框架，针对拳击比赛解说生成任务，发现当前多模态大语言模型在类别条件生成和解说节奏评估上表现不佳，并通过引入击打事件检测的基线方法EIC-Gen取得了改进。

摘要翻译

近期，多模态大语言模型在通用视频理解方面展现出强大能力，推动了自动体育解说生成研究的热潮。然而，该任务现有的基准测试仅聚焦于足球、篮球等团队运动，完全未涉及格斗类运动。值得注意的是，格斗类运动带来了独特的挑战：关键动作在毫秒间展开，其视觉差异细微却具有决定性的语义意义；且与团队运动相比，专业解说中包含的战术分析比例显著更高。本文提出了BoxComm，一个包含445场世界拳击锦标赛视频的大规模数据集，其中收录了超过5.2万句来自专业转播的解说语句。我们设计了一种结构化解说分类法，将每句解说归类为实时描述、战术分析或背景信息三类，为体育解说基准测试提供了首个类别层面的标注体系。基于此分类法，我们引入了两项新颖且互补的、专为体育解说生成定制的评估方法：（1）类别条件生成，评估模型能否在给定视频上下文的情况下生成指定类型的准确解说；（2）解说节奏评估，衡量自由生成的解说在连续视频片段上是否展现出恰当的时间分布与类型分布，从而捕捉以往基准测试未能涵盖的解说能力维度。在多个先进多模态大语言模型上的实验表明，当前模型在这两项评估中均表现不佳。我们进一步提出了EIC-Gen改进基线模型，该模型通过融入检测出的击打动作为结构化动作线索，实现了性能的持续提升，凸显了感知转瞬即逝的细微动作对于格斗运动解说的重要性。

摘要 (Abstract)

Recent multimodal large language models (MLLMs) have shown strong capabilities in general video understanding, driving growing interest in automatic sports commentary generation. However, existing benchmarks for this task focus exclusively on team sports such as soccer and basketball, leaving combat sports entirely unexplored. Notably, combat sports present distinct challenges: critical actions unfold within milliseconds with visually subtle yet semantically decisive differences, and professional commentary contains a substantially higher proportion of tactical analysis compared to team sports. In this paper, we present BoxComm, a large-scale dataset comprising 445 World Boxing Championship match videos with over 52K commentary sentences from professional broadcasts. We propose a structured commentary taxonomy that categorizes each sentence into play-by-play, tactical, or contextual, providing the first category-level annotation for sports commentary benchmarks. Building on this taxonomy, we introduce two novel and complementary evaluations tailored to sports commentary generation: (1) category-conditioned generation, which evaluates whether models can produce accurate commentary of a specified type given video context; and (2) commentary rhythm assessment, which measures whether freely generated commentary exhibits appropriate temporal pacing and type distribution over continuous video segments, capturing a dimension of commentary competence that prior benchmarks have not addressed. Experiments on multiple state-of-the-art MLLMs reveal that current models struggle on both evaluations. We further propose EIC-Gen, an improved baseline incorporating detected punch events to supply structured action cues, yielding consistent gains and highlighting the importance of perceiving fleeting and subtle events for combat sports commentary.

关键词: multimodal large language models, sports commentary generation, boxing, category-conditioned generation, commentary rhythm assessment, video understanding, combat sports, benchmark dataset

227. ❌ NAIMA: Semantics Aware RGB Guided Depth Super-Resolution

作者: Tayyab Nasir, Daochang Liu, Ajmal Mian 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04407v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的深度图超分辨率任务，使用预训练的视觉Transformer（DINOv2）提取语义先验，并通过提出的NAIMA架构改进多模态融合。虽然涉及预训练模型的使用，但核心是视觉任务而非大语言模型技术。仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为使用了预训练的视觉Transformer进行知识蒸馏。其他关键词均与大语言模型、推理、对齐、科学AI应用等无关，故评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为NAIMA的语义感知引导深度超分辨率方法，通过从预训练的视觉Transformer中提取全局语义先验并设计引导令牌注意力模块，有效解决了RGB图像中误导性颜色和纹理线索导致的深度边界模糊问题，在多个数据集和缩放因子下显著超越了现有方法。

摘要翻译

引导式深度超分辨率（Guided Depth Super-resolution，GDSR）是一种多模态的深度图超分辨率方法，它依赖低分辨率深度图和高分辨率RGB图像来恢复更精细的结构细节。然而，RGB图像中指示深度不连续性的误导性颜色与纹理线索，往往导致生成的深度图中出现伪影和模糊的深度边界。我们提出一种解决方案，引入由预训练视觉变换器（vision transformer）令牌嵌入生成的全局上下文语义先验。我们从预训练令牌嵌入中提取语义知识的方法，其动机在于这些嵌入在相关单目深度估计任务中已展现出的有效性。我们提出了一个引导令牌注意力（Guided Token Attention，GTA）模块，该模块通过交叉注意力迭代地对齐编码的RGB空间特征与深度编码，并选择性地注入从预训练视觉变换器不同层提取的全局语义上下文。此外，我们提出了一种名为“隐式多令牌对齐神经注意力”（Neural Attention for Implicit Multi-token Alignment，NAIMA）的架构，它将DINOv2模型与GTA模块相结合，以实现语义感知的GDSR。我们所提出的架构凭借其提取语义知识的能力，在多种缩放因子和数据集上均取得了相较于现有方法的显著提升。

摘要 (Abstract)

Guided depth super-resolution (GDSR) is a multi-modal approach for depth map super-resolution that relies on a low-resolution depth map and a high-resolution RGB image to restore finer structural details. However, the misleading color and texture cues indicating depth discontinuities in RGB images often lead to artifacts and blurred depth boundaries in the generated depth map. We propose a solution that introduces global contextual semantic priors, generated from pretrained vision transformer token embeddings. Our approach to distilling semantic knowledge from pretrained token embeddings is motivated by their demonstrated effectiveness in related monocular depth estimation tasks. We introduce a Guided Token Attention (GTA) module, which iteratively aligns encoded RGB spatial features with depth encodings, using cross-attention for selectively injecting global semantic context extracted from different layers of a pretrained vision transformer. Additionally, we present an architecture called Neural Attention for Implicit Multi-token Alignment (NAIMA), which integrates DINOv2 with GTA blocks for a semantics-aware GDSR. Our proposed architecture, with its ability to distill semantic knowledge, achieves significant improvements over existing methods across multiple scaling factors and datasets.

关键词: Guided depth super-resolution, Semantic priors, Vision transformer, DINOv2, Cross-attention, Multi-modal fusion, Depth map enhancement, Neural architecture

228. ❌ 3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image

作者: Ze-Xin Yin, Liu Liu, Xinjie Wang, Wei Sui, Zhizhong Su, Jian Yang, Jin Xie 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04406v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于3D场景生成和补全的计算机视觉任务，使用扩散模型和几何估计方法，未涉及任何大语言模型（LLM）、深度学习技术原理创新或AI for Science的具体应用。所有评分关键词均与大模型、深度学习技术原理或科学AI应用相关，而本文研究内容属于纯3D视觉领域，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了3D-Fixer，一种从单张图像进行粗到细的原位3D场景补全方法，通过使用碎片化几何作为空间锚点来保持布局保真度，并创建了ARSG-110K大规模数据集，在几何精度上达到了最先进水平。

摘要翻译

从单视图生成组合式三维场景需要同时恢复场景布局与三维资产。现有方法主要分为两类：前馈生成方法与单实例生成方法。前者通过高效网络推理直接预测具有显式六自由度位姿的三维资产，但其对复杂场景的泛化能力较差。后者通过分治策略提升泛化能力，但存在耗时的位姿优化问题。为弥合这一差距，我们提出了3D-Fixer——一种新颖的原位补全范式。具体而言，3D-Fixer扩展三维对象生成先验，基于原始位置的部分可见点云生成完整三维资产，这些点云从几何估计方法获得的碎片化几何中裁剪得到。与需要显式位姿对齐的先前工作不同，3D-Fixer以碎片化几何作为空间锚点来保持布局保真度。其核心在于，我们提出了一种由粗到细的生成方案以解决遮挡下的边界模糊问题，该方案由双分支条件网络与用于稳定训练的遮挡鲁棒特征对齐策略提供支持。此外，为应对数据稀缺的瓶颈，我们提出了ARSG-110K——迄今为止最大规模的场景级数据集，包含超过11万个多样化场景、300万张带高精度三维真值标注的图像。大量实验表明，3D-Fixer在保持扩散过程效率的同时，实现了最先进的几何精度，显著优于MIDI、Gen3DSR等基线方法。代码与数据将在https://zx-yin.github.io/3dfixer公开。

摘要 (Abstract)

Compositional 3D scene generation from a single view requires the simultaneous recovery of scene layout and 3D assets. Existing approaches mainly fall into two categories: feed-forward generation methods and per-instance generation methods. The former directly predict 3D assets with explicit 6DoF poses through efficient network inference, but they generalize poorly to complex scenes. The latter improve generalization through a divide-and-conquer strategy, but suffer from time-consuming pose optimization. To bridge this gap, we introduce 3D-Fixer, a novel in-place completion paradigm. Specifically, 3D-Fixer extends 3D object generative priors to generate complete 3D assets conditioned on the partially visible point cloud at the original locations, which are cropped from the fragmented geometry obtained from the geometry estimation methods. Unlike prior works that require explicit pose alignment, 3D-Fixer uses fragmented geometry as a spatial anchor to preserve layout fidelity. At its core, we propose a coarse-to-fine generation scheme to resolve boundary ambiguity under occlusion, supported by a dual-branch conditioning network and an Occlusion-Robust Feature Alignment (ORFA) strategy for stable training. Furthermore, to address the data scarcity bottleneck, we present ARSG-110K, the largest scene-level dataset to date, comprising over 110K diverse scenes and 3M annotated images with high-fidelity 3D ground truth. Extensive experiments show that 3D-Fixer achieves state-of-the-art geometric accuracy, which significantly outperforms baselines such as MIDI and Gen3DSR, while maintaining the efficiency of the diffusion process. Code and data will be publicly available at https://zx-yin.github.io/3dfixer.

关键词: 3D scene completion, single image, in-place completion, coarse-to-fine generation, diffusion process, geometry estimation, occlusion handling, scene dataset

229. ❌ UENR-600K: A Large-Scale Physically Grounded Dataset for Nighttime Video Deraining

作者: Pei Yang, Hai Ci, Beibei Lin, Yiren Song, Mike Zheng Shou 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04402v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的夜间视频去雨任务，通过物理模拟创建大规模数据集并改进视频生成模型。虽然属于AI应用，但所有关键词均与大语言模型、深度学习技术原理、AI for Science等特定领域相关，而本文不涉及任何大模型技术、语言模型、推理方法、对齐技术、模型优化或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了夜间视频去雨因雨滴与人工照明交互而独特的挑战，通过创建大规模物理模拟数据集UENR-600K并改进视频生成模型，显著提升了模型在真实世界视频上的泛化性能。

摘要翻译

夜间视频去雨任务具有独特的挑战性，因为雨滴与人造光源会产生交互作用。与日间的白色雨滴不同，夜间雨滴会呈现多种颜色并表现出局部光照效果。现有小规模合成数据集依赖于二维雨滴叠加，无法捕捉这些物理特性，导致模型在真实夜间雨景中泛化能力较差。同时，由于雨效无法与传感器噪声等其他退化因素分离，采集真实配对的夜间视频仍不现实。为弥补这一差距，我们提出了UENR-600K——一个大规模、基于物理原理的数据集，包含60万对1080p帧序列。我们利用虚幻引擎（Unreal Engine）在虚拟环境中将雨滴模拟为三维粒子。该方法保证了照片级真实感与物理真实的雨滴，能够准确捕捉色彩折射、场景遮挡、雨幕等细节。借助这一高质量数据，我们通过适配Wan 2.2视频生成模型建立了新的最先进基线。我们的基线将去雨任务视为视频到视频的生成问题，利用强大的生成先验知识几乎完全弥合了仿真与真实的差距。大量基准测试表明，基于本数据集训练的模型在真实世界视频中展现出显著更优的泛化能力。项目页面：https://showlab.github.io/UENR-600K/。

摘要 (Abstract)

Nighttime video deraining is uniquely challenging because raindrops interact with artificial lighting. Unlike daytime white rain, nighttime rain takes on various colors and appears locally illuminated. Existing small-scale synthetic datasets rely on 2D rain overlays and fail to capture these physical properties, causing models to generalize poorly to real-world night rain. Meanwhile, capturing real paired nighttime videos remains impractical because rain effects cannot be isolated from other degradations like sensor noise. To bridge this gap, we introduce UENR-600K, a large-scale, physically grounded dataset containing 600,000 1080p frame pairs. We utilize Unreal Engine to simulate rain as 3D particles within virtual environments. This approach guarantees photorealism and physically real raindrops, capturing correct details like color refractions, scene occlusions, rain curtains. Leveraging this high-quality data, we establish a new state-of-the-art baseline by adapting the Wan 2.2 video generation model. Our baseline treat deraining as a video-to-video generation task, exploiting strong generative priors to almost entirely bridge the sim-to-real gap. Extensive benchmarking demonstrates that models trained on our dataset generalize significantly better to real-world videos. Project page: https://showlab.github.io/UENR-600K/.

关键词: nighttime video deraining, physically grounded dataset, Unreal Engine simulation, 3D rain particles, video-to-video generation, sim-to-real gap, photorealistic rain, state-of-the-art baseline

230. ❌ BiTDiff: Fine-Grained 3D Conducting Motion Generation via BiMamba-Transformer Diffusion

作者: Tianzhi Jia, Kaixing Yang, Xiaole Yang, Xulong Tang, Ke Qiu, Shikui Wei, Yao Zhao 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04395v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究3D指挥动作生成，使用BiMamba-Transformer混合架构和扩散模型，属于计算机视觉/图形学领域，与所有大模型/深度学习技术原理关键词（如LLMs、MoE、Scaling Laws、RLHF等）完全无关，也与AI for Science等应用领域关键词无关。

!!! tip deepseek-chat TL;DR

该论文解决了3D指挥动作生成中缺乏大规模精细数据集和高效长序列生成方法的挑战，通过构建CM-Data数据集和提出BiTDiff框架，实现了最先进的3D指挥动作生成性能。

摘要翻译

三维指挥动作生成旨在根据音乐合成细粒度的指挥动作，在音乐教育、虚拟演出、数字人动画以及人机协同创作等领域具有广泛的应用潜力。然而，该任务目前仍未得到充分探索，主要面临两大挑战：(1) 缺乏大规模细粒度的三维指挥动作数据集；(2) 缺少能够同时支持高质量与高效率生成长序列动作的有效方法。为应对数据限制，我们开发了一套以质量为导向的三维指挥动作采集流程，并构建了CM-Data——一个包含约10小时指挥动作数据的细粒度SMPL-X数据集。据我们所知，CM-Data是首个且规模最大的面向三维指挥动作生成的公开数据集。为解决方法上的局限，我们提出了BiTDiff，一个用于三维指挥动作生成的新框架。该框架基于BiMamba-Transformer混合模型架构以实现高效的长序列建模，并采用基于扩散的生成策略结合人体运动学分解以实现高质量动作合成。具体而言，BiTDiff引入了辅助的物理一致性损失函数和针对手部/身体的特定正向运动学设计，以优化细粒度动作建模；同时利用BiMamba进行内存高效的长序列时序建模，并利用Transformer实现跨模态语义对齐。此外，BiTDiff支持无需训练的关节级动作编辑，便于下游人机交互设计。大量的定量与定性实验表明，BiTDiff在CM-Data数据集上实现了三维指挥动作生成的最先进性能。代码将在论文录用后公开。

摘要 (Abstract)

3D conducting motion generation aims to synthesize fine-grained conductor motions from music, with broad potential in music education, virtual performance, digital human animation, and human-AI co-creation. However, this task remains underexplored due to two major challenges: (1) the lack of large-scale fine-grained 3D conducting datasets and (2) the absence of effective methods that can jointly support long-sequence generation with high quality and efficiency. To address the data limitation, we develop a quality-oriented 3D conducting motion collection pipeline and construct CM-Data, a fine-grained SMPL-X dataset with about 10 hours of conducting motion data. To the best of our knowledge, CM-Data is the first and largest public dataset for 3D conducting motion generation. To address the methodological limitation, we propose BiTDiff, a novel framework for 3D conducting motion generation, built upon a BiMamba-Transformer hybrid model architecture for efficient long-sequence modeling and a Diffusion-based generative strategy with human-kinematic decomposition for high-quality motion synthesis. Specifically, BiTDiff introduces auxiliary physical-consistency losses and a hand-/body-specific forward-kinematics design for better fine-grained motion modeling, while leveraging BiMamba for memory-efficient long-sequence temporal modeling and Transformer for cross-modal semantic alignment. In addition, BiTDiff supports training-free joint-level motion editing, enabling downstream human-AI interaction design. Extensive quantitative and qualitative experiments demonstrate that BiTDiff achieves state-of-the-art (SOTA) performance for 3D conducting motion generation on the CM-Data dataset. Code will be available upon acceptance.

关键词: 3D conducting motion generation, BiMamba-Transformer, Diffusion model, CM-Data dataset, SMPL-X, long-sequence modeling, human-kinematic decomposition, state-of-the-art performance

231. ❌ Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

作者: Songyuan Yang, Weijiang Yu, Jilin Ma, Ziyu Liu, Guijian Tang, Wenjing Yang, Huibin Tan, Nong Xiao 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04379v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出RLER框架，专注于视频推理任务，核心创新在于将证据生成与答案获取解耦，通过强化学习优化推理过程，并引入证据加权选举机制。与关键词相关性分析：1）高度相关（8-10分）：论文基于大语言模型（LLMs）构建大模态模型（LMMs），涉及多步推理（Chain of Thought）、深度推理（System 2 Thinking）、自我改进（Self-Correction）、幻觉缓解（Hallucination Mitigation）和可解释AI（Explainable AI），这些是论文的核心技术要素。2）无关（0分）：其余关键词如MoE、量化、RAG、上下文扩展等未在论文中涉及，论文未提及科学领域AI应用。

!!! tip deepseek-chat TL;DR

该论文针对视频推理中模型推理过程缺乏证据验证的问题，提出了RLER双范式框架，通过强化学习优化证据生成和证据加权选举机制，在8个基准测试中实现了最先进性能，平均提升6.3%，同时保持了计算效率。

摘要翻译

视频推理能力已借助大型多模态模型（LMMs）取得进展，然而其推理过程通常为单次前向传递，直接返回答案而未验证推理是否与证据对齐。我们提出“强化学习以生成证据，择优选举以完成推理”（RLER），这是一种将学习生成证据与获取可靠答案解耦的双阶段范式。在RLER训练阶段，我们通过组相对强化学习（group-relative reinforcement learning, RL）及三项新颖的任务驱动奖励来优化策略：帧敏感奖励使推理基于显式的关键帧；思维透明性奖励塑造可读且可解析的推理轨迹；抗重复奖励提升信息密度。这些信号教导模型输出结构化、机器可验证的证据，并增强其推理能力。在RLER推理阶段，我们应用一个无需训练的编排器，生成一小批多样化的候选答案，解析其答案及引用的帧，依据证据一致性、置信度、透明性和非冗余性进行评分，随后执行一次稳健的证据加权选举。这闭合了证据生成与使用之间的循环，在不扩大模型规模的前提下提升了可靠性与可解释性。我们在8个代表性基准上，将RLER与多种开源及基于强化学习的LMMs进行了全面评估。RLER在所有基准上均达到最先进水平，相比基础模型平均提升6.3%，同时每个问题平均仅使用3.1个候选答案，表明其在计算成本与质量间取得了良好平衡。结果支持一个简洁的论点：在学习阶段显式生成证据，在推理阶段依据证据进行选举，是实现可信视频推理的稳健路径。

摘要 (Abstract)

Video reasoning has advanced with large multimodal models (LMMs), yet their inference is often a single pass that returns an answer without verifying whether the reasoning is evidence-aligned. We introduce Reinforce to Learn, Elect to Reason (RLER), a dual paradigm that decouples learning to produce evidence from obtaining a reliable answer. In RLER-Training, we optimize the policy with group-relative reinforcement learning (RL) and 3 novel task-driven rewards: Frame-sensitive reward grounds reasoning on explicit key frames, Think-transparency reward shapes readable and parsable reasoning traces, and Anti-repetition reward boosts information density. These signals teach the model to emit structured, machine-checkable evidence and potentiate reasoning capabilities. In RLER-Inference, we apply a train-free orchestrator that generates a small set of diverse candidates, parses their answers and cited frames, scores them by evidence consistency, confidence, transparency, and non-redundancy, and then performs a robust evidence-weighted election. This closes the loop between producing and using evidence, improving reliability and interpretability without enlarging the model. We comprehensively evaluate RLER against various open-source and RL-based LMMs on 8 representative benchmarks. RLER achieves state of the art across all benchmarks and delivers an average improvement of 6.3% over base models, while using on average 3.1 candidates per question, indicating a favorable balance between compute and quality. The results support a simple thesis: making evidence explicit during learning and electing by evidence during inference is a robust path to trustworthy video reasoning.

关键词: video reasoning, large multimodal models, reinforcement learning, evidence alignment, reasoning traces, interpretability, state-of-the-art, trustworthy AI

232. ❌ Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning

作者: Songyuan Yang, Weijiang Yu, Ziyu Liu, Guijian Tang, Wenjing Yang, Huibin Tan, Nong Xiao 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04372v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	8.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Graph-to-Frame RAG（G2F-RAG），一种用于视频推理的检索增强生成方法，核心创新在于将外部知识以视觉空间（推理帧）形式融合，而非传统文本或剪辑形式。该方法与RAG、LLM Agents、Multi-agent Systems高度相关（8-10分），因为系统采用多智能体控制器进行知识检索和渲染，并基于大模型（LMMs）进行推理（8分）。论文强调可解释性和证据追踪，与Explainable AI相关（8分）。推理过程涉及多步和深度分析，与Chain of Thought和System 2 Thinking有一定关联（5分）。其他关键词如MoE、SFT、RLHF等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对视频推理中外部知识融合问题，提出了一种无需训练、可审计的Graph-to-Frame RAG方法，通过将知识图谱渲染为视觉推理帧，在统一视觉域中进行联合推理，显著提升了知识密集型场景下的性能并增强了可解释性。

摘要翻译

当视频推理需要外部知识时，许多基于大型多模态模型（LMMs）的系统采用检索增强技术来补充缺失的上下文。然而，直接附加文本或多片段证据会迫使异质信号挤入单一的注意力空间。我们观察到，即使在非长视频中，也会出现注意力稀释和认知负荷加重的问题。瓶颈不仅在于检索什么，更在于如何表示外部知识并将其与视频主干网络融合。本文提出图到帧检索增强生成（Graph-to-Frame RAG, G2F-RAG），这是一种无需训练且可审计的范式，可在视觉空间中传递知识。在离线阶段，一个智能体构建一个与问题无关的视频知识图谱，该图谱整合了实体、事件、空间关系以及关联的世界知识。在线阶段，一个分层多智能体控制器判断是否需要外部知识，检索一个最小充分子图，并将其渲染为单个推理帧附加到视频末尾。随后，大型多模态模型在统一的视觉领域内进行联合推理。此设计降低了认知负荷，并留下了明确、可检查的证据轨迹。G2F-RAG是跨主干网络和规模即插即用的。它在多样化的公共基准测试中取得了稳定的性能提升，在知识密集型场景中改进更为显著。消融实验进一步证实了知识表示与传递方式的重要性。G2F-RAG将检索重新定义为视觉空间的知识融合，以实现鲁棒且可解释的视频推理。

摘要 (Abstract)

When video reasoning requires external knowledge, many systems with large multimodal models (LMMs) adopt retrieval augmentation to supply the missing context. Appending textual or multi-clip evidence, however, forces heterogeneous signals into a single attention space. We observe diluted attention and higher cognitive load even on non-long videos. The bottleneck is not only what to retrieve but how to represent and fuse external knowledge with the video backbone.We present Graph-to-Frame RAG (G2F-RAG), a training free and auditable paradigm that delivers knowledge in the visual space. On the offline stage, an agent builds a problem-agnostic video knowledge graph that integrates entities, events, spatial relations, and linked world knowledge. On the online stage, a hierarchical multi-agent controller decides whether external knowledge is needed, retrieves a minimal sufficient subgraph, and renders it as a single reasoning frame appended to the video. LMMs then perform joint reasoning in a unified visual domain. This design reduces cognitive load and leaves an explicit, inspectable evidence trail.G2F-RAG is plug-and-play across backbones and scales. It yields consistent gains on diverse public benchmarks, with larger improvements in knowledge-intensive settings. Ablations further confirm that knowledge representation and delivery matter. G2F-RAG reframes retrieval as visual space knowledge fusion for robust and interpretable video reasoning.

关键词: Retrieval-Augmented Generation, Video Reasoning, Knowledge Graph, Multi-agent Systems, Visual Space Fusion, Interpretable AI, Large Multimodal Models, Training-Free Paradigm

233. ❌ Spatially-Weighted CLIP for Street-View Geo-localization

作者: Ting Han, Fengjiao Li, Chunsong Chen, Haoling Huang, Yiping Chen, Meiliu Wu 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04357v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和地理信息领域的多模态学习，提出了一种改进CLIP模型用于街景地理定位的方法。虽然涉及深度学习技术，但论文内容与所有评分关键词（均围绕大语言模型技术、训练方法、推理优化、对齐技术、代理系统等）完全无关。论文未涉及任何语言模型、MoE、缩放定律、预训练/后训练、对齐技术、RAG、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI应用。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种空间加权CLIP框架，通过将地理空间自相关纳入视觉-语言对比学习，显著提高了街景地理定位的准确性和空间一致性。

摘要翻译

本文提出空间加权CLIP（SW-CLIP），一种用于街景地理定位的新型框架，其将空间自相关特性显式融入视觉-语言对比学习。与传统基于CLIP的方法将所有不匹配样本视为同等负例不同，SW-CLIP依据托布勒地理学第一定律，通过距离感知的软监督建模地理关联。具体而言，我们引入位置文本表征对地理坐标进行编码，并利用基于大地测量距离生成的空间加权软标签替代独热编码的InfoNCE目标。此外，通过邻域一致性正则化保持嵌入空间中的局部空间结构。在多城市数据集上的实验表明，相较于标准CLIP，SW-CLIP显著提升了地理定位精度，减少了长尾误差，并增强了空间连贯性。研究结果凸显了从语义对齐转向地理对齐对鲁棒地理定位的重要性，并为将空间原理融入多模态表征学习提供了通用范式。

摘要 (Abstract)

This paper proposes Spatially-Weighted CLIP (SW-CLIP), a novel framework for street-view geo-localization that explicitly incorporates spatial autocorrelation into vision-language contrastive learning. Unlike conventional CLIP-based methods that treat all non-matching samples as equally negative, SW-CLIP leverages Tobler’s First Law of Geography to model geographic relationships through distance-aware soft supervision. Specifically, we introduce a location-as-text representation to encode geographic positions and replace one-hot InfoNCE targets with spatially weighted soft labels derived from geodesic distance. Additionally, a neighborhood-consistency regularization is employed to preserve local spatial structure in the embedding space. Experiments on a multi-city dataset demonstrate that SW-CLIP significantly improves geo-localization accuracy, reduces long-tail errors, and enhances spatial coherence compared to standard CLIP. The results highlight the importance of shifting from semantic alignment to geographic alignment for robust geo-localization and provide a general paradigm for integrating spatial principles into multimodal representation learning.

关键词: Spatially-Weighted CLIP, street-view geo-localization, vision-language contrastive learning, spatial autocorrelation, geographic alignment, multimodal representation learning, Tobler’s First Law of Geography

234. ❌ OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

作者: Weiguo Pian, Saksham Singh Kushwaha, Zhimin Chen, Shijian Deng, Kai Wang, Yunhui Guo, Yapeng Tian 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04348v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文OmniSonic提出了一种基于流匹配扩散的音频生成框架，核心创新在于TriAttn-DiT架构和Mixture-of-Experts（MoE）门控机制，用于同时处理屏幕内环境声音、屏幕外环境声音和语音条件。因此，与’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分），因为MoE是其核心架构组件。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为音频生成可视为AI在多媒体科学领域的应用，但论文未明确聚焦于生物信息学或化学信息学。其他关键词（如LLMs、RLHF、RAG等）与论文的音频生成主题完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了OmniSonic，一种基于流匹配扩散的框架，通过TriAttn-DiT架构和MoE门控机制，从视频和文本中生成包含屏幕内外环境声音和语音的全面听觉场景，在基准测试中优于现有方法。

摘要翻译

本文提出通用全场景音频生成任务，旨在合成包含跨领域（如环境事件、乐器声和人声）屏幕内与屏幕外声音的完整听觉场景。现有视频条件音频生成模型通常仅聚焦于生成与可见发声事件对应的屏幕内环境声，忽略了屏幕外听觉事件。近期出现的全场景文本-视频联合音频生成模型虽试图生成同时包含屏幕内外声音的听觉场景，但其局限于非语音声音，无法生成或整合人类语音。为突破这些限制，我们提出OmniSonic——一个基于流匹配的扩散框架，可同时接受视频与文本的联合条件输入。该框架采用TriAttn-DiT架构，通过三重交叉注意力操作并行处理屏幕内环境声、屏幕外环境声及语音条件，并引入专家混合门控机制，在生成过程中自适应平衡三者的贡献。此外，我们构建了包含千余样本的UniHAGen-Bench新基准数据集，涵盖三类具有代表性的屏幕内外语音-环境交互场景。大量实验表明，OmniSonic在客观指标与人工评估中均持续优于现有先进方法，为通用全场景音频生成建立了强基准。项目页面：https://weiguopian.github.io/OmniSonic_webpage/

摘要 (Abstract)

In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting off-screen auditory events. While recent holistic joint text-video-to-audio generation models aim to produce auditory scenes with both on- and off-screen sound but they are limited to non-speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow-matching-based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention operations to process on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with a Mixture-of-Experts (MoE) gating mechanism that adaptively balances their contributions during generation. Furthermore, we construct UniHAGen-Bench, a new benchmark with over one thousand samples covering three representative on/off-screen speech-environment scenarios. Extensive experiments show that OmniSonic consistently outperforms state-of-the-art approaches on both objective metrics and human evaluations, establishing a strong baseline for universal and holistic audio generation. Project page: https://weiguopian.github.io/OmniSonic_webpage/

关键词: audio generation, video-conditioned, text-conditioned, Mixture-of-Experts, diffusion framework, holistic auditory scenes, on-screen sounds, off-screen sounds

235. ❌ GA-GS: Generation-Assisted Gaussian Splatting for Static Scene Reconstruction

作者: Yedong Shen, Shiqi Zhang, Sha Zhang, Yifan Duan, Xinran Zhang, Wenhao Yu, Lu Zhang, Jiajun Deng, Yanyong Zhang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04331v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文GA-GS专注于计算机视觉中的静态3D场景重建，使用高斯泼溅和扩散模型进行遮挡区域修复，属于计算机图形学/视觉领域。所有评分关键词均与大模型、深度学习技术原理或AI for Science相关，但论文未涉及任何大模型技术（如LLMs、MoE、RLHF等）、大模型应用（如RAG、Agent等）或科学AI应用（如生物信息学）。论文虽使用扩散模型，但这是生成模型在视觉任务的应用，而非大模型技术原理创新或科学领域应用，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于生成辅助高斯泼溅的方法（GA-GS），用于从包含动态物体的单目视频中重建静态3D场景，通过扩散模型修复遮挡区域并引入可学习的真实性标量，在DAVIS和自建数据集上实现了最先进的性能。

摘要翻译

从单目视频中重建包含动态物体的静态三维场景对于虚拟现实和自动驾驶等众多应用至关重要。现有方法通常依赖背景进行静态场景重建，限制了恢复被动态物体遮挡区域的能力。本文提出GA-GS，一种用于静态场景重建的生成辅助高斯溅射方法。我们工作的核心创新在于利用生成模型辅助重建被遮挡区域。我们采用运动感知模块分割并移除动态区域，随后使用扩散模型对遮挡区域进行修复，提供伪真实值监督。为平衡真实背景与生成区域的贡献，我们为每个高斯基元引入可学习的真实性标量，该标量能在溅射过程中动态调制不透明度，实现真实性感知的渲染与监督。由于现有数据集均未提供含动态物体视频的静态场景真实值，我们构建了名为Trajectory-Match的数据集，使用固定路径机器人记录每个场景在有/无动态物体时的状态，从而实现对遮挡区域重建的定量评估。在DAVIS数据集及自建数据集上的大量实验表明，GA-GS在静态场景重建中取得了最先进的性能，尤其在大规模持续遮挡的挑战性场景中表现突出。

摘要 (Abstract)

Reconstructing static 3D scene from monocular video with dynamic objects is important for numerous applications such as virtual reality and autonomous driving. Current approaches typically rely on background for static scene reconstruction, limiting the ability to recover regions occluded by dynamic objects. In this paper, we propose GA-GS, a Generation-Assisted Gaussian Splatting method for Static Scene Reconstruction. The key innovation of our work lies in leveraging generation to assist in reconstructing occluded regions. We employ a motion-aware module to segment and remove dynamic regions, and thenuse a diffusion model to inpaint the occluded areas, providing pseudo-ground-truth supervision. To balance contributions from real background and generated region, we introduce a learnable authenticity scalar for each Gaussian primitive, which dynamically modulates opacity during splatting for authenticity-aware rendering and supervision. Since no existing dataset provides ground-truth static scene of video with dynamic objects, we construct a dataset named Trajectory-Match, using a fixed-path robot to record each scene with/without dynamic objects, enabling quantitative evaluation in reconstruction of occluded regions. Extensive experiments on both the DAVIS and our dataset show that GA-GS achieves state-of-the-art performance in static scene reconstruction, especially in challenging scenarios with large-scale, persistent occlusions.

关键词: Static Scene Reconstruction, Gaussian Splatting, Diffusion Model, Occlusion Inpainting, 3D Reconstruction, Monocular Video, Dynamic Objects, Trajectory-Match Dataset

236. ❌ HighFM: Towards a Foundation Model for Learning Representations from High-Frequency Earth Observation Data

作者: Stella Girtsou, Konstantinos Alexis, Giorgos Giannopoulos, Harris Kontoes 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04306v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文HighFM专注于地球观测领域的基础模型研究，与’Foundation Models’高度相关（10分），因为它明确提出了一个用于高时间分辨率地球观测数据的基础模型。论文涉及’Pre-training’（10分），因为它使用SatMAE掩码自编码框架进行预训练，并涉及’Post-training’（10分），因为预训练模型在云掩码和活跃火灾检测任务上进行了微调。论文属于’AI for Science’（10分）范畴，因为它将AI应用于气候灾害监测和地球观测科学。其他关键词如MoE、SLMs、RLHF、RAG等与论文的计算机视觉和地球观测焦点无关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有地球观测基础模型依赖低重访率卫星数据的局限性，提出了HighFM——首个面向高时间分辨率多光谱地球观测数据的基础模型，通过在SEVIRI数据集上预训练并微调于云掩码和火灾检测任务，证明了其在实时灾害监测中的有效性。

摘要翻译

气候相关灾害日益频繁和严重，这加强了对实时监测、早期预警和科学决策的需求。基于卫星数据和机器学习的地球观测技术为应对这些挑战提供了有力工具。基础模型通过在大规模遥感数据集上进行通用预训练，彻底改变了地球观测领域的机器学习应用。然而，现有模型大多依赖高空间分辨率但重访周期长的卫星影像，限制了其在快速演变现象和时间紧迫的应急响应中的适用性。本研究提出了HighFM，这是面向高时间分辨率多光谱地球观测数据构建基础模型的初步探索。我们利用来自第二代气象卫星平台超过2 TB的SEVIRI影像数据，改进SatMAE掩码自编码框架，以学习稳健的时空表征。为支持实时监测，我们在原始架构中引入细粒度时间编码机制，以捕捉短期动态变化。预训练模型随后在云掩膜和活跃火点检测任务上进行微调。我们将基于SEVIRI预训练的视觉变换器与传统基线模型及近期地理空间基础模型进行对比评估，结果显示其在平衡准确率和交并比指标上均取得稳定提升。我们的研究成果凸显了高时间分辨率静止轨道数据在实时地球观测中的应用潜力，为构建面向灾害检测与追踪的基础模型提供了可扩展的路径。

摘要 (Abstract)

The increasing frequency and severity of climate related disasters have intensified the need for real time monitoring, early warning, and informed decision-making. Earth Observation (EO), powered by satellite data and Machine Learning (ML), offers powerful tools to meet these challenges. Foundation Models (FMs) have revolutionized EO ML by enabling general-purpose pretraining on large scale remote sensing datasets. However most existing models rely on high-resolution satellite imagery with low revisit rates limiting their suitability for fast-evolving phenomena and time critical emergency response. In this work, we present HighFM, a first cut approach towards a FM for high temporal resolution, multispectral EO data. Leveraging over 2 TB of SEVIRI imagery from the Meteosat Second Generation (MSG) platform, we adapt the SatMAE masked autoencoding framework to learn robust spatiotemporal representations. To support real time monitoring, we enhance the original architecture with fine grained temporal encodings to capture short term variability. The pretrained models are then finetuned on cloud masking and active fire detection tasks. We benchmark our SEVIRI pretrained Vision Transformers against traditional baselines and recent geospatial FMs, demonstrating consistent gains across both balanced accuracy and IoU metrics. Our results highlight the potential of temporally dense geostationary data for real-time EO, offering a scalable path toward foundation models for disaster detection and tracking.

关键词: Foundation Model, Earth Observation, High Temporal Resolution, SatMAE, SEVIRI, Cloud Masking, Active Fire Detection, Disaster Monitoring

237. ❌ A Persistent Homology Design Space for 3D Point Cloud Deep Learning

作者: Prachi Kudeshia, Jiju Poovvancheri, Amr Ghoneim, Dong Chen 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04299v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于3D点云深度学习中的拓扑数据分析（持久同调），属于计算机视觉和几何深度学习的范畴。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、智能体等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及AI在3D形状分析中的应用，属于广义的’AI for Science’范畴，但并非其核心生物信息学或化学信息学子领域，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一个将持久同调（一种拓扑数据分析方法）系统性地整合到3D点云深度学习中的统一设计框架，通过实验证明该框架能提升模型在分类和分割任务上的准确性、鲁棒性和部分一致性。

摘要翻译

持久同调（Persistent Homology，PH）通过捕捉在不同尺度下持续存在的连通分量、环状结构及空洞，提供了对内在形状结构稳定且多尺度的描述符，从而为三维数据的纯几何表示提供了互补的不变量。然而，尽管其理论保证坚实且实证应用日益增多，持久同调在点云深度学习中的整合仍大多处于临时性状态，且在架构中处于边缘位置。本研究为三维点云中的持久同调驱动学习（3DPHDL）引入了一个统一的设计空间，形式化了复形构建、过滤策略、持久性表示、神经主干网络及预测任务之间的相互作用。除了常见的拓扑图计算与向量化流程外，我们识别了六个原则性的注入点，通过这些点，拓扑学可作为结构性的归纳偏置，重塑采样过程、邻域图构建、优化动态、自监督学习、输出校准乃至内部网络正则化。我们在ModelNet40分类和ShapeNetPart分割任务上进行了受控实证研究，以此实例化该框架：系统地将持久性拓扑图、持久性图像及持久性景观等表示融入代表性主干网络（PointNet、DGCNN和Point Transformer），并分析其对准确性、噪声与采样变化的鲁棒性以及计算可扩展性的影响。结果表明，该方法在拓扑敏感的分类能力和部件一致性方面取得了持续改进，同时揭示了表示表达能力与组合复杂性之间有意义的权衡。通过将持久同调不仅视为辅助特征，更视为学习流程中的结构化组件，本研究为将拓扑推理融入三维点云学习提供了一个系统化框架。

摘要 (Abstract)

Persistent Homology (PH) offers stable, multi-scale descriptors of intrinsic shape structure by capturing connected components, loops, and voids that persist across scales, providing invariants that complement purely geometric representations of 3D data. Yet, despite strong theoretical guarantees and increasing empirical adoption, its integration into deep learning for point clouds remains largely ad hoc and architecturally peripheral. In this work, we introduce a unified design space for Persistent-Homology driven learning in 3D point clouds (3DPHDL), formalizing the interplay between complex construction, filtration strategy, persistence representation, neural backbone, and prediction task. Beyond the canonical pipeline of diagram computation and vectorization, we identify six principled injection points through which topology can act as a structural inductive bias reshaping sampling, neighborhood graphs, optimization dynamics, self-supervision, output calibration, and even internal network regularization. We instantiate this framework through a controlled empirical study on ModelNet40 classification and ShapeNetPart segmentation, systematically augmenting representative backbones (PointNet, DGCNN, and Point Transformer) with persistence diagrams, images, and landscapes, and analyzing their impact on accuracy, robustness to noise and sampling variation, and computational scalability. Our results demonstrate consistent improvements in topology-sensitive discrimination and part consistency, while revealing meaningful trade-offs between representational expressiveness and combinatorial complexity. By viewing persistent homology not merely as an auxiliary feature but as a structured component within the learning pipeline, this work provides a systematic framework for incorporating topological reasoning into 3D point cloud learning.

关键词: Persistent Homology, 3D Point Clouds, Deep Learning, Topological Data Analysis, Design Space, PointNet, DGCNN, Point Transformer

238. ❌ Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning

作者: Donghuo Zeng, Hao Niu, Masato Taya 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04229v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是无监督音频-视觉表示学习，提出了一种名为HSC-MAE的层次化语义相关性感知掩码自编码器框架。论文的核心技术涉及多模态对齐、掩码自编码器、典型相关分析（DCCA）、对比学习（InfoNCE）等，属于计算机视觉和音频处理领域的多模态表示学习研究。所有评分关键词均与大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文专注于传统的音频-视觉多模态学习，未涉及任何大模型、深度学习新技术原理或AI for Science的具体应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对弱配对、无标签多模态数据中学习对齐的音频-视觉表示这一挑战，提出了HSC-MAE框架，通过层次化语义相关性约束在三个互补级别上增强表示一致性，在AVE和VEGAS数据集上取得了优于现有无监督基线的性能提升。

摘要翻译

从弱配对、无标签语料库中学习对齐的多模态嵌入具有挑战性：处理流程通常仅提供预提取特征，片段包含多个事件，且存在虚假共现。我们提出HSC-MAE（分层语义关联感知掩码自编码器），这是一种双路径师生框架，通过在三个互补的表征层级——从粗粒度到细粒度——强制实施语义一致性：（i）通过DCCA实现全局层级的规范几何关联，将音频和视觉嵌入对齐到共享的模态不变子空间中；（ii）通过教师网络挖掘的软性top-k亲和度实现局部层级的邻域语义关联，保留语义相似实例间的多重正例关系结构；（iii）通过掩码自编码实现样本层级的条件充分性关联，确保单个嵌入在部分观测下仍能保留区分性语义内容。具体而言，学生MAE路径通过掩码特征重建和亲和度加权的软性top-k InfoNCE进行训练；而通过CCA路径在未掩码输入上运行的EMA教师网络则提供稳定的规范几何与软性正例。可学习的多任务权重协调了相互竞争的目标，可选的蒸馏损失将教师网络的几何结构传递给学生网络。在AVE和VEGAS数据集上的实验表明，相较于强无监督基线方法，本方法在mAP指标上取得显著提升，验证了HSC-MAE能够产生鲁棒且结构良好的视听表征。

摘要 (Abstract)

Learning aligned multimodal embeddings from weakly paired, label-free corpora is challenging: pipelines often provide only pre-extracted features, clips contain multiple events, and spurious co-occurrences. We propose HSC-MAE (Hierarchical Semantic Correlation-Aware Masked Autoencoder), a dual-path teacher-student framework that enforces semantic consistency across three complementary levels of representation - from coarse to fine: (i) global-level canonical-geometry correlation via DCCA, which aligns audio and visual embeddings within a shared modality-invariant subspace; (ii) local-level neighborhood-semantics correlation via teacher-mined soft top-k affinities, which preserves multi-positive relational structure among semantically similar instances; and (iii) sample-level conditional-sufficiency correlation via masked autoencoding, which ensures individual embeddings retain discriminative semantic content under partial observation. Concretely, a student MAE path is trained with masked feature reconstruction and affinity-weighted soft top-k InfoNCE; an EMA teacher operating on unmasked inputs via the CCA path supplies stable canonical geometry and soft positives. Learnable multi-task weights reconcile competing objectives, and an optional distillation loss transfers teacher geometry into the student. Experiments on AVE and VEGAS demonstrate substantial mAP improvements over strong unsupervised baselines, validating that HSC-MAE yields robust and well-structured audio-visual representations.

关键词: unsupervised audio-visual representation learning, masked autoencoder, hierarchical semantic correlation, multimodal alignment, teacher-student framework, canonical correlation analysis, contrastive learning, weakly paired data

239. ❌ DriveVA: Video Action Models are Zero-Shot Drivers

作者: Mengmeng Liu, Diankun Zhang, Jiuming Liu, Jianfeng Cui, Hongwei Xie, Guang Chen, Hangjun Ye, Michael Ying Yang, Francesco Nex, Hao Cheng 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04198v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文DriveVA专注于自动驾驶领域的世界模型（World Models），提出了一种联合解码未来视觉预测和动作序列的共享潜在生成过程。该研究直接与关键词’World Models AND General World Models’高度相关（10分），因为论文的核心就是构建自动驾驶世界模型。与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为论文提到继承了大规模视频生成模型的预训练先验知识，并展示了跨域泛化能力。其他关键词主要涉及大语言模型（LLMs）的特定技术、训练方法、推理技术或特定科学领域应用，而本文研究的是视频动作模型在自动驾驶中的具体应用，并未涉及LLMs、MoE、对齐、RAG、推理加速、AI for Science等主题，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对自动驾驶中世界模型泛化能力有限和规划与场景演化不一致的问题，提出了DriveVA模型，通过联合预测未来动作序列和视频实现了90.9 PDM的闭环性能，并在多个数据集上显著降低了误差和碰撞率。

摘要翻译

泛化能力是自动驾驶领域的核心挑战，因为实际部署要求系统在未见过的场景、传感器领域和环境条件下均具备鲁棒性能。近期基于世界模型的规划方法在场景理解和多模态未来预测方面展现出强大能力，但其在不同数据集和传感器配置间的泛化能力仍然有限。此外，这些方法采用的松散耦合规划范式往往导致视觉想象过程中的视频-轨迹一致性较差。为克服这些局限，我们提出DriveVA——一种新颖的自动驾驶世界模型，通过在共享的潜在生成过程中联合解码未来视觉预测与动作序列。DriveVA从经过良好预训练的大规模视频生成模型中继承了关于运动动力学和物理合理性的丰富先验知识，以捕捉连续的时空演化规律与因果交互模式。为此，DriveVA采用基于DiT（Diffusion Transformer）的解码器联合预测未来动作序列（轨迹）与视频，实现了规划与场景演化间更紧密的对齐。我们还引入了视频延续策略以增强长时推演的一致性。DriveVA在NAVSIM挑战赛中取得了90.9 PDM分数的卓越闭环性能。大量实验同时证明了DriveVA的零样本能力与跨域泛化性：相较于最先进的基于世界模型的规划器，其在nuScenes数据集上平均L2误差和碰撞率分别降低78.9%与83.3%，在基于CARLA v2构建的Bench2drive数据集上分别降低52.5%与52.4%。

摘要 (Abstract)

Generalization is a central challenge in autonomous driving, as real-world deployment requires robust performance under unseen scenarios, sensor domains, and environmental conditions. Recent world-model-based planning methods have shown strong capabilities in scene understanding and multi-modal future prediction, yet their generalization across datasets and sensor configurations remains limited. In addition, their loosely coupled planning paradigm often leads to poor video-trajectory consistency during visual imagination. To overcome these limitations, we propose DriveVA, a novel autonomous driving world model that jointly decodes future visual forecasts and action sequences in a shared latent generative process. DriveVA inherits rich priors on motion dynamics and physical plausibility from well-pretrained large-scale video generation models to capture continuous spatiotemporal evolution and causal interaction patterns. To this end, DriveVA employs a DiT-based decoder to jointly predict future action sequences (trajectories) and videos, enabling tighter alignment between planning and scene evolution. We also introduce a video continuation strategy to strengthen long-duration rollout consistency. DriveVA achieves an impressive closed-loop performance of 90.9 PDM score on the challenge NAVSIM. Extensive experiments also demonstrate the zero-shot capability and cross-domain generalization of DriveVA, which reduces average L2 error and collision rate by 78.9% and 83.3% on nuScenes and 52.5% and 52.4% on the Bench2drive built on CARLA v2 compared with the state-of-the-art world-model-based planner.

关键词: autonomous driving, world model, video generation, trajectory prediction, zero-shot generalization, cross-domain generalization, DiT-based decoder, closed-loop performance

240. ❌ Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks

作者: Adrienne Deganutti, Elad Hirsch, Haonan Zhu, Jaejung Seol, Purvanshi Mehta 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04192v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于图形设计领域的AI评估基准，主要涉及图像生成、布局理解、空间推理等计算机视觉任务，而所有评分关键词均围绕大语言模型（LLM）的技术原理、训练方法、推理优化、对齐技术、应用框架等展开。论文摘要中未提及任何LLM相关技术术语（如LLM、MoE、SFT、RLHF、RAG、CoT等），也未涉及生物信息学或科学AI应用。虽然论文评估AI模型，但其焦点是图形设计任务的评估基准，而非大模型技术本身或其在科学领域的应用创新，因此与所有关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了首个针对专业图形设计任务的综合评估基准GraphicDesignBench，发现当前AI模型在空间推理、矢量代码生成、排版感知和动画分解等核心设计挑战上仍存在显著不足。

摘要翻译

我们推出GraphicDesignBench（GDB），这是首个专门用于评估AI模型在完整专业平面设计任务上的综合性基准测试套件。与现有聚焦于自然图像理解或通用文本到图像合成的基准不同，GDB针对专业设计工作的独特挑战：将传达意图转化为结构化布局、渲染字体保真的文本、操作分层构图、生成有效的矢量图形，以及对动画进行推理。该套件包含沿五个维度组织的50项任务：布局、字体排印、信息图表、模板与设计语义以及动画，每项任务均在理解和生成两种设置下进行评估，并基于从LICA分层构图数据集中提取的真实设计模板。我们使用涵盖空间准确性、感知质量、文本保真度、语义对齐和结构有效性的标准化度量分类法，评估了一系列前沿闭源模型。我们的结果表明，当前模型在专业设计的核心挑战上仍存在不足：对复杂布局的空间推理、保真的矢量代码生成、细粒度的字体排印感知，以及动画的时间分解等问题在很大程度上尚未解决。虽然高层语义理解已可触及，但随着任务对精确性、结构性和构图意识要求的提高，差距急剧扩大。GDB为追踪AI系统向具备能力的设计协作者方向发展提供了一个严谨、可复现的测试平台。完整的评估框架已公开提供。

摘要 (Abstract)

We introduce GraphicDesignBench (GDB), the first comprehensive benchmark suite designed specifically to evaluate AI models on the full breadth of professional graphic design tasks. Unlike existing benchmarks that focus on natural-image understanding or generic text-to-image synthesis, GDB targets the unique challenges of professional design work: translating communicative intent into structured layouts, rendering typographically faithful text, manipulating layered compositions, producing valid vector graphics, and reasoning about animation. The suite comprises 50 tasks organized along five axes: layout, typography, infographics, template & design semantics and animation, each evaluated under both understanding and generation settings, and grounded in real-world design templates drawn from the LICA layered-composition dataset. We evaluate a set of frontier closed-source models using a standardized metric taxonomy covering spatial accuracy, perceptual quality, text fidelity, semantic alignment, and structural validity. Our results reveal that current models fall short on the core challenges of professional design: spatial reasoning over complex layouts, faithful vector code generation, fine-grained typographic perception, and temporal decomposition of animations remain largely unsolved. While high-level semantic understanding is within reach, the gap widens sharply as tasks demand precision, structure, and compositional awareness. GDB provides a rigorous, reproducible testbed for tracking progress toward AI systems that can function as capable design collaborators. The full evaluation framework is publicly available.

关键词: graphic design benchmark, AI evaluation, spatial reasoning, vector graphics generation, typographic perception, animation decomposition, professional design tasks, layout understanding

241. ❌ AURA: Always-On Understanding and Real-Time Assistance via Video Streams

作者: Xudong Lu, Yang Bo, Jinpeng Chen, Shuhan Li, Xintong Guo, Huankang Guan, Fang Liu, Dunyuan Xu, Peiwen Sun, Heyang Sun, Rui Liu, Hongsheng Li 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04184v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确聚焦于Video Large Language Models (VideoLLMs)，属于大模型在视频理解领域的应用创新，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文主要解决实时视频流处理框架问题，涉及上下文管理、训练目标、部署优化等应用层面，未深入讨论其他关键词所代表的具体技术原理（如MoE、量化、推理加速、对齐方法等），也未涉及科学领域应用，因此其他关键词均评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了AURA框架，解决了现有视频大模型无法有效处理实时视频流进行连续观察和及时响应的问题，实现了端到端的流式视觉交互，并在流式基准测试中取得了最先进的性能。

摘要翻译

视频大语言模型（VideoLLMs）已在众多视频理解任务中展现出卓越性能，但现有系统大多仍处于离线状态，难以适配需要持续观察与即时响应的实时视频流。近期流式视频大语言模型虽取得进展，但当前方法通常依赖解耦的触发-响应流程，或局限于描述式旁白生成，限制了其在开放式问答与长程交互中的效能。我们提出AURA（持续感知与实时辅助系统），一种端到端的流式视觉交互框架，使统一的视频大语言模型能够持续处理视频流，同时支持实时问答与主动响应。AURA整合了面向稳定长程流式交互的上下文管理、数据构建、训练目标与部署优化方案。该框架在流式基准测试中取得了最先进的性能，并支持集成自动语音识别与文本转语音的实时演示系统，可在两块80G加速器上以每秒2帧的速度运行。我们同步开源AURA模型及实时推理框架，以促进未来研究。

摘要 (Abstract)

Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.

关键词: Video Large Language Models, streaming video understanding, real-time question answering, proactive responses, end-to-end framework, context management, long-horizon interaction, inference optimization

242. ❌ Scale-Aware Vision-Language Adaptation for Extreme Far-Distance Video Person Re-identification

作者: Ashwat Rajbhandari, Bharatesh Chakravarthi 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04183v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究极端远距离视频行人重识别，使用CLIP等大规模视觉语言模型作为基础，通过骨干网络升级、选择性微调、轻量级时间注意力池化、适配器学习和提示条件跨视图学习等技术进行适应。与关键词的相关性分析：1）与’Large Language Models’等有一定关联（5分），因使用CLIP等大规模视觉语言模型；2）与’Pre-training’和’Post-training’高度相关（8分），涉及预训练模型适应和微调；3）与’PEFT’有一定关联（5分），提及适配器学习；4）其他关键词如MoE、SLMs、RAG、RLHF等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究如何将大规模视觉语言模型适应于极端远距离视频行人重识别任务，通过骨干网络升级、选择性微调、时间注意力池化等技术，在DetReIDX基准上取得了显著性能提升。

摘要翻译

极端远距离视频行人重识别（ReID）因尺度压缩、分辨率退化、运动模糊及空地视角失配而极具挑战性。随着相机高度与目标距离的增加，基于近距离图像训练的模型性能显著下降。本研究探讨了如何使大规模视觉-语言模型适应此类极端条件并保持可靠性能。我们以基于CLIP的基线模型为起点，将视觉骨干网络从ViT-B/16升级至ViT-L/14，并引入骨干感知选择性微调策略以稳定更大规模Transformer的适配过程。针对噪声多且分辨率低的视频轨迹片段，我们融入轻量级时序注意力池化机制，以抑制低质量帧并强化信息丰富的观测片段。我们保留了基于适配器与提示条件化的跨视角学习方法以缓解空地领域偏移，并通过改进的优化策略与k-互逆重排序进一步优化检索效果。在DetReIDX压力测试基准上的实验表明，我们的方法在A2G（空对地）、G2A（地对空）和A2A（空对空）任务中分别取得46.69、41.23和22.98的mAP分数，总体mAP达到35.73。这些结果证明，大规模视觉-语言骨干网络结合以稳定性为核心的适配策略，能显著提升极端远距离视频行人重识别的鲁棒性。

摘要 (Abstract)

Extreme far-distance video person re-identification (ReID) is particularly challenging due to scale compression, resolution degradation, motion blur, and aerial-ground viewpoint mismatch. As camera altitude and subject distance increase, models trained on close-range imagery degrade significantly. In this work, we investigate how large-scale vision-language models can be adapted to operate reliably under these conditions. Starting from a CLIP-based baseline, we upgrade the visual backbone from ViT-B/16 to ViT-L/14 and introduce backbone-aware selective fine-tuning to stabilize adaptation of the larger transformer. To address noisy and low-resolution tracklets, we incorporate a lightweight temporal attention pooling mechanism that suppresses degraded frames and emphasizes informative observations. We retain adapter-based and prompt-conditioned cross-view learning to mitigate aerial-ground domain shifts, and further refine retrieval using improved optimization and k-reciprocal re-ranking. Experiments on the DetReIDX stress-test benchmark show that our approach achieves mAP scores of 46.69 (A2G), 41.23 (G2A), and 22.98 (A2A), corresponding to an overall mAP of 35.73. These results show that large-scale vision-language backbones, when combined with stability-focused adaptation, significantly enhance robustness in extreme far-distance video person ReID.

关键词: vision-language models, person re-identification, domain adaptation, fine-tuning, temporal attention, cross-view learning, video analysis, CLIP

243. ❌ Incomplete Multi-View Multi-Label Classification via Shared Codebook and Fused-Teacher Self-Distillation

作者: Xu Yan, Jun Yin, Shiliang Sun, Minghua Wan 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04170v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多视图多标签分类问题，专注于处理视图和标签双重缺失的场景，提出了一种基于共享码本和融合教师自蒸馏的方法。论文内容属于传统的多视图学习、多标签分类和自蒸馏领域，未涉及任何大语言模型、深度学习技术原理创新或AI for Science的具体应用。所有评分关键词均与大模型、深度学习技术或科学AI应用相关，而本文是传统的机器学习方法研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对视图和标签双重缺失的多视图多标签分类问题，提出了一种基于共享码本和融合教师自蒸馏的方法，通过实验验证了其在多个基准数据集上的有效性。

摘要翻译

尽管多视图多标签学习已得到广泛研究，但针对视图和标签双重缺失场景的探索仍显不足。现有方法主要依赖对比学习或信息瓶颈理论来学习缺失视图条件下的一致性表示，然而缺乏显式结构约束的损失对齐方式限制了模型捕获稳定且具判别性的共享语义的能力。为解决这一问题，我们引入了一种更具结构化的表示一致性学习机制：通过多视图共享码本与跨视图重构学习离散的一致性表示，该方法将不同视图自然对齐到有限的共享码本嵌入空间中，并减少特征冗余。在决策层面，我们设计了一种权重估计方法，用于评估各视图保持标签关联结构的能力，并据此分配权重以提升融合预测的质量。此外，我们提出了一种融合教师自蒸馏框架，其中融合预测指导视图特定分类器的训练，并将全局知识反馈至单视图分支，从而增强模型在标签缺失条件下的泛化能力。通过在五个基准数据集上与先进方法进行广泛对比实验，充分证明了所提方法的有效性。代码发布于 https://github.com/xuy11/SCSD。

摘要 (Abstract)

Although multi-view multi-label learning has been extensively studied, research on the dual-missing scenario, where both views and labels are incomplete, remains largely unexplored. Existing methods mainly rely on contrastive learning or information bottleneck theory to learn consistent representations under missing-view conditions, but loss-based alignment without explicit structural constraints limits the ability to capture stable and discriminative shared semantics. To address this issue, we introduce a more structured mechanism for consistent representation learning: we learn discrete consistent representations through a multi-view shared codebook and cross-view reconstruction, which naturally align different views within the limited shared codebook embeddings and reduce feature redundancy. At the decision level, we design a weight estimation method that evaluates the ability of each view to preserve label correlation structures, assigning weights accordingly to enhance the quality of the fused prediction. In addition, we introduce a fused-teacher self-distillation framework, where the fused prediction guides the training of view-specific classifiers and feeds the global knowledge back into the single-view branches, thereby enhancing the generalization ability of the model under missing-label conditions. The effectiveness of our proposed method is thoroughly demonstrated through extensive comparative experiments with advanced methods on five benchmark datasets. Code is available at https://github.com/xuy11/SCSD.

关键词: multi-view learning, multi-label classification, incomplete views, incomplete labels, shared codebook, self-distillation, fused-teacher, representation learning

244. ❌ GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models

作者: Yaohan Guan, Pristina Wang, Najim Dehak, Alan Yuille, Jieneng Chen, Daniel Khashabi 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04172v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文GENFIG1专注于评估视觉语言模型（VLMs）在科学可视化任务中的能力，特别是生成能清晰表达论文核心思想的“Figure 1”。它涉及科学理解和视觉合成的推理，与“AI for Science”高度相关（8分），因为该任务直接应用于科学领域，促进多模态AI在科研中的进步。与“Large Language Models”有一定关联（5分），因为VLMs通常基于或结合LLMs，但论文更强调视觉生成而非纯文本模型。与“Chain of Thought”和“System 2 Thinking”有一定关联（各5分），因为任务需要多步推理（如理解概念、识别重点、设计图形），涉及深度思考过程。其他关键词（如MoE、Scaling Laws、RLHF等）与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了GENFIG1基准，用于评估视觉语言模型在生成科学论文核心思想可视化摘要（Figure 1）方面的能力，结果显示当前最佳模型仍面临显著挑战。

摘要翻译

在许多科学论文中，“图1”作为核心研究思想的主要视觉摘要。这些图像视觉简洁但概念丰富，通常需要作者投入大量精力并反复修改才能完善，这凸显了科学视觉传达的难度。基于此观察，我们提出了GENFIG1基准，用于评估生成式人工智能模型（例如视觉-语言模型）。GENFIG1以论文的核心要素（标题、摘要、引言和图表说明）作为输入，评估模型生成能清晰表达并阐释论文中心思想的图表的能力。解决GENFIG1任务不仅需要生成视觉上吸引人的图形，更要求模型在文本到图像的生成过程中进行推理，将科学理解与视觉合成相结合。具体而言，模型必须：（i）理解并把握论文的技术概念，（ii）识别其中最关键的部分，（iii）设计出连贯且具有美学效果的图形，以视觉方式传达这些概念并忠实于输入内容。我们从顶级深度学习会议发表的论文中精心构建该基准，实施严格的质量控制，并引入一种与专家人工评估高度相关的自动评估指标。我们在GENFIG1上评估了一系列代表性模型，结果表明即使对于性能最优的系统，该任务仍构成重大挑战。我们希望这一基准能为多模态人工智能的未来发展奠定基础。

摘要 (Abstract)

In many science papers, “Figure 1” serves as the primary visual summary of the core research idea. These figures are visually simple yet conceptually rich, often requiring significant effort and iteration by human authors to get right, highlighting the difficulty of science visual communication. With this intuition, we introduce GENFIG1, a benchmark for generative AI models (e.g., Vision-Language Models). GENFIG1 evaluates models for their ability to produce figures that clearly express and motivate the central idea of a paper (title, abstract, introduction, and figure caption) as input. Solving GENFIG1 requires more than producing visually appealing graphics: the task entails reasoning for text-to-image generation that couples scientific understanding with visual synthesis. Specifically, models must (i) comprehend and grasp the technical concepts of the paper, (ii) identify the most salient ones, and (iii) design a coherent and aesthetically effective graphic that conveys those concepts visually and is faithful to the input. We curate the benchmark from papers published at top deep-learning conferences, apply stringent quality control, and introduce an automatic evaluation metric that correlates well with expert human judgments. We evaluate a suite of representative models on GENFIG1 and demonstrate that the task presents significant challenges, even for the best-performing systems. We hope this benchmark serves as a foundation for future progress in multimodal AI.

关键词: GENFIG1, vision-language models, scientific visualization, benchmark, text-to-image generation, multimodal AI, evaluation metric, deep learning conferences

245. ❌ Hierarchical Co-Embedding of Font Shapes and Impression Tags

作者: Yugo Kubota, Kaito Shiku, Seiichi Uchida 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04158v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究字体形状与印象标签的层次化共嵌入，属于计算机视觉和自然语言处理的交叉领域，但未涉及大模型、深度学习技术原理创新或科学领域应用。所有关键词均与大模型技术、训练方法、推理优化、对齐、压缩、代理系统或科学AI应用相关，而本文专注于字体图像和文本标签的表示学习，使用双曲空间建模蕴含关系，与给定关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种双曲共嵌入框架，通过蕴含关系而非简单对齐来建模字体与印象标签的对应关系，从而量化印象对字体风格的约束强度（风格特异性），并在MyFonts数据集上展示了优于一对一基线的双向检索性能。

摘要翻译

字体形态能够唤起广泛的印象感知，但字体与印象描述之间并非一一对应关系：某些印象与多样化的字体风格广泛兼容，而另一些印象则强烈限制了可能的字体集合。我们将这种分级约束强度称为风格特异性。本文提出一种双曲协同嵌入框架，通过蕴含关系而非简单的配对对齐来建模字体与印象的对应关系。以单个标签或标签集合表示的字体图像与印象描述，被嵌入共享的双曲空间，并受到两种互补的蕴含约束：印象到字体的蕴含关系，以及印象间从低到高风格特异性的蕴含关系。该框架形成了一种径向结构：低风格特异性印象靠近空间原点，高风格特异性印象则分布于更外围区域，从而产生一种可解释的几何度量，用以量化印象对字体风格的约束强度。在MyFonts数据集上的实验表明，本方法相较于强一对一基线模型实现了更优的双向检索性能。此外，通过空间遍历与标签层级分析显示，所学空间能够捕捉从模糊印象到高风格特异性印象的连贯递进关系，并为风格特异性提供了有意义的数据驱动量化指标。

摘要 (Abstract)

Font shapes can evoke a wide range of impressions, but the correspondence between fonts and impression descriptions is not one-to-one: some impressions are broadly compatible with diverse styles, whereas others strongly constrain the set of plausible fonts. We refer to this graded constraint strength as style specificity. In this paper, we propose a hyperbolic co-embedding framework that models font–impression correspondence through entailment rather than simple paired alignment. Font images and impression descriptions, represented as single tags or tag sets, are embedded in a shared hyperbolic space with two complementary entailment constraints: impression-to-font entailment and low-to-high style-specificity entailment among impressions. This formulation induces a radial structure in which low style-specificity impressions lie near the origin and high style-specificity impressions lie farther away, yielding an interpretable geometric measure of how strongly an impression constrains font style. Experiments on the MyFonts dataset demonstrate improved bidirectional retrieval over strong one-to-one baselines. In addition, traversal and tag-level analyses show that the learned space captures a coherent progression from ambiguous to more style-specific impressions and provides a meaningful, data-driven quantification of style specificity.

关键词: font shapes, impression tags, hyperbolic co-embedding, style specificity, entailment constraints, bidirectional retrieval, MyFonts dataset, geometric measure

246. ❌ Uncertainty-Aware Test-Time Adaptation for Cross-Region Spatio-Temporal Fusion of Land Surface Temperature

作者: Sofiane Bouaziz, Adel Hafiane, Raphael Canals, Rachid Nedjai 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04153v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于遥感领域的地表温度时空融合回归任务，提出了一种不确定性感知的测试时适应框架。与大多数关键词无关，因为这些关键词主要针对大语言模型（LLM）及其相关技术（如MoE、RLHF、RAG等），而本文研究的是传统的深度学习模型在遥感应用中的领域适应问题。唯一相关的关键词是：1. “Pre-training OR Continual Pre-training OR Domain Adaptation”（5分）：论文涉及预训练模型和领域适应（domain adaptation），但具体是测试时适应（TTA），而非典型的预训练或持续预训练。2. “AI for Science OR Bioinformatics OR Cheminformatics”（8分）：论文属于AI在科学领域的应用（遥感、环境科学），符合"AI for Science"范畴，但非生物信息学或化学信息学。其他关键词如LLMs、MoE、SFT等均未涉及，因为论文未使用或讨论大语言模型技术。

!!! tip deepseek-chat TL;DR

该论文针对地表温度时空融合回归任务中模型难以泛化到新地理区域的问题，提出了一种不确定性感知的测试时适应框架，在四个不同气候的目标区域上实现了RMSE和MAE平均提升24.2%和27.9%。

摘要翻译

深度学习模型在各类遥感应用中展现出巨大潜力。然而，由于领域偏移的存在，这些模型往往难以泛化至训练时未见过的新地理区域。当训练区域与目标区域因土地覆盖、气候及环境条件差异导致数据分布不同时，便会产生领域偏移。测试时适应方法已成为应对此类偏移的解决方案，但现有方法主要针对分类任务设计，无法直接应用于回归任务。本研究针对地表温度估计中的时空融合回归任务，提出一种不确定性感知的测试时适应框架。该框架在认知不确定性、土地利用与土地覆盖一致性以及偏差校正的指导下，仅更新预训练时空融合模型中的融合模块，且无需源数据或带标签的目标样本。在意大利罗马、埃及开罗、西班牙马德里和法国蒙彼利埃这四个气候各异的目标区域进行的实验表明，针对法国奥尔良预训练模型的均方根误差与平均绝对误差均获得持续改善。即使在目标区域未标注数据有限且仅进行10轮测试时适应训练的情况下，平均提升幅度仍分别达到24.2%和27.9%。

摘要 (Abstract)

Deep learning models have shown great promise in diverse remote sensing applications. However, they often struggle to generalize across geographic regions unseen during training due to domain shifts. Domain shifts occur when data distributions differ between the training region and new target regions, due to variations in land cover, climate, and environmental conditions. Test-time adaptation (TTA) has emerged as a solution to such shifts, but existing methods are primarily designed for classification and are not directly applicable to regression tasks. In this work, we address the regression task of spatio-temporal fusion (STF) for land surface temperature estimation. We propose an uncertainty-aware TTA framework that updates only the fusion module of a pre-trained STF model, guided by epistemic uncertainty, land use and land cover consistency, and bias correction, without requiring source data or labeled target samples. Experiments on four target regions with diverse climates, namely Rome in Italy, Cairo in Egypt, Madrid in Spain, and Montpellier in France, show consistent improvements in RMSE and MAE for a pre-trained model in Orléans, France. The average gains are 24.2% and 27.9%, respectively, even with limited unlabeled target data and only 10 TTA epochs.

关键词: test-time adaptation, domain adaptation, spatio-temporal fusion, land surface temperature, remote sensing, uncertainty-aware, regression, deep learning

247. ❌ OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

作者: Liyu Zhang, Kehan Li, Tingrui Han, Tao Zhao, Yuxuan Sheng, Shibo He, Chao Li 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04142v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于流匹配模型的训练效率改进，核心贡献是提出了一种离策略GRPO框架（OP-GRPO），以提高样本效率。仅与关键词’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分），因为GRPO（Group Relative Policy Optimization）是一种后训练方法，用于提升生成质量。其他关键词均不涉及，因为论文未讨论大语言模型、推理、对齐、压缩、科学AI应用等主题。

!!! tip deepseek-chat TL;DR

该论文针对流匹配模型中GRPO方法样本效率低的问题，提出了首个离策略GRPO框架（OP-GRPO），通过轨迹选择、重要性采样校正和轨迹截断，在图像和视频生成基准上仅用34.2%的训练步骤就达到了可比或更优的性能。

摘要翻译

通过GRPO进行的后训练已证明在提升流匹配模型生成质量方面具有显著效果。然而，由于其固有的在线策略训练范式，GRPO存在样本效率低下的问题。为解决这一局限，我们提出了OP-GRPO，这是首个专为流匹配模型设计的离线策略GRPO框架。首先，我们主动选择高质量轨迹，并自适应地将其纳入回放缓冲区，以供后续训练迭代重复使用。其次，为缓解离线策略样本引入的分布偏移，我们提出了一种序列级重要性采样校正方法，该方法在保持GRPO裁剪机制完整性的同时，确保了策略更新的稳定性。第三，我们从理论和实验上证明，后期去噪步骤会产生病态的离线策略比率，并通过在后期步骤截断轨迹来缓解这一问题。在图像和视频生成基准测试中，OP-GRPO仅需平均34.2%的训练步数即可达到与Flow-GRPO相当或更优的性能，在保持生成质量的同时显著提升了训练效率。

摘要 (Abstract)

Post training via GRPO has demonstrated remarkable effectiveness in improving the generation quality of flow-matching models. However, GRPO suffers from inherently low sample efficiency due to its on-policy training paradigm. To address this limitation, we present OP-GRPO, the first Off-Policy GRPO framework tailored for flow-matching models. First, we actively select high-quality trajectories and adaptively incorporate them into a replay buffer for reuse in subsequent training iterations. Second, to mitigate the distribution shift introduced by off-policy samples, we propose a sequence-level importance sampling correction that preserves the integrity of GRPO’s clipping mechanism while ensuring stable policy updates. Third, we theoretically and empirically show that late denoising steps yield ill-conditioned off-policy ratios, and mitigate this by truncating trajectories at late steps. Across image and video generation benchmarks, OP-GRPO achieves comparable or superior performance to Flow-GRPO with only 34.2% of the training steps on average, yielding substantial gains in training efficiency while maintaining generation quality.

关键词: OP-GRPO, Off-Policy GRPO, flow-matching models, sample efficiency, importance sampling, trajectory truncation, image generation, video generation

248. ❌ Rethinking Exposure Correction for Spatially Non-uniform Degradation

作者: Ao Li, Jiawei Sun, Le Dong, Zhenyu Wang, Weisheng Dong 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04136v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的曝光校正任务，提出了一种针对空间非均匀退化的新方法，包括空间信号编码器、HSL补偿模块和非均匀损失函数。所有评分关键词均与大模型、深度学习技术原理或科学AI应用相关，而本文研究的是传统图像处理问题，未涉及任何大模型、深度学习技术或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对真实世界中图像曝光校正存在的空间非均匀退化问题，提出了一种包含空间自适应调制权重和不确定性引导损失的新方法，在定性和定量评估中均优于现有技术。

摘要翻译

真实场景下的曝光校正面临空间非均匀退化的根本性挑战，即单张图像中常同时存在多种曝光误差。然而，现有的曝光校正方法仍主要在均匀退化的假设下发展。在架构层面，这些方法通常依赖全局聚合的调制信号，仅能捕捉整体曝光趋势。从优化角度，传统的重建损失通常基于全局共享的尺度构建，因而忽视了不同区域间空间变化的校正需求。为应对这些局限，我们提出一种专为空间非均匀性设计的新型曝光校正范式。具体而言，我们引入空间信号编码器（Spatial Signal Encoder）来预测空间自适应的调制权重，这些权重用于引导多个查找表（look-up tables）进行图像变换，并结合基于HSL的补偿模块以提升色彩保真度。在架构设计之外，我们提出一种不确定性启发的非均匀损失，该损失根据局部修复不确定性动态分配优化重点，从而更好地匹配真实世界曝光误差的异质性特征。大量实验表明，与现有先进方法相比，我们的方法在定性和定量评估上均取得了更优的性能。代码公开于https://github.com/FALALAS/rethinkingEC。

摘要 (Abstract)

Real-world exposure correction is fundamentally challenged by spatially non-uniform degradations, where diverse exposure errors frequently coexist within a single image. However, existing exposure correction methods are still largely developed under a predominantly uniform assumption. Architecturally, they typically rely on globally aggregated modulation signals that capture only the overall exposure trend. From the optimization perspective, conventional reconstruction losses are usually derived under a shared global scale, thus overlooking the spatially varying correction demands across regions. To address these limitations, we propose a new exposure correction paradigm explicitly designed for spatial non-uniformity. Specifically, we introduce a Spatial Signal Encoder to predict spatially adaptive modulation weights, which are used to guide multiple look-up tables for image transformation, together with an HSL-based compensation module for improved color fidelity. Beyond the architectural design, we propose an uncertainty-inspired non-uniform loss that dynamically allocates the optimization focus based on local restoration uncertainties, better matching the heterogeneous nature of real-world exposure errors. Extensive experiments demonstrate that our method achieves superior qualitative and quantitative performance compared with state-of-the-art methods. Code is available at https://github.com/FALALAS/rethinkingEC.

关键词: exposure correction, spatially non-uniform degradation, spatial signal encoder, look-up tables, HSL compensation, uncertainty-inspired loss, image restoration, computer vision

249. ❌ NTIRE 2026 3D Restoration and Reconstruction in Real-world Adverse Conditions: RealX3D Challenge Results

作者: Shuhong Liu, Chenyu Bao, Ziteng Cui, Xuangeng Chu, Bin Ren, Lin Gu, Xiang Chen, Mingrui Li, Long Ma, Marcos V. Conde, Radu Timofte, Yun Liu, Ryo Umagami, Tomohiro Hashimoto, Zijian Hu, Yuan Gan, Tianhan Xu, Yusuke Kurose, Tatsuya Harada, Junwei Yuan, Gengjia Chang, Xining Ge, Mache You, Qida Cao, Zeliang Li, Xinyuan Hu, Hongde Gu, Changyue Shi, Jiajun Ding, Zhou Yu, Jun Yu, Seungsang Oh, Fei Wang, Donggun Kim, Zhiliang Wu, Seho Ahn, Xinye Zheng, Kun Li, Yanyan Wei, Weisi Lin, Dizhe Zhang, Yuchao Chen, Meixi Song, Hanqing Wang, Haoran Feng, Lu Qi, Jiaao Shan, Yang Gu, Jiacheng Liu, Shiyu Liu, Kui Jiang, Junjun Jiang, Runyu Zhu, Sixun Dong, Qingxia Ye, Zhiqiang Zhang, Zhihua Xu, Zhiwei Wang, Phan The Son, Zhimiao Shi, Zixuan Guo, Xueming Fu, Lixia Han, Changhe Liu, Zhenyu Zhao, Manabu Tsukada, Zheng Zhang, Zihan Zhai, Tingting Li, Ziyang Zheng, Yuhao Liu, Dingju Wang, Jeongbin You, Younghyuk Kim, Il-Youp Kwak, Mingzhe Lyu, Junbo Yang, Wenhan Yang, Hongsen Zhang, Jinqiang Cui, Hong Zhang, Haojie Guo, Hantang Li, Qiang Zhu, Bowen He, Xiandong Meng, Debin Zhao, Xiaopeng Fan, Wei Zhou, Linzhe Jiang, Linfeng Li, Louzhe Xu, Qi Xu, Hang Song, Chenkun Guo, Weizhi Nie, Yufei Li, Xingan Zhan, Zhanqi Shi, Dufeng Zhang, Boyuan Tian, Jingshuo Zeng, Gang He, Yubao Fu, Weijie Wang, Cunchuan Huang 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04135v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是关于3D恢复与重建的计算机视觉挑战赛综述，专注于在极端低光和烟雾等恶劣条件下的3D重建方法评估，完全不涉及大语言模型、深度学习技术原理或任何评分关键词中的技术概念，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文综述了NTIRE 2026 3D恢复与重建挑战赛，评估了在极端低光和烟雾等恶劣条件下33个团队提交的3D重建方法，揭示了该领域在处理3D场景退化方面的显著进展。

摘要翻译

本文对NTIRE 2026三维恢复与重建（3DRR）挑战赛进行了全面综述，详细阐述了所提出的方法与结果。该挑战赛旨在发掘在真实世界恶劣条件下——特别是我们提出的RealX3D基准数据集所捕获的极端低光与烟雾退化环境——具有鲁棒性的三维重建流程。本次竞赛共有279名参与者注册，其中33支团队提交了有效结果。我们通过对比前沿基线方法对提交方案进行了系统评估，揭示了在恶劣条件下三维重建技术的显著进展。我们的分析重点总结了表现优异方法之间的共性设计原则，并为处理三维场景退化问题提供了有效的策略见解。

摘要 (Abstract)

This paper presents a comprehensive review of the NTIRE 2026 3D Restoration and Reconstruction (3DRR) Challenge, detailing the proposed methods and results. The challenge seeks to identify robust reconstruction pipelines that are robust under real-world adverse conditions, specifically extreme low-light and smoke-degraded environments, as captured by our RealX3D benchmark. A total of 279 participants registered for the competition, of whom 33 teams submitted valid results. We thoroughly evaluate the submitted approaches against state-of-the-art baselines, revealing significant progress in 3D reconstruction under adverse conditions. Our analysis highlights shared design principles among top-performing methods and provides insights into effective strategies for handling 3D scene degradation.

关键词: 3D restoration, 3D reconstruction, adverse conditions, low-light, smoke-degraded, RealX3D benchmark, challenge results, robust pipelines

250. ❌ Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks

作者: Rubén Moreno-Aguado, Alba Magallón, Victor Moreno, Yingying Fang, Guang Yang 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04133v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出VoxelFM，一种用于CT影像的3D基础模型，属于医学影像AI领域。核心相关关键词：1）‘AI for Science OR Bioinformatics OR Cheminformatics’（10分）：直接应用于生物医学/临床科学；2）‘Pre-training OR Continual Pre-training OR Domain Adaptation’（10分）：核心是自监督预训练（DINO框架）；3）‘Large Language Models OR LLMs OR Foundation Models’（8分）：属于基础模型范畴，但非语言模型；4）‘Post-training OR Supervised Fine-tuning OR SFT’（5分）：涉及下游任务微调；5）‘PEFT OR LoRA OR Parameter-efficient Fine-tuning’（5分）：强调轻量级探针（lightweight probes），类似参数高效微调思想。其他关键词（如MoE、RLHF、RAG等）与论文的视觉模型、医学影像处理无直接关联，故评0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种名为VoxelFM的3D CT基础模型，通过自监督学习获得鲁棒的视觉特征，在无需语言监督的情况下，在七类临床下游任务中匹配或超越了现有CT基础模型的性能。

摘要翻译

当前，开发人工智能系统以支持放射科医生完成从分割到报告生成等任务受到广泛关注。现有的计算机断层扫描（CT）基础模型主要侧重于构建能够执行问答和报告生成等任务的通用视觉-语言系统。然而，训练可靠的视觉-语言系统需要大规模配对的图像-文本数据，而这在CT领域仍难以获得。此外，将底层视觉表征适配到下游任务通常需要对骨干网络进行部分或全部微调，这一过程计算成本高昂，许多研究团队难以实现。相反，基础模型应优先学习鲁棒的视觉表征，使其能够以最少的标注数据且无需骨干网络微调即可高效迁移到新任务。我们提出了VoxelFM，这是一个基于DINO框架通过自蒸馏训练的三维CT基础模型，它无需语言监督即可学习语义丰富的特征。我们使用冻结的骨干网络表征配合轻量级探针，在七类临床相关下游任务上评估了VoxelFM：分类、回归、生存分析、实例检索、定位、分割和报告生成。在所有任务类别中，VoxelFM均达到或超越了四种现有的CT基础模型。尽管在预训练期间未接受任何语言监督，VoxelFM在包括报告生成在内的任务上，表现仍优于那些明确以语言对齐为目标进行训练的模型。我们的结果表明，当前的CT基础模型作为轻量级探针的特征提取器，其性能显著优于作为视觉-语言模型的视觉编码器。模型权重与训练代码均已公开。

摘要 (Abstract)

There is substantial interest in developing artificial intelligence systems to support radiologists across tasks ranging from segmentation to report generation. Existing computed tomography (CT) foundation models have largely focused on building generalist vision-language systems capable of tasks such as question answering and report generation. However, training reliable vision-language systems requires paired image-text data at a scale that remains unavailable in CT. Moreover, adapting the underlying visual representations to downstream tasks typically requires partial or full backbone fine-tuning, a computationally demanding process inaccessible to many research groups. Instead, foundation models should prioritise learning robust visual representations that enable efficient transfer to new tasks with minimal labelled data and without backbone fine-tuning. We present VoxelFM, a 3D CT foundation model trained with self-distillation using the DINO framework, which learns semantically rich features without language supervision. We evaluated VoxelFM across seven categories of clinically relevant downstream tasks using frozen backbone representations with lightweight probes: classification, regression, survival analysis, instance retrieval, localisation, segmentation, and report generation. VoxelFM matched or outperformed four existing CT foundation models across all task categories. Despite receiving no language supervision during pre-training, VoxelFM surpassed models explicitly trained with language-alignment objectives, including on report generation. Our results indicate that current CT foundation models perform significantly better as feature extractors for lightweight probes rather than as vision encoders for vision-language models. Model weights and training code are publicly available.

关键词: computed tomography, foundation model, self-distillation, DINO, transfer learning, clinical tasks, visual representations, lightweight probes

251. ❌ Efficient Onboard Spacecraft Pose Estimation with Event Cameras and Neuromorphic Hardware

作者: Arunkumar Rathinam, Jules Lecomte, Jost Reelsen, Gregor Lenz, Axel von Arnim, Djamila Aouada 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04117v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和嵌入式系统领域，研究使用事件相机和神经形态硬件进行航天器姿态估计。论文的核心技术涉及事件表示、轻量级网络（MobileNet风格）、量化感知训练（8/4-bit）以及将模型转换为脉冲神经网络在Akida硬件上部署。这与评分关键词列表中的绝大多数大语言模型（LLM）及相关技术（如提示工程、对齐、推理、智能体等）完全无关。唯一有微弱关联的关键词是’Quantization OR Model Compression OR Low-bit Weights’，因为论文明确提到了’quantization-aware training (8/4-bit)’，但这只是其实现低功耗部署的一个技术环节，并非论文的核心创新或主要研究内容，因此给予5分（有一定关联）。其他所有关键词均未在论文标题或摘要中提及，与论文主题无直接关系，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于航天器6自由度姿态估计的端到端流程，通过结合事件相机和BrainChip Akida神经形态处理器，实现了在Akida硬件上的实时、低功耗推理，为未来自主太空任务提供了可行的感知解决方案。

摘要翻译

可靠的相对姿态估计是实现自主交会与近距离操作的关键技术，然而空间图像因极端光照、高对比度和目标快速运动而极具挑战性。事件相机提供异步的、由变化驱动的测量数据，当基于帧的图像饱和或模糊时，事件数据仍能保持信息有效性；而神经形态处理器可利用稀疏激活实现低延迟、高能效的推理。本文提出了一种航天器六自由度姿态估计流程，将基于事件的视觉与BrainChip Akida神经形态处理器相结合。利用SPADES数据集，我们在轻量化事件帧表示上训练了紧凑型MobileNet风格的关键点回归网络，应用量化感知训练（8/4位），并将模型转换为兼容Akida的脉冲神经网络。我们对三种事件表示方法进行了基准测试，并在Akida V1硬件上实现了实时低功耗推理。此外，我们针对Akida V2设计了一种基于热图的模型，并在Akida云平台上进行评估，获得了更高的姿态估计精度。据我们所知，这是首次在Akida硬件上实现端到端的航天器姿态估计演示，为未来自主空间任务实现低延迟、低功耗感知提供了一条实用路径。

摘要 (Abstract)

Reliable relative pose estimation is a key enabler for autonomous rendezvous and proximity operations, yet space imagery is notoriously challenging due to extreme illumination, high contrast, and fast target motion. Event cameras provide asynchronous, change-driven measurements that can remain informative when frame-based imagery saturates or blurs, while neuromorphic processors can exploit sparse activations for low-latency, energy-efficient inferences. This paper presents a spacecraft 6-DoF pose-estimation pipeline that couples event-based vision with the BrainChip Akida neuromorphic processor. Using the SPADES dataset, we train compact MobileNet-style keypoint regression networks on lightweight event-frame representations, apply quantization-aware training (8/4-bit), and convert the models to Akida-compatible spiking neural networks. We benchmark three event representations and demonstrate real-time, low-power inference on Akida V1 hardware. We additionally design a heatmap-based model targeting Akida V2 and evaluate it on Akida Cloud, yielding improved pose accuracy. To our knowledge, this is the first end-to-end demonstration of spacecraft pose estimation running on Akida hardware, highlighting a practical route to low-latency, low-power perception for future autonomous space missions.

关键词: spacecraft pose estimation, event cameras, neuromorphic hardware, Akida processor, quantization-aware training, spiking neural networks, low-power inference, real-time perception

作者: Peixin Chen, Guoxi Zhang, Jianwei Ma, Qing Li 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04108v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究具身导航中的假设图精炼框架，使用视觉语言模型（VLMs）进行语义预测和错误纠正。与大多数大模型技术关键词（如LLMs、MoE、Scaling Laws、训练方法等）无直接关联。仅与少数关键词有弱关联：1）‘Self-Correction’（5分）：框架包含验证驱动的级联纠正机制，可视为一种自我纠正形式；2）‘LLM Agents’（5分）：研究具身智能体，属于广义的智能体范畴，但未明确使用LLMs；3）‘Hallucination Mitigation’（5分）：解决VLM预测错误传播问题，类似于缓解幻觉，但非针对LLMs。其他关键词均不相关。

!!! tip deepseek-chat TL;DR

该论文提出了假设图精炼（HGR）框架，通过语义假设模块和级联错误纠正机制，解决了具身导航中视觉语言模型预测错误累积的问题，在多个基准测试中提升了导航成功率和效率。

摘要翻译

具身智能体必须在部分可观测环境中进行探索，同时维持可靠的长时程记忆。现有的基于图的导航系统提升了可扩展性，但这些系统通常将未探索区域视为语义未知区域，导致前沿探索效率低下。尽管视觉-语言模型能够预测前沿语义，但错误的预测可能被嵌入记忆并通过下游推理传播，造成结构性的错误累积，而仅靠置信度衰减无法解决此问题。这些现象要求一个能够利用语义预测进行定向探索，同时能在新证据与预测矛盾时系统性地撤回错误的框架。我们提出假设图优化框架，该框架将前沿预测表示为依赖感知图记忆中的可修订假设节点。HGR包含两个核心部分：（1）语义假设模块，该模块估计基于上下文条件的前沿语义分布，并依据目标相关性、移动成本和不确定性对探索目标进行排序；（2）验证驱动的级联修正机制，该机制将现场观测与预测语义进行比对，一旦发现不匹配，则撤回被证伪的节点及其所有下游依赖节点。与累积式地图构建不同，该方法允许图通过剪除错误子图进行收缩，从而在整个长周期任务中保持记忆的可靠性。我们在多模态终身导航基准和具身问答基准上评估了HGR。HGR在GOAT-Bench上取得了72.41%的成功率和56.22%的SPL指标，并在两个问答基准上均展现出持续的性能提升。诊断分析表明，级联修正机制消除了约20%的结构性冗余假设节点，并将对错误区域的重复访问减少了4.5倍，其中镜面和透明表面导致的预测错误占已修正错误的67%。

摘要 (Abstract)

Embodied agents must explore partially observed environments while maintaining reliable long-horizon memory. Existing graph-based navigation systems improve scalability, but they often treat unexplored regions as semantically unknown, leading to inefficient frontier search. Although vision-language models (VLMs) can predict frontier semantics, erroneous predictions may be embedded into memory and propagate through downstream inferences, causing structural error accumulation that confidence attenuation alone cannot resolve. These observations call for a framework that can leverage semantic predictions for directed exploration while systematically retracting errors once new evidence contradicts them. We propose Hypothesis Graph Refinement (HGR), a framework that represents frontier predictions as revisable hypothesis nodes in a dependency-aware graph memory. HGR introduces (1) semantic hypothesis module, which estimates context-conditioned semantic distributions over frontiers and ranks exploration targets by goal relevance, travel cost, and uncertainty, and (2) verification-driven cascade correction, which compares on-site observations against predicted semantics and, upon mismatch, retracts the refuted node together with all its downstream dependents. Unlike additive map-building, this allows the graph to contract by pruning erroneous subgraphs, keeping memory reliable throughout long episodes. We evaluate HGR on multimodal lifelong navigation (GOAT-Bench) and embodied question answering (A-EQA, EM-EQA). HGR achieves 72.41% success rate and 56.22% SPL on GOAT-Bench, and shows consistent improvements on both QA benchmarks. Diagnostic analysis reveals that cascade correction eliminates approximately 20% of structurally redundant hypothesis nodes and reduces revisits to erroneous regions by 4.5x, with specular and transparent surfaces accounting for 67% of corrected prediction errors.

关键词: Embodied Navigation, Hypothesis Graph Refinement, Vision-Language Models, Cascade Error Correction, Semantic Hypothesis, Graph Memory, Long-horizon Memory, Frontier Exploration

253. ❌ A Physics-Informed, Behavior-Aware Digital Twin for Robust Multimodal Forecasting of Core Body Temperature in Precision Livestock Farming

作者: Riasad Alvi, Mohaimenul Azam Khan Raiaan, Sadia Sultana Chowa, Arefin Ittesafun Abian, Reem E Mohamed, Md Rafiqul Islam, Yakub Sebastian, Sheikh Izzal Azid, Sami Azam 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04098v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于农业领域的数字孪生和机器学习应用，使用物理模型、高斯过程、卡尔曼滤波、马尔可夫链和LightGBM集成方法进行奶牛核心体温预测。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理系统等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文属于AI在农业科学（可视为广义科学领域）的应用，但未涉及生物信息学或化学信息学。

!!! tip deepseek-chat TL;DR

该研究提出了一种结合物理模型和机器学习堆叠集成的数字孪生框架，用于精准畜牧业中奶牛核心体温的多模态预测，实现了高精度的热应力早期检测。

摘要翻译

精准畜牧业需要准确及时的热应激预测，以确保动物福利并优化牧场管理。本研究提出了一种物理信息数字孪生（Digital Twin, DT）框架，结合一种不确定性感知、专家加权的堆叠集成方法，用于奶牛核心体温（Core Body Temperature, CBT）的多模态预测。该框架利用高频异构的MmCows数据集，集成了以下组件：一个基于常微分方程（Ordinary Differential Equation, ODE）的体温调节模型，用于模拟代谢产热与散热；一个高斯过程，用于捕捉个体奶牛的特异性偏差；一个卡尔曼滤波器，用于将预测与实时传感器数据对齐；以及一个行为马尔可夫链，用于模拟变化环境条件下的活动状态转换。数字孪生输出的关键生理指标（如预测CBT、热应激概率和行为状态分布）与原始传感器数据融合，并通过多尺度时间分析和跨模态特征工程进行增强，形成一个综合特征集。预测方法设计为三阶段堆叠集成：第一阶段针对不同特征组训练特定模态的LightGBM“专家”模型；第二阶段收集其预测结果作为元特征；第三阶段通过Optuna优化的LightGBM元模型生成最终的CBT预测。预测不确定性通过自助法进行量化，并使用预测区间覆盖概率（Prediction Interval Coverage Probability, PICP）进行验证。消融分析证实，融入数字孪生衍生特征和多模态融合能显著提升性能。所提框架在提前2小时的预测中，实现了交叉验证R² 0.783、F1分数84.25%和PICP 92.38%的性能，为早期热应激检测和精准畜牧管理提供了一个稳健、不确定性感知且具有物理原理依据的系统。

摘要 (Abstract)

Precision livestock farming requires accurate and timely heat stress prediction to ensure animal welfare and optimize farm management. This study presents a physics-informed digital twin (DT) framework combined with an uncertainty-aware, expert-weighted stacked ensemble for multimodal forecasting of Core Body Temperature (CBT) in dairy cattle. Using the high-frequency, heterogeneous MmCows dataset, the DT integrates an ordinary differential equation (ODE)-based thermoregulation model that simulates metabolic heat production and dissipation, a Gaussian process for capturing cow-specific deviations, a Kalman filter for aligning predictions with real-time sensor data, and a behavioral Markov chain that models activity-state transitions under varying environmental conditions. The DT outputs key physiological indicators, such as predicted CBT, heat stress probability, and behavioral state distributions are fused with raw sensor data and enriched through multi-scale temporal analysis and cross-modal feature engineering to form a comprehensive feature set. The predictive methodology is designed in a three-stage stacked ensemble, where stage 1 trains modality-specific LightGBM ’expert’ models on distinct feature groups, stage 2 collects their predictions as meta-features, and at stage 3 Optuna-tuned LightGBM meta-model yields the final CBT forecast. Predictive uncertainty is quantified via bootstrapping and validated using Prediction Interval Coverage Probability (PICP). Ablation analysis confirms that incorporating DT-derived features and multimodal fusion substantially enhances performance. The proposed framework achieves a cross-validated R2 of 0.783, F1 score of 84.25% and PICP of 92.38% for 2-hour ahead forecasting, providing a robust, uncertainty-aware, and physically principled system for early heat stress detection and precision livestock management.

关键词: Digital Twin, Precision Livestock Farming, Core Body Temperature, Multimodal Forecasting, Physics-Informed Model, Stacked Ensemble, Heat Stress Prediction, Uncertainty Quantification

254. ❌ LAA-X: Unified Localized Artifact Attention for Quality-Agnostic and Generalizable Face Forgery Detection

作者: Dat Nguyen, Enjie Ghorbel, Anis Kacem, Marcella Astrid, Djamila Aouada 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04086v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的深度伪造检测，提出了一种基于注意力机制和多任务学习的检测框架。虽然论文涉及深度学习技术（CNN和Transformer），但所有关键词均针对大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、应用等），而本文完全不涉及语言模型、文本生成或LLM相关技术。论文内容与所有关键词均无直接关联，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为LAA-X的新型深度伪造检测框架，通过局部伪影注意力和多任务学习来提高对高质量伪造的鲁棒性和对未知操作的泛化能力，并在多个基准测试中达到了最先进的性能。

摘要翻译

本文提出了一种新颖的深度伪造检测框架——局部伪影注意力X（Localized Artifact Attention X，简称LAA-X），该框架不仅对高质量伪造内容具有鲁棒性，还能泛化至未见过的篡改类型。现有方法通常依赖于结合隐式注意力机制的二元分类器，这些方法往往难以泛化到已知篡改类型之外。相比之下，LAA-X引入了一种基于多任务学习框架并结合混合式数据合成的显式注意力策略。辅助任务的设计旨在引导模型聚焦于局部的、易产生伪影（即脆弱）的区域。所提出的框架兼容CNN和Transformer骨干网络，从而衍生出两个不同版本，分别命名为LAA-Net和LAA-Former。尽管仅使用真实样本和伪伪造样本进行训练，LAA-X在多个基准测试中仍可与最先进方法相竞争。LAA-Net与LAA-Former的代码及预训练权重均已公开。

摘要 (Abstract)

In this paper, we propose Localized Artifact Attention X (LAA-X), a novel deepfake detection framework that is both robust to high-quality forgeries and capable of generalizing to unseen manipulations. Existing approaches typically rely on binary classifiers coupled with implicit attention mechanisms, which often fail to generalize beyond known manipulations. In contrast, LAA-X introduces an explicit attention strategy based on a multi-task learning framework combined with blending-based data synthesis. Auxiliary tasks are designed to guide the model toward localized, artifact-prone (i.e., vulnerable) regions. The proposed framework is compatible with both CNN and transformer backbones, resulting in two different versions, namely, LAA-Net and LAA-Former, respectively. Despite being trained only on real and pseudo-fake samples, LAA-X competes with state-of-the-art methods across multiple benchmarks. Code and pre-trained weights for LAA-Net\footnote{https://github.com/10Ring/LAA-Net} and LAA-Former\footnote{https://github.com/10Ring/LAA-Former} are publicly available.

关键词: Deepfake Detection, Localized Artifact Attention, Multi-task Learning, Generalizable Detection, CNN, Transformer, Face Forgery, Data Synthesis

255. ❌ Intelligent Traffic Monitoring with YOLOv11: A Case Study in Real-Time Vehicle Detection

作者: Shkelqim Sherifi 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04080v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉和深度学习在交通监控中的应用，具体使用YOLOv11进行车辆检测和计数，并集成BoT-SORT/ByteTrack进行多目标跟踪。所有评分关键词均与大语言模型（LLMs）、大模型技术原理、科学领域AI应用（如生物信息学）等相关，而本文研究的是基于CNN的视觉目标检测系统，未涉及任何大语言模型、MoE、缩放定律、训练对齐、推理优化、智能体、模型压缩等大模型相关技术，也未在生物信息学等科学领域应用大模型。因此，所有关键词与论文内容完全无关，得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于YOLOv11和BoT-SORT/ByteTrack的离线实时交通监控系统，用于车辆检测和计数，在典型条件下实现了高精度（如汽车F1分数0.90-1.00）和66.67-95.83%的计数准确率。

摘要翻译

人工智能驱动的计算机视觉领域近期进展显著提升了监控系统的性能。交通监控作为一项重要应用，结合了计算机视觉与基于深度学习的物体检测及计数技术。本文提出一种离线实时交通监控系统，该系统将预训练的YOLOv11检测器与BoT-SORT/ByteTrack多目标跟踪算法相结合，基于PyTorch/OpenCV框架实现，并封装于Qt开发的桌面用户界面中。该卷积神经网络（CNN）处理流程能够从视频流中高效实现车辆检测与计数，且无需依赖云端服务。在多样化场景测试中，系统达到（66.67-95.83%）的计数准确率。按类别检测显示高精度（轿车：0.97-1.00；卡车：1.00）与强召回率（轿车：0.82-1.00；卡车：0.70-1.00），相应F1分数为（轿车0.90-1.00，卡车0.82-1.00）。尽管恶劣天气条件可能对性能产生负面影响，但在常规环境下系统仍保持稳健表现。通过将轻量化模型与易用、不依赖云端的界面相集成，本研究展示了人工智能驱动交通监控系统的潜力，为未来智慧城市的现代化与发展提供了技术支撑。

摘要 (Abstract)

Recent advancements in computer vision, driven by artificial intelligence, have significantly enhanced monitoring systems. One notable application is traffic monitoring, which leverages computer vision alongside deep learning-based object detection and counting. We present an offline, real-time traffic monitoring system that couples a pre-trained YOLOv11 detector with BoT-SORT/ByteTrack for multi-object tracking, implemented in PyTorch/OpenCV and wrapped in a Qt-based desktop UI. The CNN pipeline enables efficient vehicle detection and counting from video streams without cloud dependencies. Across diverse scenes, the system achieves (66.67-95.83%) counting accuracy. Class-wise detection yields high precision (cars: 0.97-1.00; trucks: 1.00) with strong recall (cars: 0.82-1.00; trucks: 0.70-1.00), resulting in F1 scores of (0.90-1.00 for cars and 0.82-1.00 for trucks). While adverse weather conditions may negatively impact this performance, results remain robust in typical conditions. By integrating lightweight models with an accessible, cloud-independent interface, this paper contributes to the modernization and development of future smart cities by showing the capacity of AI-driven traffic monitoring systems.

关键词: traffic monitoring, vehicle detection, YOLOv11, real-time system, object tracking, computer vision, deep learning, smart cities

256. ❌ Stratifying Reinforcement Learning with Signal Temporal Logic

作者: Justin Curry, Alberto Speranzon 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04923v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于信号时序逻辑（STL）与分层理论的结合及其在深度强化学习（DRL）中的应用，研究内容涉及STL语义、分层空间、DRL嵌入空间分析等。所有评分关键词均与大模型、深度学习技术原理或AI科学应用直接相关，而本文未涉及任何大模型（LLM/SLM）、模型训练/微调技术、推理优化、对齐方法、代理系统、模型压缩等主题，也未涉及生物信息学等具体科学领域应用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文提出了一种基于分层理论的信号时序逻辑（STL）语义框架，揭示了STL公式与时空分层结构的对应关系，并将其应用于分析深度强化学习（DRL）智能体的嵌入空间几何结构，在Minigrid游戏中进行了实证验证。

摘要翻译

本文为信号时序逻辑（STL）发展了一种基于分层理论的语义框架，其中每个原子谓词均被解释为对分层空间中成员资格的检验。这一视角揭示了分层理论与STL之间新颖的对应原理，表明大多数STL公式可被视为诱导了时空的分层结构。该解释的意义具有双重性：首先，它为分析深度强化学习（DRL）所生成的嵌入空间结构提供了全新的理论框架，并将其与环境决策空间的几何特性联系起来；其次，它建立了一个原则性框架，既能复用现有的高维分析工具，也激励着新型计算技术的创造。为夯实理论，我们（1）通过Minigrid游戏阐释分层理论的作用；（2）将数值技术应用于DRL智能体在同类游戏中的潜在嵌入表示，其中STL公式的鲁棒性被用作奖励函数。在此过程中，我们提出了计算高效的签名方法，初步证据表明该方法在揭示此类嵌入空间的分层结构方面具有潜力。

摘要 (Abstract)

In this paper, we develop a stratification-based semantics for Signal Temporal Logic (STL) in which each atomic predicate is interpreted as a membership test in a stratified space. This perspective reveals a novel correspondence principle between stratification theory and STL, showing that most STL formulas can be viewed as inducing a stratification of space-time. The significance of this interpretation is twofold. First, it offers a fresh theoretical framework for analyzing the structure of the embedding space generated by deep reinforcement learning (DRL) and relates it to the geometry of the ambient decision space. Second, it provides a principled framework that both enables the reuse of existing high-dimensional analysis tools and motivates the creation of novel computational techniques. To ground the theory, we (1) illustrate the role of stratification theory in Minigrid games and (2) apply numerical techniques to the latent embeddings of a DRL agent playing such a game where the robustness of STL formulas is used as the reward. In the process, we propose computationally efficient signatures that, based on preliminary evidence, appear promising for uncovering the stratification structure of such embedding spaces.

关键词: Signal Temporal Logic, Stratification Theory, Deep Reinforcement Learning, Embedding Space Analysis, Minigrid Games, Robustness, Computational Signatures

257. ❌ PINNs in PDE Constrained Optimal Control Problems: Direct vs Indirect Methods

作者: Zhen Zhang, Shanqing Liu, Alessandro Alla, Jerome Darbon, George Em Karniadakis 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04920v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究物理信息神经网络（PINNs）在偏微分方程约束最优控制问题中的应用，属于AI for Science领域，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），但论文未涉及大语言模型、深度学习技术原理创新或任何其他评分关键词中的具体技术，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了物理信息神经网络（PINNs）在偏微分方程约束最优控制问题中的两种数值方法（直接法和间接法），发现间接PINN能更好地保持PDE约束和最优性结构，并产生更准确的神经网络近似。

摘要翻译

本研究探讨将物理信息神经网络（PINNs）作为半线性偏微分方程最优控制的数值工具。我们首先回顾了偏微分方程最优控制的经典直接法与间接法视角，随后提出了两种PINN建模框架：一种基于状态约束下最小化目标函数的直接建模法，另一种基于一阶最优性系统的间接建模法。针对一类半线性抛物型方程，我们推导了状态方程、伴随方程及平稳性条件，其形式与连续时间庞特里亚金型最优性条件保持一致。随后，我们将该框架具体应用于艾伦-卡恩（Allen-Cahn）控制问题，并比较了三种数值方法：（i）先离散后优化的伴随方法，（ii）直接PINN法，以及（iii）间接PINN法。数值结果表明，PINN参数化具有隐式正则化效应，即其倾向于生成更平滑的控制曲线。研究还表明，相较于直接PINN法，间接PINN法能更忠实保持偏微分方程约束与最优性结构，并得到更精确的神经网络近似解。

摘要 (Abstract)

We study physics-informed neural networks (PINNs) as numerical tools for the optimal control of semilinear partial differential equations. We first recall the classical direct and indirect viewpoints for optimal control of PDEs, and then present two PINN formulations: a direct formulation based on minimizing the objective under the state constraint, and an indirect formulation based on the first-order optimality system. For a class of semilinear parabolic equations, we derive the state equation, the adjoint equation, and the stationarity condition in a form consistent with continuous-time Pontryagin-type optimality conditions. We then specialize the framework to an Allen-Cahn control problem and compare three numerical approaches: (i) a discretize-then-optimize adjoint method, (ii) a direct PINN, and (iii) an indirect PINN. Numerical results show that the PINN parameterization has an implicit regularizing effect, in the sense that it tends to produce smoother control profiles. They also indicate that the indirect PINN more faithfully preserves the PDE contraint and optimality structure and yields a more accurate neural approximation than the direct PINN.

关键词: Physics-informed neural networks, PINNs, Optimal control, Partial differential equations, Direct method, Indirect method, Allen-Cahn equation, Adjoint method

258. ❌ Empowering Power Outage Prediction with Spatially Aware Hybrid Graph Neural Networks and Contrastive Learning

作者: Xuyang Shen, Zijie Pan, Diego Cerrai, Xinxuan Zhang, Christopher Colorio, Emmanouil N. Anagnostou, Dongjin Song 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04916v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用图神经网络（SA-HGNN）和对比学习进行极端天气导致的停电预测，属于AI在科学领域的应用（具体为气候和基础设施领域）。论文未涉及任何大语言模型（LLM）、深度学习技术原理创新或关键词列表中除’AI for Science’外的其他技术。‘AI for Science’得5分，因为论文确实应用AI解决科学问题（气候和能源基础设施），但并非核心生物信息学或化学信息学。其他关键词均得0分，因为论文内容与LLM、MoE、缩放定律、训练方法、推理优化、代理系统等完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种空间感知混合图神经网络（SA-HGNN）结合对比学习的方法，以改进极端天气导致的停电预测，并在多个地区实现了最先进的性能。

摘要翻译

气候变化加剧的极端天气事件，如强风暴、飓风、暴风雪和冰暴，频繁引发大范围停电事故。这些停电导致工业停滞、社区受影响、关键基础设施受损，严重扰乱经济运行，并对各行业产生深远影响。为减轻此类影响，康涅狄格大学与Eversource能源中心开发了停电预测建模系统，旨在极端天气发生前为配电网络提供预警性预测。然而，该系统现有预测模型未纳入极端天气事件的空间效应。为此，我们开发了融合对比学习的空间感知混合图神经网络，以提升极端天气致停电的预测能力。具体而言，我们首先通过空间感知混合图神经网络，对静态特征（如土地覆盖、基础设施）和事件特定动态特征（如风速、降水）的空间关系进行编码。随后，利用对比学习处理不同类型极端天气事件的数据不平衡问题，通过最小化相似位置在同类事件中的嵌入距离，同时最大化所有位置在跨事件间的嵌入距离，生成位置特定的特征表示。在康涅狄格州、西马萨诸塞州、东马萨诸塞州和新罕布什尔州四个公用事业服务区域的实证研究表明，该模型在停电预测任务中达到了最先进的性能水平。

摘要 (Abstract)

Extreme weather events, such as severe storms, hurricanes, snowstorms, and ice storms, which are exacerbated by climate change, frequently cause widespread power outages. These outages halt industrial operations, impact communities, damage critical infrastructure, profoundly disrupt economies, and have far-reaching effects across various sectors. To mitigate these effects, the University of Connecticut and Eversource Energy Center have developed an outage prediction modeling (OPM) system to provide pre-emptive forecasts for electric distribution networks before such weather events occur. However, existing predictive models in the system do not incorporate the spatial effect of extreme weather events. To this end, we develop Spatially Aware Hybrid Graph Neural Networks (SA-HGNN) with contrastive learning to enhance the OPM predictions for extreme weather-induced power outages. Specifically, we first encode spatial relationships of both static features (e.g., land cover, infrastructure) and event-specific dynamic features (e.g., wind speed, precipitation) via Spatially Aware Hybrid Graph Neural Networks (SA-HGNN). Next, we leverage contrastive learning to handle the imbalance problem associated with different types of extreme weather events and generate location-specific embeddings by minimizing intra-event distances between similar locations while maximizing inter-event distances across all locations. Thorough empirical studies in four utility service territories, i.e., Connecticut, Western Massachusetts, Eastern Massachusetts, and New Hampshire, demonstrate that SA-HGNN can achieve state-of-the-art performance for power outage prediction.

关键词: power outage prediction, graph neural networks, contrastive learning, extreme weather events, spatial awareness, infrastructure resilience, weather forecasting, SA-HGNN

259. ❌ Are Latent Reasoning Models Easily Interpretable?

作者: Connor Dilgren, Sarah Wiegreffe 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04902v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究潜推理模型（LRMs）的可解释性，与推理方法（Chain of Thought/System 2 Thinking）高度相关（10分），因为LRMs是推理模型的一种；与可解释AI高度相关（10分），因为核心研究可解释性；与大模型有一定关联（5分），因为LRMs可视为大模型的一种推理架构；与自我纠正有一定关联（5分），因为可解释性可作为预测正确性的信号。其他关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了潜推理模型（LRMs）的可解释性问题，发现当前LRMs主要编码可解释的推理过程，且可解释性可作为预测正确性的信号。

摘要翻译

隐式推理模型因其较低的推理成本（相对于显式推理模型）以及在理论上能够并行探索多条推理路径的能力而引起了广泛的研究关注。然而，这些优势是以降低可解释性为代价的：隐式推理模型难以监控，因为它们并非使用自然语言进行推理。本文通过考察两种先进的隐式推理模型，对其可解释性进行了深入研究。首先，我们发现隐式推理标记对于模型的预测往往并非必需；在逻辑推理数据集上，隐式推理模型几乎总能在完全不使用隐式推理的情况下得出相同的最终答案。这种对推理标记的利用不足，可能部分解释了为何隐式推理模型并未持续优于显式推理方法，同时也对先前研究中这些标记所声称的作用提出了质疑。其次，我们证明当隐式推理标记对模型性能确有必要时，对于正确预测的实例，我们能够以高达65-93%的比例解码出标准推理轨迹。这表明隐式推理模型通常实现的是预期的解决方案，而非不可解释的推理过程。最后，我们提出一种方法，能够在无需预先知晓标准推理轨迹的情况下，从隐式标记中解码出经过验证的自然语言推理轨迹。实验表明，对于大多数正确预测能够找到验证轨迹，而对于错误预测则仅能覆盖少数情况。我们的研究结果强调，当前的隐式推理模型在很大程度上编码了可解释的推理过程，且可解释性本身可作为预测正确性的一个信号。

摘要 (Abstract)

Latent reasoning models (LRMs) have attracted significant research interest due to their low inference cost (relative to explicit reasoning models) and theoretical ability to explore multiple reasoning paths in parallel. However, these benefits come at the cost of reduced interpretability: LRMs are difficult to monitor because they do not reason in natural language. This paper presents an investigation into LRM interpretability by examining two state-of-the-art LRMs. First, we find that latent reasoning tokens are often unnecessary for LRMs’ predictions; on logical reasoning datasets, LRMs can almost always produce the same final answers without using latent reasoning at all. This underutilization of reasoning tokens may partially explain why LRMs do not consistently outperform explicit reasoning methods and raises doubts about the stated role of these tokens in prior work. Second, we demonstrate that when latent reasoning tokens are necessary for performance, we can decode gold reasoning traces up to 65-93% of the time for correctly predicted instances. This suggests LRMs often implement the expected solution rather than an uninterpretable reasoning process. Finally, we present a method to decode a verified natural language reasoning trace from latent tokens without knowing a gold reasoning trace a priori, demonstrating that it is possible to find a verified trace for a majority of correct predictions but only a minority of incorrect predictions. Our findings highlight that current LRMs largely encode interpretable processes, and interpretability itself can be a signal of prediction correctness.

关键词: latent reasoning models, interpretability, reasoning tokens, explicit reasoning, natural language reasoning, logical reasoning, prediction correctness, verified trace

260. ❌ Data Attribution in Adaptive Learning

作者: Amit Kiran Rege 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04892v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究自适应学习中的数据归因问题，主要涉及机器学习模型（包括语言模型）在训练过程中生成自身训练数据的场景。与’Large Language Models’相关（5分），因为摘要明确提到语言模型的后训练流程是自适应学习的例子之一；与’Post-training’相关（5分），因为论文直接讨论语言模型后训练管道中的自适应学习问题。其他关键词均未在论文标题或摘要中提及或暗示，因此得0分。

!!! tip deepseek-chat TL;DR

该论文研究了自适应学习中的数据归因问题，提出了一个条件干预目标来形式化有限视野自适应学习中的发生级归因，证明了重放侧信息通常无法恢复该目标，并识别了一个可以从记录数据中识别目标的结构类别。

摘要翻译

机器学习模型越来越多地自主生成训练数据——在线赌博机算法、强化学习以及语言模型的后训练流程是其主要范例。在这些自适应场景中，单个训练观测既更新学习器，又改变学习器未来将收集数据的分布。为静态数据集设计的标准归因方法忽略了这种反馈机制。我们通过条件干预目标的形式化方法，为有限时域自适应学习建立了观测层级的归因框架，证明回放侧信息通常无法还原该目标，并识别出一个可从已记录数据中识别该目标的结构类别。

摘要 (Abstract)

Machine learning models increasingly generate their own training data – online bandits, reinforcement learning, and post-training pipelines for language models are leading examples. In these adaptive settings, a single training observation both updates the learner and shifts the distribution of future data the learner will collect. Standard attribution methods, designed for static datasets, ignore this feedback. We formalize occurrence-level attribution for finite-horizon adaptive learning via a conditional interventional target, prove that replay-side information cannot recover it in general, and identify a structural class in which the target is identified from logged data.

关键词: Data Attribution, Adaptive Learning, Machine Learning Models, Training Data Generation, Post-training Pipelines, Language Models, Conditional Interventional Target, Logged Data

261. ❌ Optimizing LLM Prompt Engineering with DSPy Based Declarative Learning

作者: Shiek Ruksana, Sailesh Kiran Kurra, Thipparthi Sanjay Baradwaj 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04869v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的提示工程优化，使用DSPy框架进行声明式学习，因此与’Large Language Models’高度相关（10分）。论文明确提到在chain-of-thought基准测试上进行评估，因此与’Chain of Thought’高度相关（10分）。论文提到减少幻觉、提高事实准确性，因此与’Hallucination Mitigation’高度相关（10分）。论文提到在检索增强生成任务上进行评估，因此与’Retrieval-Augmented Generation’有一定关联（8分）。其他关键词在论文标题和摘要中未提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了使用DSPy声明式学习框架优化LLM提示工程的问题，通过自动化、模块化的方法减少了幻觉并提高了事实准确性，在多项基准测试中实现了30-45%的准确性提升和25%的幻觉率降低。

摘要翻译

大型语言模型（LLM）在广泛的自然语言处理任务中展现出强大性能，但其效果高度依赖于提示词的设计、结构及其中嵌入的推理信号。传统的提示工程方法主要依赖于启发式的试错过程，这限制了其跨任务的可扩展性、可复现性和泛化能力。DSPy作为一种用于优化文本处理流程的声明式框架，通过为基于LLM的系统提供自动化、模块化且可学习的提示构建方案，提供了一种替代路径。本文对基于DSPy的声明式提示优化学习进行了系统性研究，重点关注提示合成、修正、校准与自适应推理控制。我们提出了一种统一的DSPy LLM架构，该架构结合了符号规划、无梯度优化与自动化模块重写，以减少幻觉、改善事实依据并避免不必要的提示复杂性。在推理任务、检索增强生成以及多步思维链基准测试上进行的实验评估表明，该框架在不同模型中均能持续提升输出可靠性、效率与泛化能力。结果显示，事实准确性最高可提升30%至45%，幻觉率降低约25%。最后，我们总结了当前的主要局限性，并对声明式提示优化框架的未来研究方向进行了探讨。

摘要 (Abstract)

Large Language Models (LLMs) have shown strong performance across a wide range of natural language processing tasks; however, their effectiveness is highly dependent on prompt design, structure, and embedded reasoning signals. Conventional prompt engineering methods largely rely on heuristic trial-and-error processes, which limits scalability, reproducibility, and generalization across tasks. DSPy, a declarative framework for optimizing text-processing pipelines, offers an alternative approach by enabling automated, modular, and learnable prompt construction for LLM-based systems.This paper presents a systematic study of DSPy-based declarative learning for prompt optimization, with emphasis on prompt synthesis, correction, calibration, and adaptive reasoning control. We introduce a unified DSPy LLM architecture that combines symbolic planning, gradient free optimization, and automated module rewriting to reduce hallucinations, improve factual grounding, and avoid unnecessary prompt complexity. Experimental evaluations conducted on reasoning tasks, retrieval-augmented generation, and multi-step chain-of-thought benchmarks demonstrate consistent gains in output reliability, efficiency, and generalization across models. The results show improvements of up to 30 to 45% in factual accuracy and a reduction of approximately 25% in hallucination rates. Finally, we outline key limitations and discuss future research directions for declarative prompt optimization frameworks.

关键词: LLM prompt engineering, DSPy, declarative learning, hallucination reduction, factual accuracy, chain-of-thought, retrieval-augmented generation, automated prompt optimization

262. ❌ FairLogue: A Toolkit for Intersectional Fairness Analysis in Clinical Machine Learning Models

作者: Nick Souligne, Vignesh Subbian 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04858v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于临床机器学习中的算法公平性工具包开发，主要涉及传统机器学习（逻辑回归）和公平性评估方法。论文内容与绝大多数大模型/深度学习技术关键词完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为该研究属于医疗健康领域的AI应用（临床机器学习），但并未涉及大模型或深度学习技术。

!!! tip deepseek-chat TL;DR

该研究开发了Fairlogue工具包，用于临床机器学习模型中的交叉公平性分析，并在青光眼手术预测任务中验证了该工具包能识别出比单轴分析更大的公平性差距。

摘要翻译

目的：算法公平性对于医疗保健领域实现公平可信的机器学习至关重要。现有公平性工具多侧重于单维度人口统计学比较，可能忽略影响交叉人群的复合性差异。本研究推出Fairlogue工具包，旨在临床环境的观测性与反事实场景中实现交叉公平性评估的操作化。方法：Fairlogue是基于Python的工具包，包含三个组件：1）将人口均等、机会均等化与机会均等差异扩展至交叉人群的观测框架；2）评估治疗场景下公平性的反事实框架；3）针对交叉群体成员身份干预措施评估公平性的广义反事实框架。该工具包通过使用逻辑回归模型（以种族和性别作为受保护属性）的青光眼手术预测任务，在All of Us Controlled Tier V8数据集的电子健康记录数据中进行验证。结果：观测分析显示，尽管模型性能中等（AUROC = 0.709；准确率 = 0.651），仍存在显著的交叉差异。交叉评估揭示的公平性差距大于单维度分析，包括0.20的人口均等差异，以及机会均等化真阳性率和假阳性率差距分别达0.33和0.15。基于置换零分布的反事实分析得出接近零的不公平性（“u值”）估计值，表明在控制协变量后，观测到的差异与随机效应一致。结论：Fairlogue提供模块化工具包，整合了观测与反事实方法，可用于临床机器学习工作流中交叉偏倚的量化与评估。

摘要 (Abstract)

Objective: Algorithmic fairness is essential for equitable and trustworthy machine learning in healthcare. Most fairness tools emphasize single-axis demographic comparisons and may miss compounded disparities affecting intersectional populations. This study introduces Fairlogue, a toolkit designed to operationalize intersectional fairness assessment in observational and counterfactual contexts within clinical settings. Methods: Fairlogue is a Python-based toolkit composed of three components: 1) an observational framework extending demographic parity, equalized odds, and equal opportunity difference to intersectional populations; 2) a counterfactual framework evaluating fairness under treatment-based contexts; and 3) a generalized counterfactual framework assessing fairness under interventions on intersectional group membership. The toolkit was evaluated using electronic health record data from the All of Us Controlled Tier V8 dataset in a glaucoma surgery prediction task using logistic regression with race and gender as protected attributes. Results: Observational analysis identified substantial intersectional disparities despite moderate model performance (AUROC = 0.709; accuracy = 0.651). Intersectional evaluation revealed larger fairness gaps than single-axis analyses, including demographic parity differences of 0.20 and equalized odds true positive and false positive rate gaps of 0.33 and 0.15, respectively. Counterfactual analysis using permutation-based null distributions produced unfairness (“u-value”) estimates near zero, suggesting observed disparities were consistent with chance after conditioning on covariates. Conclusion: Fairlogue provides a modular toolkit integrating observational and counterfactual methods for quantifying and evaluating intersectional bias in clinical machine learning workflows.

关键词: algorithmic fairness, intersectional fairness, clinical machine learning, healthcare AI, fairness toolkit, counterfactual analysis, demographic parity, equalized odds

263. ❌ The Role of Generator Access in Autoregressive Post-Training

作者: Amit Kiran Rege 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04855v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究"autoregressive post-training”，与"Post-training OR Supervised Fine-tuning OR SFT"高度相关（10分），因为直接研究后训练方法。论文涉及自回归模型训练，与"Large Language Models OR LLMs OR Foundation Models"有一定关联（8分），因为LLMs通常使用自回归训练。其他关键词如MoE、SLMs、RLHF、RAG等均未在标题或摘要中提及，因此评0分。

!!! tip deepseek-chat TL;DR

该论文研究了在自回归后训练中生成器访问方式（仅限于新鲜根起始展开还是可以返回先前构建的前缀）如何影响训练效果，发现改变生成器接口会为KL正则化结果奖励后训练带来指数级差距。

摘要翻译

本研究探讨生成器访问方式如何约束自回归式后训练。核心问题在于学习器是否仅限于从初始根节点开始的全新序列生成，抑或能够回溯至已构建的前缀序列并在该处查询下一词元生成规则。在根节点起始机制下，沿采样轨迹的输出采样、生成词元对数概率、top-$k$报告及完整下一词元分布均简化为单一范式化实验，其效能受限于到达信息丰富前缀的在线策略概率。弱前缀控制机制突破了此限制，一旦获得控制能力，条件采样或逻辑值等更丰富的观测方式将超越top-$1$访问模式。仅改变生成器接口即可为基于KL正则化的结果奖励后训练创造指数级差异。

摘要 (Abstract)

We study how generator access constrains autoregressive post-training. The central question is whether the learner is confined to fresh root-start rollouts or can return to previously built prefixes and query the next-token rule there. In the root-start regime, output sampling, generated-token log probabilities, top-$k$ reports, and full next-token distributions along sampled trajectories all reduce to one canonical experiment, limited by the on-policy probability of reaching informative prefixes. Weak prefix control breaks this barrier, and once control is available, richer observations such as conditional sampling or logits can outperform top-$1$ access. Changing only the generator interface creates an exponential gap for KL-regularized outcome-reward post-training.

关键词: autoregressive post-training, generator access, prefix control, KL-regularized outcome-reward, next-token rule, sampled trajectories, exponential gap

264. ❌ A Robust SINDy Autoencoder for Noisy Dynamical System Identification

作者: Kairui Ding 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04829v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用SINDy自编码器进行非线性动力系统识别，属于AI for Science（科学AI）领域，因此与’AI for Science OR Bioinformatics OR Cheminformatics’关键词有一定关联（评分5分）。论文未涉及大语言模型（LLMs）、深度学习技术原理创新、模型训练/对齐/推理优化、智能体系统等主题，与其他所有关键词完全无关（评分0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合噪声分离模块的鲁棒SINDy自编码器，用于从含噪声观测数据中更可靠地识别非线性动力系统的简化潜在坐标和稀疏控制方程。

摘要翻译

非线性动力学稀疏辨识（Sparse Identification of Nonlinear Dynamics，简称SINDy）已被广泛用于从数据中发现动力系统的控制方程。该方法利用稀疏回归技术，从候选函数库中辨识未知系统的简约模型。因此，其依赖于一个基本假设：动力学在所选坐标系中具有稀疏表示。为克服这一局限，研究者致力于寻找一种坐标变换，以获得能够重构原系统的简化坐标。近年来，SINDy自编码器通过将稀疏模型发现与自编码器架构相结合，进一步拓展了这一思路，能够同时学习简化的潜在坐标与简约的控制方程。该框架中的一个核心挑战是对测量误差的鲁棒性。受噪声分离神经网络结构的启发，我们在SINDy自编码器架构中引入了一个噪声分离模块，从而提升了鲁棒性，并能够更可靠地从含噪观测数据中辨识动力系统。在洛伦兹系统上的数值实验表明，所提方法能够恢复可解释的潜在动力学，并从含噪观测中准确估计测量噪声。

摘要 (Abstract)

Sparse identification of nonlinear dynamics (SINDy) has been widely used to discover the governing equations of a dynamical system from data. It uses sparse regression techniques to identify parsimonious models of unknown systems from a library of candidate functions. Therefore, it relies on the assumption that the dynamics are sparsely represented in the coordinate system used. To address this limitation, one seeks a coordinate transformation that provides reduced coordinates capable of reconstructing the original system. Recently, SINDy autoencoders have extended this idea by combining sparse model discovery with autoencoder architectures to learn simplified latent coordinates together with parsimonious governing equations. A central challenge in this framework is robustness to measurement error. Inspired by noise-separating neural network structures, we incorporate a noise-separation module into the SINDy autoencoder architecture, thereby improving robustness and enabling more reliable identification of noisy dynamical systems. Numerical experiments on the Lorenz system show that the proposed method recovers interpretable latent dynamics and accurately estimates the measurement noise from noisy observations.

关键词: SINDy, autoencoder, dynamical system identification, sparse regression, noise robustness, latent coordinates, Lorenz system, measurement noise

265. ❌ Hybrid Fourier Neural Operator for Surrogate Modeling of Laser Processing with a Quantum-Circuit Mixer

作者: Mateusz Papierz, Asel Sagingalieva, Alix Benoit, Toni Ivas, Elia Iseli, Alexey Melnikov 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04828v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于开发一种混合量子-经典傅里叶神经算子（HQ-LP-FNO），用于激光加工过程的三维代理建模，涉及传热、熔池对流、自由表面变形和相变等多物理场问题。论文的核心创新在于使用变分量子电路（VQC）作为紧凑的谱混合器，以减少参数数量并提高模型性能。所有关键词中，只有“AI for Science OR Bioinformatics OR Cheminformatics”高度相关（10分），因为论文明确属于AI在科学（特别是物理和工程模拟）领域的应用。其他关键词均与大型语言模型、对齐、推理、代理等主题无关，因此得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种混合量子-经典傅里叶神经算子（HQ-LP-FNO），用于高效模拟三维激光加工中的多物理场过程，通过引入变分量子电路作为紧凑的谱混合器，在减少15.6%训练参数的同时，将相位分数平均绝对误差降低了26%，并提高了温度预测精度。

摘要翻译

数据驱动的代理模型可以替代参数化偏微分方程中昂贵的多物理场求解器，然而为三维问题构建紧凑且精确的神经算子仍具挑战性：在傅里叶神经算子（Fourier Neural Operator, FNO）中，密集的逐模式谱通道混合操作所需参数量与保留的傅里叶模式数量呈线性增长，导致参数量膨胀并限制了实时部署能力。我们提出HQ-LP-FNO，一种混合量子-经典FNO，它用紧凑的模式共享变分量子电路混合器替代了其中可配置比例的传统密集谱模块，该混合器的参数量独立于傅里叶模式数量。我们协同设计了一个参数匹配的经典瓶颈控制模型，以提供严格的评估框架。在高能激光加工的三维代理建模任务中（涉及热传导、熔池对流、自由表面形变与相变耦合），HQ-LP-FNO相比经典基线模型减少了15.6%的可训练参数，同时将相分数平均绝对误差降低26%，相对温度平均绝对误差从2.89%降至2.56%。通过对量子通道占比的扫描分析发现，适度的变分量子电路分配能在所有测试配置（包括全经典基线）中取得最佳温度指标，这指向了一种最优的经典-量子任务划分。消融实验证实，由变分量子电路通过其紧凑电路结构天然实现的模式共享混合机制，是这些性能提升的主要贡献因素。基于ibm-torino后端校准噪声的噪声模拟器研究表明，量子混合器在测试的测量次数范围内保持数值稳定性。这些结果表明，基于变分量子电路的参数高效谱混合能够改进针对复杂多物理场问题的神经算子代理模型，并为实践中混合量子算子学习建立了受控的评估框架。

摘要 (Abstract)

Data-driven surrogates can replace expensive multiphysics solvers for parametric PDEs, yet building compact, accurate neural operators for three-dimensional problems remains challenging: in Fourier Neural Operators, dense mode-wise spectral channel mixing scales linearly with the number of retained Fourier modes, inflating parameter counts and limiting real-time deployability. We introduce HQ-LP-FNO, a hybrid quantum-classical FNO that replaces a configurable fraction of these dense spectral blocks with a compact, mode-shared variational quantum circuit mixer whose parameter count is independent of the Fourier mode budget. A parameter-matched classical bottleneck control is co-designed to provide a rigorous evaluation framework. Evaluated on three-dimensional surrogate modeling of high-energy laser processing, coupling heat transfer, melt-pool convection, free-surface deformation, and phase change, HQ-LP-FNO reduces trainable parameters by 15.6% relative to a classical baseline while lowering phase-fraction mean absolute error by 26% and relative temperature MAE from 2.89% to 2.56%. A sweep over the quantum-channel budget reveals that a moderate VQC allocation yields the best temperature metrics across all tested configurations, including the fully classical baseline, pointing toward an optimal classical-quantum partitioning. The ablation confirms that mode-shared mixing, naturally implemented by the VQC through its compact circuit structure, is the dominant contributor to these improvements. A noisy-simulator study under backend-calibrated noise from ibm-torino confirms numerical stability of the quantum mixer across the tested shot range. These results demonstrate that VQC-based parameter-efficient spectral mixing can improve neural operator surrogates for complex multiphysics problems and establish a controlled evaluation protocol for hybrid quantum operator learning in practice.

关键词: Hybrid quantum-classical FNO, Surrogate modeling, Laser processing, Multiphysics simulation, Variational quantum circuit, Parameter-efficient spectral mixing, Fourier Neural Operator, Three-dimensional surrogate

266. ❌ Partially deterministic sampling for compressed sensing with denoising guarantees

作者: Yaniv Plan, Matthew S. Scott, Ozgur Yilmaz 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04802v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究压缩感知中的采样方案优化，属于信号处理领域，与所有提供的大模型和深度学习关键词完全无关。论文未涉及任何语言模型、训练方法、推理技术、对齐、代理系统或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合随机和确定性采样的优化压缩感知方案，在图像压缩感知中相比传统采样方法提供了理论改进和数值实验验证的降噪保证。

摘要翻译

我们研究当采样向量选自酉矩阵行向量时的压缩感知问题。现有文献通常采用随机方式选择这些采样向量；随机性的运用推动了该领域重要的实证与理论进展。然而，实践中常存在某些关键采样向量，此时研究者会偏离理论框架而确定性地选择这些行向量。在本工作中，我们针对伯努利选择器推导出一种优化的采样方案，该方案自然融合了行向量的随机选择与确定性选择，从而严格判定哪些行应被确定性采样。理论结果与数值实验表明，相较于有放回和无放回采样方案，该采样方案在生成式先验与稀疏先验的图像压缩感知中均实现了可量化的性能提升。此外，我们的理论保证相比前人研究具有更优的样本复杂度界限，并在此设定下提出了新颖的去噪保证。

摘要 (Abstract)

We study compressed sensing when the sampling vectors are chosen from the rows of a unitary matrix. In the literature, these sampling vectors are typically chosen randomly; the use of randomness has enabled major empirical and theoretical advances in the field. However, in practice there are often certain crucial sampling vectors, in which case practitioners will depart from the theory and sample such rows deterministically. In this work, we derive an optimized sampling scheme for Bernoulli selectors which naturally combines random and deterministic selection of rows, thus rigorously deciding which rows should be sampled deterministically. This sampling scheme provides measurable improvements in image compressed sensing for both generative and sparse priors when compared to with-replacement and without-replacement sampling schemes, as we show with theoretical results and numerical experiments. Additionally, our theoretical guarantees feature improved sample complexity bounds compared to previous works, and novel denoising guarantees in this setting.

关键词: compressed sensing, sampling vectors, deterministic sampling, Bernoulli selectors, denoising guarantees, unitary matrix, image compressed sensing, sample complexity

267. ❌ Forgetting to Witness: Efficient Federated Unlearning and Its Visible Evaluation

作者: Houzhe Wang, Xiaojie Zhu, Chi Chen 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04800v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于联邦学习中的遗忘学习（federated unlearning）技术，提出了一种高效的联邦遗忘方法和可视化评估框架。虽然论文涉及机器学习模型（如分类器、GAN），但所有关键词均与大模型（LLMs）及其相关技术（如MoE、RLHF、RAG等）或AI在科学领域的应用直接相关。论文未提及任何大模型技术、架构、训练方法或科学AI应用，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种高效的联邦遗忘学习方法及其可视化评估框架，解决了联邦学习中特定数据删除后模型信息遗忘的问题，并通过实验验证了方法的有效性。

摘要翻译

随着数据隐私与安全的重要性日益凸显，联邦遗忘学习作为一个新兴研究领域应运而生，其致力于确保联邦学习模型在特定数据被删除后不再保留或泄露相关信息。本文首次提出了一个完整的联邦遗忘学习流程，包括一种联邦遗忘学习方法及相应的评估框架。我们提出的联邦遗忘学习方法无需存储历史数据，即可保证高效性与模型准确性。该方法有效结合了知识蒸馏模型与多种优化机制。此外，我们提出一个名为Skyeye的可视化框架，用于呈现联邦遗忘学习模型的遗忘能力。该框架将联邦遗忘学习模型作为分类器嵌入生成对抗网络（GAN）中，随后分类器与判别器共同指导生成器生成样本。在此过程中，生成器从分类器的知识中学习，并通过样本生成将知识可视化。最终，模型遗忘能力的评估基于删除数据与生成样本之间的关联度进行。我们通过全面的实验验证了所提出的联邦遗忘学习方法及其评估框架的有效性。

摘要 (Abstract)

With the increasing importance of data privacy and security, federated unlearning has emerged as a novel research field dedicated to ensuring that federated learning models no longer retain or leak relevant information once specific data has been deleted. In this paper, to the best of our knowledge, we propose the first complete pipeline for federated unlearning, which includes a federated unlearning approach and an evaluation framework. Our proposed federated unlearning approach ensures high efficiency and model accuracy without the need to store historical data.It effectively leverages the knowledge distillation model alongside various optimization mechanisms. Moreover, we propose a framework named Skyeye to visualize the forgetting capacity of federated unlearning models. It utilizes the federated unlearning model as the classifier integrated into a Generative Adversarial Network (GAN). Afterward, both the classifier and discriminator guide the generator in generating samples. Throughout this process, the generator learns from the classifier’s knowledge. The generator then visualizes this knowledge through sample generation. Finally, the model’s forgetting capability is evaluated based on the relevance between the deleted data and the generated samples. Comprehensive experiments are conducted to illustrate the effectiveness of the proposed federated unlearning approach and the corresponding evaluation framework.

关键词: federated unlearning, federated learning, knowledge distillation, GAN, visualization, forgetting capacity, data privacy, model accuracy

268. ❌ Fine-Tuning Integrity for Modern Neural Networks: Structured Drift Proofs via Norm, Rank, and Sparsity Certificates

作者: Zhenhang Shang, Kani Chen 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04738v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究微调（fine-tuning）的安全性和完整性验证，与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分）。论文提出通过规范、秩和稀疏性证书来约束模型更新，与’Mixture of Experts OR MoE OR Sparse Models’和’PEFT OR LoRA OR Parameter-efficient Fine-tuning’有一定关联（各5分），因为涉及稀疏性和参数高效更新。论文提到大模型微调，与’Large Language Models OR LLMs OR Foundation Models’有一般关联（5分）。其他关键词如具体技术（RAG、RLHF等）、推理方法、科学应用等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大模型微调过程中的安全风险，提出了一种名为微调完整性（FTI）的安全框架，通过简洁模型差异证明（SMDPs）来验证更新后的模型与基础模型之间的差异是否符合预定义的规范、秩或稀疏性约束。

摘要翻译

微调现已成为适配大型神经网络的主要方法，但它也引入了新的完整性风险。不可信方可能植入后门、改变安全行为或在声称仅进行微小更新的同时覆盖模型的大部分内容。现有的验证工具侧重于推理正确性或全模型溯源，未能解决这一问题。
我们提出微调完整性作为受控模型演化的安全目标。FTI系统能够证明微调模型与可信基模型之间的差异仅存在于策略定义的漂移类别内。我们提出简洁模型差异证明作为一种新型密码学原语，用于强制执行这些漂移约束。SMDP可提供零知识证明，表明模型更新是范数有界的、低秩的或稀疏的。验证成本仅取决于漂移的结构，而与模型规模无关。
我们基于随机投影、多项式承诺和流式线性校验给出了具体的SMDP构造方案。同时通过信息论下界证明，简洁证明的实现必须以某种结构化为前提。最后，我们针对Transformer、CNN和MLP提出了架构感知的实例化方案，并构建了一个端到端系统，可将块级证明聚合为全局证书。

摘要 (Abstract)

Fine-tuning is now the primary method for adapting large neural networks, but it also introduces new integrity risks. An untrusted party can insert backdoors, change safety behavior, or overwrite large parts of a model while claiming only small updates. Existing verification tools focus on inference correctness or full-model provenance and do not address this problem. We introduce Fine-Tuning Integrity (FTI) as a security goal for controlled model evolution. An FTI system certifies that a fine-tuned model differs from a trusted base only within a policy-defined drift class. We propose Succinct Model Difference Proofs (SMDPs) as a new cryptographic primitive for enforcing these drift constraints. SMDPs provide zero-knowledge proofs that the update to a model is norm-bounded, low-rank, or sparse. The verifier cost depends only on the structure of the drift, not on the size of the model. We give concrete SMDP constructions based on random projections, polynomial commitments, and streaming linear checks. We also prove an information-theoretic lower bound showing that some form of structure is necessary for succinct proofs. Finally, we present architecture-aware instantiations for transformers, CNNs, and MLPs, together with an end-to-end system that aggregates block-level proofs into a global certificate.

关键词: fine-tuning, integrity, security, model drift, succinct proofs, norm-bounded, low-rank, sparse

269. ❌ A Muon-Accelerated Algorithm for Low Separation Rank Tensor Generalized Linear Models

作者: Xiao Liang, Shuang Li 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04726v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于低分离秩张量广义线性模型的优化算法（LSRTR-M），属于计算数学和统计建模领域，与深度学习、大模型技术无直接关联。所有关键词均涉及大模型、深度学习及其相关技术（如训练、推理、对齐、应用等），而本文研究的是传统张量回归算法的计算加速，未涉及神经网络、大语言模型或任何深度学习技术。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文提到生物医学成像应用，但核心是算法优化而非AI科学应用，因此给5分（有一定关联）。其他关键词完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该论文针对低分离秩张量广义线性模型估计中计算效率低的问题，提出了一种融合Muon更新的LSRTR-M算法，在合成数据和Vessel MNIST 3D任务上实现了更快的收敛速度和更高的计算效率。

摘要翻译

张量数据在多维信号与成像问题（如生物医学成像）中自然产生。当将其纳入广义线性模型时，简单的向量化处理可能破坏其多维结构，并导致高维、不适定的估计问题。为应对这一挑战，低分离秩分解通过对系数张量施加低秩多线性结构来降低模型复杂度。估计基于低分离秩的张量广义线性模型的一种代表性方法是低分离秩张量回归算法，该算法采用块坐标下降法，并通过基于QR分解的重复投影强制因子矩阵正交化。然而，重复的投影步骤计算成本较高且收敛速度较慢。基于对此类数据进行可扩展估计与分类的需求，我们提出了LSRTR-M算法，该算法将穆恩更新（通过牛顿-舒尔茨正交化的动量）融入LSRTR框架。具体而言，LSRTR-M保留了原有的块坐标方案，同时将基于投影的因子更新替换为穆恩更新步骤。在合成的线性、逻辑与泊松低分离秩张量广义线性模型中，LSRTR-M在迭代次数和实际计算时间上均收敛更快，同时实现了更低的归一化估计误差与预测误差。在Vessel MNIST 3D任务中，该算法在保持竞争力的分类性能的同时，进一步提升了计算效率。

摘要 (Abstract)

Tensor-valued data arise naturally in multidimensional signal and imaging problems, such as biomedical imaging. When incorporated into generalized linear models (GLMs), naive vectorization can destroy their multi-way structure and lead to high-dimensional, ill-posed estimation. To address this challenge, Low Separation Rank (LSR) decompositions reduce model complexity by imposing low-rank multilinear structure on the coefficient tensor. A representative approach for estimating LSR-based tensor GLMs (LSR-TGLMs) is the Low Separation Rank Tensor Regression (LSRTR) algorithm, which adopts block coordinate descent and enforces orthogonality of the factor matrices through repeated QR-based projections. However, the repeated projection steps can be computationally demanding and slow convergence. Motivated by the need for scalable estimation and classification from such data, we propose LSRTR-M, which incorporates Muon (MomentUm Orthogonalized by Newton-Schulz) updates into the LSRTR framework. Specifically, LSRTR-M preserves the original block coordinate scheme while replacing the projection-based factor updates with Muon steps. Across synthetic linear, logistic, and Poisson LSR-TGLMs, LSRTR-M converges faster in both iteration count and wall-clock time, while achieving lower normalized estimation and prediction errors. On the Vessel MNIST 3D task, it further improves computational efficiency while maintaining competitive classification performance.

关键词: Tensor Generalized Linear Models, Low Separation Rank, LSRTR-M, Muon updates, Block coordinate descent, Computational efficiency, Vessel MNIST 3D, Convergence acceleration

270. ❌ Towards protein folding pathways by reconstructing protein residue networks with a policy-driven model

作者: Susan Khor 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04677v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于蛋白质折叠路径的建模，使用基于策略的模型重建蛋白质残基网络，属于生物信息学领域。论文内容与大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关。仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为其应用AI方法于生物信息学问题（蛋白质折叠），但论文未明确提及大模型或深度学习，而是使用策略驱动的模型（policy-driven model）和爬山算法（hill-climber），因此相关性有限，给5分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于策略驱动模型重建蛋白质残基网络的方法，用于模拟蛋白质折叠路径，其数值观测结果与已知折叠速率高度相关。

摘要翻译

一种通过合适的节点选择与边恢复策略重构蛋白质残基网络的方法，产生了与已发表的52个双态折叠蛋白及21个多态折叠蛋白的折叠速率高度相关（皮尔逊相关系数 < -0.83）的数值观测结果；在折叠家族层面上相关性同样显著。这些结果是使用先前提出的ND模型偶然获得的，但本研究通过引入依据特征状态指导操作的策略对其进行了扩展。该结果表明起始搜索点与主导条件（随机种子）对于简单爬山算法快速实现策略搜索的成功至关重要。合适的策略与随机种子这两种条件（由强相关性统计证明）为在ND框架内模拟蛋白质折叠创造了有利环境，可类比于蛋白质自然折叠所需的适当生理条件。值得关注的是，可通过检查恢复边的序列来探索其作为潜在蛋白质折叠路径的可能性。为此，我们收集了轨迹数据用于分析及进一步的模型评估与开发。

摘要 (Abstract)

A method that reconstructs protein residue networks using suitable node selection and edge recovery policies produced numerical observations that correlate strongly (Pearson’s correlation coefficient < -0.83) with published folding rates for 52 two-state folders and 21 multi-state folders; correlations are also strong at the fold-family level. These results were obtained serendipitously with the ND model, which was introduced previously, but is here extended with policies that dictate actions according to feature states. This result points to the importance of both the starting search point and the prevailing condition (random seed) for the quick success of policy search by a simple hill-climber. The two conditions, suitable policies and random seed, which (evidenced by the strong correlation statistic) setup a conducive environment for modelling protein folding within ND, could be compared to appropriate physiological conditions required by proteins to fold naturally. Of interest is an examination of the sequence of restored edges for potential as plausible protein folding pathways. Towards this end, trajectory data is collected for analysis and further model evaluation and development.

关键词: protein folding pathways, protein residue networks, policy-driven model, node selection, edge recovery, folding rates, ND model, trajectory data

271. ❌ Minimaxity and Admissibility of Bayesian Neural Networks

作者: Daniel Andrew Coulson, Martin T. Wells 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04673v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究贝叶斯神经网络（BNNs）在统计决策理论框架下的最优性（极小极大性和可容许性），属于深度学习理论分析范畴。所有评分关键词均聚焦于大语言模型（LLMs）及其相关技术（如训练方法、推理优化、应用等），而本文完全不涉及LLMs、大模型技术或AI在科学领域的应用创新，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文从统计决策理论角度研究了深度ReLU贝叶斯神经网络在二次损失和Kullback-Leibler损失下的最优性，提出了一种超先验设计，证明了所得决策规则同时具有可容许性和极小极大性。

摘要翻译

贝叶斯神经网络（BNNs）为深度学习模型中的推断提供了一种自然的概率框架。尽管其应用广泛，但从统计决策理论视角审视其最优性的研究仍较为有限。本文在二次损失下，研究了由深度全连接前馈ReLU贝叶斯神经网络在正态位置模型中诱导的决策规则。我们证明，在固定先验尺度的情况下，所诱导的贝叶斯决策规则并非极小化极大。随后，我们提出对贝叶斯神经网络先验的有效输出方差施加一个超先验，该超先验能产生超调和平方根边际密度，从而证明由此得到的决策规则同时具有可容许性与极小化极大性。我们进一步将这些结果从二次损失设定推广至使用Kullback-Leibler损失的预测密度估计问题。最后，我们通过数值模拟验证了理论发现。

摘要 (Abstract)

Bayesian neural networks (BNNs) offer a natural probabilistic formulation for inference in deep learning models. Despite their popularity, their optimality has received limited attention through the lens of statistical decision theory. In this paper, we study decision rules induced by deep, fully connected feedforward ReLU BNNs in the normal location model under quadratic loss. We show that, for fixed prior scales, the induced Bayes decision rule is not minimax. We then propose a hyperprior on the effective output variance of the BNN prior that yields a superharmonic square-root marginal density, establishing that the resulting decision rule is simultaneously admissible and minimax. We further extend these results from the quadratic loss setting to the predictive density estimation problem with Kullback–Leibler loss. Finally, we validate our theoretical findings numerically through simulation.

关键词: Bayesian neural networks, minimaxity, admissibility, statistical decision theory, quadratic loss, Kullback-Leibler loss, hyperprior, ReLU networks

272. ❌ Anticipatory Reinforcement Learning: From Generative Path-Laws to Distributional Value Functions

作者: Daniel Bloch 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04662v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于强化学习（RL）的理论框架创新，特别是针对非马尔可夫决策过程、路径依赖环境和连续时间随机过程。论文的核心贡献在于提出Anticipatory Reinforcement Learning（ARL），利用签名增强流形和自洽场方法来处理单轨迹观测下的路径几何问题。所有评分关键词均与大模型、深度学习技术原理或特定AI应用领域（如生物信息学）直接相关，而本论文未涉及任何大模型、深度学习架构、训练方法、对齐技术、推理优化、代理系统或科学AI应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了Anticipatory Reinforcement Learning（ARL）框架，通过将状态空间提升到签名增强流形并利用自洽场方法，解决了非马尔可夫决策过程中路径依赖几何的挑战，从而在单观测轨迹约束下实现了计算复杂度降低、方差减少以及在高波动连续时间环境中的主动风险管理和策略稳定性提升。

摘要翻译

本文介绍了预期强化学习（Anticipatory Reinforcement Learning, ARL），这是一种新颖的框架，旨在弥合非马尔可夫决策过程与经典强化学习架构之间的差距，特别是在仅能观测到单一轨迹的约束条件下。在具有跳跃扩散和结构突变特征的环境中，传统的基于状态的方法往往无法捕捉准确预测所必需的路径依赖几何结构。我们通过将状态空间提升至一个特征增强流形来解决这一问题，其中过程的历史被嵌入为动态坐标。通过采用自洽场方法，智能体能够维持对未来路径律的预期代理，从而实现对期望收益的确定性评估。这种从随机分支到单次线性评估的转变，显著降低了计算复杂度和方差。我们证明该框架保留了基本的收缩性质，并确保即使在存在重尾噪声的情况下也能实现稳定的泛化。我们的结果表明，通过将强化学习建立在路径空间的拓扑特征之上，智能体能够在高度波动、连续时间的环境中实现主动的风险管理和卓越的策略稳定性。

摘要 (Abstract)

This paper introduces Anticipatory Reinforcement Learning (ARL), a novel framework designed to bridge the gap between non-Markovian decision processes and classical reinforcement learning architectures, specifically under the constraint of a single observed trajectory. In environments characterised by jump-diffusions and structural breaks, traditional state-based methods often fail to capture the essential path-dependent geometry required for accurate foresight. We resolve this by lifting the state space into a signature-augmented manifold, where the history of the process is embedded as a dynamical coordinate. By utilising a self-consistent field approach, the agent maintains an anticipated proxy of the future path-law, allowing for a deterministic evaluation of expected returns. This transition from stochastic branching to a single-pass linear evaluation significantly reduces computational complexity and variance. We prove that this framework preserves fundamental contraction properties and ensures stable generalisation even in the presence of heavy-tailed noise. Our results demonstrate that by grounding reinforcement learning in the topological features of path-space, agents can achieve proactive risk management and superior policy stability in highly volatile, continuous-time environments.

关键词: Anticipatory Reinforcement Learning, non-Markovian decision processes, path-dependent geometry, signature-augmented manifold, self-consistent field approach, continuous-time environments, computational complexity reduction, policy stability

273. ❌ From Curiosity to Caution: Mitigating Reward Hacking for Best-of-N with Pessimism

作者: Zhuohao Yu, Zhiwei Steven Wu, Adam Block 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04648v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究BoN采样中的奖励黑客问题，提出基于悲观原则的caution方法，通过误差模型降低非典型响应的奖励估计。与"Large Language Models"高度相关（10分），因为研究基于LLM的推理时计算扩展；与"RLHF"高度相关（10分），因为涉及奖励模型、奖励黑客和RL中的悲观原则；其他关键词如MoE、SLMs、Scaling Laws等与论文内容无直接关联，均给0分。

!!! tip deepseek-chat TL;DR

该论文研究了BoN采样中的奖励黑客问题，提出了一种基于悲观原则的caution方法，通过训练误差模型来降低非典型响应的奖励估计，从而有效缓解奖励黑客并提升生成质量。

摘要翻译

推理时计算扩展已成为提升语言模型在广泛任务上性能的强大范式，但如何最优利用额外计算资源的问题仍未解决。一种流行的方法是BoN采样，即生成N个候选回答，依据奖励模型进行评分，并选择得分最高的回答。尽管这种方法能提升性能，但它容易受到奖励破解的影响——随着N值增大，性能反而下降，因为系统会选择那些利用奖励模型缺陷而非真正提升生成质量的回答。先前通过强化奖励模型或强分布正则化来缓解奖励破解的尝试，要么未能完全解决过度优化问题，要么过于保守而无法充分利用额外计算资源。本研究中，我们探索了强化学习中的悲观原则，该原则利用价值估计的置信下界来避免因奖励估计不确定而采取的分布外行动。我们提出的方法称为“谨慎”，可被视为“好奇心”的反向机制：好奇心将预测误差作为新颖性信号予以奖励，而谨慎则将预测误差作为分布不确定性的信号予以惩罚。具体而言，该方法在典型回答上训练误差模型，并利用其预测误差来降低非典型回答的奖励估计值。我们的大量实验评估表明，谨慎是一种简单且计算高效的方法，能显著缓解BoN采样中的奖励破解问题。我们还在简化的线性设定中提供了理论分析，证明谨慎方法在理论上优于标准BoN方法。综合而言，我们的研究不仅确立了谨慎作为解决奖励破解的实用方案，也为基于好奇心的方法可能成为大语言模型场景中通用的分布外检测技术提供了证据。

摘要 (Abstract)

Inference-time compute scaling has emerged as a powerful paradigm for improving language model performance on a wide range of tasks, but the question of how best to use the additional compute remains open. A popular approach is BoN sampling, where N candidate responses are generated, scored according to a reward model, and the highest-scoring response is selected. While this approach can improve performance, it is vulnerable to reward hacking, where performance degrades as N increases due to the selection of responses that exploit imperfections in the reward model instead of genuinely improving generation quality. Prior attempts to mitigate reward hacking, via stronger reward models or heavy-handed distributional regularization, either fail to fully address over-optimization or are too conservative to exploit additional compute. In this work, we explore the principle of pessimism in RL, which uses lower confidence bounds on value estimates to avoid OOD actions with uncertain reward estimates. Our approach, termed as caution, can be seen as the reverse of curiosity: where curiosity rewards prediction error as a signal of novelty, caution penalizes prediction error as a signal of distributional uncertainty. Practically, caution trains an error model on typical responses and uses its prediction error to lower reward estimates for atypical ones. Our extensive empirical evaluation demonstrates that caution is a simple, computationally efficient approach that substantially mitigates reward hacking in BoN sampling. We also provide a theoretical analysis in a simplified linear setting, which shows that caution provably improves over the standard BoN approach. Together, our results not only establish caution as a practical solution to reward hacking, but also provide evidence that curiosity-based approaches can be a general OOD detection technique in LLM settings.

关键词: reward hacking, BoN sampling, pessimism, inference-time compute, reward model, caution, out-of-distribution detection, language model performance

274. ❌ Interpretation of Crystal Energy Landscapes with Kolmogorov-Arnold Networks

作者: Gen Zu, Ning Mao, Claudia Felser, Yang Zhang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04636v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文主要研究Kolmogorov-Arnold Networks（KANs）在晶体能量景观解释中的应用，属于材料信息学领域。与大多数关键词（涉及大模型技术、训练方法、推理优化、智能体等）完全无关。仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文强调KANs的可解释性优势。与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为论文直接应用AI于材料科学（属于AI for Science范畴）。

!!! tip deepseek-chat TL;DR

该论文提出使用可解释的Kolmogorov-Arnold Networks（KANs）框架来预测和解释晶体材料的能量、带隙和功函数，实现了高精度预测并揭示了与周期表和量子力学原理一致的化学趋势。

摘要翻译

表征晶体能量景观对于预测热力学稳定性、电子结构和功能行为至关重要。尽管机器学习（ML）能够实现快速的性质预测，但大多数模型的“黑箱”特性限制了其产生新科学见解的效用。在此，我们引入可解释的柯尔莫哥洛夫-阿诺德网络（KANs）作为弥合这一差距的框架。与使用固定激活函数的传统神经网络不同，KANs采用可学习的函数来揭示潜在的物理关系。我们开发了元素加权KAN，这是一个仅基于成分的模型，在预测大规模数据集的形成能、带隙和功函数方面达到了最先进的精度。至关重要的是，在没有任何显式物理约束的情况下，KANs通过嵌入分析、相关性研究和主成分分析，揭示了与元素周期表和量子力学原理一致的可解释化学趋势。这些结果表明，KANs提供了一个兼具高预测性能和科学可解释性的强大框架，为透明、基于化学的材料信息学建立了新范式。

摘要 (Abstract)

Characterizing crystalline energy landscapes is essential to predicting thermodynamic stability, electronic structure, and functional behavior. While machine learning (ML) enables rapid property predictions, the “black-box” nature of most models limits their utility for generating new scientific insights. Here, we introduce Kolmogorov-Arnold Networks (KANs) as an interpretable framework to bridge this gap. Unlike conventional neural networks with fixed activation functions, KANs employ learnable functions that reveal underlying physical relationships. We developed the Element-Weighted KAN, a composition-only model that achieves state-of-the-art accuracy in predicting formation energy, band gap, and work function across large-scale datasets. Crucially, without any explicit physical constraints, KANs uncover interpretable chemical trends aligned with the periodic table and quantum mechanical principles through embedding analysis, correlation studies, and principal component analysis. These results demonstrate that KANs provide a powerful framework with high predictive performance and scientific interpretability, establishing a new paradigm for transparent, chemistry-based materials informatics.

关键词: Kolmogorov-Arnold Networks, crystal energy landscapes, interpretable machine learning, materials informatics, formation energy prediction, band gap, work function, scientific interpretability

275. ❌ Dynamic Free-Rider Detection in Federated Learning via Simulated Attack Patterns

作者: Motoki Nakamura 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04611v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于联邦学习中的搭便车攻击检测问题，提出了一种基于模拟攻击模式的动态检测方法S2-WEF。论文内容涉及联邦学习、安全攻击、模型参数分析、聚类分类等，但完全不涉及大语言模型、深度学习技术原理、AI for Science等关键词领域。所有关键词均与大模型技术、深度学习创新、科学AI应用无关，因此全部评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对联邦学习中动态搭便车攻击难以检测的问题，提出了一种基于模拟攻击模式和偏差评分的S2-WEF检测方法，实验证明该方法比现有方法具有更高的鲁棒性。

摘要翻译

联邦学习（FL）允许多个客户端通过聚合本地更新来协同训练全局模型，而无需共享私有数据。然而，FL常常面临“搭便车者”的挑战，这些客户端提交虚假模型参数而不进行实际训练，试图在不贡献的情况下获取全局模型。Chen等人提出了一种基于模型参数权重演化频率（WEF）的搭便车者检测方法。该检测方法是实际搭便车检测方法中的领先候选方案，因为它既不需要代理数据集，也不依赖预训练。然而，该方法难以检测“动态”搭便车者——即在早期轮次行为诚实、后期转为搭便车的客户端，尤其是在面对全局模型模仿攻击（如增量权重攻击及我们新提出的自适应WEF伪装攻击）时。本文提出一种新颖的检测方法S2-WEF，该方法在服务器端利用先前广播的全局模型模拟基于潜在全局模型攻击的WEF模式，并识别提交的WEF模式与模拟模式相似的客户端。为应对多种搭便车攻击策略，S2-WEF进一步将这种基于模拟的相似度分数与通过提交的WEF相互比较计算得出的偏差分数相结合，并通过二维聚类和单分数分类来区分良性客户端与搭便车者。该方法能够在无需代理数据集或预训练的情况下，动态检测在训练过程中转为搭便车行为的客户端。我们在三个数据集和五种攻击类型上进行了大量实验，结果表明S2-WEF相比现有方法具有更高的鲁棒性。

摘要 (Abstract)

Federated learning (FL) enables multiple clients to collaboratively train a global model by aggregating local updates without sharing private data. However, FL often faces the challenge of free-riders, clients who submit fake model parameters without performing actual training to obtain the global model without contributing. Chen et al. proposed a free-rider detection method based on the weight evolving frequency (WEF) of model parameters. This detection approach is a leading candidate for practical free-rider detection methods, as it requires neither a proxy dataset nor pre-training. Nevertheless, it struggles to detect ``dynamic’’ free-riders who behave honestly in early rounds and later switch to free-riding, particularly under global-model-mimicking attacks such as the delta weight attack and our newly proposed adaptive WEF-camouflage attack. In this paper, we propose a novel detection method S2-WEF that simulates the WEF patterns of potential global-model-based attacks on the server side using previously broadcasted global models, and identifies clients whose submitted WEF patterns resemble the simulated ones. To handle a variety of free-rider attack strategies, S2-WEF further combines this simulation-based similarity score with a deviation score computed from mutual comparisons among submitted WEFs, and separates benign and free-rider clients by two-dimensional clustering and per-score classification. This method enables dynamic detection of clients that transition into free-riders during training without proxy datasets or pre-training. We conduct extensive experiments across three datasets and five attack types, demonstrating that S2-WEF achieves higher robustness than existing approaches.

关键词: Federated Learning, Free-rider Detection, Dynamic Attack, WEF Pattern, Simulated Attack, Two-dimensional Clustering, Robustness Evaluation

276. ❌ Noisy Nonreciprocal Pairwise Comparisons: Scale Variation, Noise Calibration, and Admissible Ranking Regions

作者: Jean-Pierre Magnot 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04588v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是成对比较分析中的非互易性问题，属于决策分析和统计建模领域，与所有提供的大模型和深度学习技术关键词完全无关。论文内容涉及比较矩阵、尺度变化、噪声校准和排名区域等传统统计方法，没有涉及任何人工智能、机器学习或深度学习技术。

!!! tip deepseek-chat TL;DR

该论文研究了非互易性成对比较中的尺度变化和噪声问题，提出了一个加性模型来分离尺度变化和随机扰动，并开发了噪声水平估计和排名区域概率评估方法。

摘要翻译

成对比较广泛应用于决策分析、偏好建模与评估问题中。在许多实际情境中，观测到的比较矩阵不具备互反性。这种互反性的缺失常被视为需要立即修正的缺陷。本文采用了一种不同的视角：部分非互反性可能反映了评估尺度的真实变异，而另一部分则源于随机扰动。
我们引入了一个加法模型，其中未知的潜在比较矩阵具有一致性但不一定满足互反性。互反分量承载全局排序信息，而对称分量则描述了可能的尺度变异。围绕这一结构化矩阵，我们添加了随机扰动，并展示了如何估计噪声水平、评估尺度变异是否保持适度，以及如何为严格成对比较意义下的可容许排序区域分配概率。我们还将此方法与直接投影至互反矩阵的粗暴处理方式进行了对比，后者会一次性消除所有对称信息。
本文采用高斯扰动模型并非因为人类决策完全符合高斯分布，而是因为观测到的判断误差往往源于许多微小效应的累积。在此背景下，中心极限定理为高斯噪声提供了自然的启发式论证。这使得我们能够在保持模型对决策问题可解释的同时，推导出显式估计量并进行概率评估。

摘要 (Abstract)

Pairwise comparisons are widely used in decision analysis, preference modeling, and evaluation problems. In many practical situations, the observed comparison matrix is not reciprocal. This lack of reciprocity is often treated as a defect to be corrected immediately. In this article, we adopt a different point of view: part of the nonreciprocity may reflect a genuine variation in the evaluation scale, while another part is due to random perturbations. We introduce an additive model in which the unknown underlying comparison matrix is consistent but not necessarily reciprocal. The reciprocal component carries the global ranking information, whereas the symmetric component describes possible scale variation. Around this structured matrix, we add a random perturbation and show how to estimate the noise level, assess whether the scale variation remains moderate, and assign probabilities to admissible ranking regions in the sense of strict ranking by pairwise comparisons. We also compare this approach with the brutal projection onto reciprocal matrices, which suppresses all symmetric information at once. The Gaussian perturbation model is used here not because human decisions are exactly Gaussian, but because observed judgment errors often result from the accumulation of many small effects. In such a context, the central limit principle provides a natural heuristic justification for Gaussian noise. This makes it possible to derive explicit estimators and probability assessments while keeping the model interpretable for decision problems.

关键词: pairwise comparisons, nonreciprocal comparisons, scale variation, noise calibration, ranking regions, additive model, Gaussian perturbation, decision analysis

277. ❌ SAIL: Scene-aware Adaptive Iterative Learning for Long-Tail Trajectory Prediction in Autonomous Vehicles

作者: Bin Rao, Haicheng Liao, Chengyue Wang, Keqiang Li, Zhenning Li, Hai Yang 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04573v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于自动驾驶车辆的长尾轨迹预测问题，提出了一种名为SAIL的框架，该框架结合了属性引导的数据增强、特征提取和自适应对比学习策略。虽然论文属于人工智能在自动驾驶领域的应用，但所有给定的关键词都专门针对大语言模型（LLM）及其相关技术（如MoE、RLHF、RAG、量化等）、推理方法（如CoT、MCTS）或特定科学领域（如生物信息学）。论文内容完全不涉及大语言模型、其训练技术、推理方法或指定的科学AI子领域，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SAIL的场景感知自适应迭代学习框架，用于解决自动驾驶车辆在长尾场景（罕见但安全关键的事件）中的轨迹预测问题，通过在nuScenes和ETH/UCY数据集上的评估，该框架在最难的1%长尾样本上实现了高达28.8%的预测误差减少。

摘要翻译

自动驾驶车辆（AVs）在多样化交通环境中依赖精确的轨迹预测以实现安全导航，然而现有模型在应对长尾场景——即那些罕见但安全关键、以突发机动、高碰撞风险和复杂交互为特征的事件——时仍面临困难。这些挑战源于数据不平衡、长尾轨迹定义不充分，以及优先学习常见行为而非低频次行为的学习策略欠佳。为解决这一问题，我们提出SAIL，一种创新框架，通过首先在三个关键属性维度（预测误差、碰撞风险和状态复杂性）上定义并建模轨迹，系统性地应对长尾问题。我们的方法进一步将属性引导的数据增强与特征提取过程，与一种高度自适应的对比学习策略相结合。该策略采用连续的余弦动量调度、相似度加权的困难负样本挖掘，以及基于动态特征聚类的伪标签生成机制。此外，它引入了一种聚焦机制，以强化在每个已识别类别中对困难正样本的学习。这一综合设计使SAIL能够出色地识别和预测多样且具有挑战性的长尾事件。在nuScenes和ETH/UCY数据集上的广泛评估表明，SAIL性能卓越，与最先进的基线方法相比，在最困难的1%长尾样本上预测误差降低了高达28.8%，同时在所有场景中保持具有竞争力的准确性。该框架推动了真实世界混合自动驾驶环境中可靠的AV轨迹预测技术发展。

摘要 (Abstract)

Autonomous vehicles (AVs) rely on accurate trajectory prediction for safe navigation in diverse traffic environments, yet existing models struggle with long-tail scenarios-rare but safety-critical events characterized by abrupt maneuvers, high collision risks, and complex interactions. These challenges stem from data imbalance, inadequate definitions of long-tail trajectories, and suboptimal learning strategies that prioritize common behaviors over infrequent ones. To address this, we propose SAIL, a novel framework that systematically tackles the long-tail problem by first defining and modeling trajectories across three key attribute dimensions: prediction error, collision risk, and state complexity. Our approach then synergizes an attribute-guided augmentation and feature extraction process with a highly adaptive contrastive learning strategy. This strategy employs a continuous cosine momentum schedule, similarity-weighted hard-negative mining, and a dynamic pseudo-labeling mechanism based on evolving feature clustering. Furthermore, it incorporates a focusing mechanism to intensify learning on hard-positive samples within each identified class. This comprehensive design enables SAIL to excel at identifying and forecasting diverse and challenging long-tail events. Extensive evaluations on the nuScenes and ETH/UCY datasets demonstrate SAIL’s superior performance, achieving up to 28.8% reduction in prediction error on the hardest 1% of long-tail samples compared to state-of-the-art baselines, while maintaining competitive accuracy across all scenarios. This framework advances reliable AV trajectory prediction in real-world, mixed-autonomy settings.

关键词: autonomous vehicles, trajectory prediction, long-tail scenarios, contrastive learning, adaptive learning, scene-aware, collision risk, prediction error

278. ❌ Generative Modeling under Non-Monotonic MAR Missingness via Approximate Wasserstein Gradient Flows

作者: Gitte Kremling, Jeffrey Näf, Johannes Lederer 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04567v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于缺失值处理的生成建模方法（FLOWGEM），使用Wasserstein梯度流和KL散度最小化技术，属于传统统计机器学习领域。所有评分关键词均与大模型、深度学习技术原理或AI科学应用直接相关，而本文未涉及任何大模型、深度学习、AI for Science或相关技术（如微调、对齐、推理优化等），因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为FLOWGEM的迭代生成建模方法，用于处理非单调随机缺失数据，通过Wasserstein梯度流最小化KL散度来生成完整数据集，在模拟和真实数据基准测试中实现了最先进的性能。

摘要翻译

数据科学中缺失值的普遍存在对任何进一步分析都构成了重大风险。尽管已有大量研究，但处理一般非单调缺失性的原则性非参数方法仍然稀缺。实践中常使用临时插补方法，但这些方法能否恢复正确的分布尚不明确。本文提出FLOWGEM——一种从随机缺失（Missing at Random，MAR）数据集中生成完整数据集的原则性迭代方法。受忽略最大似然估计量收敛结果的启发，我们的方法最小化了观测数据分布与生成样本在不同缺失模式下的分布之间的期望KL散度（Kullback-Leibler divergence）。为最小化KL散度，我们采用对应Wasserstein梯度流的离散化粒子演化方案，其中速度场通过密度比的局部线性估计量进行近似。这一构建产生了一种数据生成方案，能够将初始粒子集合迭代地传输至目标分布。模拟研究和真实数据基准测试表明，FLOWGEM在一系列设定下（包括具有挑战性的非单调MAR机制案例）均达到了最先进的性能。这些结果共同确立了FLOWGEM作为现有插补方法的一种原则性实用替代方案，并为弥合理论严谨性与实证性能之间的差距迈出了决定性的一步。

摘要 (Abstract)

The prevalence of missing values in data science poses a substantial risk to any further analyses. Despite a wealth of research, principled nonparametric methods to deal with general non-monotone missingness are still scarce. Instead, ad-hoc imputation methods are often used, for which it remains unclear whether the correct distribution can be recovered. In this paper, we propose FLOWGEM, a principled iterative method for generating a complete dataset from a dataset with values Missing at Random (MAR). Motivated by convergence results of the ignoring maximum likelihood estimator, our approach minimizes the expected Kullback-Leibler (KL) divergence between the observed data distribution and the distribution of the generated sample over different missingness patterns. To minimize the KL divergence, we employ a discretized particle evolution of the corresponding Wasserstein Gradient Flow, where the velocity field is approximated using a local linear estimator of the density ratio. This construction yields a data generation scheme that iteratively transports an initial particle ensemble toward the target distribution. Simulation studies and real-data benchmarks demonstrate that FLOWGEM achieves state-of-the-art performance across a range of settings, including the challenging case of non-monotonic MAR mechanisms. Together, these results position FLOWGEM as a principled and practical alternative to existing imputation methods, and a decisive step towards closing the gap between theoretical rigor and empirical performance.

关键词: missing values, generative modeling, Wasserstein gradient flow, KL divergence, non-monotone MAR, imputation, data generation, particle evolution

279. ❌ Safe and Near-Optimal Gate Control: A Case Study from the Danish West Coast

作者: Martin Kristjansen, Kim Guldstrand Larsen, Marius Mikučionis, Christian Schilling 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04545v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究丹麦西海岸Ringkoebing Fjord的水闸控制问题，使用Uppaal Stratego工具构建数字孪生模型，并基于海平面和风速预测在线学习水闸控制器。该研究属于控制工程和环境管理领域，与所有提供的大模型、深度学习、AI科学应用等关键词完全无关，因此所有关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究丹麦Ringkoebing Fjord水闸的安全和近最优控制问题，通过构建数字孪生模型和在线学习控制器，在满足安全要求的同时实现了与基线控制器相当的性能表现。

摘要翻译

灵克宾湾是丹麦西海岸的一个内陆水域盆地，通过一组用于控制进出海湾水量的闸门与北海隔开。目前，由人工操作员决定何时开启或关闭多少闸门以调控海湾水位，其目标在于满足一系列相互冲突的安全与性能要求，例如将水位维持在目标范围内、保障海上交通通行以及促进鱼类洄游。本文提出了一种基于数字孪生（digital twin）和在线强化学习（online reinforcement learning）的自动闸门控制方法。我们首先使用Uppaal Stratego工具构建了灵克宾湾的数字孪生模型。随后，结合海平面和风速的预测数据，以在线方式训练出一个闸门控制器。我们在不同海平面情景下评估了学习得到的控制器，这些情景代表了正常潮汐活动、高水位和低水位条件。评估结果表明，与基线控制器不同，学习得到的控制器在满足安全要求的同时，在其他性能要求方面也表现出相近的水平。

摘要 (Abstract)

Ringkoebing Fjord is an inland water basin on the Danish west coast separated from the North Sea by a set of gates used to control the amount of water entering and leaving the fjord. Currently, human operators decide when and how many gates to open or close for controlling the fjord’s water level, with the goal to satisfy a range of conflicting safety and performance requirements such as keeping the water level in a target range, allowing maritime traffic, and enabling fish migration. Uppaal Stratego. We then use this digital twin along with forecasts of the sea level and the wind speed to learn a gate controller in an online fashion. We evaluate the learned controllers under different sea-level scenarios, representing normal tidal behavior, high waters, and low waters. Our evaluation demonstrates that, unlike a baseline controller, the learned controllers satisfy the safety requirements, while performing similarly regarding the other requirements.

关键词: gate control, water level management, digital twin, Uppaal Stratego, online learning, safety requirements, coastal engineering, environmental control

280. ❌ Beyond Imbalance Ratio: Data Characteristics as Critical Moderators of Oversampling Method Selection

作者: Yuwen Jiang, Songyun Ye 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04541v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是机器学习中的类别不平衡问题，特别是过采样方法的选择，通过控制实验检验了不平衡比率（IR）与过采样效果之间的关系，并提出了一个考虑IR、类别可分性和聚类结构的框架。论文内容完全专注于传统的机器学习方法（如高斯混合模型、过采样技术、数据集特征分析），没有涉及任何大语言模型、深度学习技术原理、AI for Science应用或相关关键词中的技术概念。所有关键词均与大模型、深度学习、AI科学应用等主题相关，而本文是纯粹的经典机器学习研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文通过控制实验发现，在控制数据特征后，类别不平衡比率（IR）与过采样效果呈弱到中度负相关，而类别可分性是更强的调节因素，并提出了一个整合IR、类别可分性和聚类结构的框架来指导过采样方法的选择。

摘要翻译

主流的不平衡比率阈值范式假定不平衡比率（IR）与过采样效果呈正相关，但这一假设尚未通过受控实验得到实证支持。我们进行了12项受控实验（涉及超过100个数据集变体），通过高斯混合数据集的算法生成，在保持数据特征（类可分性、聚类结构）恒定的前提下系统操纵IR。另有两项验证实验考察了天花板效应和指标依赖性。所有方法均在OpenML的17个真实数据集上进行了评估。在控制混杂变量后，IR与过采样效益呈现弱至中度的负相关。类可分性被证明是显著更强的调节变量，其解释方法效果变异的能力远超IR本身。我们提出了一个“情境重要性”框架，该框架整合IR、类可分性与聚类结构，为实践者提供基于证据的方法选择标准。

摘要 (Abstract)

The prevailing IR-threshold paradigm posits a positive correlation between imbalance ratio (IR) and oversampling effectiveness, yet this assumption remains empirically unsubstantiated through controlled experimentation. We conducted 12 controlled experiments (N > 100 dataset variants) that systematically manipulated IR while holding data characteristics (class separability, cluster structure) constant via algorithmic generation of Gaussian mixture datasets. Two additional validation experiments examined ceiling effects and metric-dependence. All methods were evaluated on 17 real-world datasets from OpenML. Upon controlling for confounding variables, IR exhibited a weak to moderate negative correlation with oversampling benefits. Class separability emerged as a substantially stronger moderator, accounting for significantly more variance in method effectiveness than IR alone. We propose a ‘Context Matters’ framework that integrates IR, class separability, and cluster structure to provide evidence-based selection criteria for practitioners.

关键词: imbalance ratio, oversampling, class separability, data characteristics, Gaussian mixture, controlled experiments, method selection, cluster structure

281. ❌ FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

作者: Donghu Kim, Youngdo Lee, Minho Park, Kinam Kim, I Made Aswin Nahendra, Takuma Seno, Sehee Min, Daniel Palenicek, Florian Vogt, Danica Kragic, Jan Peters, Jaegul Choo, Hojoon Lee 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04539v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于强化学习（RL）算法创新，特别是机器人控制领域。与绝大多数关键词无关，因为这些关键词主要针对大语言模型（LLM）及其相关技术（如训练、对齐、推理、部署等）。仅有两个关键词有微弱关联：1) ‘Scaling Laws AND Data Quality’：论文提到受监督学习中观察到的缩放定律启发，但未深入探讨数据质量，且应用领域是RL而非LLM，给5分。2) ‘AI for Science OR Bioinformatics OR Cheminformatics’：论文涉及机器人控制，属于AI在科学/工程领域的应用，但非生物信息学或化学信息学，给5分。其他关键词均未涉及，给0分。

!!! tip deepseek-chat TL;DR

论文提出了FlashSAC算法，解决了高维机器人控制中离线策略强化学习收敛慢和不稳定的问题，通过减少梯度更新、使用更大模型和更高数据吞吐量，在超过60个任务中显著提升了最终性能和训练效率。

摘要翻译

强化学习（RL）是在缺乏专家示范时实现机器人控制的核心方法。诸如近端策略优化（PPO）这类同策略方法因其稳定性而被广泛采用，但其对分布狭窄的同策略数据的依赖，限制了在高维状态与动作空间中进行精确策略评估的能力。异策略方法能够通过从更广泛的状态-动作分布中学习来克服这一局限，却面临收敛缓慢与不稳定的问题——这是因为在多样化的数据上拟合价值函数需要大量梯度更新，导致通过自举法产生的评论家误差不断累积。本文提出FlashSAC，一种基于柔性演员-评论家（Soft Actor-Critic）框架的快速稳定异策略强化学习算法。受监督学习中观察到的缩放定律启发，FlashSAC大幅减少了梯度更新次数，同时以更大的模型和更高的数据吞吐量作为补偿。为了在扩大规模时保持稳定性，FlashSAC显式地约束了权重、特征及梯度的范数，从而抑制评论家误差的累积。在涵盖10种模拟器的超过60项任务中，FlashSAC在最终性能与训练效率上均持续优于PPO及多种强大的异策略基线方法，并在灵巧操作等高维任务中取得了最显著的提升。在从仿真到现实的人形机器人运动控制任务中，FlashSAC将训练时间从数小时缩短至数分钟，展现了异策略强化学习在仿真到现实迁移应用中的潜力。

摘要 (Abstract)

Reinforcement learning (RL) is a core approach for robot control when expert demonstrations are unavailable. On-policy methods such as Proximal Policy Optimization (PPO) are widely used for their stability, but their reliance on narrowly distributed on-policy data limits accurate policy evaluation in high-dimensional state and action spaces. Off-policy methods can overcome this limitation by learning from a broader state-action distribution, yet suffer from slow convergence and instability, as fitting a value function over diverse data requires many gradient updates, causing critic errors to accumulate through bootstrapping. We present FlashSAC, a fast and stable off-policy RL algorithm built on Soft Actor-Critic. Motivated by scaling laws observed in supervised learning, FlashSAC sharply reduces gradient updates while compensating with larger models and higher data throughput. To maintain stability at increased scale, FlashSAC explicitly bounds weight, feature, and gradient norms, curbing critic error accumulation. Across over 60 tasks in 10 simulators, FlashSAC consistently outperforms PPO and strong off-policy baselines in both final performance and training efficiency, with the largest gains on high-dimensional tasks such as dexterous manipulation. In sim-to-real humanoid locomotion, FlashSAC reduces training time from hours to minutes, demonstrating the promise of off-policy RL for sim-to-real transfer.

关键词: Reinforcement Learning, Robot Control, Off-policy RL, Soft Actor-Critic, High-dimensional Tasks, Training Efficiency, Sim-to-real Transfer, Dexterous Manipulation

282. ❌ Learning from Equivalence Queries, Revisited

作者: Mark Braverman, Roi Livni, Yishay Mansour, Shay Moran, Kobbi Nissim 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04535v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是经典机器学习理论中的等价查询学习模型，主要探讨在对称对抗环境下的学习轮数界限，属于理论计算机科学和机器学习理论领域。论文内容完全不涉及大模型、深度学习、AI for Science等关键词相关的技术、应用或创新，所有关键词均与论文主题无关。

!!! tip deepseek-chat TL;DR

该论文重新审视了Angluin提出的等价查询学习模型，通过引入对称对抗环境的概念，在完全信息和bandit反馈两种设置下获得了学习轮数的紧致界限。

摘要翻译

现代机器学习系统（如生成模型与推荐系统）通常遵循部署、用户交互与周期性模型更新的循环演进模式。这与标准监督学习框架不同，后者侧重于在固定序列的预测任务上最小化损失或遗憾。受此场景启发，我们重新审视了由Angluin（1988）提出的经典等价查询学习模型。在该模型中，学习者反复提出假设，并在部署的假设不适用时接收反例。然而，在完全对抗性的反例生成机制下，该模型可能过于悲观。此外，已有研究大多假设一种“全信息”场景，即学习者还能观察到反例的正确标签，这一假设在实际中并非总是自然成立。
为解决这些问题，我们将环境限制在一类广泛存在的、对抗性较弱的反例生成器上，称之为“对称”生成器。非正式地说，此类生成器仅根据假设与目标之间的对称差异来选择反例。这一类别涵盖了多种自然机制，例如随机反例生成（Angluin and Dohrn, 2017; Bhatia, 2021; Chase, Freitag, and Reyzin, 2024），以及根据预设复杂度度量返回最简反例的生成器。在此框架下，我们研究了全信息与赌博机反馈两种情境下的等价查询学习。我们在这两种设定下获得了关于学习轮数的紧致界，并指出了未来工作的方向。我们的分析结合了对称对抗者的博弈论视角、自适应加权方法以及极小极大论证。

摘要 (Abstract)

Modern machine learning systems, such as generative models and recommendation systems, often evolve through a cycle of deployment, user interaction, and periodic model updates. This differs from standard supervised learning frameworks, which focus on loss or regret minimization over a fixed sequence of prediction tasks. Motivated by this setting, we revisit the classical model of learning from equivalence queries, introduced by Angluin (1988). In this model, a learner repeatedly proposes hypotheses and, when a deployed hypothesis is inadequate, receives a counterexample. Under fully adversarial counterexample generation, however, the model can be overly pessimistic. In addition, most prior work assumes a \emph{full-information} setting, where the learner also observes the correct label of the counterexample, an assumption that is not always natural. We address these issues by restricting the environment to a broad class of less adversarial counterexample generators, which we call \emph{symmetric}. Informally, such generators choose counterexamples based only on the symmetric difference between the hypothesis and the target. This class captures natural mechanisms such as random counterexamples (Angluin and Dohrn, 2017; Bhatia, 2021; Chase, Freitag, and Reyzin, 2024), as well as generators that return the simplest counterexample according to a prescribed complexity measure. Within this framework, we study learning from equivalence queries under both full-information and bandit feedback. We obtain tight bounds on the number of learning rounds in both settings and highlight directions for future work. Our analysis combines a game-theoretic view of symmetric adversaries with adaptive weighting methods and minimax arguments.

关键词: equivalence queries, learning theory, counterexample generation, symmetric adversaries, full-information feedback, bandit feedback, minimax analysis, game-theoretic view

283. ❌ SLSREC: Self-Supervised Contrastive Learning for Adaptive Fusion of Long- and Short-Term User Interests

作者: Wei Zhou, Yue Shen, Junkai Ji, Yinglan Feng, Xing Tang, Xiuqiang He, Liang Feng, Zexuan Zhu 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04530v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SLSRec专注于会话推荐系统，提出了一种融合长短期用户兴趣的自监督对比学习模型。虽然属于AI应用领域，但所有评分关键词均针对大模型/深度学习技术原理（如LLM架构、训练方法、推理优化等）或特定科学领域AI应用（如生物信息学）。该论文研究传统推荐算法，未涉及大模型技术、深度学习创新原理或科学领域AI应用，与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SLSRec的会话推荐模型，通过自监督对比学习和注意力融合网络来解耦并自适应聚合用户的长短期兴趣，在三个公开数据集上超越了现有最优模型并表现出更强的鲁棒性。

摘要翻译

用户兴趣通常涵盖长期偏好与短期意图，这反映了用户行为在不同时间尺度上的动态特性。用户交互行为在时间分布上的不均衡性凸显了兴趣的演化模式，使得利用完整历史行为准确捕捉兴趣变化面临挑战。为此，我们提出SLSRec，一种融合长期与短期推荐的创新会话模型，该模型通过按时间划分历史行为，有效捕捉用户兴趣的时间动态。与传统模型将长短期用户兴趣合并为单一表示从而损害推荐准确性不同，SLSRec采用自监督学习框架来解耦这两类兴趣。我们引入对比学习策略以确保长短期兴趣表征的精确校准。此外，设计了一个基于注意力的融合网络来自适应聚合兴趣表征，优化其整合方式以提升推荐性能。在三个公共基准数据集上的大量实验表明，SLSRec consistently outperforms state-of-the-art models，并在多种场景下展现出卓越的鲁棒性。论文录用后我们将公开全部源代码。

摘要 (Abstract)

User interests typically encompass both long-term preferences and short-term intentions, reflecting the dynamic nature of user behaviors across different timeframes. The uneven temporal distribution of user interactions highlights the evolving patterns of interests, making it challenging to accurately capture shifts in interests using comprehensive historical behaviors. To address this, we propose SLSRec, a novel Session-based model with the fusion of Long- and Short-term Recommendations that effectively captures the temporal dynamics of user interests by segmenting historical behaviors over time. Unlike conventional models that combine long- and short-term user interests into a single representation, compromising recommendation accuracy, SLSRec utilizes a self-supervised learning framework to disentangle these two types of interests. A contrastive learning strategy is introduced to ensure accurate calibration of long- and short-term interest representations. Additionally, an attention-based fusion network is designed to adaptively aggregate interest representations, optimizing their integration to enhance recommendation performance. Extensive experiments on three public benchmark datasets demonstrate that SLSRec consistently outperforms state-of-the-art models while exhibiting superior robustness across various scenarios.We will release all source code upon acceptance.

关键词: Session-based Recommendation, Long-term Interests, Short-term Interests, Self-supervised Learning, Contrastive Learning, Attention-based Fusion, User Behavior Modeling

284. ❌ Isokinetic Flow Matching for Pathwise Straightening of Generative Flows

作者: Tauhid Khan 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04491v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于生成模型中的Flow Matching技术改进，提出Isokinetic Flow Matching来减少路径曲率并加速采样。所有评分关键词均与大语言模型、深度学习技术原理或科学AI应用相关，但本文研究的是生成流模型（特别是扩散模型/流匹配），不涉及大语言模型、MoE、对齐、推理、代理、压缩等主题，也未应用于生物信息学等科学领域。因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文针对Flow Matching中轨迹叠加导致的高曲率问题，提出了Isokinetic Flow Matching方法，通过惩罚路径加速度来显著提升少步采样的生成质量，在CIFAR-10上实现了2.9倍的相对效率提升。

摘要翻译

流匹配（Flow Matching，FM）通过构建线性条件概率路径进行建模，但由于轨迹叠加效应，学习得到的边际速度场不可避免地呈现强曲率特性。这种曲率会显著放大数值截断误差，成为限制少步采样的关键瓶颈。为克服此问题，我们提出等动能流匹配（Isokinetic Flow Matching，Iso-FM）——一种轻量级、无需雅可比矩阵计算的动力学正则化器，可直接对路径加速度进行惩罚。该方法通过采用物质导数Dv/Dt的自引导有限差分近似，在无需辅助编码器或昂贵二阶自动微分的情况下，实现了局部速度一致性约束。作为可即插即用的单阶段FM训练增强模块，Iso-FM显著提升了少步生成性能。在CIFAR-10数据集（DiT-S/2架构）的实验中，Iso-FM将2步采样时的条件非最优传输弗雷歇起始距离（conditional non-OT FID）从78.82大幅降低至27.13，相对效率提升达2.9倍，并在4步采样时达到当前最佳观测值10.23。这些结果充分证明，加速度正则化是构建高效快速生成采样机制的一种原理清晰且计算高效的方法。

摘要 (Abstract)

Flow Matching (FM) constructs linear conditional probability paths, but the learned marginal velocity field inevitably exhibits strong curvature due to trajectory superposition. This curvature severely inflates numerical truncation errors, bottlenecking few-step sampling. To overcome this, we introduce Isokinetic Flow Matching (Iso-FM), a lightweight, Jacobian-free dynamical regularizer that directly penalizes pathwise acceleration. By using a self-guided finite-difference approximation of the material derivative Dv/Dt, Iso-FM enforces local velocity consistency without requiring auxiliary encoders or expensive second-order autodifferentiation. Operating as a pure plug-and-play addition to single-stage FM training, Iso-FM dramatically improves few-step generation. On CIFAR-10 (DiT-S/2), Iso-FM slashes conditional non-OT FID at 2 steps from 78.82 to 27.13 - a 2.9x relative efficiency gain - and reaches a best-observed FID at 4 steps of 10.23. These results firmly establish acceleration regularization as a principled, compute-efficient mechanism for fast generative sampling.

关键词: Flow Matching, Isokinetic Flow Matching, generative flows, few-step sampling, acceleration regularization, numerical truncation errors, pathwise straightening, velocity consistency

285. ❌ Generative modeling of granular flow on inclined planes using conditional flow matching

作者: Xuyang Li, Rui Li, Teng Man, Yimin Lu 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04453v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是使用条件流匹配（CFM）进行颗粒流重建的生成模型，属于AI在科学领域的应用（具体是计算物理/颗粒系统）。它与绝大多数关键词（如LLM、MoE、对齐、推理、代理等）完全无关，因为这些关键词特指大语言模型及其相关技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学（具体是物理模拟）中的应用，但并非生物信息学或化学信息学，因此给予中等相关度5分。论文的核心是生成模型（CFM）在特定物理逆问题中的应用，而非大模型技术本身或其原理的创新。

!!! tip deepseek-chat TL;DR

该论文提出了首个基于条件流匹配（CFM）的生成模型框架，用于从稀疏边界观测中重建颗粒流的内部速度场和应力状态，在严重欠定条件下优于传统确定性方法，并提供了不确定性估计。

摘要翻译

颗粒流主导着众多自然与工业过程，但其内部运动学与力学机制在很大程度上仍无法直接观测，因为实验仅能获取边界或自由表面的信息。传统数值模拟方法在快速逆向重构中计算成本高昂，而确定性模型在不适定场景下往往退化为过度平滑的平均预测。本研究据作者所知，首次提出了基于稀疏边界观测的颗粒流重构条件流匹配框架。该生成模型通过高保真颗粒解析离散元模拟数据进行训练，在推理阶段由可微分前向算子与稀疏感知梯度引导机制共同指导，该机制无需超参数调优即可保证测量一致性，并防止在非物料区域产生非物理速度预测。物理解码器将重构的速度场映射至应力状态与能量波动量，包括平均应力、偏应力及颗粒温度。该框架能够从完整观测到仅占信息窗口16%的稀疏数据中准确恢复内部流场，并在空间分辨率大幅稀释至仅保留11%数据的条件下仍保持有效性。在最不适定的重构场景中，其性能优于确定性卷积神经网络基线，并通过集成生成提供空间分辨的不确定性估计。这些结果表明，条件生成建模为颗粒介质中隐藏体相力学的非侵入式推断提供了实用路径，在颗粒与多相系统的逆问题中具有更广泛的适用性。

摘要 (Abstract)

Granular flows govern many natural and industrial processes, yet their interior kinematics and mechanics remain largely unobservable, as experiments access only boundaries or free surfaces. Conventional numerical simulations are computationally expensive for fast inverse reconstruction, and deterministic models tend to collapse to over-smoothed mean predictions in ill-posed settings. This study, to the best of the authors’ knowledge, presents the first conditional flow matching (CFM) framework for granular-flow reconstruction from sparse boundary observations. Trained on high-fidelity particle-resolved discrete element simulations, the generative model is guided at inference by a differentiable forward operator with a sparsity-aware gradient guidance mechanism, which enforces measurement consistency without hyperparameter tuning and prevents unphysical velocity predictions in non-material regions. A physics decoder maps the reconstructed velocity fields to stress states and energy fluctuation quantities, including mean stress, deviatoric stress, and granular temperature. The framework accurately recovers interior flow fields from full observation to only 16% of the informative window, and it remains effective under strongly diluted spatial resolution with only 11% of data. It also outperforms a deterministic CNN baseline in the most ill-posed reconstruction regime and provides spatially resolved uncertainty estimates through ensemble generation. These results demonstrate that conditional generative modeling offers a practical route for non-invasive inference of hidden bulk mechanics in granular media, with broader applicability for inverse problems in particulate and multiphase systems.

关键词: granular flow, conditional flow matching, generative modeling, inverse reconstruction, particle-resolved simulation, sparse boundary observations, physics decoder, uncertainty estimation

286. ❌ TinyNina: A Resource-Efficient Edge-AI Framework for Sustainable Air Quality Monitoring via Intra-Image Satellite Super-Resolution

作者: Prasanjit Dey, Zachary Yahn, Bianca Schoen-Phelan, Soumyabrata Dev 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04445v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文TinyNina主要研究卫星图像超分辨率用于空气质量监测，属于AI for Science应用领域（高度相关10分）。论文核心是资源高效的边缘AI框架，涉及模型轻量化（51K参数）、计算效率提升（95%开销减少）和推理加速（47倍），因此与Small Language Models/On-device AI（8分）、Quantization/Model Compression（8分）、Speculative Decoding/Inference Acceleration（8分）有一定关联。其他关键词主要涉及大语言模型技术原理，与本文的计算机视觉和边缘计算应用无关（0分）。

!!! tip deepseek-chat TL;DR

论文提出TinyNina框架，通过创新的内部图像学习范式实现卫星图像超分辨率，以51K参数的轻量化模型在空气质量监测中达到7.4 μg/m³的MAE，同时减少95%计算开销并加速推理47倍。

摘要翻译

二氧化氮（NO$_2$）是一种主要的大气污染物，是导致呼吸系统疾病和城市气候相关挑战的重要因素。尽管哨兵-2号（Sentinel-2）等卫星平台提供了全球覆盖，但其固有的空间分辨率往往限制了进行精细化NO$_2$评估所需的精度。为解决这一问题，我们提出了TinyNina——一个专为可持续环境监测设计的资源高效型边缘人工智能（Edge-AI）框架。TinyNina采用了一种新颖的图像内学习范式，该范式利用哨兵-2号的多光谱层级结构作为内部训练标签，从而有效消除了对成本高昂且通常难以获取的外部高分辨率参考数据集的依赖。该框架结合了波长特异性注意力门控和深度可分离卷积，在保持仅51K参数的超轻量级模型体积的同时，保留了污染物敏感的光谱特征。基于3276组匹配的卫星-地面站点对进行验证的实验结果表明，TinyNina实现了7.4 $μ$g/m$^3$的平均绝对误差（Mean Absolute Error, MAE），达到了当前最优水平。与EDSR和RCAN等高容量模型相比，此性能意味着计算开销降低了95%，推理速度提升了47倍。通过优先考虑任务特定效用和架构效率，TinyNina为智慧城市基础设施中的实时空气质量监测提供了一个可扩展、低延迟的解决方案。

摘要 (Abstract)

Nitrogen dioxide (NO$_2$) is a primary atmospheric pollutant and a significant contributor to respiratory morbidity and urban climate-related challenges. While satellite platforms like Sentinel-2 provide global coverage, their native spatial resolution often limits the precision required, fine-grained NO$_2$ assessment. To address this, we propose TinyNina, a resource-efficient Edge-AI framework specifically engineered for sustainable environmental monitoring. TinyNina implements a novel intra-image learning paradigm that leverages the multi-spectral hierarchy of Sentinel-2 as internal training labels, effectively eliminating the dependency on costly and often unavailable external high-resolution reference datasets. The framework incorporates wavelength-specific attention gates and depthwise separable convolutions to preserve pollutant-sensitive spectral features while maintaining an ultra-lightweight footprint of only 51K parameters. Experimental results, validated against 3,276 matched satellite-ground station pairs, demonstrate that TinyNina achieves a state-of-the-art Mean Absolute Error (MAE) of 7.4 $μ$g/m$^3$. This performance represents a 95% reduction in computational overhead and 47$\times$ faster inference compared to high-capacity models such as EDSR and RCAN. By prioritizing task-specific utility and architectural efficiency, TinyNina provides a scalable, low-latency solution for real-time air quality monitoring in smart city infrastructures.

关键词: Edge-AI, satellite super-resolution, air quality monitoring, resource-efficient, intra-image learning, ultra-lightweight model, computational efficiency, real-time inference

287. ❌ Explainable Autonomous Cyber Defense using Adversarial Multi-Agent Reinforcement Learning

作者: Yiyao Zhang, Diksha Goel, Hussain Ahmad 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04442v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究自主网络防御中的多智能体强化学习系统，与’Multi-agent Systems OR Agent Coordination’高度相关（10分），因为核心是双智能体（Blue-Team和Red-Team）协调的对抗性强化学习框架。与’Mechanistic Interpretability OR Explainable AI’有一定关联（8分），因为论文提出了Explainability-Transparency Score作为可解释性接口。其他关键词均与大模型、语言模型、训练方法、推理技术等无关，因此评分为0。论文属于网络安全领域的AI应用，而非大模型或深度学习技术原理的创新，也未涉及生物医药等科学领域应用。

!!! tip deepseek-chat TL;DR

该论文提出了一个因果多智能体决策框架（C-MADF），通过结合因果建模和对抗性双策略强化学习来解决自主网络防御中因模糊或对抗性输入导致的误报问题，在CICIoT2023数据集上将误报率从基线模型的8.4%-11.2%降低到1.8%，同时实现了高精度、召回率和F1分数。

摘要翻译

自主智能体在攻防两端网络行动中的部署日益增多，在关键基础设施环境中形成了高速闭环交互。高级持续性威胁（APT）行为体利用“离地生存”技术和定向遥测扰动，在监控系统中制造模糊性，导致自动化防御系统过度反应或将良性行为误判为恶意活动。现有的单体式与多智能体防御管道主要基于关联性信号运作，缺乏对响应行动的结构性约束，在模糊或对抗性输入下容易产生推理漂移。本文提出因果多智能体决策框架（C-MADF），这是一种用于自主网络防御的结构约束架构，将因果建模与对抗性双策略控制相结合。C-MADF首先从历史遥测数据中学习结构因果模型（SCM），并将其编译为调查级有向无环图（DAG），该图定义了允许的响应转移路径。此路线图被形式化为马尔可夫决策过程（MDP），其动作空间被显式限制于因果一致的转移路径。在此约束空间内的决策由双智能体强化学习系统执行，其中威胁优化的蓝队策略与保守塑造的红队策略相互制衡。策略间分歧通过策略分歧分数进行量化，并借助配备可解释性-透明度评分的人机交互界面进行呈现，该评分在不确定性情况下可作为升级响应信号。在真实世界数据集CICIoT2023上的实验表明，C-MADF将误报率从三种前沿文献基线的11.2%、9.7%和8.4%降低至1.8%，同时达到0.997精确率、0.961召回率与0.979 F1分数。

摘要 (Abstract)

Autonomous agents are increasingly deployed in both offensive and defensive cyber operations, creating high-speed, closed-loop interactions in critical infrastructure environments. Advanced Persistent Threat (APT) actors exploit “Living off the Land” techniques and targeted telemetry perturbations to induce ambiguity in monitoring systems, causing automated defenses to overreact or misclassify benign behavior as malicious activity. Existing monolithic and multi-agent defense pipelines largely operate on correlation-based signals, lack structural constraints on response actions, and are vulnerable to reasoning drift under ambiguous or adversarial inputs. We present the Causal Multi-Agent Decision Framework (C-MADF), a structurally constrained architecture for autonomous cyber defense that integrates causal modeling with adversarial dual-policy control. C-MADF first learns a Structural Causal Model (SCM) from historical telemetry and compiles it into an investigation-level Directed Acyclic Graph (DAG) that defines admissible response transitions. This roadmap is formalized as a Markov Decision Process (MDP) whose action space is explicitly restricted to causally consistent transitions. Decision-making within this constrained space is performed by a dual-agent reinforcement learning system in which a threat-optimizing Blue-Team policy is counterbalanced by a conservatively shaped Red-Team policy. Inter-policy disagreement is quantified through a Policy Divergence Score and exposed via a human-in-the-loop interface equipped with an Explainability-Transparency Score that serves as an escalation signal under uncertainty. On the real-world CICIoT2023 dataset, C-MADF reduces the false-positive rate from 11.2%, 9.7%, and 8.4% in three cutting-edge literature baselines to 1.8%, while achieving 0.997 precision, 0.961 recall, and 0.979 F1-score.

关键词: Autonomous Cyber Defense, Multi-Agent Reinforcement Learning, Causal Modeling, Adversarial Learning, Explainable AI, False-Positive Reduction, Structural Causal Model, Policy Divergence

288. ❌ Eliminating Vendor Lock-In in Quantum Machine Learning via Framework-Agnostic Neural Networks

作者: Poornima Kumaresan, Shwetha Singaravelu, Lakshmi Rajendran, Santhosh Sivasubramani 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04414v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于量子机器学习（QML）的框架互操作性问题，提出了一种与框架无关的量子神经网络架构。论文内容与绝大多数关键词（涉及大语言模型、训练技术、推理优化、对齐、代理等）完全无关，因为这些关键词都特指基于Transformer的大语言模型技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为量子机器学习可被视为AI在科学计算领域的一个应用分支，但论文核心是解决工程框架问题，而非AI在生物信息学或化学信息学的具体科学发现，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对量子机器学习中因软件框架碎片化导致的供应商锁定问题，提出了一种框架无关的量子神经网络架构，通过统一计算图、硬件抽象层和导出管道实现了跨框架和硬件的互操作性，并在基准测试中达到了与原生框架相当的训练效率和分类精度。

摘要翻译

量子机器学习（QML）处于量子计算与人工智能的交叉领域，为解决经典方法难以处理的问题提供了潜力。然而，当前的QML软件框架生态存在严重的碎片化问题：在TensorFlow Quantum中开发的模型无法在PennyLane后端上运行，基于Qiskit Machine Learning编写的量子电路无法部署到Amazon Braket硬件上，且研究人员若投入某一生态系统，在迁移至另一平台时将面临极高的转换成本。这种供应商锁定现象阻碍了研究的可复现性，限制了硬件访问，并拖慢了科学发现的进程。本文提出一种与框架无关的量子神经网络（QNN）架构，通过统一的计算图、硬件抽象层（HAL）以及多框架导出流水线，抽象掉供应商特定的接口。该核心架构支持同时与TensorFlow、PyTorch和JAX作为经典协处理器集成，而HAL则通过单一应用程序编程接口（API）透明地访问IBM Quantum、Amazon Braket、Azure Quantum、IonQ和Rigetti后端。我们引入了三种可插拔的数据编码策略（幅度编码、角度编码和瞬时量子多项式编码），这些策略与所有支持的后端兼容。利用开放神经网络交换（ONNX）元数据的导出模块，可实现Qiskit、Cirq、PennyLane和Braket表示之间的无损量子电路转换。我们在Iris、Wine和MNIST-4分类任务上对所提框架进行了基准测试，结果表明：与原生框架实现相比，训练时间相当（开销在8%以内），同时达到了完全一致的分类准确率。

摘要 (Abstract)

Quantum machine learning (QML) stands at the intersection of quantum computing and artificial intelligence, offering the potential to solve problems that remain intractable for classical methods. However, the current landscape of QML software frameworks suffers from severe fragmentation: models developed in TensorFlow Quantum cannot execute on PennyLane backends, circuits authored in Qiskit Machine Learning cannot be deployed to Amazon Braket hardware, and researchers who invest in one ecosystem face prohibitive switching costs when migrating to another. This vendor lock-in impedes reproducibility, limits hardware access, and slows the pace of scientific discovery. In this paper, we present a framework-agnostic quantum neural network (QNN) architecture that abstracts away vendor-specific interfaces through a unified computational graph, a hardware abstraction layer (HAL), and a multi-framework export pipeline. The core architecture supports simultaneous integration with TensorFlow, PyTorch, and JAX as classical co-processors, while the HAL provides transparent access to IBM Quantum, Amazon Braket, Azure Quantum, IonQ, and Rigetti backends through a single application programming interface (API). We introduce three pluggable data encoding strategies (amplitude, angle, and instantaneous quantum polynomial encoding) that are compatible with all supported backends. An export module leveraging Open Neural Network Exchange (ONNX) metadata enables lossless circuit translation across Qiskit, Cirq, PennyLane, and Braket representations. We benchmark our framework on the Iris, Wine, and MNIST-4 classification tasks, demonstrating training time parity (within 8% overhead) compared to native framework implementations, while achieving identical classification accuracy.

关键词: Quantum Machine Learning, Framework-Agnostic, Vendor Lock-In, Hardware Abstraction Layer, Quantum Neural Network, Cross-Framework Compatibility, ONNX, Quantum Computing

289. ❌ ReinVBC: A Model-based Reinforcement Learning Approach to Vehicle Braking Controller

作者: Haoxin Lin, Junjie Zhou, Daheng Xu, Yang Yu 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04401v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究车辆制动控制器的离线模型强化学习方法，属于传统控制工程和强化学习领域，未涉及大语言模型、深度学习技术原理创新或AI for Science等关键词。所有关键词均与大模型、深度学习技术或科学AI应用无关，因此全部评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于离线模型强化学习的车辆制动控制器方法，通过数据驱动的动力学模型学习制动策略，实验证明该方法在实际车辆制动中有效，并具有替代生产级防抱死制动系统的潜力。

摘要翻译

制动系统作为保障现代车辆安全性与可操控性的核心模块，其生产过程中依赖大量人工标定。在保持车辆制动控制器性能的同时降低人力与时间成本，对汽车产业具有重要意义。基于模型的离线强化学习方法通过在数据驱动的动力学模型中进行策略探索，为应对现实世界控制任务提供了有前景的解决方案。本研究提出ReinVBC，采用基于模型的离线强化学习方法处理车辆制动控制问题。我们在模型学习与利用的框架中引入了多项有效的工程化设计，以获取可靠的车辆动力学模型与高效的制动策略。多项实验结果表明，该方法在实际车辆制动中具备优异性能，并展现出替代生产级防抱死制动系统的潜力。

摘要 (Abstract)

Braking system, the key module to ensure the safety and steer-ability of current vehicles, relies on extensive manual calibration during production. Reducing labor and time consumption while maintaining the Vehicle Braking Controller (VBC) performance greatly benefits the vehicle industry. Model-based methods in offline reinforcement learning, which facilitate policy exploration within a data-driven dynamics model, offer a promising solution for addressing real-world control tasks. This work proposes ReinVBC, which applies an offline model-based reinforcement learning approach to deal with the vehicle braking control problem. We introduce useful engineering designs into the paradigm of model learning and utilization to obtain a reliable vehicle dynamics model and a capable braking policy. Several results demonstrate the capability of our method in real-world vehicle braking and its potential to replace the production-grade anti-lock braking system.

关键词: vehicle braking control, model-based reinforcement learning, offline reinforcement learning, dynamics model, anti-lock braking system, policy exploration, real-world control, engineering design

290. ❌ Finite-Time Analysis of Q-Value Iteration for General-Sum Stackelberg Games

作者: Narim Jeong, Donghwan Lee 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04394v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是多智能体强化学习（MARL）中Stackelberg博弈的Q值迭代收敛性分析，属于经典强化学习理论范畴。论文内容完全不涉及大语言模型（LLMs）、深度学习、大模型技术原理或其在科学领域的应用。所有关键词均围绕大模型技术及其应用，而本文专注于传统强化学习算法在博弈论场景下的理论分析，因此绝大多数关键词评分为0。唯一略有相关的是’Multi-agent Systems OR Agent Coordination’，因为论文研究双智能体博弈，但论文关注的是理论收敛性而非协调机制，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文从控制理论视角研究了两玩家一般和马尔可夫博弈中Stackelberg Q值迭代的收敛性，首次为Stackelberg交互下的Q值迭代提供了有限时间收敛保证。

摘要翻译

强化学习在单智能体环境中已取得实证与理论上的成功，但将这些成果扩展到一般和马尔可夫博弈中的多智能体强化学习仍具挑战性。本文从控制理论视角研究双玩家一般和马尔可夫博弈中斯塔克尔伯格Q值迭代的收敛性。我们提出一种适用于斯塔克尔伯格框架的松弛策略条件，并将学习动态建模为一个切换系统。通过构建上下比较系统，我们为Q函数建立了有限时间误差界，并刻画了其收敛特性。我们的研究结果为斯塔克尔伯格学习提供了新颖的控制理论视角。此外，据作者所知，本文首次在斯塔克尔伯格交互下为一般和马尔可夫博弈中的Q值迭代提供了有限时间收敛性保证。

摘要 (Abstract)

Reinforcement learning has been successful both empirically and theoretically in single-agent settings, but extending these results to multi-agent reinforcement learning in general-sum Markov games remains challenging. This paper studies the convergence of Stackelberg Q-value iteration in two-player general-sum Markov games from a control-theoretic perspective. We introduce a relaxed policy condition tailored to the Stackelberg setting and model the learning dynamics as a switching system. By constructing upper and lower comparison systems, we establish finite-time error bounds for the Q-functions and characterize their convergence properties. Our results provide a novel control-theoretic perspective on Stackelberg learning. Moreover, to the best of the authors’ knowledge, this paper offers the first finite-time convergence guarantees for Q-value iteration in general-sum Markov games under Stackelberg interactions.

关键词: Stackelberg games, multi-agent reinforcement learning, Q-value iteration, finite-time convergence, general-sum Markov games, control-theoretic analysis, switching systems

291. ❌ Adversarial Robustness Analysis of Cloud-Assisted Autonomous Driving Systems

作者: Maher Al Islam, Amr S. El-Wakeel 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04349v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究云辅助自动驾驶系统中的对抗鲁棒性分析，主要涉及深度学习（YOLOv8）、对抗攻击（FGSM、PGD）、网络延迟/丢包和系统安全评估。所有评分关键词均与大模型（LLMs）及其相关技术（如MoE、SFT、RAG、量化等）或AI for Science（生物信息学、化学信息学）直接相关，而本文未涉及任何大模型技术或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文通过硬件在环测试平台评估了云辅助自动驾驶系统中对抗攻击和网络损伤对感知与控制性能的联合影响，发现对抗扰动和网络延迟/丢包会显著降低检测精度并破坏闭环控制稳定性。

摘要翻译

自动驾驶系统日益依赖基于深度学习的感知与控制技术，这对计算资源提出了巨大需求。云辅助架构将这些功能卸载至远程服务器，通过车联网（Internet of Vehicles, IoV）实现增强感知与协同决策。然而，该范式引入了跨层安全漏洞：感知模型的对抗性操纵与车-云链路的网络损伤可能共同危及安全关键型自动驾驶功能。本文提出了一种硬件在环车联网测试平台，集成实时感知、控制与通信模块，以评估云辅助自动驾驶中的此类脆弱性。部署于云端的基于YOLOv8的目标检测器遭受了使用快速梯度符号法（Fast Gradient Sign Method, FGSM）和投影梯度下降法（Projected Gradient Descent, PGD）的白盒对抗攻击，同时网络攻击者在车-云环路中注入延迟与丢包。实验结果表明，对抗性扰动显著降低了感知性能：在ε=0.04时，PGD攻击将检测精度与召回率从干净基准下的0.73和0.68分别降至0.22和0.15。150-250毫秒的网络延迟（对应约3-4帧的瞬时丢失）以及0.5-5%的丢包率进一步破坏了闭环控制的稳定性，导致执行延迟与规则违反。这些发现凸显了云辅助自动驾驶系统对跨层韧性的迫切需求。

摘要 (Abstract)

Autonomous vehicles increasingly rely on deep learning-based perception and control, which impose substantial computational demands. Cloud-assisted architectures offload these functions to remote servers, enabling enhanced perception and coordinated decision-making through the Internet of Vehicles (IoV). However, this paradigm introduces cross-layer vulnerabilities, where adversarial manipulation of perception models and network impairments in the vehicle-cloud link can jointly undermine safety-critical autonomy. This paper presents a hardware-in-the-loop IoV testbed that integrates real-time perception, control, and communication to evaluate such vulnerabilities in cloud-assisted autonomous driving. A YOLOv8-based object detector deployed on the cloud is subjected to whitebox adversarial attacks using the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD), while network adversaries induce delay and packet loss in the vehicle-cloud loop. Results show that adversarial perturbations significantly degrade perception performance, with PGD reducing detection precision and recall from 0.73 and 0.68 in the clean baseline to 0.22 and 0.15 at epsilon= 0.04. Network delays of 150-250 ms, corresponding to transient losses of approximately 3-4 frames, and packet loss rates of 0.5-5 % further destabilize closed-loop control, leading to delayed actuation and rule violations. These findings highlight the need for cross-layer resilience in cloud-assisted autonomous driving systems.

关键词: autonomous driving, cloud-assisted systems, adversarial robustness, YOLOv8, network impairments, hardware-in-the-loop, perception degradation, closed-loop control

292. ❌ Deep Kuratowski Embedding Neural Networks for Wasserstein Metric Learning

作者: Andrew Qing He 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04343v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是Wasserstein距离学习的神经网络方法，具体提出了两种基于Kuratowski嵌入定理的架构（DeepKENN和ODE-KENN）来近似计算Wasserstein-2距离。论文内容完全聚焦于度量学习、神经网络架构（CNN、Neural ODE）和计算效率问题，与所有评分关键词（均涉及大模型、深度学习技术原理、AI科学应用等）无直接关联。论文未涉及语言模型、模型训练/调优技术、推理方法、代理系统、模型压缩等任何关键词相关主题，也未涉及生物信息学等科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了两种基于Kuratowski嵌入定理的神经网络架构（DeepKENN和ODE-KENN）来学习近似Wasserstein-2距离，实验表明ODE-KENN在MNIST数据集上比基线方法降低了28%的测试MSE，可作为快速替代方案用于下游成对距离计算。

摘要翻译

计算成对Wasserstein距离是数据分析流程中的基础性瓶颈。受经典库拉托夫斯基嵌入定理的启发，我们提出了两种神经架构用于从数据中学习逼近Wasserstein-2距离（$W_2$）。第一种架构DeepKENN通过可学习的正权重聚合卷积神经网络（CNN）所有中间特征图的距离。第二种架构ODE-KENN用神经常微分方程（Neural ODE）替代离散层堆叠，将每个输入嵌入到无限维巴拿赫空间$C^1([0,1], \mathbb{R}^d)$中，并通过轨迹平滑性提供隐式正则化。在使用预计算精确$W_2$距离的MNIST数据集上的实验表明：在参数量匹配的条件下，ODE-KENN的测试均方误差（MSE）比单层基线低28%，比DeepKENN低18%，同时表现出更小的泛化差距。所构建的快速替代模型可在下游成对距离计算中替代昂贵的$W_2$计算源。

摘要 (Abstract)

Computing pairwise Wasserstein distances is a fundamental bottleneck in data analysis pipelines. Motivated by the classical Kuratowski embedding theorem, we propose two neural architectures for learning to approximate the Wasserstein-2 distance ($W_2$) from data. The first, DeepKENN, aggregates distances across all intermediate feature maps of a CNN using learnable positive weights. The second, ODE-KENN, replaces the discrete layer stack with a Neural ODE, embedding each input into the infinite-dimensional Banach space $C^1([0,1], \mathbb{R}^d)$ and providing implicit regularization via trajectory smoothness. Experiments on MNIST with exact precomputed $W_2$ distances show that ODE-KENN achieves a 28% lower test MSE than the single-layer baseline and 18% lower than DeepKENN under matched parameter counts, while exhibiting a smaller generalization gap. The resulting fast surrogate can replace the expensive $W_2$ oracle in downstream pairwise distance computations.

关键词: Wasserstein distance, metric learning, Kuratowski embedding, neural networks, Neural ODE, CNN, approximation, computational efficiency

293. ❌ Generative models for decision-making under distributional shift

作者: Xiuyuan Cheng, Yunqin Zhu, Yao Xie 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04342v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于基于流的生成模型和分数匹配方法，用于在分布偏移下进行决策，属于传统生成模型和运筹学领域，与所有关键词（均围绕大模型、深度学习技术原理及其应用）完全无关。

!!! tip deepseek-chat TL;DR

该教程提出了一个基于推前映射、连续性和概率空间优化的统一框架，利用流模型和分数匹配等生成模型来构建决策相关分布，以解决分布偏移下的场景生成、鲁棒决策和不确定性量化问题。

摘要翻译

许多数据驱动的决策问题使用基于历史数据估计的名义分布进行建模，而实际性能最终由部署分布决定——该分布可能存在偏移、依赖情境、部分可观测或受压力影响。本教程将现代生成模型，特别是基于流和得分的方法，作为构建与决策相关分布的数学工具进行阐述。从运筹学视角看，其主要价值不在于无约束的样本生成，而在于通过传输映射、速度场、得分场和引导随机动力学来实现分布的表征与变换。我们提出了一个基于推前映射、连续性、福克-普朗克方程、Wasserstein几何与概率空间优化的统一框架。在此框架内，生成模型可用于：学习名义不确定性、构建稳健性分析所需的压力分布或最不利分布、在侧信息与部分观测条件下生成条件分布或后验分布。同时我们强调了代表性的理论保证，包括迭代流模型的前向-反向收敛性、传输映射空间的一阶极小极大分析，以及生成先验下后验采样的误差传递界。本教程为使用生成模型进行场景生成、稳健决策、不确定性量化及分布偏移下的相关问题提供了系统性的原理介绍。

摘要 (Abstract)

Many data-driven decision problems are formulated using a nominal distribution estimated from historical data, while performance is ultimately determined by a deployment distribution that may be shifted, context-dependent, partially observed, or stress-induced. This tutorial presents modern generative models, particularly flow- and score-based methods, as mathematical tools for constructing decision-relevant distributions. From an operations research perspective, their primary value lies not in unconstrained sample synthesis but in representing and transforming distributions through transport maps, velocity fields, score fields, and guided stochastic dynamics. We present a unified framework based on pushforward maps, continuity, Fokker-Planck equations, Wasserstein geometry, and optimization in probability space. Within this framework, generative models can be used to learn nominal uncertainty, construct stressed or least-favorable distributions for robustness, and produce conditional or posterior distributions under side information and partial observation. We also highlight representative theoretical guarantees, including forward-reverse convergence for iterative flow models, first-order minimax analysis in transport-map space, and error-transfer bounds for posterior sampling with generative priors. The tutorial provides a principled introduction to using generative models for scenario generation, robust decision-making, uncertainty quantification, and related problems under distributional shift.

关键词: generative models, distributional shift, decision-making, flow-based methods, score-based methods, robustness, uncertainty quantification, scenario generation

294. ❌ Thermodynamic-Inspired Explainable GeoAI: Uncovering Regime-Dependent Mechanisms in Heterogeneous Spatial Systems

作者: Sooyoung Lim, Zhenlong Li, Zi-Kui Liu 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04339v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文提出了一种受热力学启发的可解释地理AI框架，将统计力学与图神经网络结合，用于揭示异质空间系统中的机制。该研究主要涉及地理空间AI、可解释AI和科学AI应用，与大多数大模型技术关键词（如LLM、MoE、SFT、RLHF等）完全无关。唯一相关的关键词是’Mechanistic Interpretability OR Explainable AI’（高度相关，评10分），因为论文核心是提高GeoAI的可解释性；以及’AI for Science OR Bioinformatics OR Cheminformatics’（有一定关联，评8分），因为论文属于地理和环境科学领域的AI应用，符合’AI for Science’范畴。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种热力学启发的可解释地理AI框架，通过整合统计力学和图神经网络，成功识别了异质空间系统中传统方法遗漏的机制转变，如在2023年加拿大野火事件中诊断出向负担主导状态的相变。

摘要翻译

建模空间异质性及其相关的临界转变，始终是地理学与环境科学的一项基础性挑战。尽管传统的地理加权回归和深度学习模型提升了预测能力，但它们往往难以阐明状态依赖的非线性关系，即驱动因子的功能角色在异质性区域中表现出相反的效应。我们提出一个受热力学启发的可解释地理空间人工智能框架，该框架将统计力学与图神经网络相结合。通过将空间变异性概念化为系统负担与系统容量之间的一种热力学竞争，我们的模型揭示了驱动空间过程的潜在机制。利用三个模拟数据集和三个不同领域的真实数据集，我们展示了新框架成功识别出标准基线方法所遗漏的、依赖于系统状态的预测因子角色逆转现象。值得注意的是，该框架明确诊断出在2023年加拿大野火事件期间系统向负担主导状态的相变，从而将物理机制的转变与统计异常值区分开来。这些发现表明，热力学约束能够提升地理空间人工智能的可解释性，同时在复杂空间系统中保持强大的预测性能。

摘要 (Abstract)

Modeling spatial heterogeneity and associated critical transitions remains a fundamental challenge in geography and environmental science. While conventional Geographically Weighted Regression (GWR) and deep learning models have improved predictive skill, they often fail to elucidate state-dependent nonlinearities where the functional roles of drivers represent opposing effects across heterogeneous domains. We introduce a thermodynamics-inspired explainable geospatial AI framework that integrates statistical mechanics with graph neural networks. By conceptualizing spatial variability as a thermodynamic competition between system Burden (E) and Capacity (S), our model disentangles the latent mechanisms driving spatial processes. Using three simulation datasets and three real-word datasets across distinct domains (housing markets, mental health prevalence, and wildfire-induced PM2.5 anomalies), we show that the new framework successfully identifies regime-dependent role reversals of predictors that standard baselines miss. Notably, the framework explicitly diagnoses the phase transition into a Burden-dominated regime during the 2023 Canadian wildfire event, distinguishing physical mechanism shifts from statistical outliers. These findings demonstrate that thermodynamic constraints can improve the interpretability of GeoAI while preserving strong predictive performance in complex spatial systems.

关键词: Thermodynamic-inspired AI, Explainable GeoAI, Spatial heterogeneity, Graph neural networks, Regime-dependent mechanisms, Phase transition, Geospatial modeling, Interpretability

295. ❌ Boosted Distributional Reinforcement Learning: Analysis and Healthcare Applications

作者: Zequn Chen, Wesley J. Marrero 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04334v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于强化学习（RL）算法（特别是分布强化学习）在医疗保健（高血压管理）中的应用。摘要和标题中未提及任何大模型（LLM）、深度学习、MoE、量化、推理加速、对齐、微调、RAG、上下文学习、代理系统等关键词。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将AI（强化学习）应用于医疗保健（高血压管理），这属于AI for Science的一个子领域，但并非核心焦点（核心是RL算法）。因此，该关键词得5分（有一定关联），其他所有关键词得0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文提出了一种Boosted Distributional Reinforcement Learning（BDRL）算法，通过优化个体结果分布并确保相似个体间的可比性，解决了在高度不确定的医疗决策中传统强化学习方法的不足，并在美国成年人群高血压管理应用中证明了其能提高质量调整生命年的数量和一致性。

摘要翻译

研究人员与实践者日益关注运用强化学习来优化机器人学与医疗健康等复杂领域的决策。当前，这些研究主要基于期望值学习范式。然而，在涉及多个异质群体的高度不确定情境中，依赖以期望值为核心的目标函数可能不足以支撑一致性决策。尽管已有分布强化学习算法被提出以建模结果的完整分布，但这些方法可能导致可比智能体之间实现效益的显著差异。这一挑战在医疗场景中尤为突出，因为医生（控制者）必须管理多名疾病进展不确定且治疗反应存在异质性的患者（从属智能体）。本文提出一种提升式分布强化学习算法，该算法在优化个体化结果分布的同时，强制相似智能体间的可比性，并分析了其收敛性。为进一步稳定学习过程，我们引入了一个后更新投影步骤，该步骤被构建为约束凸优化问题，能高效地将个体结果在指定容差范围内与高性能参照基准对齐。我们将该算法应用于美国成年人口的大规模子集，通过将个体按心血管疾病风险分组进行高血压管理。我们的方法通过模仿各风险组中高性能参照基准的行为，调整中位患者与脆弱患者的治疗方案。此外，与强化学习基线方法相比，BDRL算法在质量调整生命年的数量与一致性方面均表现出显著提升。

摘要 (Abstract)

Researchers and practitioners are increasingly considering reinforcement learning to optimize decisions in complex domains like robotics and healthcare. To date, these efforts have largely utilized expectation-based learning. However, relying on expectation-focused objectives may be insufficient for making consistent decisions in highly uncertain situations involving multiple heterogeneous groups. While distributional reinforcement learning algorithms have been introduced to model the full distributions of outcomes, they can yield large discrepancies in realized benefits among comparable agents. This challenge is particularly acute in healthcare settings, where physicians (controllers) must manage multiple patients (subordinate agents) with uncertain disease progression and heterogeneous treatment responses. We propose a Boosted Distributional Reinforcement Learning (BDRL) algorithm that optimizes agent-specific outcome distributions while enforcing comparability among similar agents and analyze its convergence. To further stabilize learning, we incorporate a post-update projection step formulated as a constrained convex optimization problem, which efficiently aligns individual outcomes with a high-performing reference within a specified tolerance. We apply our algorithm to manage hypertension in a large subset of the US adult population by categorizing individuals into cardiovascular disease risk groups. Our approach modifies treatment plans for median and vulnerable patients by mimicking the behavior of high-performing references in each risk group. Furthermore, we find that BDRL improves the number and consistency of quality-adjusted life years compared with reinforcement learning baselines.

关键词: reinforcement learning, distributional reinforcement learning, healthcare applications, hypertension management, decision optimization, agent-specific outcomes, convergence analysis, quality-adjusted life years

296. ❌ Soft Tournament Equilibrium

作者: Saad Alqithami 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04328v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确研究基于大语言模型（LLMs）的通用人工智能代理（agents）的评估问题，因此与’Large Language Models OR LLMs OR Foundation Models’和’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分）。论文涉及多个代理（agents）之间的交互和评估，与’Multi-agent Systems OR Agent Coordination’有一定关联（5分）。论文未涉及其他关键词的具体技术、方法或应用领域，因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对基于大语言模型的通用人工智能代理在非传递性交互中传统排名方法不稳定的问题，提出了一个名为Soft Tournament Equilibrium（STE）的可微分框架，用于直接从成对比较数据中学习和计算集合值锦标赛解，以提供更稳健的代理能力评估。

摘要翻译

通用人工智能智能体（尤其是基于大语言模型的智能体）的评估面临重大挑战，这源于其交互的非传递性。当智能体A击败B、B击败C而C又击败A时，强制线性排序的传统排名方法可能产生误导且不稳定。我们认为，对于此类循环竞争领域，评估的基本对象不应是排名，而应是经典锦标赛理论中所概念化的集值核心。本文提出软锦标赛均衡（STE），这是一个可从成对比较数据中直接学习和计算集值锦标赛解的可微分框架。STE首先学习一个概率化锦标赛模型（该模型可受丰富上下文信息调节），随后采用新颖的可微分算子——软可达性与软覆盖性——来计算两种经典锦标赛解（即顶级循环与未被覆盖集）的连续模拟。其输出为一组核心智能体，每个智能体均附有经过校准的隶属度分数，从而提供对智能体能力细致而稳健的评估。我们建立了STE的理论基础，证明其在零温度极限下与经典解的一致性（这确立了其孔多塞包含性质），并分析了其稳定性与样本复杂度。我们制定了在合成与真实世界基准上验证STE的实验方案。本研究旨在提供一部完整独立的论述，将通用智能体评估重新置于更合适、更稳健的理论基础之上，从非稳定排名转向稳定的集值均衡。

摘要 (Abstract)

The evaluation of general-purpose artificial agents, particularly those based on large language models, presents a significant challenge due to the non-transitive nature of their interactions. When agent A defeats B, B defeats C, and C defeats A, traditional ranking methods that force a linear ordering can be misleading and unstable. We argue that for such cyclic domains, the fundamental object of evaluation should not be a ranking but a set-valued core, as conceptualized in classical tournament theory. This paper introduces Soft Tournament Equilibrium (STE), a differentiable framework for learning and computing set-valued tournament solutions directly from pairwise comparison data. STE first learns a probabilistic tournament model, potentially conditioned on rich contextual information. It then employs novel, differentiable operators for soft reachability and soft covering to compute continuous analogues of two seminal tournament solutions: the Top Cycle and the Uncovered Set. The output is a set of core agents, each with a calibrated membership score, providing a nuanced and robust assessment of agent capabilities. We develop the theoretical foundation for STE to prove its consistency with classical solutions in the zero-temperature limit, which establishes its Condorcet-inclusion properties, and analyzing its stability and sample complexity. We specify an experimental protocol for validating STE on both synthetic and real-world benchmarks. This work aims to provide a complete, standalone treatise that re-centers general-agent evaluation on a more appropriate and robust theoretical foundation, moving from unstable rankings to stable, set-valued equilibria.

关键词: large language models, artificial agents, tournament theory, evaluation, non-transitive interactions, set-valued solutions, Soft Tournament Equilibrium, pairwise comparison

297. ❌ Minimising Willmore Energy via Neural Flow

作者: Edward Hirst, Henrique N. Sá Earp, Tomás S. R. Silva 期刊/来源: arxiv 发布日期: 2026-04-06 arXiv链接: http://arxiv.org/abs/2604.04321v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究的是使用神经网络架构（neural architectures）来最小化Willmore能量的几何优化问题，属于计算数学和几何分析领域。虽然使用了神经网络，但论文的核心内容与所有大模型（LLM）相关的关键词（如预训练、微调、对齐、推理、代理等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将神经网络应用于科学计算（几何优化），属于AI for Science的广义范畴，但并非核心内容，因此给予5分（有一定关联）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于神经网络的Willmore流方法，用于最小化曲面嵌入的Willmore能量，成功复现了球面和Clifford环面，并为亏格2曲面的最小Willmore曲面问题提供了新思路。

摘要翻译

神经Willmore流被提出作为一种自然演化过程，用以最小化闭有向$\mathbb{R}^3$中$2$-曲面的Willmore能量，该能量为平均曲率的$L^2$范数平方。研究采用神经架构对从拓扑$2d$区域到$3d$欧几里得空间的映射进行建模，其学习过程通过物理信息神经网络（PINN）风格的损失函数，最小化作为嵌入泛函的Willmore能量。训练结果分别重现了亏格$0$曲面预期的标准球面，以及亏格$1$曲面下的Clifford环面。此外，针对亏格$2$曲面的实验为探索这一开放问题中的极小Willmore曲面提供了一种新方法。

摘要 (Abstract)

The neural Willmore flow of a closed oriented $2$-surface in $\mathbb{R}^3$ is introduced as a natural evolution process to minimise the Willmore energy, which is the squared $L^2$-norm of mean curvature. Neural architectures are used to model maps from topological $2d$ domains to $3d$ Euclidean space, where the learning process minimises a PINN-style loss for the Willmore energy as a functional on the embedding. Training reproduces the expected round sphere for genus $0$ surfaces, and the Clifford torus for genus $1$ surfaces, respectively. Furthermore, the experiment in the genus $2$ case provides a novel approach to search for minimal Willmore surfaces in this open problem.

关键词: Willmore energy, neural flow, surface embedding, PINN-style loss, genus 0, genus 1, Clifford torus, minimal surfaces

298. ❌ Effects of Generative AI Errors on User Reliance Across Task Difficulty

作者: Jacy Reese Anthis, Hannah Cha, Solon Barocas, Alexandra Chouldechova, Jake Hofman 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04319v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究生成式AI错误对用户依赖的影响，属于AI系统的人机交互研究，而非大模型/深度学习技术原理创新或具体应用。论文未涉及任何评分关键词中的技术（如LLM架构、训练方法、推理优化、对齐技术等），也未涉及AI在科学领域的应用。所有关键词均与论文内容无关。

!!! tip deepseek-chat TL;DR

该研究通过图表生成任务实验，探究了生成式AI错误率（10%、30%、50%）在不同任务难度下对用户使用行为的影响，发现更多错误会减少使用，但简单任务错误并未比困难任务错误更显著地减少使用，表明人们在此实验环境中并不排斥AI的'锯齿状'错误模式。

摘要翻译

人工智能（AI）的能力分布呈现为一条锯齿状前沿（jagged frontier），即AI系统在人类认为简单的任务上意外失败，却在人类认为困难的任务上取得成功。为探究用户对此现象的反应，我们基于图表生成任务开发了一种激励相容的实验方法，通过人为诱导生成式AI输出错误，并测试其对用户依赖程度的影响。我们在一项预先注册的3x2实验（N = 577）中展示了该实验界面，通过在简单或困难的图表生成任务中设置10%、30%或50%的错误率进行验证。实验证实，观察到更多错误会降低使用意愿，但我们意外发现：简单任务中的错误并未比困难任务中的错误显著降低使用意愿，这表明在当前实验情境下，人们并未对AI能力的锯齿状特征产生排斥。我们建议未来研究在调整任务难度的同时，结合AI错误的其他特征（例如锯齿状错误模式是否易于被用户学习）进行深入探索。

摘要 (Abstract)

The capabilities of artificial intelligence (AI) lie along a jagged frontier, where AI systems surprisingly fail on tasks that humans find easy and succeed on tasks that humans find hard. To investigate user reactions to this phenomenon, we developed an incentive-compatible experimental methodology based on diagram generation tasks, in which we induce errors in generative AI output and test effects on user reliance. We demonstrate the interface in a preregistered 3x2 experiment (N = 577) with error rates of 10%, 30%, or 50% on easier or harder diagram generation tasks. We confirmed that observing more errors reduces use, but we unexpectedly found that easy-task errors did not significantly reduce use more than hard-task errors, suggesting that people are not averse to jaggedness in this experimental setting. We encourage future work that varies task difficulty at the same time as other features of AI errors, such as whether the jagged error patterns are easily learned.

关键词: Generative AI, AI errors, User reliance, Task difficulty, Jagged frontier, Diagram generation, Experimental methodology, Error rates

299. ❌ How Long short-term memory artificial neural network, synthetic data, and fine-tuning improve the classification of raw EEG data

作者: Albert Nasybullin, Vladimir Maksimenko, Semen Kurkin 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04316v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究EEG数据分类的机器学习流程，使用LSTM、合成数据和微调技术，属于AI在生物医学信号处理（EEG）领域的应用。与大多数大模型技术关键词（如LLMs、MoE、RLHF等）完全无关，仅与’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（论文提到fine-tuning），以及与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（EEG分类属于生物信息学/科学AI应用）。其他关键词均不相关。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合合成数据生成、LSTM神经网络和微调技术的机器学习流程，用于提高原始EEG数据在隐式视觉刺激（如不同模糊度的Necker立方体）实验中的分类质量。

摘要翻译

本文探讨了一种用于脑电图数据分类的机器学习流程。针对内隐视觉刺激实验（如具有不同模糊度的内克尔立方体）的分类问题，我们提出了一种结合合成数据生成、长短期记忆人工神经网络（Long Short-Term Memory Artificial Neural Network, LSTM）与微调技术的解决方案。所开发的方法提升了原始脑电图数据分类模型的质量。

摘要 (Abstract)

In this paper, we discuss a Machine Learning pipeline for the classification of EEG data. We propose a combination of synthetic data generation, long short-term memory artificial neural network (LSTM), and fine-tuning to solve classification problems for experiments with implicit visual stimuli, such as the Necker cube with different levels of ambiguity. The developed approach increased the quality of the classification model of raw EEG data.

关键词: EEG classification, LSTM, synthetic data generation, fine-tuning, machine learning pipeline, Necker cube, implicit visual stimuli, raw EEG data

300. ❌ Convolutional Neural Network and Adversarial Autoencoder in EEG images classification

作者: Albert Nasybullin, Semen Kurkin 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04313v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究EEG图像分类，使用卷积神经网络和对抗自编码器，属于神经科学领域的AI应用。论文未涉及大语言模型、MoE、小语言模型、缩放定律、预训练、后训练、指令调优、RLHF、参数高效微调、RAG、长上下文、注意力优化、推理方法、智能体、工具使用、多智能体、量化、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等大模型核心技术。仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文将AI应用于神经科学（EEG分析），属于科学领域的AI应用，但未明确涉及生物信息学或化学信息学，因此给5分。

!!! tip deepseek-chat TL;DR

该论文通过将原始EEG信号预处理为2D拓扑图，并开发监督和半监督神经网络，解决了手部运动期间大脑活动分类的问题。

摘要翻译

本文探讨了在神经科学领域的脑电图数据分析中，如何应用计算机视觉算法解决分类问题。我们的方法结合计算机视觉与神经网络技术，以对手部运动期间的人类大脑活动进行分类。研究首先对原始脑电信号进行预处理，并生成二维脑电地形图。随后，我们开发了监督式与半监督式神经网络，用于对不同运动皮层活动进行分类。

摘要 (Abstract)

In this paper, we consider applying computer vision algorithms for the classification problem one faces in neuroscience during EEG data analysis. Our approach is to apply a combination of computer vision and neural network methods to solve human brain activity classification problems during hand movement. We pre-processed raw EEG signals and generated 2D EEG topograms. Later, we developed supervised and semi-supervised neural networks to classify different motor cortex activities.

关键词: EEG classification, Convolutional Neural Network, Adversarial Autoencoder, EEG topograms, motor cortex activity, supervised learning, semi-supervised learning, neuroscience

301. ❌ Out-of-Air Computation: Enabling Structured Extraction from Wireless Superposition

作者: Seyed Mohammad Azimi-Abarghouyi 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04312v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究无线通信中的计算框架（AirCPU），专注于联合源信道编码、嵌套格架构和无线叠加计算，与所有评分关键词（均围绕大模型、深度学习技术及其应用）完全无关，无任何匹配内容。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为AirCPU的新型无线计算框架，通过结构化编码从无线叠加中提取计算，避免了预嵌入计算，并利用多层嵌套格架构在固定功率约束下实现渐进分辨率，显著扩展了可靠操作范围。

摘要翻译

空中计算(AirComp)的传统构建依赖于将计算预嵌入传输波形或利用大规模天线阵列的原理，通常要求无线多址信道(MAC)在近似理想计算介质的条件下运行。本文提出一种新型计算框架，称为空外计算(AirCPU)，其建立了联合信源信道编码基础：计算并非在传输前嵌入，而是通过利用结构化编码从无线叠加信号中提取。AirCPU直接对连续值设备数据进行操作，避免了独立的信源量化阶段，并采用多层嵌套格架构，通过将每个输入分解为分层缩放的分量来实现渐进分辨率，所有分量均在固定功率约束下通过公共有界数字星座进行传输。我们形式化解耦分辨率的概念，证明在解码错误概率足够小的运行机制下，信道噪声和有限星座约束对失真的影响可忽略不计，最终计算误差主要由最精细格所设定的目标分辨率决定。针对衰落多址信道，除提出的直接计算机制外，我们进一步引入集体计算与逐次计算机制，通过利用多个已解码的整数系数函数和边信息函数作为无线叠加信号的结构化表征，显著扩展可靠运行区间；在此背景下，我们系统阐述并刻画了基础可靠性条件与整数优化问题，并提出一种结构化的低复杂度双群近似方法以解决这些问题。

摘要 (Abstract)

Over-the-air computation (AirComp) has traditionally been built on the principle of pre-embedding computation into transmitted waveforms or on exploiting massive antenna arrays, often requiring the wireless multiple-access channel (MAC) to operate under conditions that approximate an ideal computational medium. This paper introduces a new computation framework, termed out-of-air computation (AirCPU), which establishes a joint source-channel coding foundation in which computation is not embedded before transmission but is instead extracted from the wireless superposition by exploiting structured coding. AirCPU operates directly on continuous-valued device data, avoiding the need for a separate source quantization stage, and employs a multi-layer nested lattice architecture that enables progressive resolution by decomposing each input into hierarchically scaled components, all transmitted over a common bounded digital constellation under a fixed power constraint. We formalize the notion of decoupled resolution, showing that in operating regimes where the decoding error probability is sufficiently small, the impact of channel noise and finite constellation constraints on distortion becomes negligible, and the resulting computation error is primarily determined by the target resolution set by the finest lattice. For fading MACs, we further introduce collective and successive computation mechanisms, in addition to the proposed direct computation, which exploit multiple decoded integer-coefficient functions and side-information functions as structural representations of the wireless superposition to significantly expand the reliable operating regime; in this context, we formulate and characterize the underlying reliability conditions and integer optimization problems, and develop a structured low-complexity two-group approximation to address them.

关键词: out-of-air computation, AirCPU, wireless superposition, joint source-channel coding, nested lattice architecture, progressive resolution, fading MACs, structured coding

302. ❌ CavMerge: Merging K-means Based on Local Log-Concavity

作者: Zhili Qiao, Wangqian Ju, Peng Liu 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04302v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《CavMerge: Merging K-means Based on Local Log-Concavity》专注于传统机器学习中的K-means聚类算法改进，提出了一种新的合并算法来解决非线性可分数据的问题。论文内容完全围绕经典聚类方法（K-means）、算法优化（合并策略）、计算效率、参数调优和统计一致性展开，未涉及任何大语言模型（LLMs）、深度学习、大模型技术原理、科学AI应用或评分关键词列表中的任何具体技术（如MoE、RLHF、RAG等）。所有关键词均与大模型、深度学习及其相关技术、应用或方法论相关，而本文是纯粹的经典机器学习聚类算法研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对K-means聚类在非线性可分数据上性能不佳的问题，提出了一种基于局部对数凹性的新型合并算法CavMerge，该算法无需参数调优、计算高效，并在理论和实证上优于现有方法。

摘要翻译

K-means聚类作为一种经典且广泛应用的聚类技术，在处理非线性可分数据时表现常欠佳。为解决这一问题，学界已提出多种调整与改进方法，其中包括将较大K值下的K-means结果进行合并以获得最终聚类分配的方案。然而，此类现有方法往往存在计算效率低下和超参数调优困难的问题。本文提出一种新型K-means合并算法——\emph{CavMerge}，该算法具有直观性强、无需参数调优且计算高效的特点。在极小的局部分布假设条件下，本算法展现出强一致性与快速收敛的理论保证。通过对多种模拟数据集和真实数据集的实证研究，结果表明相较于当前主流先进算法，本方法能够产生更可靠的聚类结果。

摘要 (Abstract)

K-means clustering, a classic and widely-used clustering technique, is known to exhibit suboptimal performance when applied to non-linearly separable data. Numerous adjustments and modifications have been proposed to address this issue, including methods that merge K-means results from a relatively large K to obtain a final cluster assignment. However, existing methods of this nature often encounter computational inefficiencies and suffer from hyperparameter tuning. Here we present \emph{CavMerge}, a novel K-means merging algorithm that is intuitive, free of parameter tuning, and computationally efficient. Operating under minimal local distributional assumptions, our algorithm demonstrates strong consistency and rapid convergence guarantees. Empirical studies on various simulated and real datasets demonstrate that our method yields more reliable clusters in comparison to current state-of-the-art algorithms.

关键词: K-means clustering, cluster merging, local log-concavity, parameter-free, computational efficiency, non-linearly separable data, consistency guarantees, empirical evaluation

303. ❌ Correcting Source Mismatch in Flow Matching with Radial-Angular Transport

作者: Fouad Oubari, Mathilde Mougeot 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04291v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是Flow Matching（流匹配）方法中的源分布不匹配问题，提出了一种径向-角度流匹配框架（RAFM）来改进对重尾或各向异性数据的生成建模。该工作属于生成模型和概率建模领域，与所有关键词（均围绕大语言模型、深度学习技术原理及其应用）完全无关。论文未涉及任何大模型、深度学习技术原理创新或科学应用，也未使用或改进任何关键词中的技术。

!!! tip deepseek-chat TL;DR

该论文针对Flow Matching方法中高斯源分布与重尾或各向异性数据不匹配的问题，提出了径向-角度流匹配框架（RAFM），通过匹配数据的径向分布并简化角度对齐，显著提升了生成性能，同时保持了轻量级确定性训练流程。

摘要翻译

流匹配通常基于高斯源与欧几里得概率路径构建。然而，对于重尾或各向异性数据，高斯源在径向分布层面即已引发结构失配。本文提出径向-角向流匹配框架，该框架在标准免仿真流匹配模板内显式修正此类源失配问题。RAFM采用一种源分布，其径向律与数据径向律匹配，且其条件角向分布在球面上均匀，从而在构造上消除了高斯径向失配。这将剩余的传输问题简化为角向对齐，自然引出基于球面测地线插值定义的缩放球面上的条件路径。由此产生的框架可生成针对径向-角向传输定制的显式流匹配目标，且无需修改底层确定性训练流程。
我们建立了匹配径向源的精确密度表达式，证明了可分离高斯径向惩罚项的径向-角向KL分解定理，刻画了诱导的目标向量场特性，并推导出将流匹配误差与生成误差相关联的稳定性结果。我们进一步分析了径向律的经验估计方法，其中Wasserstein距离与累积分布函数度量可提供自然保证。实验表明，RAFM较标准高斯流匹配有显著提升，在与近期非高斯替代方案的比较中保持竞争力，同时维持轻量级确定性训练流程。总体而言，RAFM为重尾与极端事件数据的流匹配提供了原则性的源与路径设计范式。

摘要 (Abstract)

Flow Matching is typically built from Gaussian sources and Euclidean probability paths. For heavy-tailed or anisotropic data, however, a Gaussian source induces a structural mismatch already at the level of the radial distribution. We introduce \textit{Radial–Angular Flow Matching (RAFM)}, a framework that explicitly corrects this source mismatch within the standard simulation-free Flow Matching template. RAFM uses a source whose radial law matches that of the data and whose conditional angular distribution is uniform on the sphere, thereby removing the Gaussian radial mismatch by construction. This reduces the remaining transport problem to angular alignment, which leads naturally to conditional paths on scaled spheres defined by spherical geodesic interpolation. The resulting framework yields explicit Flow Matching targets tailored to radial–angular transport without modifying the underlying deterministic training pipeline. We establish the exact density of the matched-radial source, prove a radial–angular KL decomposition that isolates the Gaussian radial penalty, characterize the induced target vector field, and derive a stability result linking Flow Matching error to generation error. We further analyze empirical estimation of the radial law, for which Wasserstein and CDF metrics provide natural guarantees. Empirically, RAFM substantially improves over standard Gaussian Flow Matching and remains competitive with recent non-Gaussian alternatives while preserving a lightweight deterministic training procedure. Overall, RAFM provides a principled source-and-path design for Flow Matching on heavy-tailed and extreme-event data.

关键词: Flow Matching, Radial-Angular Transport, Source Mismatch, Heavy-tailed Data, Anisotropic Data, Gaussian Source, Spherical Geodesic Interpolation, Deterministic Training

304. ❌ DAGAF: A directed acyclic generative adversarial framework for joint structure learning and tabular data synthesis

作者: Hristo Petkov, Calum MacLellan, Feng Dong 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04290v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于因果结构学习和表格数据合成的生成对抗框架，使用DAG和因果模型（ANM、LiNGAM、PNL），未涉及大模型、深度学习技术原理或科学AI应用。所有关键词均与大模型、深度学习、AI科学应用相关，与论文的因果推断和生成建模主题完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为DAGAF的双步生成对抗框架，用于在多种因果模型假设下联合进行因果结构学习和表格数据合成，实验表明其在结构学习方面优于现有方法并能生成高质量样本。

摘要翻译

理解数据变量间的因果关系能为表格数据集的构建提供关键洞见。现有因果学习方法大多侧重于应用单一可识别因果模型——如加性噪声模型（Additive Noise Model, ANM）或线性非高斯无环模型（Linear non-Gaussian Acyclic Model, LiNGAM）——来发现观测数据中呈现的依赖关系。我们对此进行改进，提出一种新颖的双阶段框架，能够在多重因果模型假设下同时执行因果结构学习与表格数据合成。该方法采用有向无环图（Directed Acyclic Graph, DAG）表征数据变量间的因果关系。通过应用包括ANM、LiNGAM及后非线性模型（Post-Nonlinear model, PNL）在内的多种函数因果模型，我们隐式学习DAG的结构以模拟观测数据的生成过程，从而有效复现真实数据分布。理论分析为此提供了支撑，阐释了构成该框架目标函数的多重损失项。实验结果表明，DAGAF在结构学习方面优于许多现有方法，在真实数据集与基准数据集上均实现了显著更低的结构汉明距离（Structural Hamming Distance, SHD）得分（相较于前沿方法，在Sachs数据集提升47%，Child数据集提升11%，Hailfinder数据集提升5%，Pathfinder数据集提升7%），同时能够生成多样化、高质量的合成样本。

摘要 (Abstract)

Understanding the causal relationships between data variables can provide crucial insights into the construction of tabular datasets. Most existing causality learning methods typically focus on applying a single identifiable causal model, such as the Additive Noise Model (ANM) or the Linear non-Gaussian Acyclic Model (LiNGAM), to discover the dependencies exhibited in observational data. We improve on this approach by introducing a novel dual-step framework capable of performing both causal structure learning and tabular data synthesis under multiple causal model assumptions. Our approach uses Directed Acyclic Graphs (DAG) to represent causal relationships among data variables. By applying various functional causal models including ANM, LiNGAM and the Post-Nonlinear model (PNL), we implicitly learn the contents of DAG to simulate the generative process of observational data, effectively replicating the real data distribution. This is supported by a theoretical analysis to explain the multiple loss terms comprising the objective function of the framework. Experimental results demonstrate that DAGAF outperforms many existing methods in structure learning, achieving significantly lower Structural Hamming Distance (SHD) scores across both real-world and benchmark datasets (Sachs: 47%, Child: 11%, Hailfinder: 5%, Pathfinder: 7% improvement compared to state-of-the-art), while being able to produce diverse, high-quality samples.

关键词: causal structure learning, tabular data synthesis, directed acyclic graphs, generative adversarial framework, functional causal models, structural hamming distance, observational data, data distribution

305. ❌ A Logical-Rule Autoencoder for Interpretable Recommendations

作者: Jinhao Pan, Bowen Wei, Ziwei Zhu 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04270v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种用于协同过滤的逻辑规则可解释自编码器（LIA），专注于推荐系统的可解释性设计。论文的核心贡献在于通过可学习的逻辑规则层实现内在可解释性，这与关键词’Mechanistic Interpretability OR Explainable AI’高度相关（评10分），因为论文直接解决了深度学习模型的黑盒问题，提供了可追溯的决策过程。然而，论文未涉及大语言模型（LLMs）、MoE、小语言模型、扩展定律、预训练/后训练、对齐技术、RAG、上下文扩展、推理加速、幻觉缓解、世界模型、模型合并、上下文学习或科学AI等主题，因此这些关键词评0分。论文属于传统的深度学习推荐系统研究，而非大模型或前沿深度学习技术原理的创新应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种逻辑规则可解释自编码器（LIA），用于解决深度学习推荐模型的黑盒问题，通过可学习的逻辑规则层实现内在可解释性，并在实验中展示了优于传统基线的推荐性能。

摘要翻译

大多数深度学习推荐模型以黑箱方式运行，依赖难以解释的潜在表征，这使其决策过程缺乏透明度。这种内在可解释性的缺失在需要透明度和问责机制的应用场景中引发了担忧。本研究提出一种专为协同过滤设计的逻辑规则可解释自编码器（Logical-rule Interpretable Autoencoder, LIA），其结构本身具备可解释性。LIA引入了一个可学习的逻辑规则层，其中每个规则神经元配备门控参数，可在训练过程中自动选择AND或OR逻辑运算符，使模型能够直接从数据中发现多样化的逻辑模式。为实现功能完备性同时避免输入维度翻倍，LIA通过连接权重的符号编码否定关系，为每条规则中同时表达肯定和否定的项目条件提供了参数高效的机制。通过学习显式、人类可读的重构规则，LIA使用户能够直接追溯每条推荐背后的决策过程。大量实验表明，该方法在保持完全可解释性的同时，其推荐性能优于传统基线模型。代码与数据已公开于https://github.com/weibowen555/LIA。

摘要 (Abstract)

Most deep learning recommendation models operate as black boxes, relying on latent representations that obscure their decision process. This lack of intrinsic interpretability raises concerns in applications that require transparency and accountability. In this work, we propose a Logical-rule Interpretable Autoencoder (LIA) for collaborative filtering that is interpretable by design. LIA introduces a learnable logical rule layer in which each rule neuron is equipped with a gate parameter that automatically selects between AND and OR operators during training, enabling the model to discover diverse logical patterns directly from data. To support functional completeness without doubling the input dimensionality, LIA encodes negation through the sign of connection weights, providing a parameter-efficient mechanism for expressing both positive and negated item conditions within each rule. By learning explicit, human-readable reconstruction rules, LIA allows users to directly trace the decision process behind each recommendation. Extensive experiments show that our method achieves improved recommendation performance over traditional baselines while remaining fully interpretable. Code and data are available at https://github.com/weibowen555/LIA.

关键词: interpretable recommendations, logical-rule autoencoder, collaborative filtering, explainable AI, transparent decision process, human-readable rules, parameter-efficient mechanism, recommendation performance

306. ❌ Beyond Fluency: Toward Reliable Trajectories in Agentic IR

作者: Anushree Sinha, Srivaths Ranganathan, Debanshu Das, Abhishek Dharmaratnakar 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04269v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于自主智能体在信息检索中的多步工作流程，核心讨论Agentic IR系统在规划、检索、推理和执行中的错误模式及可靠性问题。高度相关的关键词包括：LLM Agents/Tool Use（论文核心主题）、Chain of Thought（Reason-Act-Observe循环）、Self-Correction（验证门和系统弃权）、Hallucination Mitigation（对抗欺骗性流畅度）、RAG（检索增强生成）。中等相关的关键词包括：LLMs/Foundation Models（智能体通常基于大模型）、System 2 Thinking（深入推理）、Alignment（功能对齐）、Multi-agent Systems（可能涉及协调）、Explainable AI（因果归因）。其余关键词与论文的技术细节（如模型架构、训练方法、优化技术）无关。

!!! tip deepseek-chat TL;DR

该论文研究了自主智能体在信息检索中多步工作流程的可靠性问题，提出了通过验证门和系统弃权来确保轨迹完整性和执行正确性的方法。

摘要翻译

信息检索正从被动的文档排序转向在多步骤“推理-行动-观察”循环中运行的自主智能体工作流。在此类长程任务轨迹中，早期的微小错误可能产生级联效应，导致系统内部推理与外部工具执行之间的功能失调，尽管其语言表达仍保持流畅。本立场论文综合了在工业级智能体系统中观察到的故障模式，将错误归类于规划、检索、推理与执行等环节。我们认为，安全部署需超越终端准确性指标，转向关注轨迹完整性与因果归因。为应对复合型错误与具有欺骗性的流畅性，我们提出在每个交互单元设置验证节点，并主张在校准不确定性的前提下进行系统性弃权。可靠的智能体信息检索系统必须优先确保过程正确性与基于现实的执行，而非追求看似合理但未经核验的任务完成度。

摘要 (Abstract)

Information Retrieval is shifting from passive document ranking toward autonomous agentic workflows that operate in multi-step Reason-Act-Observe loops. In such long-horizon trajectories, minor early errors can cascade, leading to functional misalignment between internal reasoning and external tool execution despite continued linguistic fluency. This position paper synthesizes failure modes observed in industrial agentic systems, categorizing errors across planning, retrieval, reasoning, and execution. We argue that safe deployment requires moving beyond endpoint accuracy toward trajectory integrity and causal attribution. To address compounding error and deceptive fluency, we propose verification gates at each interaction unit and advocate systematic abstention under calibrated uncertainty. Reliable Agentic IR systems must prioritize process correctness and grounded execution over plausible but unverified completion.

关键词: Agentic IR, autonomous agents, multi-step workflows, trajectory integrity, verification gates, systematic abstention, causal attribution, functional misalignment

307. ❌ Avoiding Non-Integrable Beliefs in Expectation Propagation

作者: Zilu Zhao, Jichao Chen, Dirk Slock 期刊/来源: arxiv 发布日期: 2026-04-05 arXiv链接: http://arxiv.org/abs/2604.04264v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是期望传播（EP）算法在贝叶斯推断中的数学优化问题，特别是解决信念不可积的问题，并在广义线性模型（GLM）中进行信号恢复应用。论文内容完全属于传统的概率图模型、变分推断和贝叶斯统计领域，不涉及任何大语言模型、深度学习、AI for Science或相关技术（如微调、对齐、推理加速等）。所有关键词均与大模型或深度学习在科学领域的应用无关，因此相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了两种新的期望传播框架，通过允许非可积消息来确保算法中的信念可积，并在广义线性模型的信号恢复问题中验证了方法的有效性。

摘要翻译

期望传播（Expectation Propagation，简称EP）是一种广泛使用的迭代消息传递算法，它将全局推断问题分解为多个局部问题。该算法通过称为“消息”的中间函数，将边缘分布近似为“信念”。已有研究表明，EP的稳态点与相应的约束贝瑟自由能（Bethe Free Energy，简称BFE）优化问题的解相同。因此，EP是优化约束BFE的一种迭代方法。然而，该迭代方法可能超出BFE优化问题的可行集，即信念可能不可积。在大多数文献中，作者采用多种方法确保所有消息可积。在大多数贝叶斯估计问题中，将消息限制为可积会缩小实际的可行集。此外，在因子本身不可积的极端情况下，仅使消息可积并不足以保证信念可积。本文提出了两种EP框架，以确保EP具有可积的信念。这两种方法均允许非可积消息的存在。随后，我们使用所提出的方法研究了广义线性模型（Generalized Linear Model，简称GLM）中的信号恢复问题。

摘要 (Abstract)

Expectation Propagation (EP) is a widely used iterative message-passing algorithm that decomposes a global inference problem into multiple local ones. It approximates marginal distributions as beliefs'' using intermediate functions called messages’’. It has been shown that the stationary points of EP are the same as corresponding constrained Bethe Free Energy (BFE) optimization problem. Therefore, EP is an iterative method of optimizing the constrained BFE. However, the iterative method may fall out of the feasible set of the BFE optimization problem, i.e., the beliefs are not integrable. In most literature, the authors use various methods to keep all the messages integrable. In most Bayesian estimation problems, limiting the messages to be integrable shrinks the actual feasible set. Furthermore, in extreme cases where the factors are not integrable, making the message itself integrable is not enough to have integrable beliefs. In this paper, two EP frameworks are proposed to ensure that EP has integrable beliefs. Both of the methods allows non-integrable messages. We then investigate the signal recovery problem in Generalized Linear Model (GLM) using our proposed methods.

关键词: Expectation Propagation, Beliefs, Integrability, Bethe Free Energy, Generalized Linear Model, Signal Recovery, Bayesian Inference, Message-passing Algorithm

Token 消耗统计

总计: 1,021,277 tokens（输入 715,016 / 输出 306,261）

模型	输入	输出	合计
deepseek-chat	550,763	306,261	857,024
glm-4.7	164,253	0	164,253

📊 ArXiv 研究报告 (2026-04-08)

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

评分设置

📈 论文统计

⭐ 及格论文详细分析

1. Search, Do not Guess: Teaching Small Language Models to Be Effective Search Agents

2. PassiveQA: A Three-Action Framework for Epistemically Calibrated Question Answering via Supervised F

3. GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces

4. How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

5. Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Fra

6. MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents

7. DARE: Diffusion Large Language Models Alignment and Reinforcement Executor

8. MERIT: Multilingual Expert-Reward Informed Tuning for Chinese-Centric Low-Resource Machine Translati

9. SODA: Semi On-Policy Black-Box Distillation for Large Language Models

10. Your Agent is More Brittle Than You Think: Uncovering Indirect Injection Vulnerabilities in Agentic

11. Temporal Inversion for Learning Interval Change in Chest X-Rays

12. A Family of Open Time-Series Foundation Models for the Radio Access Network

13. AI Trust OS – A Continuous Governance Framework for Autonomous AI Observability and Zero-Trust Comp

14. Optimizing Service Operations via LLM-Powered Multi-Agent Simulation

📋 所有论文列表

1. ✅ Search, Do not Guess: Teaching Small Language Models to Be Effective Search Agents

2. ✅ PassiveQA: A Three-Action Framework for Epistemically Calibrated Question Answering via Supervised Finetuning

3. ✅ GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces

4. ✅ How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

5. ✅ Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework

6. ✅ MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents

7. ✅ DARE: Diffusion Large Language Models Alignment and Reinforcement Executor

8. ✅ MERIT: Multilingual Expert-Reward Informed Tuning for Chinese-Centric Low-Resource Machine Translation

9. ✅ SODA: Semi On-Policy Black-Box Distillation for Large Language Models

10. ✅ Your Agent is More Brittle Than You Think: Uncovering Indirect Injection Vulnerabilities in Agentic LLMs

11. ✅ Temporal Inversion for Learning Interval Change in Chest X-Rays

12. ✅ A Family of Open Time-Series Foundation Models for the Radio Access Network

13. ✅ AI Trust OS – A Continuous Governance Framework for Autonomous AI Observability and Zero-Trust Compliance in Enterprise Environments

14. ✅ Optimizing Service Operations via LLM-Powered Multi-Agent Simulation

15. ❌ Uncertainty as a Planning Signal: Multi-Turn Decision Making for Goal-Oriented Conversation

16. ❌ CPT: Controllable and Editable Design Variations with Language Models

17. ❌ An AI Teaching Assistant for Motion Picture Engineering

18. ❌ MolDA: Molecular Understanding and Generation via Large Language Diffusion Model

19. ❌ Adaptive Cost-Efficient Evaluation for Reliable Patent Claim Validation

20. ❌ Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation

21. ❌ Lighting Up or Dimming Down? Exploring Dark Patterns of LLMs in Co-Creativity

22. ❌ High-Stakes Personalization: Rethinking LLM Customization for Individual Investor Decision-Making

23. ❌ SARES-DEIM: Sparse Mixture-of-Experts Meets DETR for Robust SAR Ship Detection

24. ❌ HI-MoE: Hierarchical Instance-Conditioned Mixture-of-Experts for Object Detection

25. ❌ Training-Free Image Editing with Visual Context Integration and Concept Alignment

26. ❌ ECG Biometrics with ArcFace-Inception: External Validation on MIMIC and HEEDB

27. ❌ Multimodal Backdoor Attack on VLMs for Autonomous Driving via Graffiti and Cross-Lingual Triggers

28. ❌ Your Pre-trained Diffusion Model Secretly Knows Restoration

29. ❌ Vero: An Open RL Recipe for General Visual Reasoning

30. ❌ Early Stopping for Large Reasoning Models via Confidence Dynamics

31. ❌ How AI Aggregation Affects Knowledge

32. ❌ Analyzing Symbolic Properties for DRL Agents in Systems and Networking

33. ❌ FileGram: Grounding Agent Personalization in File-System Behavioral Traces

34. ❌ Agentic Federated Learning: The Future of Distributed Training Orchestration

35. ❌ QED-Nano: Teaching a Tiny Model to Prove Hard Theorems

36. ❌ Rethinking Exploration in RLVR: From Entropy Regularization to Refinement via Bidirectional Entropy Modulation

37. ❌ Learning, Potential, and Retention: An Approach for Evaluating Adaptive AI-Enabled Medical Devices

38. ❌ Incompleteness of AI Safety Verification via Kolmogorov Complexity

39. ❌ Muon Dynamics as a Spectral Wasserstein Flow

40. ❌ DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing

41. ❌ Noise Immunity in In-Context Tabular Learning: An Empirical Robustness Analysis of TabPFN’s Attention Mechanisms

42. ❌ InfBaGel: Human-Object-Scene Interaction Generation with Dynamic Perception and Iterative Refinement

43. ❌ Plausibility as Commonsense Reasoning: Humans Succeed, Large Language Models Do not

44. ❌ LiveFact: A Dynamic, Time-Aware Benchmark for LLM-Driven Fake News Detection

45. ❌ Selecting Decision-Relevant Concepts in Reinforcement Learning

46. ❌ ANX: Protocol-First Design for AI Agent Interaction with a Supporting 3EX Decoupled Architecture

47. ❌ A Quantum Search Approach to Magic Square Constraint Problems with Classical Benchmarking

48. ❌ SkillX: Automatically Constructing Skill Knowledge Bases for Agents

49. ❌ Cog-DRIFT: Exploration on Adaptively Reformulated Instances Enables Learning from Hard Reasoning Problems

50. ❌ Undetectable Conversations Between AI Agents via Pseudorandom Noise-Resilient Key Exchange

51. ❌ Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

52. ❌ Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations

53. ❌ Artificial Intelligence and Cost Reduction in Public Higher Education: A Scoping Review of Emerging Evidence

54. ❌ Sampling Parallelism for Fast and Efficient Bayesian Learning

55. ❌ Discovering Failure Modes in Vision-Language Models using RL

56. ❌ Metaphors We Compute By: A Computational Audit of Cultural Translation vs. Thinking in LLMs

57. ❌ Neuromorphic Computing for Low-Power Artificial Intelligence

58. ❌ Individual and Combined Effects of English as a Second Language and Typos on LLM Performance

59. ❌ AI Assistance Reduces Persistence and Hurts Independent Performance